Upgrade synthetic control to model multiple treated units #456

drbenvincent · 2025-04-19T10:45:09Z

What

Currently, the synthetic control functionality is constrained to a single treatment unit. Clearly, having one treated unit is the minimum you could have for a working synthetic control solution. This still offers non-trivial functionality, and we have docs with a generic example with simulated data, and also for the effects of Brexit (the UK is the only treated unit).

However, there are many situations where you will have more than one treated unit. This could happen in many different domains, but it will be notable in marketing with geolift situations. We also have a docs page on geolift with a single treated geo. We also have a docs page on multi-cell geolift analysis where we have multiple treated geos. That docs page currently walks through an example of a pooled analysis approach where we simply take the average of the outcome variable across the treated geos and then proceed to model it as a single treated unit case of synthetic control. The alternative was to treat the geos as unpooled - in that case we simply run multiple independent single treated unit synthetic control analyses.

Why

This issue proposes that we add the ability to model multiple treated units (or geos). This is has a number of motivations:

it is a more general solution
it would allow a single modeling approach to geo testing (or any other multiple treatment unit situation)
it would allow the full flexibility from pooled and unpooled analysis approaches, but also newly, partially pooled analysis where there could be information sharing across weights.
it will lay the foundation for implementing synthetic differences in differences Add Synthetic Difference-in-Differences #47

Changes

Changes to the `WeightedSumFitter` class

This pymc model class would need to be changed so that we have a weight matrix, rather than a weight vector.

CausalPy/causalpy/pymc_models.py

Lines 254 to 271 in 4227edf

    
               def build_model(self, X, y, coords): 
        
                   """ 
        
                   Defines the PyMC model 
        
                   """ 
        
                   with self: 
        
                       self.add_coords(coords) 
        
                       n_predictors = X.shape[1] 
        
                       X = pm.Data("X", X, dims=["obs_ind", "coeffs"]) 
        
                       y = pm.Data("y", y[:, 0], dims="obs_ind") 
        
                       # TODO: There we should allow user-specified priors here 
        
                       beta = pm.Dirichlet("beta", a=np.ones(n_predictors), dims="coeffs") 
        
                       # beta = pm.Dirichlet( 
        
                       #     name="beta", a=(1 / n_predictors) * np.ones(n_predictors), 
        
                       #     dims="coeffs" 
        
                       # ) 
        
                       sigma = pm.HalfNormal("sigma", 1) 
        
                       mu = pm.Deterministic("mu", pm.math.dot(X, beta), dims="obs_ind") 
        
                       pm.Normal("y_hat", mu, sigma, observed=y, dims="obs_ind")

So rather than dims="coeffs" (where coeffs correspond to control units), it would be dims=("control_units", "treated_units"). This would give us an unpooled set of weights of each of the control units for each of the treated units. A later step could them implement partial pooling over these weights (across the treated_unit) dimension.

The WeightedSumFitter.build_model method would also change to update the fact that the raw data would no longer be long form, so the incoming data (currently a design matrix X would now be a 2D matrix, probably shape ("time", "unit").

Changes to the `SyntheticControl` class

SyntheticControl would no longer inherit from the PrePostFit class. So all the logic currently in PrePostFit.__innit__ would move to the new SyntheticControl.__init__. This will leave InterruptedTimeSeries as the only class that does inherit from PrePostFit, so there would be opportunity to collapse that class hierarchy, but that is a peripheral issue. The core thing is that SyntheticControl would change a lot.
The incoming dataframe is still split into pre and post treatment
Remove the formula argument and no longer use a design matrix approach (with patsy). This would result in quite a lot of change to the logic in SyntheticControl.__init__
Update the _bayesian_plot method.

Changes to tests

Update all the integration tests to deal with the changed API
Add new tests to cover the new multiple treated unit case

Changes to docs

We'd have to update the docs to use the new API.
We would also want to update the existing multi-cell geolift analysis docs.

The text was updated successfully, but these errors were encountered:

drbenvincent added enhancement New feature or request geo project Related to geo-testing labels Apr 19, 2025

drbenvincent self-assigned this Apr 19, 2025

This was referenced Apr 19, 2025

Separate SyntheticControl and InterruptedTimeSeries classes, removing the PrePostFit abstract class #457

Closed

API change for the SyntheticControl experiment class #460

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade synthetic control to model multiple treated units #456

Upgrade synthetic control to model multiple treated units #456

drbenvincent commented Apr 19, 2025

Upgrade synthetic control to model multiple treated units #456

Upgrade synthetic control to model multiple treated units #456

Comments

drbenvincent commented Apr 19, 2025

What

Why

Changes

Changes to the WeightedSumFitter class

Changes to the SyntheticControl class

Changes to tests

Changes to docs

Changes to the `WeightedSumFitter` class

Changes to the `SyntheticControl` class