Skip to content

Upgrade synthetic control to model multiple treated units #456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
drbenvincent opened this issue Apr 19, 2025 · 0 comments
Open

Upgrade synthetic control to model multiple treated units #456

drbenvincent opened this issue Apr 19, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request geo project Related to geo-testing

Comments

@drbenvincent
Copy link
Collaborator

What

Currently, the synthetic control functionality is constrained to a single treatment unit. Clearly, having one treated unit is the minimum you could have for a working synthetic control solution. This still offers non-trivial functionality, and we have docs with a generic example with simulated data, and also for the effects of Brexit (the UK is the only treated unit).

However, there are many situations where you will have more than one treated unit. This could happen in many different domains, but it will be notable in marketing with geolift situations. We also have a docs page on geolift with a single treated geo. We also have a docs page on multi-cell geolift analysis where we have multiple treated geos. That docs page currently walks through an example of a pooled analysis approach where we simply take the average of the outcome variable across the treated geos and then proceed to model it as a single treated unit case of synthetic control. The alternative was to treat the geos as unpooled - in that case we simply run multiple independent single treated unit synthetic control analyses.

Why

This issue proposes that we add the ability to model multiple treated units (or geos). This is has a number of motivations:

  • it is a more general solution
  • it would allow a single modeling approach to geo testing (or any other multiple treatment unit situation)
  • it would allow the full flexibility from pooled and unpooled analysis approaches, but also newly, partially pooled analysis where there could be information sharing across weights.
  • it will lay the foundation for implementing synthetic differences in differences Add Synthetic Difference-in-Differences #47

Changes

Changes to the WeightedSumFitter class

This pymc model class would need to be changed so that we have a weight matrix, rather than a weight vector.

def build_model(self, X, y, coords):
"""
Defines the PyMC model
"""
with self:
self.add_coords(coords)
n_predictors = X.shape[1]
X = pm.Data("X", X, dims=["obs_ind", "coeffs"])
y = pm.Data("y", y[:, 0], dims="obs_ind")
# TODO: There we should allow user-specified priors here
beta = pm.Dirichlet("beta", a=np.ones(n_predictors), dims="coeffs")
# beta = pm.Dirichlet(
# name="beta", a=(1 / n_predictors) * np.ones(n_predictors),
# dims="coeffs"
# )
sigma = pm.HalfNormal("sigma", 1)
mu = pm.Deterministic("mu", pm.math.dot(X, beta), dims="obs_ind")
pm.Normal("y_hat", mu, sigma, observed=y, dims="obs_ind")

So rather than dims="coeffs" (where coeffs correspond to control units), it would be dims=("control_units", "treated_units"). This would give us an unpooled set of weights of each of the control units for each of the treated units. A later step could them implement partial pooling over these weights (across the treated_unit) dimension.

The WeightedSumFitter.build_model method would also change to update the fact that the raw data would no longer be long form, so the incoming data (currently a design matrix X would now be a 2D matrix, probably shape ("time", "unit").

Changes to the SyntheticControl class

  • SyntheticControl would no longer inherit from the PrePostFit class. So all the logic currently in PrePostFit.__innit__ would move to the new SyntheticControl.__init__. This will leave InterruptedTimeSeries as the only class that does inherit from PrePostFit, so there would be opportunity to collapse that class hierarchy, but that is a peripheral issue. The core thing is that SyntheticControl would change a lot.
  • The incoming dataframe is still split into pre and post treatment
  • Remove the formula argument and no longer use a design matrix approach (with patsy). This would result in quite a lot of change to the logic in SyntheticControl.__init__
  • Update the _bayesian_plot method.

Changes to tests

  • Update all the integration tests to deal with the changed API
  • Add new tests to cover the new multiple treated unit case

Changes to docs

  • We'd have to update the docs to use the new API.
  • We would also want to update the existing multi-cell geolift analysis docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request geo project Related to geo-testing
Projects
None yet
Development

No branches or pull requests

1 participant