Understanding Synthetic Control Methods | by Matteo Courthoud | Jul, 2022

By Jessie Hobb On Jul 30, 2022

A detailed guide to one of the most popular causal inference techniques in the industry

Cover, image by Author

It is now widely accepted that the gold standard technique to compute the causal effect of a treatment (a drug, ad, product, …) on an outcome of interest (a disease, firm revenue, customer satisfaction, …) is AB testing, a.k.a. randomized experiments. We randomly split a set of subjects (patients, users, customers, …) into a treatment and a control group and give the treatment to the treatment group. This procedure ensures that ex-ante, the only expected difference between the two groups is caused by the treatment.

One of the key assumptions in AB testing is that there is no contamination between the treatment and control group. Giving a drug to one patient in the treatment group does not affect the health of patients in the control group. This might not be the case for example if we are trying to cure a contagious disease and the two groups are not isolated. In the industry, frequent violations of the contamination assumption are network effects — my utility of using a social network increases as the number of friends on the network increases — and general equilibrium effects — if I improve one product, it might decrease the sales of another similar product.

Because of this reason, often experiments are carried out at a sufficiently large scale so that there is no contamination across groups, such as cities, states, or even countries. Then another problem arises because of the larger scale: the treatment becomes more expensive. Giving a drug to 50% of patients in a hospital is much less expensive than giving a drug to 50% of cities in a country. Therefore, often only a few units are treated but often over a longer period of time.

In these settings, a very powerful method emerged around 10 years ago: Synthetic Control. The idea of synthetic control is to exploit the temporal variation in the data instead of the cross-sectional one (across time instead of across units). This method is extremely popular in the industry — e.g. in companies like Google, Uber, Facebook, Microsoft, and Amazon — because it is easy to interpret and deals with a setting that emerges often at large scales. In this post we are going to explore this technique by means of example: we will investigate the effectiveness of self-driving cars for a ride-sharing platform.

Suppose you were a ride-sharing platform and you wanted to test the effect of self-driving cars in your fleet.

As you can imagine, there are many limitations to running an AB/test for this type of feature. First of all, it’s complicated to randomize individual rides. Second, it’s a very expensive intervention. Third, and statistically most important, you cannot run this intervention at the ride level. The problem is that there are spillover effects from treated to control units: if indeed self-driving cars are more efficient, it means that they can serve more customers in the same amount of time, reducing the customers available to normal drivers (the control group). This spillover contaminates the experiment and prevents a causal interpretation of the results.

For all these reasons, we select only one city. Given the synthetic vibe of the article we cannot but select… (drum roll)… Miami!

I generate a simulated dataset in which we observe a panel of U.S. cities over time. The revenue data is made up, while the socio-economic variables are taken from the OECD 2022 Metropolitan Areas database. I import the data generating process dgp_selfdriving() from src.dgp. I also import some plotting functions and libraries from src.utils.

from src.utils import *
from src.dgp import dgp_selfdrivingtreatment_year = 2013
treated_city = 'Miami'
df = dgp_selfdriving().generate_data(year=treatment_year, city=treated_city)
df.head()

We have information on the largest 46 U.S. cities for the period 2002–2019. The panel is balanced, which means that we observe all cities for all time periods. Self-driving cars were introduced in 2013.

Is the treated unit, Miami, comparable to the rest of the sample? Let’s use the create_table_one function from Uber’s causalml package to produce a covariate balance table, containing the average value of our observable characteristics, across treatment and control groups. As the name suggests, this should always be the first table you present in causal inference analysis.

from causalml.match import create_table_onecreate_table_one(df, 'treated', ['density', 'employment', 'gdp', 'population', 'revenue'])

As expected, the groups are not balanced: Miami is more densely populated, poorer, larger, and has a lower employment rate than the other cities in the US in our sample.

We are interested in understanding the impact of the introduction of self-driving cars on revenue.

One initial idea could be to analyze the data as we would in an A/B test, comparing the control and treatment group. We can estimate the treatment effect as a difference in means in revenue between the treatment and control groups, after the introduction of self-driving cars.

smf.ols('revenue ~ treated', data=df[df['post']==True]).fit().summary().tables[1]

The effect of self-driving cars seems to be negative but not significant.

The main problem here is that treatment was not randomly assigned. We have a single treated unit, Miami, and it’s hardly comparable to other cities.

One alternative procedure is to compare revenue before and after the treatment, within the city of Miami.

smf.ols('revenue ~ post', data=df[df['city']==treated_city]).fit().summary().tables[1]

The effect of self-driving cars seems to be positive and statistically significant.

However, the problem with this procedure is that there might have been many other things happening after 2013. It’s quite a stretch to attribute all differences to self-driving cars.

We can better understand this concern if we plot the time trend of revenue over cities. First, we need to reshape the data into a wide format, with one column per city and one row per year.

df = df.pivot(index='year', columns='city', values='revenue').reset_index()

Now, let’s plot the revenue over time for Miami and for the other cities.

cities = [c for c in df.columns if c!='year']
df['Other Cities'] = df[[c for c in cities if c != treated_city]].mean(axis=1)

The answer is yes! Synthetic control allows us to do causal inference when we have as few as one treated unit and many control units and we observe them over time. The idea is simple: combine untreated units so that they mimic the behavior of the treated unit as closely as possible, without the treatment. Then use this “synthetic unit” as a control. The method was first introduced by Abadie, Diamond, and Hainmueller (2010) and has been called “the most important innovation in the policy evaluation literature in the last few years”. Moreover, it is widely used in the industry because of its simplicity and interpretability.

Setting

We assume that for a panel of i.i.d. subjects i = 1, …, n over time t=1, …,T we observed a set of variables (Xᵢₜ,Dᵢ,Yᵢₜ) that includes

a treatment assignment Dᵢ∈{0,1} (treated)
a response Yᵢₜ∈ℝ(revenue)
a feature vector Xᵢₜ∈ℝⁿ (population, density, employment and GDP)

Moreover, one unit (Miami in our case) is treated at time t* (2013 in our case). We distinguish time periods before treatment and time periods after treatment.

Crucially, treatment D is not randomly assigned, therefore a difference in means between the treated unit(s) and the control group is not an unbiased estimator of the average treatment effect.

The Problem

The problem is that, as usual, we do not observe the counterfactual outcome for treated units, i.e. we do not know what would have happened to them if they had not been treated. This is known as the fundamental problem of causal inference.

The simplest approach would be just to compare pre- and post-treatment periods. This is called the event study approach.

However, we can do better than this. In fact, even though treatment was not randomly assigned, we still have access to some units that were not treated.

For the outcome variable, we observe the following values

Where Y⁽ᵈ⁾ₐ,ₜ is the outcome at time t, given treatment assignment a and treatment status d. We basically have a missing data problem since we do not observe Y⁽⁰⁾ₜ,ₚₒₛₜ: what would have happened to treated units (a=t) without treatment (d=0).

The Solution

Following Doudchenko and Inbens (2018), we can formulate an estimate of the counterfactual outcome for the treated unit as a linear combination of the observed outcomes for the control units.

Synthetic control outcome, image by Author

where

the constant α allows for different averages between the two groups
the weights βᵢ are allowed to vary across control units i (otherwise, it would be a difference-in-differences)

How should we choose which weights to use? We want our synthetic control to approximate the outcome as closely as possible, before the treatment. The first approach could be to define the weights as

Synthetic control weights, image by Author

I.e. the weights are such that they minimize the distance between observable characteristics of control units Xc and the treated unit Xₜ before the treatment.

You might notice a very close similarity to linear regression. Indeed, we are doing something very similar.

In linear regression, we usually have many units (observations), few exogenous features, and one endogenous feature and we try to express the endogenous feature as a linear combination of the endogenous features, for each unit.

With synthetic control, we instead have many time periods (features), few control units and a single treated unit and we try to express the treated unit as a linear combination of the control units, for each time period.

To perform the same operation, we essentially need to transpose the data.

After the swap, we compute the synthetic control weights, exactly as we would compute regression coefficients. However, now one observation is a time period and one feature is a unit.

Synthetic control regression, image by Author

Notice that this swap is not innocent. In linear regression we assume that the relationship between the exogenous features and the endogenous feature is the same across units, instead in synthetic control we assume that the relationship between the treated units and the control unit is the same over time.

Back to self-driving cars

Let’s go back to the data now! First, we write a synth_predict function that takes as input a model that is trained on control cities and tries to predict the outcome of the treated city, Miami, before the introduction of self-driving cars.

png — Observed and synthetic Miami revenue, image by Author

png — Observed and synthetic Miami revenue, image by Author

What about inference? Is the estimate significantly different from zero? Or, more practically, “how unusual is this estimate under the null hypothesis of no policy effect?”.

We are going to perform a randomization/permutation test in order to answer this question. The idea is that if the policy has no effect, the effect we observe for Miami should not be significantly different from the effect we observe for any other city.

Therefore, we are going to replicate the procedure above, but for all other cities and compare them with the estimate for Miami.

fig, ax = plt.subplots()
for city in cities:
synth_predict(df, SyntheticControl(), city, treatment_year)
plot_difference(df, city, treatment_year, vline=False, alpha=0.2, color='C1', lw=3)
plot_difference(df, treated_city, treatment_year)

png — Distribution of synthetic effects, image by Author

From the graph, we notice two things. First, the effect for Miami is quite extreme and therefore likely not to be driven by random noise.

Second, we also notice that there are a couple of cities for which we cannot fit the pre-trend very well. In particular, there is a line that is sensibly lower than all others. This is expected since, for each city, we are building the counterfactual trend as a convex combination of all other cities. Cities that are quite extreme in terms of revenue are very useful to build the counterfactuals of other cities, but it’s hard to build a counterfactual for them.

Not to bias the analysis, let’s exclude states for which we cannot build a “good enough” counterfactual, in terms of pre-treatment MSE.

Pre-treatment prediction mean squared error, image by Author

As a rule of thumb, Abadie, Diamond, and Hainmueller (2010) suggest excluding units for which the prediction MSE is larger than twice the MSE of the treated unit.

# Reference mse
mse_treated = synth_predict(df, SyntheticControl(), treated_city, treatment_year).mse# Other mse
fig, ax = plt.subplots()
for city in cities:
mse = synth_predict(df, SyntheticControl(), city, treatment_year).mse
if mse < 2 * mse_treated:
plot_difference(df, city, treatment_year, vline=False, alpha=0.2, color='C1', lw=3)
plot_difference(df, treated_city, treatment_year)

png — Distribution of synthetic effects without unpredictable units, image by Author

After excluding extreme observations, it looks like the effect for Miami is very unusual.

One statistic that Abadie, Diamond, and Hainmueller (2010) suggest to perform a randomization test is the ratio between pre-treatment MSE and post-treatment MSE.

Pre-post mean squared prediction error ratio, image by author

We can compute a p-value as the number of observations with a higher ratio.

png — Distribution of pre-post MSE ratios over cities, image by author.

In this article, we have explored a very popular method for causal inference when we have few treated units, but many time periods. This setting emerges often in industry settings when the treatment has to be assigned at the aggregate level and randomization might not be possible. The key idea of synthetic control is to combinate control units into one synthetic control unit to use as counterfactual to estimate the causal effect of the treatment.

One of the main advantages of synthetic control is that, as long as we use positive weights that are constrained to sum to one, the method avoids extrapolation: we will never go out of the support of the data. Moreover, synthetic control studies can be “pre-registered”: you can specify the weights before the study to avoid p-hacking and cherry-picking. Another reason why this method is so popular in the industry is that weights make the counterfactual analysis explicit: one can look at the weights and understand which comparison we are making.

This method is relatively young and many extensions are appearing every year. Some notable ones are the generalized synthetic control by Xu (2017), the synthetic difference-in-differences by Doudchenko and Imbens (2017), and the penalized synthetic control of Abadie e L’Hour (2020), and the matrix completion methods of Athey et al. (2021). Last but not least, if you want to have the method explained by one of its inventors, there is this great lecture by Alberto Abadie at the NBER Summer Institute freely available on Youtube.

References

[1] A. Abadie, A. Diamond, J. Hainmueller, Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program (2010), Journal of the American Statistical Association.

[2] A. Abadie, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects (2021), Journal of Economic Perspectives.

[3] N. Doudchenko, G. Imbens, Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis (2017), working paper.

[4] Y. Xu, Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models (2018), Political Analysis.

[5] A. Abadie, J. L’Hour, A Penalized Synthetic Control Estimator for Disaggregated Data (2020), Journal of the American Statistical Association.

[6] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, K. Khosravi, Matrix Completion Methods for Causal Panel Data Models (2021), Journal of the American Statistical Association.

Data

Code

You can find the original Jupyter Notebook here:

Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

A detailed guide to one of the most popular causal inference techniques in the industry

Suppose you were a ride-sharing platform and you wanted to test the effect of self-driving cars in your fleet.

For all these reasons, we select only one city. Given the synthetic vibe of the article we cannot but select… (drum roll)… Miami!

from src.utils import *
from src.dgp import dgp_selfdrivingtreatment_year = 2013
treated_city = 'Miami'
df = dgp_selfdriving().generate_data(year=treatment_year, city=treated_city)
df.head()

from causalml.match import create_table_onecreate_table_one(df, 'treated', ['density', 'employment', 'gdp', 'population', 'revenue'])

As expected, the groups are not balanced: Miami is more densely populated, poorer, larger, and has a lower employment rate than the other cities in the US in our sample.

We are interested in understanding the impact of the introduction of self-driving cars on revenue.

smf.ols('revenue ~ treated', data=df[df['post']==True]).fit().summary().tables[1]

The effect of self-driving cars seems to be negative but not significant.

The main problem here is that treatment was not randomly assigned. We have a single treated unit, Miami, and it’s hardly comparable to other cities.

One alternative procedure is to compare revenue before and after the treatment, within the city of Miami.

smf.ols('revenue ~ post', data=df[df['city']==treated_city]).fit().summary().tables[1]

The effect of self-driving cars seems to be positive and statistically significant.

However, the problem with this procedure is that there might have been many other things happening after 2013. It’s quite a stretch to attribute all differences to self-driving cars.

We can better understand this concern if we plot the time trend of revenue over cities. First, we need to reshape the data into a wide format, with one column per city and one row per year.

df = df.pivot(index='year', columns='city', values='revenue').reset_index()

Now, let’s plot the revenue over time for Miami and for the other cities.

cities = [c for c in df.columns if c!='year']
df['Other Cities'] = df[[c for c in cities if c != treated_city]].mean(axis=1)

Setting

We assume that for a panel of i.i.d. subjects i = 1, …, n over time t=1, …,T we observed a set of variables (Xᵢₜ,Dᵢ,Yᵢₜ) that includes

a treatment assignment Dᵢ∈{0,1} (treated)
a response Yᵢₜ∈ℝ(revenue)
a feature vector Xᵢₜ∈ℝⁿ (population, density, employment and GDP)

Moreover, one unit (Miami in our case) is treated at time t* (2013 in our case). We distinguish time periods before treatment and time periods after treatment.

Crucially, treatment D is not randomly assigned, therefore a difference in means between the treated unit(s) and the control group is not an unbiased estimator of the average treatment effect.

The Problem

The simplest approach would be just to compare pre- and post-treatment periods. This is called the event study approach.

However, we can do better than this. In fact, even though treatment was not randomly assigned, we still have access to some units that were not treated.

For the outcome variable, we observe the following values

The Solution

Following Doudchenko and Inbens (2018), we can formulate an estimate of the counterfactual outcome for the treated unit as a linear combination of the observed outcomes for the control units.

where

the constant α allows for different averages between the two groups
the weights βᵢ are allowed to vary across control units i (otherwise, it would be a difference-in-differences)

I.e. the weights are such that they minimize the distance between observable characteristics of control units Xc and the treated unit Xₜ before the treatment.

You might notice a very close similarity to linear regression. Indeed, we are doing something very similar.

To perform the same operation, we essentially need to transpose the data.

After the swap, we compute the synthetic control weights, exactly as we would compute regression coefficients. However, now one observation is a time period and one feature is a unit.

Back to self-driving cars

png — Observed and synthetic Miami revenue, image by Author

png — Observed and synthetic Miami revenue, image by Author

What about inference? Is the estimate significantly different from zero? Or, more practically, “how unusual is this estimate under the null hypothesis of no policy effect?”.

Therefore, we are going to replicate the procedure above, but for all other cities and compare them with the estimate for Miami.

fig, ax = plt.subplots()
for city in cities:
synth_predict(df, SyntheticControl(), city, treatment_year)
plot_difference(df, city, treatment_year, vline=False, alpha=0.2, color='C1', lw=3)
plot_difference(df, treated_city, treatment_year)

png — Distribution of synthetic effects, image by Author

From the graph, we notice two things. First, the effect for Miami is quite extreme and therefore likely not to be driven by random noise.

Not to bias the analysis, let’s exclude states for which we cannot build a “good enough” counterfactual, in terms of pre-treatment MSE.

As a rule of thumb, Abadie, Diamond, and Hainmueller (2010) suggest excluding units for which the prediction MSE is larger than twice the MSE of the treated unit.

# Reference mse
mse_treated = synth_predict(df, SyntheticControl(), treated_city, treatment_year).mse# Other mse
fig, ax = plt.subplots()
for city in cities:
mse = synth_predict(df, SyntheticControl(), city, treatment_year).mse
if mse < 2 * mse_treated:
plot_difference(df, city, treatment_year, vline=False, alpha=0.2, color='C1', lw=3)
plot_difference(df, treated_city, treatment_year)

png — Distribution of synthetic effects without unpredictable units, image by Author

After excluding extreme observations, it looks like the effect for Miami is very unusual.

One statistic that Abadie, Diamond, and Hainmueller (2010) suggest to perform a randomization test is the ratio between pre-treatment MSE and post-treatment MSE.

We can compute a p-value as the number of observations with a higher ratio.

png — Distribution of pre-post MSE ratios over cities, image by author.

References

[2] A. Abadie, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects (2021), Journal of Economic Perspectives.

[3] N. Doudchenko, G. Imbens, Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis (2017), working paper.

[4] Y. Xu, Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models (2018), Political Analysis.

[5] A. Abadie, J. L’Hour, A Penalized Synthetic Control Estimator for Disaggregated Data (2020), Journal of the American Statistical Association.

[6] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, K. Khosravi, Matrix Completion Methods for Causal Panel Data Models (2021), Journal of the American Statistical Association.

Data

Code

You can find the original Jupyter Notebook here:

Thank you for reading!

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Understanding Synthetic Control Methods | by Matteo Courthoud | Jul, 2022

A detailed guide to one of the most popular causal inference techniques in the industry

Setting

The Problem

The Solution

Back to self-driving cars

Restricted Weights

References

Related Articles

Data

Code

Thank you for reading!

A detailed guide to one of the most popular causal inference techniques in the industry

Setting

The Problem

The Solution

Back to self-driving cars

Restricted Weights

References

Related Articles

Data

Code

Thank you for reading!