Causal Inference Using Synthetic Control | by diksha tiwari | Nov, 2022

By Jessie Hobb On Nov 29, 2022

Exploring one of the latest quasi-experimentation techniques

It is widely accepted that A/B testing is the gold standard for causal inference. Also known as Randomized Controlled Tests (RCTs), these tests involve splitting the subjects randomly into treatment and control units. This ensures that any difference between the units is due to the applied treatment. A/B testing has been widely adopted by businesses to test new products, features, and marketing strategies. This helps them capture customer reactions, issues with the product, etc. early on in a product or strategy cycle. However, there are numerous situations where randomly splitting the subjects into treatment and control groups may not be the best solution. An example could be social media testing, where network effects may lead to contamination between test and control. Similarly, in some cases, it may raise ethical concerns (for example, in medical trials) or may be too expensive or even rendered infeasible due to technical limitations. It is in these situations that we use quasi-experimental techniques such as Difference in Differences Analysis (DID), matched pair testing, regression, etc. Synthetic control method, which is the focus of this article, is one such technique. This article explores:

1) the details of the synthetic control method,

2) its advantages and disadvantages, and

3) the data requirements for the technique

Synthetic Control Method (SCM):

What is a synthetic control method?

Synthetic control methods were originally proposed in Abadie and Gardeazabal (2003) with the aim of estimating the effects of aggregate interventions on some aggregate outcome of interest [1]. Here, aggregate interventions are the interventions that are implemented at an aggregate level and affect a small number of large units (such as cities, regions, or countries). It is based on the idea that when the observation is made at an aggregate entity level, a combination of unaffected units may provide a more appropriate comparison than any single unaffected unit. To put it simply, it compares the treatment group to a weighted combination of control groups. SCM has two major advantages over traditional quasi-experimental techniques:

1) It can account for the effect of confounders changing over time, whereas DID’s parallel trend assumption implies that, without the intervention, outcomes for the treated and control groups would have followed parallel trajectories over time [1].

2) This methodology formalizes the selection of the comparison units using a data-driven procedure [2].

Model details:

Let us assume that there are J units with j=1 being the treatment and j=2, …, j=J being the control units. Let Y be the outcome variable, Y_(1t)^N be the value of outcome variable that would have been observed for the treatment unit in the absence of intervention at time t. Let T_0 be the time of intervention, Y_(1t) be the value of the outcome variable post the intervention, and Y_(jt) be the value of the outcome variable for control unit j at time t. Using w_j as the weight associated with the control units, value of Y_(1t)^N can be represented as below:

synthetic control equation; image by author

If is the impact of intervention on treatment, then

impact equation; image by author

Here, Y_(1t) can be obtained by observing Y post the intervention whereas Y_(1t)^N is obtained from Eq. (1). The question remains: how do we obtain the weights for the above equation? Abadie, Diamond, and Hainmueller (2010) propose calculating the weights in a manner similar to

weights for synthetic control; image by author

Here W is a (J-1)x1 matrix of weights w_j, X_(t,pre) is the vector of pre-intervention characteristics for the exposed region, and X_(c,pre) is the vector of the same pre-intervention characteristics for the controls.

The pre-intervention characteristics, a.k.a. covariates can be any variables that appropriately represent the treatment. For instance, in Abadie, Diamond, and Hainmueller (2010), while estimating the impact of Proposition 99 on California, the covariates that were used were the average retail price of cigarettes in the pre-intervention period, the mean of per capita state personal income (logged) in the pre-intervention period, the percentage of the population aged 15–24 in the pre-intervention period, and the mean of per capita beer consumption in the pre-intervention period. These variables have been augmented by three years of lagged smoking consumption (which is also the outcome variable). One may use any number of years of lag data to model the treatment unit. [4]

The formula to calculate the weights for the model although quite similar to linear regression has subtle differences. The model uses following constraints which makes it different from the classical linear regression model:

The last two constraints safeguard the method against extrapolation. Because a synthetic control is a weighted average of the available control units, this method makes explicit: (1) the relative contribution of each control unit to the counterfactual of interest; and (2) the similarities (or lack thereof) between the unit affected by the event or intervention of interest and the synthetic control, in terms of pre-intervention outcomes and other predictors of post-intervention outcomes. Relative to traditional regression methods, transparency and a safeguard against extrapolation are the two attractive features of the method [4].

Implementation Example:

For this exercise, I have used publicly available data the details of which are described in [6]. The code from the article Understanding Synthetic Control Methods has been used for this example.

In this example, we will try to estimate the impact of Proposition 99 on annual per capita cigarette consumption at the state level which is measured as the per capita cigarette sale in our data-set. Thus, for this example our outcome variable of interest is “annual per capita cigarette sale”. The sample period for our example begins in 1970 and ends in 2000. California introduced Proposition 99 in 1989. Let us start by looking at the contextual requirements of this method [1]:

Small effects with high volatility will be difficult to measure with this method.
Availability of comparison group i.e. not all units adopt interventions similar to the treatment group. For this example, the states that introduced formal state-wide tobacco control programs or increased tax on cigarettes by more than 50 cents during the time period of our study were excluded
No spill-over effect on the control units i.e. implementing intervention on treatment does not impact the outcome variable of interest in control units. This example assumes that there is no spillover effect between treatment and control units.
Pre-intervention, differences in characteristics of synthetic control and affected unit is small i.e.

Given we have taken care of the contextual requirements let us look at the data: