Understanding Contamination Bias. Problems and solutions of linear… | by Matteo Courthoud | Jun, 2022
Problems and solutions of linear regression with multiple treatments
In many causal inference settings, we might be interested in the effect of not just one treatment, but many mutually exclusive treatments. For example, we might want to test alternative UX designs, or drugs, or policies. Depending on the context, there might be many reasons why we want to test different treatments at the same time, but generally, it can help reduce the sample size, as we need just a single control group. A simple way to recover the different treatment effects is a linear regression of the outcome of interest on the different treatment indicators.
However, in causal inference, we often condition the analysis on other observable variables (often called control variables), either to increase power or, especially in quasi-experimental settings, to identify a causal parameter instead of a simple correlation. There are cases in which adding control variables can backfire, but otherwise, we used to think that linear regression recovers the average treatment effect.
In a breakthrough paper, Goldsmith-Pinkham, Hull and Kolesár (2022) have recently shown that in the case of multiple and mutually-exclusive treatments and in presence of control variables, the regression coefficients do not identify causal effects. However, not everything is lost: the authors propose a simple solution to this problem that still makes use of linear regression.
In this blog post, I am going to go through a simple example illustrating the nature of the problem and the solution proposed by the authors.
Suppose we were an online store and we are not satisfied with our current checkout page. In particular, we would like to change our checkout button to increase the probability of a purchase. Our UX designer comes up with two alternative checkout buttons, which are displayed below.
In order to understand which button to use, we run an A/B test, or randomized control trial. In particular, when people arrive at the checkout page, we show them one of the three options, at random. Then, for each user, we record the revenue generated, which is our outcome of interest.
I generate a synthetic dataset using dgp_buttons()
from src.dgp
as data generating process. I also import plotting functions and standard libraries from src.utils
.
from src.utils import *
from src.dgp import dgp_buttonsdgp = dgp_buttons()
df = dgp.generate_data()
df.head()
We have information on 1000 users, for which we observe their checkout button (default
, button1
or button2
), the revenue
they generate and whether they accessed the page from desktop or mobile
.
We notice too late that we have a problem with randomization. We showed button1
more frequently to desktop users and button2
more frequently to mobile users. The control group that sees the default
button instead is balanced.
What should we do? What happens if we simply compare revenue
across groups
? Let’s do it by regressing revenue
on group
dummy variables.
smf.ols('revenue ~ group', data=df).fit().summary().tables[1]
From the regression results, we estimate a negative and significant effect at the 5% level, for both buttons. Should we believe these estimates? Are they causal?
It is unlikely that what we have estimated are the true treatment effects.
In fact, there might be substantial differences in purchase attitudes between desktop and mobile users. Since we do not have a comparable number of mobile and desktop users across treatment arms, it might be that the observed differences in revenue
are due to the device used and not the button design.
Because of this, we decide to condition our analysis on the device used and we include the mobile
dummy variable in the regression.
smf.ols('revenue ~ group + mobile', data=df).fit().summary().tables[1]
Now the coefficient of button1
is positive and significant. Should we recommend its implementation?
The answer is surprisingly no. Goldsmith-Pinkham, Hull, Kolesár (2022) show that this type of regression does not identify the average treatment effect when:
- there are mutually exclusive treatment arms (in our case,
groups
) - we are controlling for some variable X (in our case,
mobile
) - there are treatment effects are heterogeneous in X
This is true even if the treatment is “as good as random” once we condition on X.
Indeed, in our case, the true treatment effects are the ones reported in the following table.
The first button has no effect on revenue, irrespectively of the device, while the second button has a positive effect on mobile users and a negative effect on desktop users. Our (wrong) regression specification instead estimates a positive effect of the first button.
Let’s now dig more in detail into the math, to understand why this is happening.
This section borrows heavily from Goldsmith-Pinkham, Hull, Kolesár (2022). For a great summary of the paper, I recommend this excellent Twitter thread by one of the authors, Paul Goldsmith-Pinkham.
Suppose we are interested in the effect of a treatment D on an outcome Y. First, let’s consider the standard case of a single treatment arm so that the treatment variable is binary, D∈{0,1}. Also, consider a single binary control variable X∈{0,1}. We also assume that treatment assignment is as good as random, conditionally on X. This assumption is called conditional ignorability, and it means that there might be systematic differences between the treatment and control groups, however, these differences are fully accounted for by X. Formally we write
Where Yᵢ(d) denotes the potential outcome of individual i when their treatment status is d. For example, Yᵢ(0) indicates the potential outcome of individual i in case they are not treated. This notation comes from Rubin’s potential outcomes framework. We can write the individual treatment effect of individual i as
In this setting, the regression of interest is
The coefficient of interest is β.
Angrist (1998) shows that the regression coefficient β identifies the average treatment effect. In particular, β identifies a weighted average of the within-group x average treatment effect τ(x) with convex weights. In this particular setting, we can write it as
The weights λ and (1-λ) are given by the within-group treatment variance. Hence, the OLS estimator gives less weight to groups where we have less treatment variance, i.e., where treatment is more imbalanced. Groups where treatment is distributed 50–50 get the most weight.
The weights can be derived using the Frisch-Waugh-Lowell theorem to express β₁ as the OLS coefficient of a univariate regression of Y on Dᵢ₁(⊥X), where Dᵢ₁(⊥X) are the residuals from regressing D₁ on X. If you are not familiar with the Frisch-Waugh-Lowell theorem, I wrote an introductory blog post here.
The second term of the central expression disappears because the residual Dᵢ₁(⊥X) is by construction mean independent of the control variable X, i.e.
This mean independence property is crucial to obtaining an unbiased estimate and its failure in the multiple-treatment case is the source of the contamination bias.
Let’s now consider the case of multiple treatment arms, D∈{0,1,2}, where 1 and 2 indicate two mutually-exclusive treatments. We still assume conditional ignorability, i.e., treatment assignment is as good as random, conditional on X.
In this case, we have two different individual treatment effects, one per treatment.
The regression of interest is
Do the linear regression coefficients β₁ and β₂ identify an average treatment effect?
It would be very tempting to say yes. In fact, it looks like not much has changed with respect to the previous setup. We just have one extra treatment, but the potential outcomes are still conditionally independent of it. However, it is not the case.
The linear regression coefficients do not identify an average treatment effect.
Where is the issue?
Let’s concentrate on β₁ (the same applies to β₂). As before, can rewrite β₁ using the Frisch-Waugh-Lowell theorem as the OLS coefficient of a univariate regression of Yᵢ on Dᵢ₁(⊥X, D₂), where Dᵢ₁(⊥X, D₂) are the residuals from regressing D₁ on D₂ and X.
The problem is the last term. Without the last term, we could still write β₁ as a convex combination of the individual treatment effects. However, the last term biases the estimator by adding a component that depends on the treatment effect of D₂, τ₂. Why does this term not disappear?
The problem is that Dᵢ₁(⊥X, D₂) is not mean independent of Dᵢ₂, i.e.
The reason lies in the fact that the treatments are mutually exclusive. This implies that when Dᵢ₁=1, Dᵢ₂ must be zero, regardless of the value of Xᵢ. Therefore, the last term does not cancel out and it introduces a contamination bias.
Solution
Goldsmith-Pinkham, Hull, Kolesár (2022) propose different solutions to the problem. The simplest one is a simple estimator, first proposed by Imbens and Wooldridge (2009), is able to remove the bias. The procedure is the following.
- De-mean the control variable: X̃ = X − X̅
- Regress Y on the interaction between the treatment indicators D and the demeaned control variable X̃
This estimator is an unbiased estimator of the average treatment effect.
This estimator also just requires a simple linear regression. Moreover, it is unbiased also for continuous control variables X, not only for a binary one as we have considered so far.
In order to better understand both the problem and the solution, let’s run some simulations.
We run an estimator over different draws from the data generating process dgp_buttons()
. Note that this is only possible with synthetic data and we do not have this luxury in reality. For each sample, we record the estimated coefficient and the corresponding p-value.
First, let’s try it with the old estimator that regresses revenue
on both group
and mobile
dummy variables.
ols_estimator = lambda x: smf.ols('revenue ~ group + mobile', data=x).fit()
results = simulate(dgp, ols_estimator)
I plot the distribution of the coefficient estimates of button1
over 1000 simulations, highlighting the statistically significant ones at the 5% level. I also highlight the true value of the coefficient, zero, with a vertical dotted bar.
plot_results(results)
As we can see, we reject the null hypothesis of no effect of button1
in 45% of the simulations. Since we set a confidence level of 5%, we would have expected at most around 5% of rejections. Our estimator is biased.
As we have seen above, the problem is that the estimator is not a convex combination of the effect of button1
across mobile and desktop users (which is zero for both), but it is contaminated by the effect of button2
.
Let’s now try the estimator proposed by Imbens and Wooldridge (2009). First, we need to de-mean our control variable, mobile
. Then, we regress revenue
on the interaction between group
and the de-meaned control variable, res_mobile
.
df['mobile_res'] = df['mobile'] - np.mean(df['mobile'])
smf.ols('revenue ~ group * mobile_res', data=df).fit().summary().tables[1]
The estimated coefficients are now close to their true values. The estimated coefficient for button1
is not significant, while the estimated coefficient for button2
is negative and significant.
Let’s check whether this result holds across samples by running a simulation. We repeat the estimation procedure 1000 times and we plot the distribution of estimated coefficients for button1
.
new_estimator = lambda x: smf.ols('revenue ~ group * mobile', data=x).fit()
new_results = simulate(dgp, new_estimator)plot_results(new_results)
Now the distribution of the estimated coefficient of button1
is centered around the true value of zero. Moreover, we reject the null hypothesis of no effect only in 1% of the simulations, consistently with the chosen confidence level of 95%.
In this post, we have seen the dangers of running a factor regression model with multiple mutually exclusive treatment arms and treatment effect heterogeneity across a control variable. In this case, because the treatments are not independent, the regression coefficients are not a convex combination of the within-group average treatment effects, but also capture the treatment effects of the other treatments introducing contamination bias. The solution to the problem is both simple and elegant, requiring just a linear regression.
However, the problem is more general than this setting and generally concerns every setting in which (all of the following)
- We have multiple treatments that depend on each other
- We need to condition the analysis on a control variable
- The treatment effects are heterogeneous in the control variable
Another popular example is the case of the Two-Way Fixed Effects (TWFE) estimator with staggered treatments.
References
[1] J. Angrist, Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants (1998), Econometrica.
[2] D. Rubin, Causal Inference Using Potential Outcomes (2005), Journal of the American Statistical Association.
[3] G. Imbens, J. Wooldridge, Recent Developments in the Econometrics of Program Evaluation (2009), Journal of Economic Literature.
[4] P. Goldsmith-Pinkham, P. Hull, M. Kolesár, Contamination Bias in Linear Regressions (2022), working paper.
Related Articles
Code
You can find the original Jupyter Notebook here:
Thank you for reading!
I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.
Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!
Problems and solutions of linear regression with multiple treatments
In many causal inference settings, we might be interested in the effect of not just one treatment, but many mutually exclusive treatments. For example, we might want to test alternative UX designs, or drugs, or policies. Depending on the context, there might be many reasons why we want to test different treatments at the same time, but generally, it can help reduce the sample size, as we need just a single control group. A simple way to recover the different treatment effects is a linear regression of the outcome of interest on the different treatment indicators.
However, in causal inference, we often condition the analysis on other observable variables (often called control variables), either to increase power or, especially in quasi-experimental settings, to identify a causal parameter instead of a simple correlation. There are cases in which adding control variables can backfire, but otherwise, we used to think that linear regression recovers the average treatment effect.
In a breakthrough paper, Goldsmith-Pinkham, Hull and Kolesár (2022) have recently shown that in the case of multiple and mutually-exclusive treatments and in presence of control variables, the regression coefficients do not identify causal effects. However, not everything is lost: the authors propose a simple solution to this problem that still makes use of linear regression.
In this blog post, I am going to go through a simple example illustrating the nature of the problem and the solution proposed by the authors.
Suppose we were an online store and we are not satisfied with our current checkout page. In particular, we would like to change our checkout button to increase the probability of a purchase. Our UX designer comes up with two alternative checkout buttons, which are displayed below.
In order to understand which button to use, we run an A/B test, or randomized control trial. In particular, when people arrive at the checkout page, we show them one of the three options, at random. Then, for each user, we record the revenue generated, which is our outcome of interest.
I generate a synthetic dataset using dgp_buttons()
from src.dgp
as data generating process. I also import plotting functions and standard libraries from src.utils
.
from src.utils import *
from src.dgp import dgp_buttonsdgp = dgp_buttons()
df = dgp.generate_data()
df.head()
We have information on 1000 users, for which we observe their checkout button (default
, button1
or button2
), the revenue
they generate and whether they accessed the page from desktop or mobile
.
We notice too late that we have a problem with randomization. We showed button1
more frequently to desktop users and button2
more frequently to mobile users. The control group that sees the default
button instead is balanced.
What should we do? What happens if we simply compare revenue
across groups
? Let’s do it by regressing revenue
on group
dummy variables.
smf.ols('revenue ~ group', data=df).fit().summary().tables[1]
From the regression results, we estimate a negative and significant effect at the 5% level, for both buttons. Should we believe these estimates? Are they causal?
It is unlikely that what we have estimated are the true treatment effects.
In fact, there might be substantial differences in purchase attitudes between desktop and mobile users. Since we do not have a comparable number of mobile and desktop users across treatment arms, it might be that the observed differences in revenue
are due to the device used and not the button design.
Because of this, we decide to condition our analysis on the device used and we include the mobile
dummy variable in the regression.
smf.ols('revenue ~ group + mobile', data=df).fit().summary().tables[1]
Now the coefficient of button1
is positive and significant. Should we recommend its implementation?
The answer is surprisingly no. Goldsmith-Pinkham, Hull, Kolesár (2022) show that this type of regression does not identify the average treatment effect when:
- there are mutually exclusive treatment arms (in our case,
groups
) - we are controlling for some variable X (in our case,
mobile
) - there are treatment effects are heterogeneous in X
This is true even if the treatment is “as good as random” once we condition on X.
Indeed, in our case, the true treatment effects are the ones reported in the following table.
The first button has no effect on revenue, irrespectively of the device, while the second button has a positive effect on mobile users and a negative effect on desktop users. Our (wrong) regression specification instead estimates a positive effect of the first button.
Let’s now dig more in detail into the math, to understand why this is happening.
This section borrows heavily from Goldsmith-Pinkham, Hull, Kolesár (2022). For a great summary of the paper, I recommend this excellent Twitter thread by one of the authors, Paul Goldsmith-Pinkham.
Suppose we are interested in the effect of a treatment D on an outcome Y. First, let’s consider the standard case of a single treatment arm so that the treatment variable is binary, D∈{0,1}. Also, consider a single binary control variable X∈{0,1}. We also assume that treatment assignment is as good as random, conditionally on X. This assumption is called conditional ignorability, and it means that there might be systematic differences between the treatment and control groups, however, these differences are fully accounted for by X. Formally we write
Where Yᵢ(d) denotes the potential outcome of individual i when their treatment status is d. For example, Yᵢ(0) indicates the potential outcome of individual i in case they are not treated. This notation comes from Rubin’s potential outcomes framework. We can write the individual treatment effect of individual i as
In this setting, the regression of interest is
The coefficient of interest is β.
Angrist (1998) shows that the regression coefficient β identifies the average treatment effect. In particular, β identifies a weighted average of the within-group x average treatment effect τ(x) with convex weights. In this particular setting, we can write it as
The weights λ and (1-λ) are given by the within-group treatment variance. Hence, the OLS estimator gives less weight to groups where we have less treatment variance, i.e., where treatment is more imbalanced. Groups where treatment is distributed 50–50 get the most weight.
The weights can be derived using the Frisch-Waugh-Lowell theorem to express β₁ as the OLS coefficient of a univariate regression of Y on Dᵢ₁(⊥X), where Dᵢ₁(⊥X) are the residuals from regressing D₁ on X. If you are not familiar with the Frisch-Waugh-Lowell theorem, I wrote an introductory blog post here.
The second term of the central expression disappears because the residual Dᵢ₁(⊥X) is by construction mean independent of the control variable X, i.e.
This mean independence property is crucial to obtaining an unbiased estimate and its failure in the multiple-treatment case is the source of the contamination bias.
Let’s now consider the case of multiple treatment arms, D∈{0,1,2}, where 1 and 2 indicate two mutually-exclusive treatments. We still assume conditional ignorability, i.e., treatment assignment is as good as random, conditional on X.
In this case, we have two different individual treatment effects, one per treatment.
The regression of interest is
Do the linear regression coefficients β₁ and β₂ identify an average treatment effect?
It would be very tempting to say yes. In fact, it looks like not much has changed with respect to the previous setup. We just have one extra treatment, but the potential outcomes are still conditionally independent of it. However, it is not the case.
The linear regression coefficients do not identify an average treatment effect.
Where is the issue?
Let’s concentrate on β₁ (the same applies to β₂). As before, can rewrite β₁ using the Frisch-Waugh-Lowell theorem as the OLS coefficient of a univariate regression of Yᵢ on Dᵢ₁(⊥X, D₂), where Dᵢ₁(⊥X, D₂) are the residuals from regressing D₁ on D₂ and X.
The problem is the last term. Without the last term, we could still write β₁ as a convex combination of the individual treatment effects. However, the last term biases the estimator by adding a component that depends on the treatment effect of D₂, τ₂. Why does this term not disappear?
The problem is that Dᵢ₁(⊥X, D₂) is not mean independent of Dᵢ₂, i.e.
The reason lies in the fact that the treatments are mutually exclusive. This implies that when Dᵢ₁=1, Dᵢ₂ must be zero, regardless of the value of Xᵢ. Therefore, the last term does not cancel out and it introduces a contamination bias.
Solution
Goldsmith-Pinkham, Hull, Kolesár (2022) propose different solutions to the problem. The simplest one is a simple estimator, first proposed by Imbens and Wooldridge (2009), is able to remove the bias. The procedure is the following.
- De-mean the control variable: X̃ = X − X̅
- Regress Y on the interaction between the treatment indicators D and the demeaned control variable X̃
This estimator is an unbiased estimator of the average treatment effect.
This estimator also just requires a simple linear regression. Moreover, it is unbiased also for continuous control variables X, not only for a binary one as we have considered so far.
In order to better understand both the problem and the solution, let’s run some simulations.
We run an estimator over different draws from the data generating process dgp_buttons()
. Note that this is only possible with synthetic data and we do not have this luxury in reality. For each sample, we record the estimated coefficient and the corresponding p-value.
First, let’s try it with the old estimator that regresses revenue
on both group
and mobile
dummy variables.
ols_estimator = lambda x: smf.ols('revenue ~ group + mobile', data=x).fit()
results = simulate(dgp, ols_estimator)
I plot the distribution of the coefficient estimates of button1
over 1000 simulations, highlighting the statistically significant ones at the 5% level. I also highlight the true value of the coefficient, zero, with a vertical dotted bar.
plot_results(results)
As we can see, we reject the null hypothesis of no effect of button1
in 45% of the simulations. Since we set a confidence level of 5%, we would have expected at most around 5% of rejections. Our estimator is biased.
As we have seen above, the problem is that the estimator is not a convex combination of the effect of button1
across mobile and desktop users (which is zero for both), but it is contaminated by the effect of button2
.
Let’s now try the estimator proposed by Imbens and Wooldridge (2009). First, we need to de-mean our control variable, mobile
. Then, we regress revenue
on the interaction between group
and the de-meaned control variable, res_mobile
.
df['mobile_res'] = df['mobile'] - np.mean(df['mobile'])
smf.ols('revenue ~ group * mobile_res', data=df).fit().summary().tables[1]
The estimated coefficients are now close to their true values. The estimated coefficient for button1
is not significant, while the estimated coefficient for button2
is negative and significant.
Let’s check whether this result holds across samples by running a simulation. We repeat the estimation procedure 1000 times and we plot the distribution of estimated coefficients for button1
.
new_estimator = lambda x: smf.ols('revenue ~ group * mobile', data=x).fit()
new_results = simulate(dgp, new_estimator)plot_results(new_results)
Now the distribution of the estimated coefficient of button1
is centered around the true value of zero. Moreover, we reject the null hypothesis of no effect only in 1% of the simulations, consistently with the chosen confidence level of 95%.
In this post, we have seen the dangers of running a factor regression model with multiple mutually exclusive treatment arms and treatment effect heterogeneity across a control variable. In this case, because the treatments are not independent, the regression coefficients are not a convex combination of the within-group average treatment effects, but also capture the treatment effects of the other treatments introducing contamination bias. The solution to the problem is both simple and elegant, requiring just a linear regression.
However, the problem is more general than this setting and generally concerns every setting in which (all of the following)
- We have multiple treatments that depend on each other
- We need to condition the analysis on a control variable
- The treatment effects are heterogeneous in the control variable
Another popular example is the case of the Two-Way Fixed Effects (TWFE) estimator with staggered treatments.
References
[1] J. Angrist, Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants (1998), Econometrica.
[2] D. Rubin, Causal Inference Using Potential Outcomes (2005), Journal of the American Statistical Association.
[3] G. Imbens, J. Wooldridge, Recent Developments in the Econometrics of Program Evaluation (2009), Journal of Economic Literature.
[4] P. Goldsmith-Pinkham, P. Hull, M. Kolesár, Contamination Bias in Linear Regressions (2022), working paper.
Related Articles
Code
You can find the original Jupyter Notebook here:
Thank you for reading!
I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.
Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!