Techno Blender
Digitally Yours.

Understanding Contamination Bias. Problems and solutions of linear… | by Matteo Courthoud | Jun, 2022

0 80


Problems and solutions of linear regression with multiple treatments

Image by Author

In many causal inference settings, we might be interested in the effect of not just one treatment, but many mutually exclusive treatments. For example, we might want to test alternative UX designs, or drugs, or policies. Depending on the context, there might be many reasons why we want to test different treatments at the same time, but generally, it can help reduce the sample size, as we need just a single control group. A simple way to recover the different treatment effects is a linear regression of the outcome of interest on the different treatment indicators.

However, in causal inference, we often condition the analysis on other observable variables (often called control variables), either to increase power or, especially in quasi-experimental settings, to identify a causal parameter instead of a simple correlation. There are cases in which adding control variables can backfire, but otherwise, we used to think that linear regression recovers the average treatment effect.

In a breakthrough paper, Goldsmith-Pinkham, Hull and Kolesár (2022) have recently shown that in the case of multiple and mutually-exclusive treatments and in presence of control variables, the regression coefficients do not identify causal effects. However, not everything is lost: the authors propose a simple solution to this problem that still makes use of linear regression.

In this blog post, I am going to go through a simple example illustrating the nature of the problem and the solution proposed by the authors.

Suppose we were an online store and we are not satisfied with our current checkout page. In particular, we would like to change our checkout button to increase the probability of a purchase. Our UX designer comes up with two alternative checkout buttons, which are displayed below.

Default option and alternative treatments, image by Author

In order to understand which button to use, we run an A/B test, or randomized control trial. In particular, when people arrive at the checkout page, we show them one of the three options, at random. Then, for each user, we record the revenue generated, which is our outcome of interest.

I generate a synthetic dataset using dgp_buttons() from src.dgp as data generating process. I also import plotting functions and standard libraries from src.utils.

from src.utils import *
from src.dgp import dgp_buttons
dgp = dgp_buttons()
df = dgp.generate_data()
df.head()
png
Data snapshot, image by Author

We have information on 1000 users, for which we observe their checkout button (default, button1 or button2), the revenue they generate and whether they accessed the page from desktop or mobile.

We notice too late that we have a problem with randomization. We showed button1 more frequently to desktop users and button2 more frequently to mobile users. The control group that sees the default button instead is balanced.

Treatment assignment imbalance, image by Author

What should we do? What happens if we simply compare revenue across groups? Let’s do it by regressing revenue on group dummy variables.

smf.ols('revenue ~ group', data=df).fit().summary().tables[1]
Regression summary, image by Author

From the regression results, we estimate a negative and significant effect at the 5% level, for both buttons. Should we believe these estimates? Are they causal?

It is unlikely that what we have estimated are the true treatment effects.

In fact, there might be substantial differences in purchase attitudes between desktop and mobile users. Since we do not have a comparable number of mobile and desktop users across treatment arms, it might be that the observed differences in revenue are due to the device used and not the button design.

Because of this, we decide to condition our analysis on the device used and we include the mobile dummy variable in the regression.

smf.ols('revenue ~ group + mobile', data=df).fit().summary().tables[1]
Regression summary, image by Author

Now the coefficient of button1 is positive and significant. Should we recommend its implementation?

The answer is surprisingly no. Goldsmith-Pinkham, Hull, Kolesár (2022) show that this type of regression does not identify the average treatment effect when:

  • there are mutually exclusive treatment arms (in our case, groups)
  • we are controlling for some variable X (in our case, mobile)
  • there are treatment effects are heterogeneous in X

This is true even if the treatment is “as good as random” once we condition on X.

Indeed, in our case, the true treatment effects are the ones reported in the following table.

True average treatment effects, image by Author

The first button has no effect on revenue, irrespectively of the device, while the second button has a positive effect on mobile users and a negative effect on desktop users. Our (wrong) regression specification instead estimates a positive effect of the first button.

Let’s now dig more in detail into the math, to understand why this is happening.

This section borrows heavily from Goldsmith-Pinkham, Hull, Kolesár (2022). For a great summary of the paper, I recommend this excellent Twitter thread by one of the authors, Paul Goldsmith-Pinkham.

Twitter thread on contamination bias

Suppose we are interested in the effect of a treatment D on an outcome Y. First, let’s consider the standard case of a single treatment arm so that the treatment variable is binary, D∈{0,1}. Also, consider a single binary control variable X∈{0,1}. We also assume that treatment assignment is as good as random, conditionally on X. This assumption is called conditional ignorability, and it means that there might be systematic differences between the treatment and control groups, however, these differences are fully accounted for by X. Formally we write

Conditional ignorability assumption, image by Author

Where Yᵢ(d) denotes the potential outcome of individual i when their treatment status is d. For example, Yᵢ(0) indicates the potential outcome of individual i in case they are not treated. This notation comes from Rubin’s potential outcomes framework. We can write the individual treatment effect of individual i as

Individual treatment effects, image by Author

In this setting, the regression of interest is

Regression specification, image by Author

The coefficient of interest is β.

Angrist (1998) shows that the regression coefficient β identifies the average treatment effect. In particular, β identifies a weighted average of the within-group x average treatment effect τ(x) with convex weights. In this particular setting, we can write it as

Coefficient decomposition, image by Author

The weights λ and (1-λ) are given by the within-group treatment variance. Hence, the OLS estimator gives less weight to groups where we have less treatment variance, i.e., where treatment is more imbalanced. Groups where treatment is distributed 50–50 get the most weight.

Group weights, image by author

The weights can be derived using the Frisch-Waugh-Lowell theorem to express β₁ as the OLS coefficient of a univariate regression of Y on Dᵢ₁(⊥X), where Dᵢ₁(⊥X) are the residuals from regressing D₁ on X. If you are not familiar with the Frisch-Waugh-Lowell theorem, I wrote an introductory blog post here.

Coefficient decomposition using FWL theorem, image by Author

The second term of the central expression disappears because the residual Dᵢ₁(⊥X) is by construction mean independent of the control variable X, i.e.

Mean independence, image by Author

This mean independence property is crucial to obtaining an unbiased estimate and its failure in the multiple-treatment case is the source of the contamination bias.

Let’s now consider the case of multiple treatment arms, D∈{0,1,2}, where 1 and 2 indicate two mutually-exclusive treatments. We still assume conditional ignorability, i.e., treatment assignment is as good as random, conditional on X.

Conditional ignorability assumption, image by Author

In this case, we have two different individual treatment effects, one per treatment.

Individual treatment effects, image by Author

The regression of interest is

Regression specification, image by Author

Do the linear regression coefficients β₁ and β₂ identify an average treatment effect?

It would be very tempting to say yes. In fact, it looks like not much has changed with respect to the previous setup. We just have one extra treatment, but the potential outcomes are still conditionally independent of it. However, it is not the case.

The linear regression coefficients do not identify an average treatment effect.

Where is the issue?

Let’s concentrate on β₁ (the same applies to β₂). As before, can rewrite β₁ using the Frisch-Waugh-Lowell theorem as the OLS coefficient of a univariate regression of Yᵢ on Dᵢ₁(⊥X, D₂), where Dᵢ₁(⊥X, D₂) are the residuals from regressing D₁ on D₂ and X.

Coefficient decomposition using the FWL theorem, image by Author

The problem is the last term. Without the last term, we could still write β₁ as a convex combination of the individual treatment effects. However, the last term biases the estimator by adding a component that depends on the treatment effect of D₂, τ₂. Why does this term not disappear?

The problem is that Dᵢ₁(⊥X, D₂) is not mean independent of Dᵢ₂, i.e.

Violation of mean independence, image by Author

The reason lies in the fact that the treatments are mutually exclusive. This implies that when Dᵢ₁=1, Dᵢ₂ must be zero, regardless of the value of Xᵢ. Therefore, the last term does not cancel out and it introduces a contamination bias.

Solution

Goldsmith-Pinkham, Hull, Kolesár (2022) propose different solutions to the problem. The simplest one is a simple estimator, first proposed by Imbens and Wooldridge (2009), is able to remove the bias. The procedure is the following.

  1. De-mean the control variable: X̃ = X − X̅
  2. Regress Y on the interaction between the treatment indicators D and the demeaned control variable

This estimator is an unbiased estimator of the average treatment effect.

This estimator also just requires a simple linear regression. Moreover, it is unbiased also for continuous control variables X, not only for a binary one as we have considered so far.

In order to better understand both the problem and the solution, let’s run some simulations.

We run an estimator over different draws from the data generating process dgp_buttons(). Note that this is only possible with synthetic data and we do not have this luxury in reality. For each sample, we record the estimated coefficient and the corresponding p-value.

First, let’s try it with the old estimator that regresses revenue on both group and mobile dummy variables.

ols_estimator = lambda x: smf.ols('revenue ~ group + mobile', data=x).fit()
results = simulate(dgp, ols_estimator)

I plot the distribution of the coefficient estimates of button1 over 1000 simulations, highlighting the statistically significant ones at the 5% level. I also highlight the true value of the coefficient, zero, with a vertical dotted bar.

plot_results(results)
png
Image by Author

As we can see, we reject the null hypothesis of no effect of button1 in 45% of the simulations. Since we set a confidence level of 5%, we would have expected at most around 5% of rejections. Our estimator is biased.

As we have seen above, the problem is that the estimator is not a convex combination of the effect of button1 across mobile and desktop users (which is zero for both), but it is contaminated by the effect of button2.

Let’s now try the estimator proposed by Imbens and Wooldridge (2009). First, we need to de-mean our control variable, mobile. Then, we regress revenue on the interaction between group and the de-meaned control variable, res_mobile.

df['mobile_res'] = df['mobile'] - np.mean(df['mobile'])
smf.ols('revenue ~ group * mobile_res', data=df).fit().summary().tables[1]
Regression summary, image by Author

The estimated coefficients are now close to their true values. The estimated coefficient for button1 is not significant, while the estimated coefficient for button2 is negative and significant.

Let’s check whether this result holds across samples by running a simulation. We repeat the estimation procedure 1000 times and we plot the distribution of estimated coefficients for button1.

new_estimator = lambda x: smf.ols('revenue ~ group * mobile', data=x).fit()
new_results = simulate(dgp, new_estimator)
plot_results(new_results)
png
Image by Author

Now the distribution of the estimated coefficient of button1 is centered around the true value of zero. Moreover, we reject the null hypothesis of no effect only in 1% of the simulations, consistently with the chosen confidence level of 95%.

In this post, we have seen the dangers of running a factor regression model with multiple mutually exclusive treatment arms and treatment effect heterogeneity across a control variable. In this case, because the treatments are not independent, the regression coefficients are not a convex combination of the within-group average treatment effects, but also capture the treatment effects of the other treatments introducing contamination bias. The solution to the problem is both simple and elegant, requiring just a linear regression.

However, the problem is more general than this setting and generally concerns every setting in which (all of the following)

  1. We have multiple treatments that depend on each other
  2. We need to condition the analysis on a control variable
  3. The treatment effects are heterogeneous in the control variable

Another popular example is the case of the Two-Way Fixed Effects (TWFE) estimator with staggered treatments.

References

[1] J. Angrist, Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants (1998), Econometrica.

[2] D. Rubin, Causal Inference Using Potential Outcomes (2005), Journal of the American Statistical Association.

[3] G. Imbens, J. Wooldridge, Recent Developments in the Econometrics of Program Evaluation (2009), Journal of Economic Literature.

[4] P. Goldsmith-Pinkham, P. Hull, M. Kolesár, Contamination Bias in Linear Regressions (2022), working paper.

Related Articles

Code

You can find the original Jupyter Notebook here:

Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!


Problems and solutions of linear regression with multiple treatments

Image by Author

In many causal inference settings, we might be interested in the effect of not just one treatment, but many mutually exclusive treatments. For example, we might want to test alternative UX designs, or drugs, or policies. Depending on the context, there might be many reasons why we want to test different treatments at the same time, but generally, it can help reduce the sample size, as we need just a single control group. A simple way to recover the different treatment effects is a linear regression of the outcome of interest on the different treatment indicators.

However, in causal inference, we often condition the analysis on other observable variables (often called control variables), either to increase power or, especially in quasi-experimental settings, to identify a causal parameter instead of a simple correlation. There are cases in which adding control variables can backfire, but otherwise, we used to think that linear regression recovers the average treatment effect.

In a breakthrough paper, Goldsmith-Pinkham, Hull and Kolesár (2022) have recently shown that in the case of multiple and mutually-exclusive treatments and in presence of control variables, the regression coefficients do not identify causal effects. However, not everything is lost: the authors propose a simple solution to this problem that still makes use of linear regression.

In this blog post, I am going to go through a simple example illustrating the nature of the problem and the solution proposed by the authors.

Suppose we were an online store and we are not satisfied with our current checkout page. In particular, we would like to change our checkout button to increase the probability of a purchase. Our UX designer comes up with two alternative checkout buttons, which are displayed below.

Default option and alternative treatments, image by Author

In order to understand which button to use, we run an A/B test, or randomized control trial. In particular, when people arrive at the checkout page, we show them one of the three options, at random. Then, for each user, we record the revenue generated, which is our outcome of interest.

I generate a synthetic dataset using dgp_buttons() from src.dgp as data generating process. I also import plotting functions and standard libraries from src.utils.

from src.utils import *
from src.dgp import dgp_buttons
dgp = dgp_buttons()
df = dgp.generate_data()
df.head()
png
Data snapshot, image by Author

We have information on 1000 users, for which we observe their checkout button (default, button1 or button2), the revenue they generate and whether they accessed the page from desktop or mobile.

We notice too late that we have a problem with randomization. We showed button1 more frequently to desktop users and button2 more frequently to mobile users. The control group that sees the default button instead is balanced.

Treatment assignment imbalance, image by Author

What should we do? What happens if we simply compare revenue across groups? Let’s do it by regressing revenue on group dummy variables.

smf.ols('revenue ~ group', data=df).fit().summary().tables[1]
Regression summary, image by Author

From the regression results, we estimate a negative and significant effect at the 5% level, for both buttons. Should we believe these estimates? Are they causal?

It is unlikely that what we have estimated are the true treatment effects.

In fact, there might be substantial differences in purchase attitudes between desktop and mobile users. Since we do not have a comparable number of mobile and desktop users across treatment arms, it might be that the observed differences in revenue are due to the device used and not the button design.

Because of this, we decide to condition our analysis on the device used and we include the mobile dummy variable in the regression.

smf.ols('revenue ~ group + mobile', data=df).fit().summary().tables[1]
Regression summary, image by Author

Now the coefficient of button1 is positive and significant. Should we recommend its implementation?

The answer is surprisingly no. Goldsmith-Pinkham, Hull, Kolesár (2022) show that this type of regression does not identify the average treatment effect when:

  • there are mutually exclusive treatment arms (in our case, groups)
  • we are controlling for some variable X (in our case, mobile)
  • there are treatment effects are heterogeneous in X

This is true even if the treatment is “as good as random” once we condition on X.

Indeed, in our case, the true treatment effects are the ones reported in the following table.

True average treatment effects, image by Author

The first button has no effect on revenue, irrespectively of the device, while the second button has a positive effect on mobile users and a negative effect on desktop users. Our (wrong) regression specification instead estimates a positive effect of the first button.

Let’s now dig more in detail into the math, to understand why this is happening.

This section borrows heavily from Goldsmith-Pinkham, Hull, Kolesár (2022). For a great summary of the paper, I recommend this excellent Twitter thread by one of the authors, Paul Goldsmith-Pinkham.

Twitter thread on contamination bias

Suppose we are interested in the effect of a treatment D on an outcome Y. First, let’s consider the standard case of a single treatment arm so that the treatment variable is binary, D∈{0,1}. Also, consider a single binary control variable X∈{0,1}. We also assume that treatment assignment is as good as random, conditionally on X. This assumption is called conditional ignorability, and it means that there might be systematic differences between the treatment and control groups, however, these differences are fully accounted for by X. Formally we write

Conditional ignorability assumption, image by Author

Where Yᵢ(d) denotes the potential outcome of individual i when their treatment status is d. For example, Yᵢ(0) indicates the potential outcome of individual i in case they are not treated. This notation comes from Rubin’s potential outcomes framework. We can write the individual treatment effect of individual i as

Individual treatment effects, image by Author

In this setting, the regression of interest is

Regression specification, image by Author

The coefficient of interest is β.

Angrist (1998) shows that the regression coefficient β identifies the average treatment effect. In particular, β identifies a weighted average of the within-group x average treatment effect τ(x) with convex weights. In this particular setting, we can write it as

Coefficient decomposition, image by Author

The weights λ and (1-λ) are given by the within-group treatment variance. Hence, the OLS estimator gives less weight to groups where we have less treatment variance, i.e., where treatment is more imbalanced. Groups where treatment is distributed 50–50 get the most weight.

Group weights, image by author

The weights can be derived using the Frisch-Waugh-Lowell theorem to express β₁ as the OLS coefficient of a univariate regression of Y on Dᵢ₁(⊥X), where Dᵢ₁(⊥X) are the residuals from regressing D₁ on X. If you are not familiar with the Frisch-Waugh-Lowell theorem, I wrote an introductory blog post here.

Coefficient decomposition using FWL theorem, image by Author

The second term of the central expression disappears because the residual Dᵢ₁(⊥X) is by construction mean independent of the control variable X, i.e.

Mean independence, image by Author

This mean independence property is crucial to obtaining an unbiased estimate and its failure in the multiple-treatment case is the source of the contamination bias.

Let’s now consider the case of multiple treatment arms, D∈{0,1,2}, where 1 and 2 indicate two mutually-exclusive treatments. We still assume conditional ignorability, i.e., treatment assignment is as good as random, conditional on X.

Conditional ignorability assumption, image by Author

In this case, we have two different individual treatment effects, one per treatment.

Individual treatment effects, image by Author

The regression of interest is

Regression specification, image by Author

Do the linear regression coefficients β₁ and β₂ identify an average treatment effect?

It would be very tempting to say yes. In fact, it looks like not much has changed with respect to the previous setup. We just have one extra treatment, but the potential outcomes are still conditionally independent of it. However, it is not the case.

The linear regression coefficients do not identify an average treatment effect.

Where is the issue?

Let’s concentrate on β₁ (the same applies to β₂). As before, can rewrite β₁ using the Frisch-Waugh-Lowell theorem as the OLS coefficient of a univariate regression of Yᵢ on Dᵢ₁(⊥X, D₂), where Dᵢ₁(⊥X, D₂) are the residuals from regressing D₁ on D₂ and X.

Coefficient decomposition using the FWL theorem, image by Author

The problem is the last term. Without the last term, we could still write β₁ as a convex combination of the individual treatment effects. However, the last term biases the estimator by adding a component that depends on the treatment effect of D₂, τ₂. Why does this term not disappear?

The problem is that Dᵢ₁(⊥X, D₂) is not mean independent of Dᵢ₂, i.e.

Violation of mean independence, image by Author

The reason lies in the fact that the treatments are mutually exclusive. This implies that when Dᵢ₁=1, Dᵢ₂ must be zero, regardless of the value of Xᵢ. Therefore, the last term does not cancel out and it introduces a contamination bias.

Solution

Goldsmith-Pinkham, Hull, Kolesár (2022) propose different solutions to the problem. The simplest one is a simple estimator, first proposed by Imbens and Wooldridge (2009), is able to remove the bias. The procedure is the following.

  1. De-mean the control variable: X̃ = X − X̅
  2. Regress Y on the interaction between the treatment indicators D and the demeaned control variable

This estimator is an unbiased estimator of the average treatment effect.

This estimator also just requires a simple linear regression. Moreover, it is unbiased also for continuous control variables X, not only for a binary one as we have considered so far.

In order to better understand both the problem and the solution, let’s run some simulations.

We run an estimator over different draws from the data generating process dgp_buttons(). Note that this is only possible with synthetic data and we do not have this luxury in reality. For each sample, we record the estimated coefficient and the corresponding p-value.

First, let’s try it with the old estimator that regresses revenue on both group and mobile dummy variables.

ols_estimator = lambda x: smf.ols('revenue ~ group + mobile', data=x).fit()
results = simulate(dgp, ols_estimator)

I plot the distribution of the coefficient estimates of button1 over 1000 simulations, highlighting the statistically significant ones at the 5% level. I also highlight the true value of the coefficient, zero, with a vertical dotted bar.

plot_results(results)
png
Image by Author

As we can see, we reject the null hypothesis of no effect of button1 in 45% of the simulations. Since we set a confidence level of 5%, we would have expected at most around 5% of rejections. Our estimator is biased.

As we have seen above, the problem is that the estimator is not a convex combination of the effect of button1 across mobile and desktop users (which is zero for both), but it is contaminated by the effect of button2.

Let’s now try the estimator proposed by Imbens and Wooldridge (2009). First, we need to de-mean our control variable, mobile. Then, we regress revenue on the interaction between group and the de-meaned control variable, res_mobile.

df['mobile_res'] = df['mobile'] - np.mean(df['mobile'])
smf.ols('revenue ~ group * mobile_res', data=df).fit().summary().tables[1]
Regression summary, image by Author

The estimated coefficients are now close to their true values. The estimated coefficient for button1 is not significant, while the estimated coefficient for button2 is negative and significant.

Let’s check whether this result holds across samples by running a simulation. We repeat the estimation procedure 1000 times and we plot the distribution of estimated coefficients for button1.

new_estimator = lambda x: smf.ols('revenue ~ group * mobile', data=x).fit()
new_results = simulate(dgp, new_estimator)
plot_results(new_results)
png
Image by Author

Now the distribution of the estimated coefficient of button1 is centered around the true value of zero. Moreover, we reject the null hypothesis of no effect only in 1% of the simulations, consistently with the chosen confidence level of 95%.

In this post, we have seen the dangers of running a factor regression model with multiple mutually exclusive treatment arms and treatment effect heterogeneity across a control variable. In this case, because the treatments are not independent, the regression coefficients are not a convex combination of the within-group average treatment effects, but also capture the treatment effects of the other treatments introducing contamination bias. The solution to the problem is both simple and elegant, requiring just a linear regression.

However, the problem is more general than this setting and generally concerns every setting in which (all of the following)

  1. We have multiple treatments that depend on each other
  2. We need to condition the analysis on a control variable
  3. The treatment effects are heterogeneous in the control variable

Another popular example is the case of the Two-Way Fixed Effects (TWFE) estimator with staggered treatments.

References

[1] J. Angrist, Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants (1998), Econometrica.

[2] D. Rubin, Causal Inference Using Potential Outcomes (2005), Journal of the American Statistical Association.

[3] G. Imbens, J. Wooldridge, Recent Developments in the Econometrics of Program Evaluation (2009), Journal of Economic Literature.

[4] P. Goldsmith-Pinkham, P. Hull, M. Kolesár, Contamination Bias in Linear Regressions (2022), working paper.

Related Articles

Code

You can find the original Jupyter Notebook here:

Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment