Unlock the Power of Causal Inference: A Data Scientist’s Guide to Understanding the Backdoor Adjustment Formula | by Graham Harrison | Jan, 2023

By Jessie Hobb On Jan 19, 2023

A fully working example of the backdoor adjustment formula using Python and the pgmpy library

In probability theory it is very straightforward to look at a dataset and calculate the probability of an event based on knowing something about other variables.

For example:

i.e. the probability of a sale is equal to the probability of a click on the link given that the product has been searched.

However, this approach breaks down when causal effects exist in the data and this is where causal inference comes in. There are a range of approaches depending on the pattern of causality and this article is going to focus on unlocking the power of the backdoor adjustment formula.

The “backdoor criteria” exists when the causal affect of X on Y is “confounded” by a third factor that influences both X and Y …

In this instance the formula 𝑝(𝑌|𝑋) does not work because of the confounding effect of Z and the backdoor adjustment formula needs to be applied from Pearlean “do” calculus to get the correct result:

This looks scarily complicated, especially to those who are new to “do” calculus, but it is actually very easy to understand and apply.

By the end of this article you will understand how to apply the backdoor adjustment formula in Python and pgmpy, why it is so powerful and what is happening behind the scenes.

To get started we need some standard imports and a test dataset which is entirely synthetic and fictitious.

The data is read in from a csv in the code below but if you would like to know how to generate synthetic binary data the following article provides a full explanation –

The synthetic data represents the results of a fictitiuous drug trial for 1000 subjects who all had a medical condition that the drug would be tested on. For example, the first row represents a male (male=1) who took the drug (drug=1) and recovered (recovery=1).

Let’s start by taking a quick look at the traditional probabilities in the data relating to recovery and patient outcomes using a single line of Python code:

drug  recovery
0     1           0.826
0           0.174
1     1           0.778
0           0.222
Name: recovery, dtype: float64

The result is that 77.8% of patients who took the drug recovered but 82.6% of patients who did not take the drug recovered. The traditional, probabilistic approach clearly suggests that the drug has a negative impact and that the drug trial should end.

However, there is causality between the features and that means that causal inference and the “do” operator need to be applied to establish the true effect.

The cause-and-effect relationships in the data can be visualised using a simple Directed Acyclic Graph (or DAG)…

The code to create the DAG has been left out of this article but a full explanation can be found here:

The causal diagram shows that whilst taking the drug has a causal affect on recovery, it is not that simple. Gender (male) has a causal impact on both “drug” and “recovery” because …

A higher proportion of males decide to take the drug compared to females
Males have a higher recovery rate than females

Therefore we must “de-confound” the affect of “male” from the affect of taking the drug to get the true affect.

The ideal scenario is that we travel back in time and force everyone to take the drug and measure the impact. We then travel back in time again and this time and prevent everyone from taking the drug. We then simply compare the two results and we have our answer!

However, that solution suffers from the impossibility of time travel and the ethical, moral, and legal aspects to forcing or preventing drug-taking.

There is a way forward though. It is the Pearlean “do” calculus which provides a formula for converting an interventional “do” into an equivalent formula containing only observational data which we know.

The remainder of this article is going to provide a simple causal solution using the pgmpy library and a second version which performs all the calculus by hand to show how it works.

The first stage is to create a pgmpy causal model using the causal relationships defined in the Directed Acyclic Graph. Pgmpy creates a set of Conditional Probability Tables that describe the causal relationships which can easily be displayed to see what is going on …

The following code will call TabularCPD.__str__ in the pgmpy library to display the conditional probability tables …

… but I have displayed them below in a more visual and understandable format …

The next stage is to run the “do” operator twice on the model, once the drug=1 and again for drug=0. The second result can then be subtracted from the first to calculate the overall affect of taking the drug indepent and de-confounded from “male” …

If the drug is taken by everyone p(recovery)=0.8301
If the drug is not taken by anyone p(recovery)=0.7779The improvement in recovery rate by everyone taking the drug is 5.2%

So pgmpy has been able to perform the magic trick of travelling back in time and re-running the drug trial. The first re-run forced everyone to take the drug, the second prevented anyone from taking the drug and then a simple subtraction provided the answer, but how does pgmpy work this magic? …

We have already concluded from the DAG that both “drug” and “recovery” are confounded by “male” and that in causal inference this pattern is referred to as the “backdoor” criteria.

The task therefore is to simulate an intervention (the time travel piece!) by writing a mathematical formula for the intervention and then to “adjust” it such that is is expressed in terms of data we can observe.

The backdoor adjustment formula from the “Introduction” section can be expressed as follows for the drug trial data –

From here it is a straightforward task to calculate the overall effect of the drug as follows –

Calculate the effect of intervening or “do-ing” Drug=1 using the backdoor adjustment criteria.
Calculate the effect of intervening or “do-ing” Drug=0 using the backdoor adjustment criteria.
Subtract the result of part 2 from the result of part 1.
If the drug is having a positive impact the overall result will be a positive number.

The result is called the “Average Causal Effect” and can be denoted as follows –

… and in the drug example …

Substituting the left and right side with the backdoor adjustment formula gives the following

So we need to solve the left side of the minus sign (i.e. the intervention where the drug is taken by everyone) by adding together the results for Male=1 and Male = 0 as follows –

These probabilities could all be easily calculated directly from the df_drug DataFrame but they have already been nicely summarised for us in the conditional probability tables so they can be immediately substituted as follows …

Now we need to solve the right hand side of the minus sign in the expanded ACE formula (i.e. for the intervention where everyone is prevented from taking the drug) …

Again, we can just read off and substitute the probabilites from the conditional probability tables ..

The Average Causal Effect (ACE) can now be calculated by subtracting the second result from the first …

So the improvement in recovery rate by everyone taking the drug is 5.2%5.2% which exactly matches the calculations produced by using the pgmpy library!

Traditional probabilistic approaches fail to produce the correct answers when causal relationships exist in data requiring causal techniques to calculate the correct results.

This article has used a synthetic dataset to show that the true effect of taking a drug on patient recovery was a positive impact of 5.2% when the traditional probabilistic approach suggested a negative impact of 5%.

To apply the causal inference techniques requires a “Directed Acyclic Graph” to define the causal relationships which is then used along with the dataset as an input to a causal calculation to show the true effect.

The pgmpy library performed the seemingly impossible magic trick of travelling back in time to intervene in the drug trial not once but twice, first forcing everyone to take the drug and then preventing anyone from taking the drug.

However, it is not magic. It is done by applying the “backdoor adjustment formula” as defined in Pearlean “do” calculus and the long-hand calculations were explained and then verified by matching the results back to the pgmpy library.

If you enjoyed this article you can get unlimited access to thousands more by becoming a Medium member for just $5 a month by clicking on my referral link (I will receive a proportion of the fees if you sign up using this link at no extra cost to you).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

If you would like to know more about the pgmpy library the full documentation can be found here: https://pgmpy.org/index.html.

And if you would like to know more about causal inference, this amazing book is a great place to get started:

And this book provides a deeper and more detailed exploration of some of the key concepts:

Please note: These are Amazon referral links and the author will receive a proportion if you make a purchase.