The Causal Inference “do” Operator Fully Explained, with an End-to-End, Example Using Python and DoWhy | by Graham Harrison | Dec, 2022

By Jessie Hobb On Dec 20, 2022

How to master the causal inference do operator and why you need it in your data science tool bag

Fully explained, end-to-end examples of causal inference that have actual, working source code are very hard to find on the Internet or in books, as I have discovered in my journey to understand how this emerging technology works and why it is so important.

But if you persevere it is certainly well worth the effort as it will be able to solve a different type of problem that does not have a solution using other machine learning techniques.

Traditional machine learning models can predict what is likely to happen if the future broadly turns out like the past but they cannot tell you what you should do differently to achieve the desired outcomes.

For example, a classification algorithm can predict if bank loan customers are likely to default but it cannot answer questions like “If we change the repayment term of the loan, will more customers avoid defaulting?”

Here are a few more examples of the types of questions causal inference can answer that traditional predictive models cannot:

Does a proposed change to a system improve people’s outcomes?
What led to a change in a system’s outcomes?
What changes to a system are likely to improve outcomes for people?

There are many examples of online articles that go through the detail of the mathematics involved in causal inference, but very few that provide a worked example with a full explanation and all of the source code.

If you stick with me and read through to the end of this article, I promise to equip you with a full explanation and all of the source code that will enable you to do something truly amazing that is just not possible with other machine learning techniques.

The first thing we need are some data. I have created a purely synthetic dataset that was inspired by the famous LaLonde data which observed and recorded the impact of an employment skills training programme on earnings in the 1970’s.

As the LaLonde data and study provided the inspiration there is a citation in the references section at the end of the article.

It is worth taking a moment to understand the key aspects of the synthetic dataset –

received_training holds a 1 if the individual attended a fictitious training programme designed to provide employment skills and to increase earnings potential. In the synthetic dataset 640 individuals attended the training programme and 1360 did not attend.
age is the age in years.
education_years hold the number of years of school education.
received_benefits holds a 1 if the individual has ever been in receipt of unemployment benefits.
university_degree is 1 if the individual studied for and attained a degree at university.
single is 1 if the individual is single (i.e. not married or in a civil partnership).
top_earner holds a 1 if the individual is in the top quartile of earners.
age_group is a categorical version of age.
earnings is the amount the individual was earning 3 years after the completion of the fictitious employment skills training programme and is the “target” or feature of interest.

Now let’s take a look at the data to see what impact the training programme had on the participants’ earnings …

According to the analysis the impact of attending the training programme is negative –

The apparent impact of attending the training programme is a decrease in annual earnings of $1,065.29.
The probability of being a top earner is 0.19 for those attending the training and 0.28 for those not attending.
The median earnings for those who received the training is $3,739 and $4,893 for those who were not trained.
The mean earnings for those who received the training is $6,067 and $7,132 for those who were not trained.

Based on the analysis the clear advice would be to stop the training programme as it can be shown using four different measures that the impact on earnings is consistently negative.

Intuitively this conclusion does not seem right though. Even if the training was absolutely terrible it does not feel right that attending the training would make a participant less employable and damage their future earnings.

At this point the potential for traditional approaches including probabilities and predictive models cannot get us any further. Any application of these techniques will conclude that the training should be cancelled.

In order to break through those limitations and to really understand what is going on we need to build a causal model and to apply the magic “do” operator.

If you want to know the true impact of the training programme and why using traditional probabilities and predictive models can cause erroneous and even dangerous outcomes, please read on …

Let’s start the journey to a more accurate evaluation by taking a more detailed look at some of the features in the dataset …

Clearly there is a significantly different pattern in education between those that attended the training and those that did not.

At this point in a real world project, we would be working with domain experts to understand these patterns but even without domain expertise it is reasonable to conclude that education may have a causal effect on both who attends the training and their ability to earn.

It is a similar story for age. For those receiving training it is a linear pattern with more younger people and fewer older people. For no training there is a spike in the 30–40 age group. One hypothesis might be that many 30–40 year olds are already earning well and do not want any job-skills related training.

Again, this would suggest that age is affecting both whether an individual is likely to attend the training and their earnings potential.

This simple additional analysis has revealed the presence of “confounders”. This term is bandied around in many of the available examples, often with the accompanying calculus and formulas, but usually with no clear explanation.

Simply stated the impact of features like age and education is mixed in with the main effect of interest i.e., of training on earnings and when we apply traditional approaches that standalone effect cannot be separated out.

It is not possible to discover causality in data on its own. The data needs to be supplemented with a “Directed Acyclic Graph” (DAG) that is constructed by “discovering” the causal relationships by utilizing domain expertise and other techniques.

For a more detailed exploration please take a look at my article on discovering causal relationships –

The next step is going to use my DirectedAcyclicGraph class. I have left the source code out of the article to keep it more concise but here is the link to the full source in case you would like to run the code for yourself – https://gist.github.com/grahamharrison68/9733b0cd4db8e3e049d5be7fc17b7602.

If you do decide to use it and if you like it why not optionally consider buying me a coffee? …

Here is my proposal for the causal relationships in the data –

The DAG can be interpreted as follows –

received_training (i.e. attending the training programme) has a causal impact on earnings (i.e. future earnings).
All the other features have a causal effect on whether an individual is likely to join the training programme or not.
All the other features also have a causal effect on future earnings.

For example, an individual’s age is “causing” them to attend the training or not, maybe by more young people wanting to be trained and age is also “causing” earnings, possibly because older people with more experience can earn more.

This pattern is quite common. When statisticians are carrying out a Randomized Control Trial (RCT) they may condition or control for the variables that are mixing with the main effect.

This means that for age they could split the observations up into the age groups, look at the relationship between treat and age for each group and then take a proportional average of each group to estimate the true overall effect.

However, there are some problems with this approach. For example, how do you define the boundaries for the groups? What if the key impact was 16–18 year olds but the boundary had been set up as 16–30? And what if there were no observations for 40–45 year olds?

Another approach is not to observe, but to intervene. We could simply force everyone to do the training and then we would see the true impact. But what if the observations were historical (as in the LaLonde data) and it was too late to intervene? Or what if this was a study of smoking or obesity?

The subjects could not be forced to smoke or to become obese just to prove our theories!

This is where the “do” operator comes in. It sounds like magic but it is genuinely possible to build a causal inference model that can accurately simulate these interventions without having to do them in the real world.

This will save significant time and money, remove the need for randomised control trials that condition for large numbers of variables and enable studies of factors that would have moral and ethical concerns in real-world studies like smoking and obesity.

Let’s imagine that rather than observing a group of people, some of whom did the training and some of whom did not, we can travel back in time, intervene instead of observing, and make them all do the training.

In that instance the DAG would look like this –

This is effectively what the magic “do” operator is doing. If you intervene and 𝑑𝑜(𝑡𝑟𝑒𝑎𝑡=1) you are effectively “rubbing out” all the input lines of causality because no matter how age, education and other features affect the probability of doing the training it is always going to happen.

Behind the scenes the DoWhy library is simulating this intervention. It can do this by using the rules of do calculus to convert 𝑝(earnings|𝑑𝑜(received_training=1)) which cannot be directly calculated unless we do a physical intervention into a set of observational rules that can be calculated from the data.

I have deliberately left the details of the maths out of this article. There are plenty of articles that show the maths but very few that show a working example with the Python code, hence that is the focus in this article.

Note: you will need my DirectedAcyclicGraph class if you want to run the code so if you haven’t downloaded it already head over to https://gist.github.com/grahamharrison68/9733b0cd4db8e3e049d5be7fc17b7602 and don’t forget to consider buying me a coffee if you like it!

Here is the full source code to perform the “do” operator on the data …

Before we get into the truly amazing results it would be useful to walk through the code line-by-line.

Firstly, importing dowhy.api magically extends pandas DataFrame so that the class gains a new causal.do method.

Next setting the random seed in numpy ensures that the do method results are reproducible. The DoWhy documentation does not mention anything about setting a random seed and this was found by pure trial-and-error. Also to note, the random seed needs to be set in the preceding statement before every call to causal.do, not just before the first one.

The next mystery of causal.do is the variable_types parameter. The DoWhy documentation is incomplete and inconsistent. Trying out lots of different things has led to the following conclusions –

Despite what the documentation says there are only 2 types that are important — “d” for discrete and “c” for continuous.
In statistics an integer is discrete but DoWhy produces some very odd results if integers are declared as “d”. Based on pouring over the DoWhy documentation and examples my conclusion is that integers need to be declared as “c” for continuous.
Inside the DoWhy source code there is a method called infer_variable_types but it is stubbed out with no code so I have written my own implementation which is available as a static method in DirectedAcyclicGraph.infer_variable_types().

Here is what the all-important causal.do method parameters mean –

x={"received_training": 1} is saying what it is we want to “do”. In this case we want to see what happens if everyone were forced to take the training which is represented in the data by received_training=1.
outcome="earnings" – this is the outcome or effect we are looking for i.e. what is the effect of “doing” received_training=1 on the individuals earnings?
dot_graph=training_model.gml_graph is informing the do operator of the causal relationships we believe exist in the data. training_model is an instance of my DirectedAcyclicGraph class and I have given it a property which spits out the structure in gml format –
The do method requires either common_causes or dot_graph to be passed in to describe the causal relationships.
The dot_graph parameter will accept a structure in either dot or gml format but this is not mentioned anywhere in the documentation; gml is much better in my opinion as it is used everywhere else in DoWhy.
Specifying a graph is much better than setting common_causes as a graph can capture any type of structure whereas common_causes is much more limiting. Again this is not mentioned anywhere in the DoWhy documentation.
The variable_types parameter has already been explained.
proceed_when_unidentifiable=True avoids an annoying user prompt that interrupts the calculation.

The causal.do method is returning a new DataFrame that effectively simulates a forced intervention and provides the data that would have been collected had everyone done the training –

DoWhy is different to most of the other Python causal libraries in this respect as most of the other libraries just to return a number and not a DataFrame.

Returning a DataFrame is initially a bit confusing but dig a little deeper and it is a powerful, flexible and informative approach.

Without knowing the details of the internal implementation my conclusion is that DoWhy is simulating a Randomized Control Trial (RCT) by sampling the data based on the groups that need to be used to “de-confound” the mixing effect described earlier in the article.

For example, take a look at a comparison of the following feature across the original observed data and the new intervention data –

Clearly DoWhy has resampled the intervention data for this feature in a very different way.

Now all that remains is to interpret the true impact of the training on earnings by taking a look inside the df_do DataFrame–

The traditional approach of using probabilities on observational data suggested that those attending the training would actually have a lower salary than if they had not bothered with the training.

The mean salary of those trained from the observational data is $6,067 whilst the causal inference “do” approach to simulating an intervention revealed the true impact — a salary increase and a mean salary of $7,392.

Instead of cancelling the training programme the advice after applying causal inference approaches would be to expand the training programme because it is providing more equitable opportunities for groups that need help to increase their long-term earnings.

I promised something amazing by the end of the article and if this result has the same revelatory impact on you as it did for me then I hope it has lived up to that promise.

Whenever there are causal effects at work in data, traditional predictive approaches can lead to the wrong conclusions and recommendations and this makes causal inference an essential tool for all data scientists to have in their tool bag.

If you enjoyed this article please consider …

Joining Medium with my referral link (I will receive a proportion of the fees if you sign up using this link).

Subscribing to a free e-mail whenever I publish a new story.

Taking a quick look at my previous articles.

Downloading my free strategic data-driven decision making framework.

Visiting my data science website — The Data Blog.

LaLonde dataset –

Citation: LaLonde, Robert J, 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data,” American Economic Review, American Economic Association, vol. 76(4), pages 604–620, September.