Using Causal ML Instead of A/B Testing | by Samuele Mazzanti | Nov, 2022

By Jessie Hobb On Nov 29, 2022

In complex environments, Causal ML is a powerful tool because it is more flexible than A/B Testing, and it doesn’t require strong assumptions

Counterfactual questions are among the most important topics in business. I hear companies asking this kind of questions all the time.

“We did this action. Afterward, the average user spending was 100 $. But how do we know what they would have spent if we didn’t do our action?”

These problems are usually addressed through A/B testing. However, A/B tests have several requirements, among which not running too many tests at the same time.

But real organizations are extremely messy, with different processes going on continuously. So, it is often impossible to assume that, while you run a test, everything else is being held constant.

This is why, in this article, I will go through a different tool that allows addressing counterfactual questions, i.e. Causal Machine Learning. The benefits of Causal ML are that it is much more flexible and, most importantly, it can be used ex-post, without the need for prior experiment design.

Let’s take a simple example. Say we have 4 users. We would like to know if giving them a discount will lead them to spend more money on our platform.

This is like asking to observe two parallel universes.

In one universe, all the users get the discount. In the other universe, none of the users get the discount. Note that the two universes are exactly the same until the moment we give (or do not give) the discount to the users.

Imagine we can actually observe the two universes. In the following image, we can see the result of our experiment:

Testing a marketing action in the multiverse: in one universe, all the users are given a discount. In another universe, none of them gets the discount. [Image by Author]

In the first universe, the users spent 120 $ overall, whereas, in the second universe, they spent a total of 100 $. Thus, we can conclude that the discount had the effect to make the users spend on average 5 $ more (which is 120 $ minus 100 $ divided by 4 users).

Great, the discount works!

Unfortunately, in practice, we don’t have the privilege to observe different universes at the same time. So, we have to find a different way. This way is provided by A/B testing.

An A/B test is a smart trick to generate different universes in our single universe. In an A/B test, we split our users into two groups: one called treatment group and the other called control group. We then give the discount only to the treatment group.

Assuming that the two groups are “similar” enough and that the two groups don’t influence each other, this is very similar to observing two different universes.

A/B testing, aka testing a marketing action in a single universe. [Image by Author]

Thus, all we have to do is to compare the average user spending in the treatment group (30 $) to the average user spending in the control group (25 $) to conclude that the discount had the effect to make the users spend on average 5 $ more (30 $ minus 25 $).

Nothing, A/B testing works just fine.

But one of the requirements of A/B tests is not running too many tests at the same time, because they can “contaminate” each other’s outcomes.

Real organizations are incredibly messy and real processes often don’t respect the assumptions required by A/B testing.

What I see happening in real companies is that different teams make different marketing actions on the same users. Then, they ask you (the data scientist) to assess the efficacy of the marketing actions they made.

So you analyze the data and you find out that:

some teams forgot to keep aside a control group;
some teams kept aside a control group which is not similar enough to the treated group;
some teams kept aside a control group that is too small;
different teams sent conflicting marketing campaigns (e.g. retention campaigns and up-selling campaigns) to the same users.
some teams sent a marketing campaign to users that belong to the control group of another team.

How can we deal with such messy processes, but keep a solid statistical foundation?

Causal inference offers a framework that adapts very well to our situation.

We have a set of covariates, which means a set of information about our users. The marketing teams take decisions based on these pieces of information. These decisions take the form of marketing actions (a.k.a. treatments). For instance, they may decide to send a discount to users that didn’t buy over the last 3 months.

The combination of covariates and treatments generates the outcomes, which are the KPIs we are interested in. In this case, this may be users’ spending.

Let’s recap this process in a diagram:

To make things a little more concrete, let’s imagine a small dataset. The dataset will be made of 3 tables, one for each causal element. Covariates can be found in the table of users. Treatments, or actions, can be found in the table of campaigns. Outcomes can be found in the table of purchases.

We can visualize graphically the journey of each user on a timeline:

Customer journey of each user. [Image by Author]

Let’s take Alice. She was targeted by campaign A on the 15th of September and by campaign B on the 4th of October. She then made a purchase (on which she spent 75 $) on the 18th of October.

Now, how do we know whether that purchase was due to campaign A, campaign B, or a combination of the two? But most of all, how do we know whether she would have purchased anyway if we didn’t send her any campaign?

To address this problem, we need to make a simplification that is often used in machine learning. We need to choose a point in time (let’s call it the cutoff point) and assume that everything that happened before that moment (actions) caused anything that happened after that moment (outcomes).

Time dimension of predictive models. [Image by Author]

Now that we agreed on this, we just have to decide how long the two periods should be. This is very dependent on the business, and adjustments can be made (for instance weighing differently the events depending on how far they are from the cutoff point).

For the sake of simplicity, let’s assume that both the actions period (yellow band in the image) and the outcome period (blue band) must be one month long.

With this approach, we can go back to our data. We know that campaign A was sent on the 15th of September. Thus, we can take this day as the cutoff point for all the users.

Yellow background: observation of actions. Blue background: observation of outcomes. [Image by Author]

Note that, when we observe the outcomes (blue background), we will simply ignore the actions. All that matters, in this case, are outcomes.

We can repeat this process taking the sending date of campaign B as the cutoff point. In this case, we will have the following:

Putting covariates, treatments and outcomes together brings us to rearrange our initial data in the following form:

Our dataset after data preparation. [Image by Author]

So now we have a set of independent variables (matrix X) made of covariates and treatments and one target variable (vector y) containing the outcome.

With these two ingredients, we are now able to train any machine learning model.

Note that the model can learn not only the relation between the treatment and the target variable (as it would happen in A/B tests) but also the relation between the treatment and the covariate. For instance, it may learn that French users are more responsive to campaign A, while German users are more responsive to campaign B.

Until now, we have treated this problem as a typical machine learning problem, but you may object that we don’t have anything to predict in this case. So, why did we use ML in the first place?

Our initial purpose was to answer questions such as:

What would have happened if we didn’t send the campaign to any user? What would have happened if we sent the campaign to all the users?

Causal ML allows us to answer counterfactual questions. Or — if you want to put it in even more high-sounding terms — Causal ML allows simulating different universes.

For example, let’s say we want to simulate two scenarios:

Universe A: What would have happened if we didn’t send the campaign to any user?
Universe B: What would have happened if we sent the campaign to all the users?

Simulating these scenarios means changing the value of the treatment variables in the predictor matrix. I want to stress this point because it is very important: we can only change treatment variables, never covariates.

Graphically,

Simulation of different scenarios using Causal ML. [Image by Author]

Do you see how flexible this approach is? We can simulate practically any scenario, on any subset of users we may be interested in.

At this point, you may have a question: how do we make predictions on the same data on which we trained the model? Isn’t it the first “Don’t” of machine learning?

It’s enough to split the dataset in folds and to train a different model for each dataset, exactly like you would do with cross-validation:

import pandas as pd
from sklearn.model_selection import KFold
from lightgbm import LGBMRegressor# initialize the number of folds and a dictionary to store the folds
n_folds = 5
folds = {fold: dict() for fold in range(n_folds)}
# for each fold...
for fold, (ix_train, ix_test) in enumerate(KFold(n_splits=n_folds).split(X=X)):
# ... store test index and trained model
folds[fold]["ix_test"] = ix_test
folds[fold]["model"] = LGBMRegressor().fit(
X=X.loc[ix_train, :], 
y=y.loc[ix_train]
)

Now we have a model trained out-of-sample for each fold. At this point, we can use these models to predict what would happen in each of the two universes:

# make counterfactual datasets for campaign A
X_zeros = X.replace({"campaign_A": {1: 0}}) # universe A: nobody gets the discount
X_ones = X.replace({"campaign_A": {0: 1}}) # universe B: everybody gets the discountpred_zeros = pd.Series(index = X.index)
pred_ones = pd.Series(index = X.index)
# for each fold, use model to make predictions on test individuals and store them
for fold in folds.keys():
ix_test = folds[fold]["ix_test"]
model = folds[fold]["model"]
pred_zeros.loc[ix_test] = model.predict(X_zeros.loc[ix_test, :])
pred_ones.loc[ix_test] = model.predict(X_ones.loc[ix_test, :])

pred_ones answers the question “what would have happened if we sent campaign A to all users, everything else being held constant?”

pred_zeros answers the question “what would have happened if we didn’t send campaign A to any users, everything else being held constant?”

Now that we have pred_zeros and pred_ones we can calculate pretty much everything we want regarding campaign A: the average treatment effect in a specific subgroup, the median treatment effect, or any other measure we may be interested in.

For example, the average treatment effect can be computed as:

ate = (pred_ones - pred_zeros).mean()

Note that the same logic we have seen above may be applied to binary response variables (for instance did the user buy, yes or no?). In that case, you should remember to calibrate the models, to make sure that you are able to predict actual probabilities.

Real organizations are complex, many different processes are constantly involving the users. In this situation, the requirements on which A/B tests rely are not necessarily met.

In this case, Causal ML can be used to answer counterfactual questions. Causal ML has many advantages: it is flexible, it is agnostic, and it allows the simulation of practically any scenario.

Thank you for reading! I hope you enjoyed this article. If you’d like, add me on Linkedin!