What I learned after running AB tests for one year as a data scientist — Part 2/2 | by Alex Vamvakaris

A real-world case study on how data scientists approach and analyze the results of an AB test

Welcome to the second and last part of the series! In Part 1, we covered the design of an experiment and sample size estimations. In this part of the series, we will focus on the knowledge and skills you will need to analyze AB tests. This is very much at the core of how data scientists provide value in a business. Instead of making decisions based on our subjective beliefs, we rely on data and statistical tests to steer the company in the right direction.

Understanding the Fundamentals: Formulating test hypothesis and exploring type I and type II errors
Exploratory Data Analysis (EDA): Recruitment into the test, trends over time, and distribution of KPIs
Analyzing the Results of the AB Test: Explaining the central limit theorem, p-values, confidence intervals, and running parametric and non-parametric bootstrapping

We will continue using the case study from Part 1. You can explore the two versions of the AB test and the selection of KPIs in the link below 👇

Photo by Stefan Steinbauer on Unsplash

In every statistical test, there is an initial and alternative hypothesis. The initial assumes no difference for our two KPIs between the control (£10 off) and treatment (20% off), while the alternative hypothesis assumes that there is a significant difference. The initial hypothesis is often called null as it represents the default or “normal” state of affairs (i.e., no difference).

Formulating the initial and alternative hypothesis for our two KPIs [Image by the author]

A tricky part to be aware of in data science interviews is how you express your objective around these hypotheses. We are trying to either reject or fail to reject the initial hypothesis. We will never be able to accept it. Now in terms of how you will communicate results to the stakeholders, that will not make a difference. But with data scientists, it will!

Let’s look at the example of black swans. Imagine a world where we have only witnessed white swans. We will never be able to accept the initial hypothesis that there are no black swans. We can collect unlimited data, and still, we might have missed that one black swan that we needed. So we can fail to reject the hypothesis that all swans are white but never accept it! On the other hand, we can always reject the initial hypothesis if we witness one black swan. Now, granted, black swans represent rare events, which we will hardly try to measure in business scenarios (minimum detectable effects, MDEs, are set usually at 2%), but still, it is a useful way to understand how we approach AB tests as data scientists.

Okay, so we have our initial and alternative hypotheses, and we can either reject or fail to reject the initial hypothesis. That translates into four eventualities, as shown in the table below.

Type I and Type II errors [Image by the author]

Using the data we gathered from running our experiment, we will either accept (got you didn’t I, remember it is fail to reject) or reject the null hypothesis. But no methodology is perfect, i.e., it is impossible to eliminate false positives and negatives entirely (the red boxes in the image above). In the words of Neyman and Pearson (1933):

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis

That doesn’t sound very promising, does it? Fortunately, they continue by adding the following:

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we ensure that, in the long run of experience, we shall not be too often wrong

This above is an important point to understand. In statistics, we do not deal with certainty. We will never infer anything with 100% certainty. However, we have statistical tools (like p-values and confidence intervals, more on that later) that we can use to help us make decisions. Decisions or rules, if you prefer, that in the long run will give us a certain (acceptable) probability of making a type I or type II error.

Type I Error

Occurs when we conclude that there is a difference, even though there is none. In other words, we reject the null hypothesis even though it is true
The probability for this error is called the significance level of a test and is symbolized by the greek letter α. The significance level of a test is usually set at 5%

Type II Error

Occurs when we conclude that there is no difference, even though there is one. In other words, we fail to reject the null hypothesis even though it is false
The probability for this error is symbolized by the greek letter β. The complement of that probability (1 — β) is called the power of a test and is usually set at 80%

Photo by Scott Webb on Unsplash

This must be by far the step that is most often rushed or skipped altogether, despite being present in all the major frameworks. However, there are real benefits to having an EDA, as it answers important questions that we need to know before we start analyzing the results.

In this part, we just want to examine that there were no issues with the allocation of visitors to control and treatment:

Recruitment only occurs on Fridays, as we intended, with the same volume of visitors recruited into control and treatment
The majority of visitors enter the test on the first Friday, and then the volume of recruits decreases. This is a well-known trend. The most engaged (existing) visitors will be recruited immediately, and then after two or three weeks, we will be mostly recruiting new visitors

Recruitment of visitors in the test by variant [Image by the author]

Next, we want to examine any trends or seasonality in our KPIs of interest. Let’s take the graph below that showcases average daily revenue (as a proxy for 28-day ARPV):

There is no pre-test difference between the two versions. If the two versions differed in the weeks before the test started, we would need to adjust our KPIs in the test period to correct for any pre-test biases
The metric is not very volatile (no extreme up and downs from day to day)
Treatment is higher than the control for every Friday since we launched the test (clear spikes). The uplift is increasing as time passes (the last two Fridays have the highest uplift)

Time series of average daily revenue [Image by the author]

Finally, we want to examine the distributions of control and treatment:

Looking at revenue, there are no extreme outliers
We can clearly see a third peak for revenue in treatment (between £70 and £90). The first is at £0 and the second at £50
ARPV is higher for treatment by 19% (up by £4.70, £29.2–£24.5)
Looking at ATPV, there is also a clear uplift for treatment (more visitors making 3 or 4 transactions)

Distributions of ARPV and ATPV split by variant [Image by the author]

Photo by Tim Stief on Unsplash

As you can imagine, if we had taken a different sample of the same size, we would have observed different statistics (ARPV and ATPV). This means that a (sample) statistic is a random variable. Like any other random variable, a statistic has a distribution called a sampling distribution. You can think of the sampling distribution as the population of all possible values for the statistic if we were to take exhaustively all the possible samples of size n from a population of size N and compute the statistic for each one. Let’s see how the sampling distribution and bootstrapping will help us analyze the results of the AB test.

Let’s start with the hero of most data science interviews, the p-value. Simply put, the p-value is the probability of collecting data that lead us to believe there is a difference, even though there is none. The lower the p-value, the more statistically significant the difference. That is because it would have been that much more improbable (due to luck only) to observe the given dataset if there was, in fact, no difference. If the p-value is lower than our acceptable level of statistical significance (p-value < α), we reject the null hypothesis.

If you can write a result as a confidence interval instead of a p-value, you should

Confidence intervals (CIs) can answer the same questions as p-values, with the advantage that they provide more information and are more straightforward to interpret. If you want to test whether two versions are statistically different, you can construct a 95% confidence interval (1- a%) of their difference and check whether the interval includes zero. In the process, you get the added bonus of learning how precise your estimate is (a wider interval indicates higher uncertainty).

Bootstrapping is a method for estimating the sampling distribution of a statistic and computing CIs

Ordinary bootstrapping, also known as simple bootstrapping, is a basic version of the method and the one we will be using for our analysis. It involves taking repeated samples of the same size as the original sample (with replacement) and calculating the statistic of interest for each one. We can then use all these different samples (and their respective statistics) to calculate a 95% confidence interval. This can be non-parametric based on the percentiles (between the 0.025 and 0.975 percentile), or we can follow a parametric approach and compute normal confidence intervals. We will do both.

For our purposes, we will use the boot package in R. First, we will need to create the following inputs:

The datasets we will use in the boot function (see snapshot below)
A function that computes the % difference in our statistics within each sample

##########################################
# Create bootstrap datasets
##########################################
boot_revenue <- data.frame(
dataset %>% 
filter(variant == "Treatment") %>% 
select(revenue_treat = revenue_test),
dataset %>% 
filter(variant == "Control") %>% 
select(revenue_contr = revenue_test)
)boot_transactions <- data.frame(
dataset %>% 
filter(variant == "Treatment") %>% 
select(transactions_treat = transactions_test),
dataset %>% 
filter(variant == "Control") %>% 
select(transactions_contr = transactions_test)
)
##########################################
# Create function for bootstrap
##########################################
diff.means <- function(d, i) {    
results <- 
(mean(d[i,1]) - mean(d[i,2])) / mean(d[i,2])
results
}

Overview of bootstrap input datasets [Image by the author]

We will then take 2000 samples and compute 95% confidence intervals of the % difference between control and treatment for ARPV and ATPV, respectively (using the boot.ci function, with the result of the boot function as input).

##########################################
# Run bootstrap for ARPV
##########################################
boot_results_revenue <- boot(boot_revenue, diff.means, R = 2000)
plot(boot_results_revenue)
boot.ci(
boot_results_revenue, 
type = c("norm", "basic", "perc"))##########################################
# Run bootstrap for ATPV
##########################################
boot_results_transactions <- boot(boot_transactions, diff.means, R = 2000)
plot(boot_results_transactions)
boot.ci(
boot_results_transactions, 
type = c("norm", "basic", "perc"))

Confidence intervals from ordinary bootstrapping [Image by the author]

So let’s interpret the results:

The 95% Cl for the % difference of ARPV between treatment and control is between 17.7% and 20.6% in favor of treatment
The 95% Cl for the % difference of ATPV between treatment and control is between 14.7% and 18% in favor of treatment
Both differences are significant as the 95% intervals are far from including zero

In case you are wondering why the normal 95% CI was so close to the non-parametric percentile one, there is a very well-known theorem that sheds some light.

According to the CLT, given a sufficiently large sample size from a population, the (sampling) distribution of the sample means will be approximately normal

When the underlying distribution is Normal, then the CLT applies automatically for the mean. What happens in all other cases, though? Well, the further we deviate from the Normal distribution, the higher the sample size we will need. The boot function provides a nice plot to check the sampling distribution of your statistic, which you can see below for our ARPV % difference sampling distribution. Both plots show that the sampling distribution is approximately Normal, which explains why the non-parametric CIs and the Normal one are almost identical.

Checking normality assumption from ordinary bootstrapping [Image by the author]

And with that last step, we have successfully run an AB test! 🚀🚀

✅ We explained basic concepts like error types I and II, p-values, confidence intervals, and the CLT

✅ We run through the EDA checklist for an AB test

✅ We analyzed the results of our experiment using parametric and non-parametric bootstrap confidence intervals

I hope the above was useful in helping you get a step closer to your goal of landing your first data science job. I tried to cover both the theory you will need for your data science interviews and also give you the skills to run and analyze your own experiments like a data scientist!

If you enjoyed reading this article and want to learn more, don’t forget to subscribe to get my stories sent directly to your inbox.

On the link below, you can also find a free PDF Walkthrough on completing a Customer Cluster Analysis in a real-life business scenario using data science techniques and best practices in R. 👇

A real-world case study on how data scientists approach and analyze the results of an AB test

Photo by Luca Bravo on Unsplash

Understanding the Fundamentals: Formulating test hypothesis and exploring type I and type II errors
Exploratory Data Analysis (EDA): Recruitment into the test, trends over time, and distribution of KPIs
Analyzing the Results of the AB Test: Explaining the central limit theorem, p-values, confidence intervals, and running parametric and non-parametric bootstrapping

We will continue using the case study from Part 1. You can explore the two versions of the AB test and the selection of KPIs in the link below 👇

Photo by Stefan Steinbauer on Unsplash

Formulating the initial and alternative hypothesis for our two KPIs [Image by the author]

Okay, so we have our initial and alternative hypotheses, and we can either reject or fail to reject the initial hypothesis. That translates into four eventualities, as shown in the table below.

Type I and Type II errors [Image by the author]

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis

That doesn’t sound very promising, does it? Fortunately, they continue by adding the following:

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we ensure that, in the long run of experience, we shall not be too often wrong

Type I Error

Occurs when we conclude that there is a difference, even though there is none. In other words, we reject the null hypothesis even though it is true
The probability for this error is called the significance level of a test and is symbolized by the greek letter α. The significance level of a test is usually set at 5%

Type II Error

Occurs when we conclude that there is no difference, even though there is one. In other words, we fail to reject the null hypothesis even though it is false
The probability for this error is symbolized by the greek letter β. The complement of that probability (1 — β) is called the power of a test and is usually set at 80%

Photo by Scott Webb on Unsplash

In this part, we just want to examine that there were no issues with the allocation of visitors to control and treatment:

Recruitment only occurs on Fridays, as we intended, with the same volume of visitors recruited into control and treatment
The majority of visitors enter the test on the first Friday, and then the volume of recruits decreases. This is a well-known trend. The most engaged (existing) visitors will be recruited immediately, and then after two or three weeks, we will be mostly recruiting new visitors

Recruitment of visitors in the test by variant [Image by the author]

Next, we want to examine any trends or seasonality in our KPIs of interest. Let’s take the graph below that showcases average daily revenue (as a proxy for 28-day ARPV):

There is no pre-test difference between the two versions. If the two versions differed in the weeks before the test started, we would need to adjust our KPIs in the test period to correct for any pre-test biases
The metric is not very volatile (no extreme up and downs from day to day)
Treatment is higher than the control for every Friday since we launched the test (clear spikes). The uplift is increasing as time passes (the last two Fridays have the highest uplift)

Time series of average daily revenue [Image by the author]

Finally, we want to examine the distributions of control and treatment:

Looking at revenue, there are no extreme outliers
We can clearly see a third peak for revenue in treatment (between £70 and £90). The first is at £0 and the second at £50
ARPV is higher for treatment by 19% (up by £4.70, £29.2–£24.5)
Looking at ATPV, there is also a clear uplift for treatment (more visitors making 3 or 4 transactions)

Distributions of ARPV and ATPV split by variant [Image by the author]

Photo by Tim Stief on Unsplash

If you can write a result as a confidence interval instead of a p-value, you should

Bootstrapping is a method for estimating the sampling distribution of a statistic and computing CIs

For our purposes, we will use the boot package in R. First, we will need to create the following inputs:

The datasets we will use in the boot function (see snapshot below)
A function that computes the % difference in our statistics within each sample

##########################################
# Create bootstrap datasets
##########################################
boot_revenue <- data.frame(
dataset %>% 
filter(variant == "Treatment") %>% 
select(revenue_treat = revenue_test),
dataset %>% 
filter(variant == "Control") %>% 
select(revenue_contr = revenue_test)
)boot_transactions <- data.frame(
dataset %>% 
filter(variant == "Treatment") %>% 
select(transactions_treat = transactions_test),
dataset %>% 
filter(variant == "Control") %>% 
select(transactions_contr = transactions_test)
)
##########################################
# Create function for bootstrap
##########################################
diff.means <- function(d, i) {    
results <- 
(mean(d[i,1]) - mean(d[i,2])) / mean(d[i,2])
results
}

Overview of bootstrap input datasets [Image by the author]

##########################################
# Run bootstrap for ARPV
##########################################
boot_results_revenue <- boot(boot_revenue, diff.means, R = 2000)
plot(boot_results_revenue)
boot.ci(
boot_results_revenue, 
type = c("norm", "basic", "perc"))##########################################
# Run bootstrap for ATPV
##########################################
boot_results_transactions <- boot(boot_transactions, diff.means, R = 2000)
plot(boot_results_transactions)
boot.ci(
boot_results_transactions, 
type = c("norm", "basic", "perc"))

Confidence intervals from ordinary bootstrapping [Image by the author]

So let’s interpret the results:

The 95% Cl for the % difference of ARPV between treatment and control is between 17.7% and 20.6% in favor of treatment
The 95% Cl for the % difference of ATPV between treatment and control is between 14.7% and 18% in favor of treatment
Both differences are significant as the 95% intervals are far from including zero

In case you are wondering why the normal 95% CI was so close to the non-parametric percentile one, there is a very well-known theorem that sheds some light.

According to the CLT, given a sufficiently large sample size from a population, the (sampling) distribution of the sample means will be approximately normal

Checking normality assumption from ordinary bootstrapping [Image by the author]

And with that last step, we have successfully run an AB test! 🚀🚀

✅ We explained basic concepts like error types I and II, p-values, confidence intervals, and the CLT

✅ We run through the EDA checklist for an AB test

✅ We analyzed the results of our experiment using parametric and non-parametric bootstrap confidence intervals

If you enjoyed reading this article and want to learn more, don’t forget to subscribe to get my stories sent directly to your inbox.

On the link below, you can also find a free PDF Walkthrough on completing a Customer Cluster Analysis in a real-life business scenario using data science techniques and best practices in R. 👇

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

What I learned after running AB tests for one year as a data scientist — Part 2/2 | by Alex Vamvakaris | Jan, 2023

A real-world case study on how data scientists approach and analyze the results of an AB test

Type I Error

Type II Error

A real-world case study on how data scientists approach and analyze the results of an AB test

Type I Error

Type II Error

Related Posts