How to Select the Right Statistical Tests for Different A/B Metrics | by Mintao Wei | Aug, 2022

By Jessie Hobb On Aug 4, 2022

A Discussion of the go-to methods for 5 Types of A/B Metrics

This article summarizes A/B test evaluation metrics into 5 categories and outlines the suggested statistics testing for its significance values in the table below.

Table by Author

Why Should We Care
User Average Metrics
User-level Conversion Metrics
Pageview-level Conversion Metrics
Percentile Metrics
SUM Metrics
Summary and Practical Tips
Note

While the t-tests are powerful, they are not universally applicable in the data world that is populated by business metrics with various distributions significantly different from the nice uni-modal normal distribution. For instance, the number of gifts sent and share actions per user is usually highly skewed with serious outliers. This is because user behaviors are not as organized and objective as statistical rules, but rather emotional and are usually characterized by extreme actions.

As data scientists, it is our duty, and also where our value lies, to delve deeper into the appropriate testing methods for different business indicators that make scientific sense from the perspective of statistics. We shall always remind ourselves to check the sampling distribution before proceeding with the t-tests.

In this main paragraph, I discuss the appropriate testing methods that correspond to each of the 5 metric categories. In short, the two key aspects in determining which method to use are statistical and experimental perspectives. Specifically, we would like to ask ourselves two questions:

Does the sampling distribution match the assumption (e.g. independence, normality, etc.) of the proposed test methods?
Does the randomization unit align with the analysis unit?

Picture by Towfiqu Barbhuiya on Unsplash

1. User Average Metrics (e.g. like/user, staytime/user, etc)

Suggested Test: Welch’s T-Test if the metric is not heavily sparse and skewed. Otherwise, choose Mann-Whitney Rank Sum U-test

What are user average metrics?

User average metrics are indicators that average the total business performance by the number of unique users, such as the average number of likes per user and average stay time per user during the experiment period. They are frequently-used evaluation criteria in A/B tests, and this comparison of whether the mathematical expectations, or sample means, are the same between the control group and experiment group is also one of the most common and classic statistical testing scenarios.

How to test user average metrics

The method for the hypothesis testing on the sample mean statistics is usually the two-sample t-test. In plain terms, the two-sample t-test determines if the expected means of a metric in two populations are equal, by comparing the sample mean between the two sample groups. The two assumptions in a candid but less precise way are:

Each sample observation needs to be independent and follow a normal or quasi-normal distribution (i.i.d.): The independent distribution assumption is often satisfied because each user can be seen as an independent individual*, and the randomization unit for an experiment is usually user-level (most experiment determines which experiment groups a certain user falls into by its user id). With independence, Central Limit Theorem can then be applied to derive normality, which satisfies this assumption for the two-sample t-test. It is worth reiterating that though the robustness of the t-test under CLT is recognized that the t-test result is generally effective even if the population distribution is not technically normal (quasi/asymptotic normal), which is often the case in real-world problems, the robustness is bounded for distributions with non-moderate departure from normality (Bartlett, 1935; Snijders, 2011). It becomes less trustworthy if the underlying distributions are severely skewed, multi-modal, or sparse with extreme outliers.

As a side note, Quantile-Quantile (QQ) plots, histograms, Shapiro-Wilk Test, Kolmogorov–Smirnov (KS) Test are common methods to test normality.
The variances for the evaluation metric in the two populations can be unknown, but they need to be homogeneous: This is not a stringent assumption as we can employ Welch’s t-test, or the unequal variances t-test, to proceed with the hypothesis testing for the equal means. The mathematical differences between Welch’s t-test and Student’s t-test are mainly about the degrees of freedom and sample variance estimates. In fact, we should always use Welch’s t-test instead of the Student t-test. This is not only implied by some academic articles (Delacre et al., 2017), but also suggested by the experiment platform engineers in tech companies based on my experience because it can be computationally expensive and time-costly to pull all the data from the database and assert the variances are equal, which is also usually a false proposition (Calculating variances can be very onerous at millions and billions scale).

Delacre et al. (2017) show that the Welch’s t-test provides better control of Type 1 error rates when the assumption of homogeneity of variance is not met, and it loses little robustness compared to Student’s t-test when the assumptions are met. Therefore, they argue that Welch’s t-test should be used as a default strategy.

For those highly skewed metrics, the nonparametric approach for two independent groups — Mann-Whitney Rank Sum U-test (MW Test) — is more appropriate because it uses ‘ranks’ rather than real values to determine whether the difference in metrics is significant, which bypasses the distortion introduced by extreme values. What’s more, the sensitivity (power) of the MW test is not even disappointing even when it loses information by substituting the absolute values with ranks, because other parametric tests (e.g. t-test) have a more substantial drop in their capability to make true positive inferences when the distribution is highly skew — VK Team conducted simulations and demonstrated this idea with great details. I would strongly recommend checking out their Medium article here.

However, the downside of the MW test is that it is computationally intensive because it requires sorting the complete sample set to generate the ranks, so one should take the data size into consideration before deciding to go with the MW test.

2. User-level Conversion Metrics

Suggested Test: Proportional Z-test

What are user-level conversion metrics?

User-level conversions are metrics out of binary outcomes— whether this user retains or not, whether this user converts or not, etc. In other words, all the users will have only one observation which is an identifier of either 1 or 0, making the user-level conversion metrics essentially Binomial proportional statistics (# converted users/# all users).

How to test user-level conversion metrics?

According to the Central Limit Theorem (CLT), we can safely approximate the distribution of the Binomial proportional statistics using normal distributions. Another way to think about this is that proportioning the Binomial aggregation is very much like taking the average, where CLT is effective in asserting normality for the distribution of any sample means when the total number of users (population size) is sufficiently large. As a result, the z-test becomes a perfect candidate.

The assumption for the normal approximation is usually satisfied, compared to the above-discussed user average metrics, because the underlying sampling distribution is binomial rather than drastically skewed thanks to the nice property of binary Bernoulli events. Additionally, the number of users, or the population size, is generally very large to support a reasonable approximation (we are often confident when N*p>5 and N*(1-p)>5 as a rule of thumb. See reference for more details).

It is worth noting that Central Limit Theorem cannot be trusted and the z-test cannot be applied if the sample values in the experiment are not independent and identically distributed (i.i.d.). This is not a concern in user-level-metric scenarios because by default the randomization unit is on users, and therefore we believe every person behaves individually. However, this independent assumption might not hold when the metrics are aggregated at a granular level than users, such as the pageview-level conversion metrics in section three below.

3. Pageview-level Conversion Metrics

Suggested Test: Delta Method + T-Test

What are some pageview-level conversion metrics?

One common example is the click-through rate (CTR), defined as # click events / # view events.

What are the problems of simply using t-tests here?

It is usually complicated to analyze the statistical significance of pageview-level conversion metrics, such as Click-through rate (CTR), in A/B experiments. The essential problem is that the unit of analysis (i.e. event level — click events and impression events) is different from the unit of randomization (i.e. user level), which could invalidate our t-test and underestimate the variance.

Specifically, When we calculate the p-values for CTRs between the experiment and control group, we are essentially aggregating all the click events and dividing by all the impression events of all the users in each group, and the t-test the difference. This process has an inherent normality assumption that data samples need to be independently drawn from a normal distribution. Put differently, since each sample observation is a view event, we need to ensure all the view events are independent. Otherwise, we are not able to use CLT to assert normality for our mean statistics.

However, this independence assumption is violated. In A/B experiments where we randomize by users (this is usually the default option because each user would have an inconsistent product experience if we randomize by sessions), a user can have multiple clicks/views events, and these events are correlated to our common sense — We can at most assert each individual behaves independently, but rather the behaviors for now and tomorrow are not time series correlated. As a result, the sample variance will no longer be an unbiased estimate of the population variance — the true sample variance for our mean statistics could be higher than our estimation because co-variance steps in and we missed it in our calculation.

Therefore, for metrics with a granularity of event level, it is inevitable to underestimate the variance if sticking to the traditional t-test methods. The direct consequence of underestimating the sample variance is false positive — The true and high variance can possibly drive the sample means away from the null hypothesis and mislead us to derive low p-values even then the treatment is not effective.

How to solve this sample-dependency problem?

There are two ways to resolve this issue: (1) t-test the difference using the delta method or (2) proceeding t-test with the empirical variance estimated using the bootstrapping approach, which is also emphasized in the <Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing>

Having the randomization unit be coarser than the analysis unit, such as
randomizing by the user and analyzing the click-through rate (by page), will work, but requires more nuanced analysis methods such as bootstrap or the delta method (Deng et al. 2017, Deng, Knoblich and Lu 2018, Tang et al. 2010, Deng et al. 2011).

Delta Method: Delta method is generally considered a more efficient approach. The key idea can be interpreted as re-writing the pageview-level CTR ratio metrics into the ratio of two ‘average user level metrics’ and asserting CLT on this ratio metric (using Taylor expansion and Slutsky theorem), so that we could convert the granularity of analysis from page level to user level, which is now consistent with our unit of diversion (i.e. user). In this way, we manage to restore i.i.d. because both our numerator and the denominator are the averages of i.i.d. samples. Using CTR as an example, theAVG.clicks and AVG.views are jointly bivariate normal in the limit and i.i.d., makingCTR , the ratio of the two averages, becoming also normally distributed (Kohavi, R., Tang, D., & Xu, Y. (2020)). The code implementation can be found in Ahmad Nur Aziz’s medium post here.

Equation 4: Click-through Rate (CTR) Definition

Equation 5: t-statistic using the Delta Method

Bootstrapping: Bootstrapping is a simulation-based method to empirically calculate the variance of the sample mean statistics. The key idea is to repeatedly draw many samples from all the view events in each group stratified by user ids, then calculate many mean statistics from these bootstrap samples, and estimate the variance for these statistics accordingly. Stratifying by user ID ensures i.i.d. assumption is met and the variance estimation is trustworthy.

However, bootstrapping is computationally expensive as the simulation is basically manipulating and aggregating the tens of billions of user logs for many iterations (Complexity ~ O(nB) where n is the sample size and B is the number of bootstrap iterations). As a result, in spite of its flexibility, it is usually not the first go-to option at tech companies and is more used as a bullet-proof double-check benchmark for critical decisions.

4. Percentile Metrics

Suggested Test: Central Limit Theorem (CLT) + Z-Test

What are percentile metrics?

Quantile metrics such as 95th percentile page load time are critical to A/B testing as many business aspects are characterized by edge cases and thus are better described by quantile metrics.

What are the challenges of testing percentile metrics?

Most of the percentile metrics, for example, 95 percentile page load latency, are engineering performance indicators defined in the page-view granularity. Consequently, the randomization unit (users) won’t match the analysis unit (page views), which invalidates the independence assumption and the population variance estimate using the plain sample variance.

How to test percentile metrics?

Bootstrapping: Bootstrapping, as a ‘universally applicable’ tool, can be used here to estimate the empirical variance and t-test the sample means of percentile metrics. Again, nevertheless, its computation is prohibitively expensive and thus cannot scale well.
CLT+Proportional Z-test: There is quite a lot of maths behind the scene but I will try to demonstrate the main ideas in simple terms (Strongly recommend reading Alex Deng (2021) if interested). To begin with, we shall first probe into the distribution of the quantile metrics, statisticians have proved that the sample percentiles are approximately normal:

Equation 6: An Illustration of Asymptotic Normality for 95th Percentile Metrics.

σ refers to the standard deviation of whether a data point is smaller than the 95th percentile (it cannot be estimated in the ‘traditional’ Bernoulli way because observations here are dependent). F is an unknown probability density function for the underlying data.

Great! Since quantiles follow the normal distribution, we can use the Z-test to calculate the p-values. Well, this is generally the direction we are going for, but there is one crucial blockers— the σ as well as the probability density function F is unknown and could be hard to estimate.

So our goal, for now, is to derive a good estimate of the variance (circled in red). There is more than one approach employed in the industry. Data scientists at LinkedIn, Quora and Wish estimate this density function directly, while Microsoft and TikTok refer to a novel idea introduced by Alex Deng, Knoblich, and Lu (2018) that bypass the estimation of the density function F. Instead, they first estimate the confidence interval for the true percentile by investigating the distribution of the Binomial proportion of an observation being smaller than the percentile, as denoted in equation 7:

Equation 7: Confidence Interval for 95th Percentile Metrics

Secondly, they work backward to derive the variance estimate based on the length of the approximated confidence interval.

Equation 8: Deriving Standard Deviation from Confidence Interval

Lastly, with the variance estimate, we can proceed to calculate the z-statistic for our sample percentile and the according p-values, similar to those average metrics.

5. SUM Metrics (Not Recommended)

Suggested Test: Simulations

What are SUM metrics?

SUM metrics refer to the aggregated-up indicators including the total number of the article read times, the total GMV, the total videos posted, and so forth for the experiment and control groups.

What are the challenges here?

SUM metrics are usually the north star metrics for product development but can be difficult to test due to the lack of statistical validity and confounding errors. Specifically, the SUM metrics can be decomposed into user average metrics and user count in the experiment and control groups, which means the SUM metrics are affected not only by the actual business fluctuations (user average metrics) and by the inevitable random error introduced by the traffic diversion.:

While the user average metric follows the (quasi) normal distribution and the traffic diversion follows the Binomial distribution with p = 0.5, the classic statistic has no complete study regarding the distribution of the product of the two.

How should we test user average metrics?

We can approach this problem with simulation. Since the two distributions are deterministic and independent, we can acquire their own probability density function (PDF) by iteratively drawing samples. If then multiplying the two PDFs, we could derive an empirical PDF for the SUM metric. The p-value can thus be calculated by investigating how the SUM statistic from what we actually see in the treatment group fit into our simulated distribution.

This is relatively a daunting process with a fair amount of approximations. Therefore, it is usually not recommended. I would suggest testing the effectiveness/statistical significance of the product feature using the user average metric and calculating the incremental lift in the SUM metrics if the management really cares.

The question on the correct statistical tests is essentially asking which test has the most sensitivity or statistical power, given the experiment and control group data.

The power is closely connected with the underlying distribution of our metrics and the independence of our samples. Therefore, it is always good practice to check the distribution skewness and think through the assumptions before proceeding with the analysis.

Below are three tips based on my personal experience and my learning from the references:

When the data is skewed, the t-test is not that robust and the Mann-Whitney test is not that weak.
Consider transforming the metric (e.g. log-transformation) to alleviate the skewness or even replace it with a more normally distributed one.
If you don’t want to mess with the daunting statistics, another way to go is to try variance reduction techniques such as CUPED. After all, our purpose is to attain high power and trustworthy significance inferences.

All the equations were hand-coded by the author.
*Each user is an independent individual: assuming no network effect. The below piece quoted from Alex Deng (2021) explains the idea of independence assumption in the A/B test scenario very well:

In online experimentation, a user is often regarded as the basic unit of autonomy and hence commonly used as the randomization unit……A general rule of thumb says that we can always treat the randomization unit as i.i.d. We name this the randomization unit principle (RUP). For example, when an experiment is randomized by user, page-view (S. Deng et al. 2011), cookie-day (Tang et al. 2010) or session/visit, RUP suggests it is reasonable to assume observations at each of these levels respectively are i.i.d. There were no previous published work that explicitly stated RUP. But it is widely used in analyses of A/B tests by the community

Aziz, A.N. (2021). Applying Delta Method in A/B Tests Analysis. https://medium.com/@ahmadnuraziz3/applying-delta-method-for-a-b-tests-analysis-8b1d13411c22
Bartlett, M. S. (1935, April). The effect of non-normality on the t distribution. In mathematical proceedings of the cambridge philosophical society (Vol. 31, №2, pp. 223–231). Cambridge University Press.
Deng, A., Knoblich, U., & Lu, J. (2018). Applying the Delta method in metric analytics: A practical guide with novel ideas. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 233–242).
Deng, A. (2021). ‘8.6 Confidence Interval and Variance Estimation for Percentile metrics’. In Causal Inference and Its Applications in Online Industry. https://alexdeng.github.io/causal/abstats.html#indvar
Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30(1).
Li, Q.K. (2021). How Wish A/B tests percentiles. https://towardsdatascience.com/how-wish-a-b-tests-percentiles-35ee3e4589e7
Liu, M., Sun, X., Varshney, M., & Xu, Y. (2019). Large-scale online experimentation with quantile metrics. arXiv preprint arXiv:1903.08762.
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. In Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (p. I). Cambridge: Cambridge University Press.
Snijders, T. A. (2011). Statistical methods: robustness. Retrieved from http://www.stats.ox.ac.uk/ Statistical Methods MT2011 Snijders.
VK Team. (2020). Practitioner’s Guide to Statistical Test. https://vkteam.medium.com/practitioners-guide-to-statistical-tests-ed2d580ef04f#9e58
Xing, T and Chong, K.Z. (2018). Two-Sample Hypothesis Tests for Differences in Percentiles. https://quoradata.quora.com/Two-Sample-Hypothesis-Tests-for-Differences-in-Percentiles