Geo Lift Experiments II: Spotify Blend Case Study | by Kaushik Sureshkumar | Jul, 2022

By Jessie Hobb On Jul 6, 2022

This is part 2 of a 2 part series which looks at geo lift experiments in a product context:

Understanding Difference in Differences
Spotify Blend Case Study

In this part we’re going to applying what we explored in part 1 to a case study. Let’s dive into the case study.

Product Context: Spotify Blend

Let us assume we’re a PM in the Blend Playlist Experience team at Spotify. For anyone unfamiliar with this feature, it’s a feature which allows a user to create a playlist with a friend of their choosing. The playlist is then autogenerated with songs that both the user and their friend like. It also shows the user taste match scores between their music taste and the friend’s. As the user listens to new music this taste match score changes, and the recommendations provided by the playlist get better.

As a PM on this feature, let’s assume we discover that users are having some friction with the experience of setting up the blend. When inviting friends to the blend, users are shown a native share modal to share an invite link with friends (as shown below). When the friends click on the link shared with them they are then added to the blend. Let’s assume that, looking at the data, we see a considerable drop off between users clicking on the Invite button and their friends successfully accepting their invite. We don’t have much data on exactly why this is, since it could be due to users not following through with sending the invite through the sharing modal. Or it could be that the friends aren’t responding to the invite message.

The native sharing modal when you click on the Invite button on the Spotify Blend feature. (Image by Author)

Hypothesis

We hypothesise that by keeping this experience within the Spotify app, users are more likely to successfully invite their friends and create a blend. We want to change this part of the feature to allow users to search for users by username (or pick from a drop down of followers/following) in order to add them to the blend. Being a data driven PM, we’d like to run this as an experiment to validate our assumptions and hypothesis.

Issues

Since interaction between multiple users is critical to the feature, the way one user engages with the feature is not independent of the way another user engages with it. As such, if we were to run this as a traditional AB test, with each user as the randomisation unit, then we’d be violating SUTVA. From a user perspective, if a user could search and invite their friend within Spotify, but their friend was in the control group and couldn’t see the invitation, that is going to disrupt their experience and impact the way they use the feature.

One way to solve this could be to break the network up into clusters of users who follow each other, and perform the random assignment at the cluster level. However, with this being one of the first multi user experiences at Spotify, let’s assume there isn’t any infrastructure in place to run this type of experiment. Setting up the infrastructure would be costly in terms of development resources and we want to be more resourceful to iterate faster.

Let’s assume that, through some exploratory data analysis, we discover that 98% of users only create blends with other users from their own geography. By restricting our feature change to a specific geography we get very close to re-establishing SUTVA. As such, this seems like an excellent candidate for a geo lift experiment. We acknowledge that the results of the geo lift experiment aren’t as strong an indicator of causality as if we were to run a social cluster experiment or if 100% of our users only created blends with users from their own geo. But we think that we’d get a decent indication of the impact of this change running this experiment as a geo lift experiment.

Our control geo will be geo_1 and our treatment geo will be geo_2. We will only release the change the in geo_2 which we believe will increase the likelihood of a user who clicks on the invite button having the invite accepted by at least 1 friend. As such our success metric would be a conversion metric representing whether a users successfully invited 1 friend to the blend or not.

Confounders

Let’s assume that we’ve chosen geo_1 and geo_2 carefully so that any confounders we can control for are applied consistently across the two geographies. Any UA campaigns that may have an impact on the metrics for this feature are applied across both geographies.

Now that we’re clear on the context, let’s assume we’ve run this experiment for some time and have collected enough data. So let’s dive into the analysis.

Analysis

Before getting stuck into the analysis, it’s worth noting that the data used here is artificially generated, and so is already in the ideal format for this exercise. In the real world, it’s likely that we’d need to perform some ETL operations in order to get the data in this format. However this is outside the scope of this post. The data is also not representative of actual Spotify user engagement data.

Let’s import our dependencies and explore the experiment data.

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt
import matplotlib.pyplot as plt
import arviz as azexperiment_data = pd.read_csv('experiment_data.csv')experiment_data.head()

We see that the data contains the a hash to represent the userId, which group the user was in, and the converted column which represents whether a user’s invite was successful or not. Let’s go ahead and summarise the conversion rates of the different groups.

We will model the conversion rate of each group as a variable with a Beta distribution, as we’ve done before when doing Bayesian modelling of conversions. Rather than estimating the parameters of each Beta distribution by inspection, we will model them as a uniformly distributed variable between 0 and the respective group sample size. So let’s start off doing that:

diff_model = pm.Model()with diff_model:
geo_1_pre_a = pm.Uniform('geo_1_pre_a', lower=0, upper=1367)
geo_1_pre_b = pm.Uniform('geo_1_pre_b', lower=0, upper=1367)    geo_1_post_a = pm.Uniform('geo_1_post_a', lower=0, upper=1893)
geo_1_post_b = pm.Uniform('geo_1_post_b', lower=0, upper=1893)
geo_2_pre_a = pm.Uniform('geo_2_pre_a', lower=0, upper=1522)
geo_2_pre_b = pm.Uniform('geo_2_pre_b', lower=0, upper=1522)    
geo_2_post_a = pm.Uniform('geo_2_post_a', lower=0, upper=1408)
geo_2_post_b = pm.Uniform('geo_2_post_b', lower=0, upper=1408)

Now that we’ve modelled our distribution parameters, we can model the distribution of each conversion rate.

with diff_model:
geo_1_pre_cr = pm.Beta('geo_1_pre_cr', alpha=geo_1_pre_a, beta=geo_1_pre_b)geo_1_post_cr = pm.Beta('geo_1_post_cr', alpha=geo_1_post_a, beta=geo_1_post_b)    
geo_2_pre_cr = pm.Beta('geo_2_pre_cr', alpha=geo_2_pre_a, beta=geo_2_pre_b)    
geo_2_post_cr = pm.Beta('geo_2_post_cr', alpha=geo_2_post_a, beta=geo_2_post_b)

Having modelled each conversion rate, we can now model the differences pre and post the release of the feature, along with the difference in differences which is our uplift.

with diff_model:
diff_pre = pm.Deterministic('diff_pre', (geo_2_pre_cr - geo_1_pre_cr))
diff_post = pm.Deterministic('diff_post', (geo_2_post_cr - geo_1_post_cr))
diff_in_diff = pm.Deterministic('diff_in_diff', diff_post - diff_pre)

Now that we’ve modelled our difference in differences hierarchically, we can add our observed values. But before that, we might as well model the lift in each geo too to help us understand the lift due to different confounders better.

with diff_model:
diff_geo_1 = pm.Deterministic('diff_geo_1', (geo_1_post_cr - geo_1_pre_cr))
diff_geo_2 = pm.Deterministic('diff_geo_2', (geo_2_post_cr - geo_2_pre_cr))

Finally we add our observed conversions to the model and then sample it. We model these conversions as Bernoulli variables which use the previously modelled conversions rates as the probability of conversion.

with diff_model:
geo_1_pre_conversions = pm.Bernoulli('geo_1_pre_conversions', p=geo_1_pre_cr, observed=conversion_values['geo_1_pre'])geo_1_post_conversions = pm.Bernoulli('geo_1_post_conversions', p=geo_1_post_cr, observed=conversion_values['geo_1_post'])    
geo_2_pre_conversions = pm.Bernoulli('geo_2_pre_conversions', p=geo_2_pre_cr, observed=conversion_values['geo_2_pre'])    
geo_2_post_conversions = pm.Bernoulli('geo_2_post_conversions', p=geo_2_post_cr, observed=conversion_values['geo_2_post'])trace = pm.sample()

Once the model has been sampled we can plot the sampled distributions.

with diff_model:
az.plot_trace(trace, compact=False)

Looking at the diff_in_diff graph it seems very likely that the uplift is greater than 0, with the highest probability of uplift being around 5%.

We can now inspect the stats of all our models and understand where their 95% credible intervals lie.

with diff_model:
display(az.summary(trace, kind='stats', round_to=2, hdi_prob=0.95))

In particular we want to look at the stats for the last 3 models — diff_in_diff, diff_geo_1 and diff_geo_2.

It looks like our diff_in_diff has a mean of 5% with the 95% credible interval lying between 0–9%. So while it’s likely that the change did impact how successful the user’s invites were, it’s also likely that it was only a small impact.

Looking at our control geo, diff_geo_1 suggests that there is a mean of 5% difference before and after releasing the change in geo_2, with the 95% credible interval lying between 2–9%. Since we didn’t release the feature change in this reason, it’s likely that this change is due to confounders.

Conclusion

While it looks like the change we wanted to test is likely to have had an impact on how successful the user’s invites were, the impact is likely to have been quite small. If we were happy with this uplift we could roll the feature out to multiple geographies and keep monitoring the success metrics. However if we feel like the impact is too small, we could re-evaluate our assumptions and hypotheses to ideate other ways of improving the user experience.