Better Churn Prediction — using survival analysis | by Iyar Lin | Oct, 2022
Answering the “when” question
On a previous post I made the case that survival analysis is essential for better churn prediction. My main argument was that churn is not a question of “who” but rather of “when”.
In the “when” question we ask when will a subscriber churn? Put differently how long does a subscriber stay subscribed on average? We can then answer one of the most important questions: What is the average subscriber life time value?
Let’s roll up our sleeves and dive right in: The survival curve S(t) measures the probability a subscriber will “survive” (not churn) until time t since starting his subscription. For example S(3)=0.8 means a subscriber has %80 chance of not churning by the 3rd month of subscription.
The most common way of estimating S(t) is by using the Kaplan-Meier curve who’s formula is given by:
where t_i are all times where at least one subscriber has churned, d_i is the number of subscribers who have churned at time t_i and n_i is the number of subscribers who survived till at least t_i. We can think of the term d_i/n_i as the churn rate at time t_i.
To illustrate let’s calculate the survival curve for the following subscriber data:
The column t denotes the time a user has been subscribed until today. If he churned that would be the time till he churned.
We have 2 times at which churn events happened: t_i = {2,6}.
For t < 2 we have S(t)=1 since no one churned up to that point.
At t_1=2 we have d_1=2 (subscribers 3 and 6) and n_1=5 (all subscribers but 4). Using the above formula we get:
At t_2=6 we have d_2=1 (subscriber 2) and n_2=1 (again, just subscriber 2).
We thus have:
Let’s plot that curve:
One thing to notice here is that at that every point along the curve we only consider subscribers who survived up to that point. If a subscriber joined very recently (e.g. subscriber 4) he won’t play a major role in the calculation.
In practice you’d be better off using the survival curve implementation in the R survival
package or the python lifelines
library.
So why go through the hassle of calculating S(t) in the first place? Turns out that the expected life time is the area under the survival curve (I won’t go into proving that here).
So in our example above:
If a users’ monthly plan bill is for example $10 then we can say that his expected LTV (life time value) is $44.
In this post we’ve seen how using survival curves we can answer the “when” question — how long is the average subscription. We saw this can then be used to indicate what is the $ value of a subscriber.
Sometimes we may actually be interested in the “who” question as well. For example “What subscribers are most likely to churn within the first month of subscription”? On my next post I’ll show that using survival curves we can better answer that question as well!
Answering the “when” question
On a previous post I made the case that survival analysis is essential for better churn prediction. My main argument was that churn is not a question of “who” but rather of “when”.
In the “when” question we ask when will a subscriber churn? Put differently how long does a subscriber stay subscribed on average? We can then answer one of the most important questions: What is the average subscriber life time value?
Let’s roll up our sleeves and dive right in: The survival curve S(t) measures the probability a subscriber will “survive” (not churn) until time t since starting his subscription. For example S(3)=0.8 means a subscriber has %80 chance of not churning by the 3rd month of subscription.
The most common way of estimating S(t) is by using the Kaplan-Meier curve who’s formula is given by:
where t_i are all times where at least one subscriber has churned, d_i is the number of subscribers who have churned at time t_i and n_i is the number of subscribers who survived till at least t_i. We can think of the term d_i/n_i as the churn rate at time t_i.
To illustrate let’s calculate the survival curve for the following subscriber data:
The column t denotes the time a user has been subscribed until today. If he churned that would be the time till he churned.
We have 2 times at which churn events happened: t_i = {2,6}.
For t < 2 we have S(t)=1 since no one churned up to that point.
At t_1=2 we have d_1=2 (subscribers 3 and 6) and n_1=5 (all subscribers but 4). Using the above formula we get:
At t_2=6 we have d_2=1 (subscriber 2) and n_2=1 (again, just subscriber 2).
We thus have:
Let’s plot that curve:
One thing to notice here is that at that every point along the curve we only consider subscribers who survived up to that point. If a subscriber joined very recently (e.g. subscriber 4) he won’t play a major role in the calculation.
In practice you’d be better off using the survival curve implementation in the R survival
package or the python lifelines
library.
So why go through the hassle of calculating S(t) in the first place? Turns out that the expected life time is the area under the survival curve (I won’t go into proving that here).
So in our example above:
If a users’ monthly plan bill is for example $10 then we can say that his expected LTV (life time value) is $44.
In this post we’ve seen how using survival curves we can answer the “when” question — how long is the average subscription. We saw this can then be used to indicate what is the $ value of a subscriber.
Sometimes we may actually be interested in the “who” question as well. For example “What subscribers are most likely to churn within the first month of subscription”? On my next post I’ll show that using survival curves we can better answer that question as well!