The Complete Introduction to Survival Analysis in Python | by Marco Peixeiro

Understand survival analysis, its use in the industry, and how to apply it in Python

Survival analysis is a fascinating field, but rarely used or mentioned at all by data science practitioners, at least in my experience. While it is mostly used in the healthcare industry, survival analysis can also be used in a wide variety of domains.

The objective of this article is to make you discover survival analysis and its possible application in your industry.

In this article, we first define survival analysis and motivate its use in different industries. Then, we see how we can implement different algorithms for survival analysis and learn how to evaluate them.

Let’s get started!

Survival analysis is a branch of statistics that measures the expected duration of time until an event occurs. The name comes from clinical research where we are literally interested in a patient’s survival, or in other words, extending the duration of time until death.

Survival analysis can take many different names depending on the field in which it’s applied. In engineering, we talk about reliability analysis, and duration analysis in economics. Time-to-event analysis is also another common name.

Regression or classification?

Considering that survival analysis measures the length of time to an event happening means that it is a type of regression.

However, the output of survival analysis is not a continuous number.

Instead, we aim to generate either a survival function or a hazard function.

A survival function expresses the probability of the event not occurring in function of time.

Alternatively, the hazard function denotes the probability of the event occurring at a point in time.

We will take a look at those functions in more details when we implement different models for survival analysis.

Now that we understand that survival analysis measures the time to an event, we can see how it is not exclusive to the healthcare domain.

In fact, survival analysis can be used anytime we need to act before the event occurs.

Take the employee churn problem for example. Suppose that we have the following dataset.


+--------+----------------+-------------------+----------------+----------+
| Salary | Years employed | Nb. of promotions | Direct manager | Has left |
+--------+----------------+-------------------+----------------+----------+
|  75000 |              2 |                 1 | John Doe       |        0 |
| 105000 |              4 |                 2 | Jane Doe       |        0 |
|  40000 |              2 |                 0 | John Doe       |        1 |
+--------+----------------+-------------------+----------------+----------+

Intuitively, we frame this as a binary classification problem, where the employee either stays at the company or leaves.

If our model predicts the employee stays, how long will they stay? And if the model predicts the employee leaves, how long before they leave? As a classification problem, either we do not know the answers to those questions, or we need a fixed time period. For example, our data is labelled such that as the employee left within 6 months.

Treating the employee churn as a survival analysis problem, we can instead have the probability of an employee staying at the company as a function of time. That way, we can take action before the employee actually leaves. Plus, we can analyze our features and determine how each impacts the length of time an employee stays.

Now, this was just an example, but survival analysis can be applied in many more situations like:

How long before a piece of equipment needs maintenance?
How long before a client pays back their debt?
How long before a software bug is resolved?
How long before a client makes another purchase?

The take-home message is that we should use survival analysis is when we are not only interested in an event occurring, but also interested in the duration of time before that event occurs.

Since survival analysis is concerned with the occurrence of an event and the duration of time until it occurs, our data has to contain that information.

Let’s bring back our sample dataset on employee churn.

+--------+----------------+-------------------+----------------+----------+
| Salary | Years employed | Nb. of promotions | Direct manager | Has left |
+--------+----------------+-------------------+----------------+----------+
|  75000 |              2 |                 1 | John Doe       |        0 |
| 105000 |              4 |                 2 | Jane Doe       |        0 |
|  40000 |              2 |                 0 | John Doe       |        1 |
+--------+----------------+-------------------+----------------+----------+

Here, we can see that we have the necessary information for survival analysis: we have the years of employment and if the employee left.

Of course, we also have data on employees that have not left the company yet.

This is what we call censored data. For two employees, we do not know when they will leave the company, or if they will leave at all.

Still, we will not wait for all employees to leave before taking action, and so survival analysis is especially designed to work with censored data.

Types of censored data

There are three types of censored data:

Right-censored
Left-censored
Interval-censored

Let’s take a look at each type.

Right-censored data

For right-censored data, we know the starting point for every subject, but we do not observe the event for everyone, as shown below.

Right-censored data: we know the starting point, but do not know if the event occurs for every subject. Image by the author.

In the figure above, subject A quit the company. For subject B, during the time of the experiment, they remained in the company. But we cannot know if they left after. So, it is right-censored.

Subject C is another example of right-censored data, where someone simply leaves the experiment. This can happen in a clinical trial scenario, when someone is still alive, but leaves the trial, so we lose track of them.

Left-censored data

To understand left-censored data, let’s consider the example of being infected by a virus.

Here, left-censored data would mean that we do not know when the person was infected, but we know that they have the virus.

Left-censored data: we don’t know when the event occurred, but we observe it later on. Image by the author.

In the figure above, the event happened before we observed it, but we cannot know the exact time. Still, this type of data can be used for survival analysis.

Interval-censored data

Finally, interval-censored data is when the event occurs between two moments of observation, but we do not know exactly when.

Interval-censored data: the event occurred between two points of observation. Image by the author.

To understand this, suppose that you go to the doctor and test for the virus. On your first visit, the result is negative and your are not infected.

Then, the following week, you go back to the doctor and test for the virus. Now, the test is positive, meaning that you were infected sometime between the two visits, but we do not know when exactly.

Now that we have an overall understanding or survival analysis, its applications and what type of data it can work with, let’s actually apply survival analysis with an example.

Here, we work with the Veterans Administration Lung Cancer Trial dataset in 1980 (Kalbfleisch, J.D., Prentice, R.L.: “The Statistical Analysis of Failure Time Data.” John Wiley & Sons, Inc. (2002)). The dataset is publicly available for download.

This dataset contains data on 137 patients, has variables and is right-censored. That means that during the study, not all patients deceased.

Here, there are two main objectives to this experiment:

Can a different treatment improve survival time?
Can we predict the survival time of a patient?

To answer the first question, let’s implement our first survival analysis method: the Kaplan-Meier estimator.

Implement the Kaplan-Meier estimator

The Kaplan-Meier estimator is a non-parametric statistic to estimate the survival function that works well with right-censored data.

As we will see soon, the Kaplan-Meier estimator will generate a survival function represented by a series of declining horizontal steps, like a staircase. Given enough samples, that function will approach the true survival function.

The main assumption of the Kaplan-Meier estimator is that censored data has the same probability of survival as uncensored data. Therefore, if someone leaves the experiment, we will assume the same probability of survival as someone who stayed in the experiment and that we observe.

To implement it, we will use the scikit-survival Python package. This is a library that comes with datasets and common survival analysis models.

First, let’s read the data.

from sksurv.datasets import load_veterans_lung_cancerX, y = load_veterans_lung_cancer()

Great! Now, we can import the Kaplan-Meier estimator and generate the survival curve. For that, we need to pass in two parameters: the status of the patient (deceased or not) and the time to the event (how many days elapsed from the beginning of the study to the observation).

from sksurv.nonparametric import kaplan_meier_estimatortime, survival_prob = kaplan_meier_estimator(y['Status'], y['Survival_in_days

Then, we can plot the survival function.

fig, ax = plt.subplots()
ax.step(time, survival_prob, where='post')
ax.set_ylabel('Probability of survival')
ax.set_xlabel('Time')plt.tight_layout()

The survival function from the Kaplan-Meier estimator on the Veterans Administration Lung Cancer Trial dataset. Notice how the curve is a series of declining horizontal steps as expected. Image by the author.

From the figure above, we can see that the survival function is indeed a series of declining horizontal steps as expected. When the study started, most patients were alive, and so the probability of survival is high. Then, the curve decreases quickly, meaning that most patients deceased in the first 400 days of the study.

Now, it is possible to generate a survival function for different groups of patient. For example, in the dataset, two different treatments were administered. It would be interesting to see if one treatment increased the chances of survival.

Let’s see this for ourselves! We separate the data into two groups depending on the treatment type and generate the survival function. Then, we plot it.

for treatment_type in ('standard', 'test'):
mask_treatment = X['Treatment'] == treatment_typetime, survival_prob = kaplan_meier_estimator(
y['Status'][mask_treatment],
y['Survival_in_days'][mask_treatment]
)
plt.step(time, survival_prob, where='post', label=f'{treatment_type}')
plt.ylabel('Probability of survival')
plt.xlabel('Time')
plt.legend(loc='best')
plt.tight_layout()

Survival functions for two different treatment types. Image by the author.

From the figure above, it seems that the experimental treatment generates a longer-lasting survival curve than the standard treatment.

But is that difference significant?

To answer that question, we use the logrank test. This is a statistical test to determine if two survival curves are significantly different. Here, the null hypothesis states that two curves are not significantly different.

Using scikit-survival, we can run the test and output the p-value.

from sksurv.compare import compare_survivalgroup_indicator = X.loc[:, 'Treatment']
groups = group_indicator.unique()
chi2, pvalue= compare_survival(y, group_indicator)
print(pvalue)

This outputs a p-value of 0.93. Since it is not less than 0.05, we fail to reject the null hypothesis and conclude that the survival curves are not significantly different. Therefore, the treatment type did not help patients live longer.

The Kaplan-Meier estimator is a great starting point, but because it a non-parametric model, it cannot take into account any of the features of our dataset. Therefore, we turn our attention to models that can take in features to estimate the survival function.

First, let’s one-hot encode our data so that the features can be used.

from sksurv.preprocessing import OneHotEncoderX_num = OneHotEncoder().fit_transform(X)

Great! Now, we can apply a model that takes into account covariates to assess survival. Here, we use the Cox Proportional Hazard model.

Cox Proportional Hazard model

The Cox Porportional Hazard model is one of the models that can evaluate the effect of different factors on survival. That way, we can determine what factors can improve survival, and what factors reduce the chance of survival.

Here, the model actually estimates the hazard function. In other words, it calculates the probability of the event occurring at a point in time. This is the opposite of the survival function, which estimates the probability of the event not occurring at a point in time.

Visualizing the hazard and survival functions. The hazard function increases in time, while the survival function decreases over time. Image by Marta Sestelo from A Short Course on Survival Analysis.

From the figure above, we can see that the survival function decreases over time, while the hazard function increases over time. Of course, once we have one function, we can easily calculate the other one.

The Cox model is therefore a hazard function expressed as:

Cox Proportional Hazard equation. Image by the author.

Here, h represents the hazard (the probability of the event happening) and the covariates are represented by x. Then, the coefficient b can be used to interpret the impact of each covariate:

if b = 0, then the feature has no impact
if b > 0, then the features increases the hazard (so survival decreases)
if b < 0, then the feature decreases the hazard (so survival increases)

An important assumption of this model is that hazard is proportional, and that proportion is independent of time. Therefore, if the risk of death for subject A is twice as much as the risk of death for subject B at an initial point in time, that proportion remains the same, no matter where we are in time.

Apply the Cox Proportional Hazard model

Now that we understand the Cox model, let’s apply it on our dataset.

We simply have to initialize the model and fit it on our data.

from sksurv.linear_model import CoxPHSurvivalAnalysisestimator = CoxPHSurvivalAnalysis()
estimator.fit(X_num, y)

Then, we given data on unseen patients, the model can generate the survival function for each one of them. Note the use of the method predict_survival_function to get the survival function and the not the hazard function.

# Create a set of 4 synthetic patientsX_test = pd.DataFrame.from_dict({
1: [65, 0, 0, 1, 60, 1, 0, 1],
2: [65, 0, 0, 1, 60, 1, 0, 0],
3: [65, 0, 1, 0, 60, 1, 0, 0],
4: [65, 0, 1, 0, 60, 1, 0, 1]},
columns=X_num.columns, orient='index')
# Estimate the survival functions
pred_surv = estimator.predict_survival_function(X_test)
# Plot the survival function for each new patient
time_points = np.arange(1, 1000)
for i, surv_func in enumerate(pred_surv):
plt.step(time_points, surv_func(time_points), where='post', label=f'Sample {i+1}')
plt.ylabel('Probability of survival')
plt.xlabel('Time')
plt.legend(loc='best')
plt.tight_layout()

Survival function for each new patient from the Cox Porportional Hazard model. Image by the author.

From the figure above, we can see that the model generates a unique survival function for each patient. We can see that for sample 4, the survival decreases the fastest, while sample 2 decreases the slowest.

So, we got predictions from our Cox Proportional Hazard model and got survival curves. But how do we know if those predictions are any good?

Common evaluation metrics in survival analysis are the concordance index or c-index, and the time-dependent ROC AUC. Let’s explore both in more detail.

Concordance index (c-index)

A survival model will predict a hazard probability. Therefore, a sample with a higher hazard probability should have a shorter survival time.

Then, to calculate the c-index, we take a pair of samples and look for the following:

If the two samples in the pair are censored, the pair is ignored (no impact on the c-index)
If the sample with a higher predicted hazard has a lower survival time than the other sample in the pair, then it is a concordant pair (c-index increases)
If the sample with a higher predicted hazard has a longer survival time than the other sample in the pair, the pair is discordant (c-index decreases).

Then, to interpret the c-index, we use the same logic as with the ROC AUC in classification:

0.5 is a random model
1.0 is a perfect model
0 is a model that misses every time

Thus, you want your c-index to be at least greater than 0.5, and the closer to 1 the better.

Evaluate our Cox model with the c-index

Now, let’s calculate the c-index of our Cox model.

from sksurv.metrics import concordance_index_censoredestimator.score(X_num, y)

This returns a c-index or 0.74, meaning that our model does better than random, which is a good sign.

Time-dependent ROC AUC

If you have worked with classification problems, you have probably encountered the ROC AUC as an evaluation metric. You basically measure the area under the ROC curve to assess the performance of your model. Again, you want the area to be greater than 0.5 and as close as possible to 1.

Now, in survival analysis, we have a continuous outcome, meaning that the ROC changes over time, unlike in binary classification. For example, a client can be paying their credit card on time, but then starts defaulting at a future point in time.

Thus, using the time-depedent ROC AUC is useful to assess you model’s capability in predicting an event occurring by time t.

Evaluate our Cox model with the dynamic ROC AUC.

Using scikit-survival, let’s evaluate our Cox model using the time-dependent ROC AUC.

from sklearn.model_selection import train_test_split# Split into train/test 
X_train, X_test, y_train, y_test = train_test_split(X_num, y, test_size=0.2, stratify=y['Status'], random_state=42)
# Fit the model on the training set
cph = CoxPHSurvivalAnalysis()
cph.fit(X_train, y_train)
# Plot the time-depdendent ROC and calculate AUC
from sksurv.metrics import cumulative_dynamic_auc
time_interval = np.arange(8, 184, 7)
cph_risk_scores = cph.predict(X_test)
cph_auc, cph_mean_auc = cumulative_dynamic_auc(y_train, y_test, cph_risk_scores, time_interval)
fig, ax = plt.subplots()
ax.plot(time_interval, cph_auc, marker='o')
ax.axhline(cph_mean_auc, ls='--')
ax.set_xlabel('Days after enrollment')
ax.set_ylabel('Time-dependent AUC')
plt.grid(True)
plt.tight_layout()

Time-dependent ROC AUC for our Cox model. The average AUC is 0.85. Image by the author.

As you can see in the figure above, the performance our model varies over time. We see that its highest performance is between 75 and 125 days. Therefore, our model predicts best in the medium-term.

From here, we could develop other survival models to see if they are better in predicting the events in the short-term or in the long-term to complement the Cox model.

Congratulations on making it to the end! There was a lot of information, especially if you are completely new to survival analysis.

I hope that you learned something new and that I inspired you to discover the field of survival analysis and apply it in your projects!

Let’s keep in touch on LinkedIn!

Cheers 🍻

Understand survival analysis, its use in the industry, and how to apply it in Python

Photo by Ricky Kharawala on Unsplash