Techno Blender
Digitally Yours.

Cancer Prediction vs Telecom Churn Modelling | by Diego Unzueta | Jul, 2022

0 69


The Similarities Between Cancer Prediction and Churn Modelling

Image from Shutterstock by tadamichi

Introduction

Over the past year, I have been working in a telecommunications company as a Data Scientist. The telecom sector has many applications for machine learning, one of the main ones being churn prediction. Accurately predicting churn can help improve customer retention and increase revenue in the business. Although churn prediction is seemingly unrelated to cancer prediction, these two problems can actually be solved very similarly. In this article, I’ll go through the best ways to engineer features and train the machine learning models for both applications.

So what is churn?

A churner is someone who stops using/paying for a service. The goal is to predict churn in the future, that way we can launch targeted marketing campaigns towards the users we predict will churn. Reducing churn and increasing customer retention will increase the lifetime value of the average customer in the business, therefore increasing the company’s revenue.

Ideally, we want to predict churn as far in the future as possible. The further in the future we are able to predict, the easier it is for marketing teams to react and organize a marketing campaign. However, the further in the future we predict, the less accurate our predictions will be. We need to strike a balance between the performance of the model and the usefulness of the results.

Analogy to Cancer Prediction

To explain churn modelling, I’ll start off with a case interview question I got a while back:

Say there is a very deadly form of cancer, that if treated early enough it can be cured. However, by the time the cancer is diagnosed in the routine medical checks, it is already too late. If it is found between 1–3 months before it is usually found through the routine diagnosis, then it can easily be cured.

You are given access to a healthcare dataset with medical records of many patients across the country. The dataset also contains monthly blood checks for each patient, as well as whether they were diagnosed with this cancer or not.

How would you go about building a model to predict this form of cancer in time?

Tables by Author (made up data, image by author)

It’s probably a good moment for you to pause and think about how you would go about building a model here. This is a question that came up in a Data Science interview, and very similar scenarios have come up in my Data Science role.

Building the Dataset

The end goal is to build a classifier that takes in information about a patient and their blood test results, and predict whether they have an un-diagnosed cancer (a cancer which will be diagnosed by the monthly check-ups in the next 1–3 months).

The datasets are extremely large, in the tens of millions. The data is relational and therefore a language like SQL is most appropriate. The datasets can be joined through the “Patient ID” column through a left join. These queries can be executed through cloud services like Google’s Big Query for example.

To train the classifier, we need to build the training dataset. Each datapoint will the monthly blood test results of each patient. The datapoints must be labelled as either positive or negative. Positive is the results of a patient who will be diagnosed with cancer in the next 1–3 months. Negative is someone who does not have cancer.

Fig. 1: Patient monthly check data (Image by Author)

In the image above I’ve shown data from 2 patients. Patient 1 is diagnosed with cancer in May, patient 2 is never diagnosed with cancer. We want a model that would classify Patient 1’s blood tests in February, March and April and positive, and would classify all of Patient 2’s blood test results as negative:

Fig. 2: Training datapoints shown in Fig. 1 split into negative, positive and not used (Image by Author)

The blood test where patient 1 is diagnosed with cancer (May) is not used in the training data, as it is already too late to cure. We want to predict the cancer 1–3 months before, those data points are labelled as positive. Note how Patient 1’s blood test in January is labelled as negative, as he only tested positive 4 months later.

Note how we are using every blood test for each patient as a different training data point. We cannot simply use the data for cancers diagnosed in the most recent month. These data are insufficient, as only an extremely small amount of patients will be diagnosed with cancer. Training these models is already a difficult challenge because the resulting dataset will be very unbalanced. We need as many cancer positives as possible in the training data to improve model performance, and therefore we need to take data from as far back as we can.

The reason we set a limit to how far in the future we predict is simply because the predictors will be weaker and weaker the further forward we try to look. We want to build a strong and robust model, where if it predicts somebody has cancer we are almost certain the prediction is true.

Feature Engineering

If we only looked at the most recent blood test data we may miss patterns in the change in results in the past few months. A steady decrease in some variables such as white blood cell count may be a predictor for example. Capturing this in the training data is extremely important and will massively affect the performance of the model.

Feature engineered variables may include:

  • The rate of change of important variables
  • Average of important variables over the past few months
  • Variance of variables over the past few months
  • Binary variables such as: Were sodium levels ever < 145 mmol/L in the past few months
  • One hot encodings of categorical variables
  • Scaled continuous variables

What features to include will largely depend on the use case and the dataset. It is up to the data scientist to exploit patterns in the data, understand the underlying problem and find the best features to feed the machine learning model.

Telecom Churn Modelling

Hopefully, by reading the cancer prediction modelling described above you can imagine how it all applies to churn modelling.

We are trying to predict whether a customer will churn or not in the next 1–3 months. We need that time to target them with marketing and prevent them from churning. Instead of monthly blood tests, we gather their interaction with the service sold to them. This may include how much they use the service, what they use it for etc.

The main difference between churn modelling and cancer prediction is that churn modelling is extremely seasonal. Models need to be retrained every month, otherwise, they severely underperform.

The next big difference is that in Telecom, the features usually come in as streams of data, rather than one data point a month like with a blood test. These streams of data still need to be converted to unique data points. These data points often correspond to a particular day of the month, where the models will be re-trained and executed periodically.

Training and Evaluating Predictive Models

The resulting datasets for both problems will be extremely unbalanced. This is because most patients will not have cancer, and most users will not churn. Ways to deal with unbalanced classification include for example weighted classification and undersampling. There are others but from my experience, these work best.

Depending on the size of the resulting datasets, different supervised learning classifiers may work best. For smaller datasets (in the tens of thousands or sometimes hundreds of thousands of rows), since this is tabular data, ensemble methods will work best. Ensemble methods include decision trees, boosting methods etc.

It’s important to follow standard machine learning best practice techniques such as:

  • Normalizing and standardizing / min max scaling continuous variables
  • One hot encoding discrete variables
  • Train test splitting
  • For churn, save the most recent churners (a month or so) for validation

To improve the models, you should tweak the model to increase the results on the test data. This may involve further feature engineering, hyperparameter tuning etc. Once this is done, you want to evaluate the model on the validation dataset, to get a good idea of what performance you can expect in production.

The most useful metric when it comes to churn modelling is the ROC AUC score. The receiver operating characteristic (ROC) is a plot of the True Positive Rate vs the False Positive Rate. It is very closely related to precision and recall which I explained in detail in this article.

Fig. 3 Precision Recall Curve vs ROC AUC Score (Image by Author)

In the image above I’ve recycled the results from a model from a previous article. On the left you can see the precision recall curve of two models. On the right you can see the ROC. By varying the threshold, we were able to change the precision and recall of each class.

In the same way, By varying the threshold of the classifier, we can obtain different true positive and false-positive rates. The integral of the ROC over all possible thresholds (from 0 to 1) gives the Area Under the Curve (AUC). The perfect model will have an integral of 1, having a perfect true positive rate even at the lowest false positive rate. For a purely random model, the ROC curve would be a diagonal line, with an AUC of 0.5 (the area of half the 1 by 1 square is 0.5). A good ROC AUC score will depend on the application and the data, but the closer to 1, the better the score.

The measure is good to compare the overall generalization capability of machine learning models. This metric is very useful for any kind of unbalanced classification problem, and therefore it is appropriate both for cancer prediction models and churn models.

Conclusions

Early cancer detection and churn prediction seem unrelated, but in reality these two problems can be tackled almost identically. In this article, I discuss how you might go about building a machine learning model for either of these challenges. I also discussed how to build the dataset, which is arguably the hardest part of the problem. It requires a large amount of feature engineering and pre-processing. Machine learning can only do so much, but at the end of the day, the quality of the model will largely depend on the quality of data and features you feed it. To finish the article, I discuss the best way to evaluate these classifiers, and I recommend the ROC AUC score for this. Hopefully this article will inspire you to train your own models and help provide meaningful insights through the use of data!

Support me

Hopefully, this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, and get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership

Other articles you might enjoy

Support Vector Machines

Precision vs Recall


The Similarities Between Cancer Prediction and Churn Modelling

Image from Shutterstock by tadamichi

Introduction

Over the past year, I have been working in a telecommunications company as a Data Scientist. The telecom sector has many applications for machine learning, one of the main ones being churn prediction. Accurately predicting churn can help improve customer retention and increase revenue in the business. Although churn prediction is seemingly unrelated to cancer prediction, these two problems can actually be solved very similarly. In this article, I’ll go through the best ways to engineer features and train the machine learning models for both applications.

So what is churn?

A churner is someone who stops using/paying for a service. The goal is to predict churn in the future, that way we can launch targeted marketing campaigns towards the users we predict will churn. Reducing churn and increasing customer retention will increase the lifetime value of the average customer in the business, therefore increasing the company’s revenue.

Ideally, we want to predict churn as far in the future as possible. The further in the future we are able to predict, the easier it is for marketing teams to react and organize a marketing campaign. However, the further in the future we predict, the less accurate our predictions will be. We need to strike a balance between the performance of the model and the usefulness of the results.

Analogy to Cancer Prediction

To explain churn modelling, I’ll start off with a case interview question I got a while back:

Say there is a very deadly form of cancer, that if treated early enough it can be cured. However, by the time the cancer is diagnosed in the routine medical checks, it is already too late. If it is found between 1–3 months before it is usually found through the routine diagnosis, then it can easily be cured.

You are given access to a healthcare dataset with medical records of many patients across the country. The dataset also contains monthly blood checks for each patient, as well as whether they were diagnosed with this cancer or not.

How would you go about building a model to predict this form of cancer in time?

Tables by Author (made up data, image by author)

It’s probably a good moment for you to pause and think about how you would go about building a model here. This is a question that came up in a Data Science interview, and very similar scenarios have come up in my Data Science role.

Building the Dataset

The end goal is to build a classifier that takes in information about a patient and their blood test results, and predict whether they have an un-diagnosed cancer (a cancer which will be diagnosed by the monthly check-ups in the next 1–3 months).

The datasets are extremely large, in the tens of millions. The data is relational and therefore a language like SQL is most appropriate. The datasets can be joined through the “Patient ID” column through a left join. These queries can be executed through cloud services like Google’s Big Query for example.

To train the classifier, we need to build the training dataset. Each datapoint will the monthly blood test results of each patient. The datapoints must be labelled as either positive or negative. Positive is the results of a patient who will be diagnosed with cancer in the next 1–3 months. Negative is someone who does not have cancer.

Fig. 1: Patient monthly check data (Image by Author)

In the image above I’ve shown data from 2 patients. Patient 1 is diagnosed with cancer in May, patient 2 is never diagnosed with cancer. We want a model that would classify Patient 1’s blood tests in February, March and April and positive, and would classify all of Patient 2’s blood test results as negative:

Fig. 2: Training datapoints shown in Fig. 1 split into negative, positive and not used (Image by Author)

The blood test where patient 1 is diagnosed with cancer (May) is not used in the training data, as it is already too late to cure. We want to predict the cancer 1–3 months before, those data points are labelled as positive. Note how Patient 1’s blood test in January is labelled as negative, as he only tested positive 4 months later.

Note how we are using every blood test for each patient as a different training data point. We cannot simply use the data for cancers diagnosed in the most recent month. These data are insufficient, as only an extremely small amount of patients will be diagnosed with cancer. Training these models is already a difficult challenge because the resulting dataset will be very unbalanced. We need as many cancer positives as possible in the training data to improve model performance, and therefore we need to take data from as far back as we can.

The reason we set a limit to how far in the future we predict is simply because the predictors will be weaker and weaker the further forward we try to look. We want to build a strong and robust model, where if it predicts somebody has cancer we are almost certain the prediction is true.

Feature Engineering

If we only looked at the most recent blood test data we may miss patterns in the change in results in the past few months. A steady decrease in some variables such as white blood cell count may be a predictor for example. Capturing this in the training data is extremely important and will massively affect the performance of the model.

Feature engineered variables may include:

  • The rate of change of important variables
  • Average of important variables over the past few months
  • Variance of variables over the past few months
  • Binary variables such as: Were sodium levels ever < 145 mmol/L in the past few months
  • One hot encodings of categorical variables
  • Scaled continuous variables

What features to include will largely depend on the use case and the dataset. It is up to the data scientist to exploit patterns in the data, understand the underlying problem and find the best features to feed the machine learning model.

Telecom Churn Modelling

Hopefully, by reading the cancer prediction modelling described above you can imagine how it all applies to churn modelling.

We are trying to predict whether a customer will churn or not in the next 1–3 months. We need that time to target them with marketing and prevent them from churning. Instead of monthly blood tests, we gather their interaction with the service sold to them. This may include how much they use the service, what they use it for etc.

The main difference between churn modelling and cancer prediction is that churn modelling is extremely seasonal. Models need to be retrained every month, otherwise, they severely underperform.

The next big difference is that in Telecom, the features usually come in as streams of data, rather than one data point a month like with a blood test. These streams of data still need to be converted to unique data points. These data points often correspond to a particular day of the month, where the models will be re-trained and executed periodically.

Training and Evaluating Predictive Models

The resulting datasets for both problems will be extremely unbalanced. This is because most patients will not have cancer, and most users will not churn. Ways to deal with unbalanced classification include for example weighted classification and undersampling. There are others but from my experience, these work best.

Depending on the size of the resulting datasets, different supervised learning classifiers may work best. For smaller datasets (in the tens of thousands or sometimes hundreds of thousands of rows), since this is tabular data, ensemble methods will work best. Ensemble methods include decision trees, boosting methods etc.

It’s important to follow standard machine learning best practice techniques such as:

  • Normalizing and standardizing / min max scaling continuous variables
  • One hot encoding discrete variables
  • Train test splitting
  • For churn, save the most recent churners (a month or so) for validation

To improve the models, you should tweak the model to increase the results on the test data. This may involve further feature engineering, hyperparameter tuning etc. Once this is done, you want to evaluate the model on the validation dataset, to get a good idea of what performance you can expect in production.

The most useful metric when it comes to churn modelling is the ROC AUC score. The receiver operating characteristic (ROC) is a plot of the True Positive Rate vs the False Positive Rate. It is very closely related to precision and recall which I explained in detail in this article.

Fig. 3 Precision Recall Curve vs ROC AUC Score (Image by Author)

In the image above I’ve recycled the results from a model from a previous article. On the left you can see the precision recall curve of two models. On the right you can see the ROC. By varying the threshold, we were able to change the precision and recall of each class.

In the same way, By varying the threshold of the classifier, we can obtain different true positive and false-positive rates. The integral of the ROC over all possible thresholds (from 0 to 1) gives the Area Under the Curve (AUC). The perfect model will have an integral of 1, having a perfect true positive rate even at the lowest false positive rate. For a purely random model, the ROC curve would be a diagonal line, with an AUC of 0.5 (the area of half the 1 by 1 square is 0.5). A good ROC AUC score will depend on the application and the data, but the closer to 1, the better the score.

The measure is good to compare the overall generalization capability of machine learning models. This metric is very useful for any kind of unbalanced classification problem, and therefore it is appropriate both for cancer prediction models and churn models.

Conclusions

Early cancer detection and churn prediction seem unrelated, but in reality these two problems can be tackled almost identically. In this article, I discuss how you might go about building a machine learning model for either of these challenges. I also discussed how to build the dataset, which is arguably the hardest part of the problem. It requires a large amount of feature engineering and pre-processing. Machine learning can only do so much, but at the end of the day, the quality of the model will largely depend on the quality of data and features you feed it. To finish the article, I discuss the best way to evaluate these classifiers, and I recommend the ROC AUC score for this. Hopefully this article will inspire you to train your own models and help provide meaningful insights through the use of data!

Support me

Hopefully, this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, and get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership

Other articles you might enjoy

Support Vector Machines

Precision vs Recall

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment