Techno Blender
Digitally Yours.

Findings from benchmarking churn prediction methods

0 14


Insights from comparing widely used customer churn prediction approaches

Photo by Robert Bye on Unsplash.

This article shows the results and findings from my previous (more technical) article, “A pipeline for benchmarking churn prediction approacheswhere I explained how to build a generic pipeline to benchmark churn prediction approaches. The motivation behind this article is a paper from Geiler et al. (2022) that benchmarks different churn prediction methods.

  • Based on my previous article, common customer churn approaches were benchmarked on five free available customer churn data sets.
  • Due to the small sample size (five data sets only) these findings might not be representative but can provide you ideas of which approaches to consider for your next churn prediction project.
  • When dealing with high class imbalance, sampling methods do not always improve your models performance.
  • Depending on your preferred error metric, the best approaches are soft voting-classifier models (Logistic Regression + XGB + Random Forest) with no sampling (PR AUC) or a logistic regression with an oversampling (SMOTE) and random oversampling (RND) approach (F2 score).

Before discussing the results, I would like to give you a quick recap of the used pipeline and the methodology. Figure 1 shows the overall structure of the used benchmarking pipeline, starting from loading the data over pre-cleaning steps to the actual benchmarking and final visualization of the results.

Figure 1. Overview of the benchmarking process (image by author).

The “Dynamic part” (colored in green) represents the different approaches or combinations of machine learning models and sampling approaches that are “dynamically” attached to the static part (imputer, scaler, and encoder) of the pipeline.

🛢 Data sets

For this article, I only used customer churn data sets that are free to use and have a “realistic” churn rate. I understand by a “realistic churn rate” a high class imbalance of 20% or less. In case your data has a churn rate of, let’s say, 40%, you should consider if you really want to do churn prediction or better analyze your business model since almost half of your customers are leaving.

A summary of the used data sets can be seen below in Table 1.

Table 1. Summary of the used data sets (image by author).

The first two rows show the number of rows or observations for each data set after the pre-cleaning step and the related churn rate. The following rows show the number of columns or features per data set and the number of their data types.

The mentioned pre-cleaning step removes columns

  • with more than 20% missing values
  • constant values only
  • like unique identifiers (e.g., user id or the address)

In addition to these rules, I removed columns in the KDD data set that contain more than 1,000 different categorial values. Otherwise, I would have run into a curse of dimensionality issue when applying the one-hot-encoding step.

The used data sets for this article are the following:

  • ACM KDD Cup — 2009 (kdd): Marketing data base from a French telecom company Orange to predict the propensity of customers to switch provider (CC0: Public Domain). It is the data set with the most observations and features.
  • IBM HR Analytics Employee Attrition & Performance (ibm_hr): Fictional data set created by IBM Data Scientists that contains factors that lead to employee attrition (Database Contents License (DbCL)).
  • Customer Churn Prediction 2020 (ccp): Churn data (artificial based on claims similar to real world). Data is also part of an R package (GPL (>= 2 license).
  • Portuguese Bank Marketing Data Set (prt_bank): Data set about a direct phone call marketing campaign of a Portuguese bank (CC BY 4.0).
  • Newspaper Churn (news): Data set about newspaper subscribers (CC0: Public Domain).

⚖️ Class Imbalance

Dealing with a class balance of nearly 50% in real-life cases is quite rare. Especially in the area of churn prediction, one usually has to deal with a high imbalance (the churners are in the minority). The imblearn package comes with a collection of different methods to deal with that. The following ones (and their combinations) were used:

  • No sampling (no_sampling) — we do not apply any methods
  • SMOTE (o_SMOTE) — oversampling
  • ADASYN (o_ADASYN) — oversampling
  • TomekLinks (u_TomekLinks) — undersampling
  • NCR (u_NCR) — undersampling
  • SMOTE and RND (h_SMOTE_RND) — over- and undersampling
  • SMOTE and TomekLinks (h_SMOTE_TomekLinks) — over- and undersampling
  • SMOTE and NCR (h_SMOTE_NCR) — over- and undersampling

📦 Models

The voting classifiers use soft-voting, which means their outcome is the mean of their used model’s predictions.

⏱️ Error Metrics

The following error metrics have been used:

There are many discussions about the “right” error metric. First of all, there is no silver bullet. Choosing the right error metric not only depends on your use case or priorities (e.g., do you have higher costs on false negatives or positives?) but also on your data (e.g., dealing with strong class imbalance) as well as if you want to predict classes (threshold-dependent) or probabilities (threshold independent). Therefore, it makes more sense to consider several metrics to get a clearer picture of your model’s performance.

The F1 score treats precision and recall equally. However, in the case of customer churn prediction, we usually have higher costs on false negatives (the costs of customer acquisition are usually much higher than the costs of keeping a customer). The F2 score adds a higher weight on the recall and less weight on the precision part of the equation.

In case we would change the default threshold (0.5) of our classifiers for example, to 0.34 because we believe that is a better one, our F2 score and recall metric would change their values. Threshold-independent metrics are ROC AUC and PR AUC. For imbalanced data sets Saito and Rehmsmeier (2015) and Czakon (2022) prefer the PR AUC over ROC AUC.

* One can also think of PR AUC as the average of precision scores calculated for each recall threshold (Czakon, 2022).

🔬 Cross-Validation

To calculate the scores, I used a repeated stratified k-fold cross validation with n_repeats and n_splits = 5. This method is usually used when dealing with imbalanced data sets. In each fold is approximately the same percentage of samples of each target class.

The results can be interpreted by using charts like box plots or a tabular visualization. I will use both in the following, starting with the F2. Figure 2 below shows the F2 score for each model by using different sampling approaches over all five data sets. The mean is represented by the green triangle markers.

Figure 2. Box plots of F2 scores for each model using different sampling approaches over all data sets (image by author).

We can see from the box plots that tree-based models (rf, lgb, xgb) show in general compared to the deep learning models (gev_nn,ffnn) a much wider spread. We can also observe that when using hybrid sampling approaches (h_SMOTE_RND,h_SMOTE_Tomek,h_SMOTE_NCR) the spread of our linear model (lr) is small and its performance is pretty good compared to the other approaches.

The table below (table 2) shows the mean F2 scores for each approach.

Table 2. Mean F2 values for each model using different sampling approaches over all data sets (image by author).

The table can be read row wise. The F2 score in bold is the highest (best) score for the respective model-sampling approach. The cell in bold and green highlight is the approach with the best overall score. The Logistic Regression (lr) model got the overall highest F2 score when using the SMOTE+RND over-sampling approach.

As mentioned previously, the F2 score is threshold-dependent. A threshold-independent metric is the PR AUC metric that can be seen below (figure 3).

Figure 3. Box plots of PR AUC scores for each model using different sampling approaches over all data sets (image by author).

We can see that basically the Gaussian Naive Bayes (gnb) shows the largest spread while the voting-classifier approaches (lr_xgb_rf, lr_xgb_rf_ffnn) show the lowest one. By looking at the table (table 3), we can see that the best approaches are voting-classifier models with no sampling approach.

Table 3. Mean PR AUC values for each model using different sampling approaches over all data sets (image by author).

In our specific case the soft voting-classifier using a Logistic Regression, XGB Classifier, Random Forest (lr_xgb_rf) performs slightly better than the soft voting classifier using the Feedforward Neural Network (lr_xgb_rf_ffnn).

For this article different approaches were benchmarked using five free available customer churn data sets. It is important to point out that the small number of datasets is probably not representative enough to draw reliable conclusions.

However, I hope this article provides you with ideas of what approaches (e.g., a soft-classifier with no sampling approach) you could try out for your next churn prediction project. The code to the models and a more detailed explanation can be found in my previous article.

If you are interested in the values for the other mentioned metrics, please see the appendix below.

Geiler, L., Affeldt, S., Nadif, M., 2022. A survey on machine learning methods for churn prediction. Int J Data Sci Anal. https://doi.org/10.1007/s41060-022-00312-5

Munkhdalai, L., Munkhdalai, T., Ryu, K.H., 2020. GEV-NN: A deep neural network architecture for class imbalance problem in binary classification. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2020.105534

Brownlee, J. (2020, February). A Gentle Introduction to the Fbeta-Measure for Machine Learning. Machinelearningmastery. https://machinelearningmastery.com/fbeta-measure-for-machine-learning/

Czakon, J. (2022, July 21). F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose? neptune.ai. Retrieved September 18, 2022, from https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc

Saito, T., & Rehmsmeier, M. (2015, March 4). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

Table A1. F1 macro score (image by author).
Table A2. F2 score (image by author).
Table A3. Lift score (image by author).
Table A4. Precision (image by author).
Table A5. Recall (image by author).
Table A6. ROC AUC (image by author).


Insights from comparing widely used customer churn prediction approaches

Photo by Robert Bye on Unsplash.

This article shows the results and findings from my previous (more technical) article, “A pipeline for benchmarking churn prediction approacheswhere I explained how to build a generic pipeline to benchmark churn prediction approaches. The motivation behind this article is a paper from Geiler et al. (2022) that benchmarks different churn prediction methods.

  • Based on my previous article, common customer churn approaches were benchmarked on five free available customer churn data sets.
  • Due to the small sample size (five data sets only) these findings might not be representative but can provide you ideas of which approaches to consider for your next churn prediction project.
  • When dealing with high class imbalance, sampling methods do not always improve your models performance.
  • Depending on your preferred error metric, the best approaches are soft voting-classifier models (Logistic Regression + XGB + Random Forest) with no sampling (PR AUC) or a logistic regression with an oversampling (SMOTE) and random oversampling (RND) approach (F2 score).

Before discussing the results, I would like to give you a quick recap of the used pipeline and the methodology. Figure 1 shows the overall structure of the used benchmarking pipeline, starting from loading the data over pre-cleaning steps to the actual benchmarking and final visualization of the results.

Figure 1. Overview of the benchmarking process (image by author).

The “Dynamic part” (colored in green) represents the different approaches or combinations of machine learning models and sampling approaches that are “dynamically” attached to the static part (imputer, scaler, and encoder) of the pipeline.

🛢 Data sets

For this article, I only used customer churn data sets that are free to use and have a “realistic” churn rate. I understand by a “realistic churn rate” a high class imbalance of 20% or less. In case your data has a churn rate of, let’s say, 40%, you should consider if you really want to do churn prediction or better analyze your business model since almost half of your customers are leaving.

A summary of the used data sets can be seen below in Table 1.

Table 1. Summary of the used data sets (image by author).

The first two rows show the number of rows or observations for each data set after the pre-cleaning step and the related churn rate. The following rows show the number of columns or features per data set and the number of their data types.

The mentioned pre-cleaning step removes columns

  • with more than 20% missing values
  • constant values only
  • like unique identifiers (e.g., user id or the address)

In addition to these rules, I removed columns in the KDD data set that contain more than 1,000 different categorial values. Otherwise, I would have run into a curse of dimensionality issue when applying the one-hot-encoding step.

The used data sets for this article are the following:

  • ACM KDD Cup — 2009 (kdd): Marketing data base from a French telecom company Orange to predict the propensity of customers to switch provider (CC0: Public Domain). It is the data set with the most observations and features.
  • IBM HR Analytics Employee Attrition & Performance (ibm_hr): Fictional data set created by IBM Data Scientists that contains factors that lead to employee attrition (Database Contents License (DbCL)).
  • Customer Churn Prediction 2020 (ccp): Churn data (artificial based on claims similar to real world). Data is also part of an R package (GPL (>= 2 license).
  • Portuguese Bank Marketing Data Set (prt_bank): Data set about a direct phone call marketing campaign of a Portuguese bank (CC BY 4.0).
  • Newspaper Churn (news): Data set about newspaper subscribers (CC0: Public Domain).

⚖️ Class Imbalance

Dealing with a class balance of nearly 50% in real-life cases is quite rare. Especially in the area of churn prediction, one usually has to deal with a high imbalance (the churners are in the minority). The imblearn package comes with a collection of different methods to deal with that. The following ones (and their combinations) were used:

  • No sampling (no_sampling) — we do not apply any methods
  • SMOTE (o_SMOTE) — oversampling
  • ADASYN (o_ADASYN) — oversampling
  • TomekLinks (u_TomekLinks) — undersampling
  • NCR (u_NCR) — undersampling
  • SMOTE and RND (h_SMOTE_RND) — over- and undersampling
  • SMOTE and TomekLinks (h_SMOTE_TomekLinks) — over- and undersampling
  • SMOTE and NCR (h_SMOTE_NCR) — over- and undersampling

📦 Models

The voting classifiers use soft-voting, which means their outcome is the mean of their used model’s predictions.

⏱️ Error Metrics

The following error metrics have been used:

There are many discussions about the “right” error metric. First of all, there is no silver bullet. Choosing the right error metric not only depends on your use case or priorities (e.g., do you have higher costs on false negatives or positives?) but also on your data (e.g., dealing with strong class imbalance) as well as if you want to predict classes (threshold-dependent) or probabilities (threshold independent). Therefore, it makes more sense to consider several metrics to get a clearer picture of your model’s performance.

The F1 score treats precision and recall equally. However, in the case of customer churn prediction, we usually have higher costs on false negatives (the costs of customer acquisition are usually much higher than the costs of keeping a customer). The F2 score adds a higher weight on the recall and less weight on the precision part of the equation.

In case we would change the default threshold (0.5) of our classifiers for example, to 0.34 because we believe that is a better one, our F2 score and recall metric would change their values. Threshold-independent metrics are ROC AUC and PR AUC. For imbalanced data sets Saito and Rehmsmeier (2015) and Czakon (2022) prefer the PR AUC over ROC AUC.

* One can also think of PR AUC as the average of precision scores calculated for each recall threshold (Czakon, 2022).

🔬 Cross-Validation

To calculate the scores, I used a repeated stratified k-fold cross validation with n_repeats and n_splits = 5. This method is usually used when dealing with imbalanced data sets. In each fold is approximately the same percentage of samples of each target class.

The results can be interpreted by using charts like box plots or a tabular visualization. I will use both in the following, starting with the F2. Figure 2 below shows the F2 score for each model by using different sampling approaches over all five data sets. The mean is represented by the green triangle markers.

Figure 2. Box plots of F2 scores for each model using different sampling approaches over all data sets (image by author).

We can see from the box plots that tree-based models (rf, lgb, xgb) show in general compared to the deep learning models (gev_nn,ffnn) a much wider spread. We can also observe that when using hybrid sampling approaches (h_SMOTE_RND,h_SMOTE_Tomek,h_SMOTE_NCR) the spread of our linear model (lr) is small and its performance is pretty good compared to the other approaches.

The table below (table 2) shows the mean F2 scores for each approach.

Table 2. Mean F2 values for each model using different sampling approaches over all data sets (image by author).

The table can be read row wise. The F2 score in bold is the highest (best) score for the respective model-sampling approach. The cell in bold and green highlight is the approach with the best overall score. The Logistic Regression (lr) model got the overall highest F2 score when using the SMOTE+RND over-sampling approach.

As mentioned previously, the F2 score is threshold-dependent. A threshold-independent metric is the PR AUC metric that can be seen below (figure 3).

Figure 3. Box plots of PR AUC scores for each model using different sampling approaches over all data sets (image by author).

We can see that basically the Gaussian Naive Bayes (gnb) shows the largest spread while the voting-classifier approaches (lr_xgb_rf, lr_xgb_rf_ffnn) show the lowest one. By looking at the table (table 3), we can see that the best approaches are voting-classifier models with no sampling approach.

Table 3. Mean PR AUC values for each model using different sampling approaches over all data sets (image by author).

In our specific case the soft voting-classifier using a Logistic Regression, XGB Classifier, Random Forest (lr_xgb_rf) performs slightly better than the soft voting classifier using the Feedforward Neural Network (lr_xgb_rf_ffnn).

For this article different approaches were benchmarked using five free available customer churn data sets. It is important to point out that the small number of datasets is probably not representative enough to draw reliable conclusions.

However, I hope this article provides you with ideas of what approaches (e.g., a soft-classifier with no sampling approach) you could try out for your next churn prediction project. The code to the models and a more detailed explanation can be found in my previous article.

If you are interested in the values for the other mentioned metrics, please see the appendix below.

Geiler, L., Affeldt, S., Nadif, M., 2022. A survey on machine learning methods for churn prediction. Int J Data Sci Anal. https://doi.org/10.1007/s41060-022-00312-5

Munkhdalai, L., Munkhdalai, T., Ryu, K.H., 2020. GEV-NN: A deep neural network architecture for class imbalance problem in binary classification. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2020.105534

Brownlee, J. (2020, February). A Gentle Introduction to the Fbeta-Measure for Machine Learning. Machinelearningmastery. https://machinelearningmastery.com/fbeta-measure-for-machine-learning/

Czakon, J. (2022, July 21). F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose? neptune.ai. Retrieved September 18, 2022, from https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc

Saito, T., & Rehmsmeier, M. (2015, March 4). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

Table A1. F1 macro score (image by author).
Table A2. F2 score (image by author).
Table A3. Lift score (image by author).
Table A4. Precision (image by author).
Table A5. Recall (image by author).
Table A6. ROC AUC (image by author).

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment