Create Powerful Model Explanations for Classification Problems with Logistic Regression | by Jin Cui | Dec, 2022

By Jessie Hobb On Jan 9, 2023

A practitioner’s guide, with a demonstration using the IBM Telco Churn dataset

Photo by Pablo García Saldaña on Unsplash

Logistic Regression is commonly used for modeling classification problems. It’s a parametric algorithm whose output provides for powerful model explanations (termed by many as explainable ML). In particular, in addition to overcoming the known limitations of a Linear Regression for modelling classification problems, and in comparison to the non-parametric Tree-based algorithms, it’s able to comfortably inform the users how a step change by a particular feature influences the target variable, which I’ll demonstrate in later sections of this article using a dataset shared by IBM.

In the first instance, I’ll demonstrate the advantages of using a Logistic Regression over a Linear Regression for a classification problem with an example.

In the insurance context, a lapse refers to the event that a policyholder exercises the option to terminate the insurance contract with the insurer. Commercially, it’s in the insurer’s interest to understand whether a policyholder is likely to lapse at the next policy renewal, as this typically helps the insurer prioritize its retention efforts. This then becomes a classification problem as the response variable takes the binary form of 0 (non-lapse) or 1 (lapse), given the attributes of a particular policyholder.

In the synthetic dataset underlying the chart below, we record the lapse behaviour of 30 policyholders, with 1 denoting a lapse and 0 otherwise.

Chart 1: Observed lapses of 30 policyholders. Chart by author

It would be counter-intuitive to model lapses in this instance using Linear Regression as indicated by the grey line in the chart below. Evidently, a Logistic Regression as indicated by the green line provides a better fit.

Chart 2: Linear Regression vs. Logistic Regression. Chart by author

Logistic Regression belongs to the family of Generalised Linear Models (“GLM”). Assuming we want to model the probability of lapse for a policyholder denoted by p, a Logistic Regression has a variance function in the form of the below, which is minimized when p takes the value of 0 or 1.

This is a nice attribute for modelling the probability of lapse for a policyholder as the observed lapse events can only take the value of 0 or 1. An accommodating consequence of this is that the Logistic Regression gives greater credibility to observations of 0 and 1, and can be easily extended to other classification problems.

In this section, I’ll be demonstrating the advantages of using a Logistic Regression over Tree-based algorithms using an IBM dataset. I’ll start by describing the model output of a Logistic Regression.

The Log-odds Ratio

Predictive models aim to express the relationship between the independent features (“X”) and the target variable (“Y”). In a Linear Regression, the relationship may be expressed as:

In this instance, Y may represent the property price, and X₁ and X₂ may represent the property size and number of bedrooms in the property respectively, in which case we would expect a positive relationship between the independent and dependent variables in the form of positive coefficients for β₁ and β₂.

On the other hand, Logistic Regression aims to model the probability (“p”) for an event (e.g. lapse of a policyholder). This is in the first instance expressed by replacing the Y in equation (2) with the Log-odds Ratio as shown in equation (3) below:

Mathematically, re-arranging equation (3) gives p as shown in equations (4) and (5) below:

Equations (3), (4) and (5) make interpretation of the Logistic Regression model outputs comprehensive for the following reasons:

Equation (3) and (4) expresses the odds ratio using a structure that maintains linearity.
Equation (5) maps the features to a probability ranging from 0 to 1. This enables users of the model to allocate the output probability p for each input data point (e.g. probability of lapse for each policyholder, from which retention efforts can be prioritized).

The β Coefficients

By having an intercept term β₀, a baseline scenario is easily established for benchmarking. For example, under a Logistic Regression, if β₂ is the estimated coefficient for a categorical variable which has been encoded to take value of 1or 0 (e.g. for high or low income level), then equation (6) below can be used to show how much the output probability p differs by this particular feature (i.e. between customers of high and low income level).

Equation (6) implies that the effect on the probability of lapse p by income level is a constant based on coefficient β₂. When the β₂ coefficient is sufficiently small, change in probability p can be approximated directly by the coefficient β₂ (as β ≈ eᵝ when β is small).

Moreover, it can be inferred from equation (4) that the sign of the β coefficients show the direction in which the corresponding features influence the output probability p.

However, in practice, not all features fitted are necessarily significant. As a rule of thumb, features are generally considered statistically significant if its β coefficient has a p-value less than 0.05, suggesting that these β coefficients have relatively small variance.

β coefficients of large variance indicate that less reliance should be placed on the corresponding feature, as the estimated coefficients may vary over a wide range.

In summary, with the β coefficients estimated by a Logistic Regression, users can output the probabilities for a target event as well as show how each feature influence the output probabilities at a data point level. This aids greatly with model explanations, which I’ll demonstrate later on.

The IBM Dataset

The dataset I’ll be using to demonstrate model explanations under a Logistic Regression is the commonly known IBM Telco Churn dataset. It contains 20 independent features and 1 target variable “Churn” which indicates whether the customer discontinued using the Telco’s service. It was initially purposed to train a classification model which predicts the target variable.

This dataset was sourced from the official IBM GitHub repository¹. A data dictionary for this dataset is provided in the table below. All features are categorical apart from tenure, MonthlyCharges (i.e. monthly premiums) and TotalCharges.

Table 3: Data dictionary. Table by author

Model Fitting

For the purpose of this demonstration, I’ve fitted a Logistic Regression in R with a subset of the features as set out in Table 3. The R codes for model fitting are provided below, noting that a base customer profile was set using the relevel method for each of the categorical feature. This allows us to quantify the relative change in the predicted probability of churn by a particular feature against a pre-defined base customer profile.

## 1. Load Telco Churn data
data_raw <- read.csv('Directory/Telco-Customer-Churn.csv', header = TRUE)## 2. 70/30 Train-Test Split
y = data_raw$Churn_Flag
set.seed(268)
sample_size <- floor(0.7 * nrow(data_raw))
sample_indi <- sample(seq_len(nrow(data_raw)), size = sample_size)
d_train <- data_raw[sample_indi,]
d_test <- data_raw[-sample_indi,]
y_train <- y[sample_indi]
y_test <- y[-sample_indi]
## 3. Logistic Regression Model Fitting
glm_1 <- glm(Churn_Flag ~
tenure
+ MonthlyCharges
+ relevel(factor(gender), ref = "Female")
+ relevel(factor(SeniorCitizen_Flag), ref = "Yes")
+ relevel(factor(PhoneService), ref = "No")
+ relevel(factor(InternetService), ref = "DSL")
+ relevel(factor(Contract), ref = "Month-to-month")
+ relevel(factor(PaperlessBilling), ref = "No")
+ relevel(factor(PaymentMethod), ref = "Bank transfer (automatic)")
#+ Partner
#+ Dependents
#+ MultipleLines
#+ OnlineSecurity 
#+ OnlineBackup 
#+ DeviceProtection
#+ TechSupport
#+ StreamingTV
#+ StreamingMovies
, data = d_train
, family = binomial("logit")
)
summary(glm_1)

The screenshot below shows the output of the glm_1 model as fitted above. In particular:

The Estimate column stores the estimated β coefficient for each feature fitted.
The Pr(>|z|) column stores the p value (which can be loosely viewed as the probability of accepting the hypothesis that the feature is insignificant) for each feature, most of which are < 0.05.

Table 4: R output of Logistic Regression. Table by author

On a separate note, the model glm_1 achieves an AUC of 0.84 (which is not bad at all, although model performance is not critically important as this demonstration is for model explanations).

Immediately with the estimated β coefficients above, we can calibrate the probability of churn for a base customer profile. Specifically using equation (5), the probability of churn for the base customer profile set as the reference level in the code is 32% (i.e. almost 1 in 3!).

This is calculated by taking the sum product of the Estimate and Feature Value column as shown in the table below (which gives -0.7557), taking the exponential of this value and then dividing one plus the exponential of this value per equation (5).

Table 5: Estimating base probability. Table by author

Moreover, the probability of churn can be calculated for any customer in the data, by populating the Feature Value column in Table 5 with the profile of the customer of interest. This may help with the application of prioritizing engagement with customers of the highest probability of churn.

In addition, the sign of the β coefficients informs the direction in which the particular feature influences churn. For example, the customer is more likely to leave the Telco if he was a Male, or had higher MonthlyCharges (which is intuitive), and less likely if the customer had locked in the contract for two years.

Taking model explanations one step further, again using equation (5), we can show how a step change in a particular feature influences churn relative to the 32% probability for an ‘average’ customer. The table below shows the (additive) change in probability of churn relative to the base customer profile.

Table 6: Change in probability of churn by feature. Table by author

In summary, the Logistic Regression has effectively informed users about the following:

The significant drivers of churn as identified in Table 4, given the p value mechanism naturally filters out the non-significant drivers;
Probability of churn for any customer in the data;
Change in probability of churn by a particular feature relative to a pre-defined base customer profile.

Table 6 can also be visualized in the waterfall chart below, where the green bars represent a decrease and the black bars represent an increase in probability of churn by a particular feature relative to the base customer profile.

Chart 7: Change in probability of churn by feature. Table by author

Based on the output of model, if I were a decision maker at the Telco, I would start proactively managing churns by focusing retention efforts on customers who:

Have the Fibre Optic Internet Service (may need to investigate whether this is cause or correlation)
Pay by Electronic Cheque
Pay high Monthly Premiums
Have Paper Billing

There are some known limitations of applying Logistic Regression (or more generally, GLM) in practice, including:

Due to the parametric nature of the model, it significant feature engineering effort as features may need to be manually fitted. This includes the fitting of interaction terms where the effect of one feature may depend on the level of another feature. One example of this interaction in the insurance context is the premium increases may differ by age. The number of possible interaction terms increases exponentially with the number of features. In addition, although Logistic Regression allows investigation into interaction terms, they may prove difficult to interpret.
The independent variables (X) are by assumption independent, which may not be true. For the use case of predicting churn, the churn events may need to be measured and segmented by time periods, which may introduce overlapped customers in different periods and correlations.

Logistic Regression is a great model for classification problems as its output allows for comprehensive model explanations, especially for non-technical audience.

In practice, it’s my view that it is in a practitioner’s best interest to compare the performance of the Logistic Regression model against other models known for solving classification problems (such as Tree-based models) for best inferences. One use case which see the two types of models collaborate is to use the Tree-based model to guide the fitting of numerical features such as age and tenure where segments of these are expected to influence the target variable differently.