Model Selection with Imbalance Data: Only AUC may Not Save you | by Marco Cerliani | Feb, 2023

By Jessie Hobb On Feb 22, 2023

Are you Searching Parameters Efficiently?

Most data scientists, who attend meetings to present ML results to business stakeholders, usually answer questions like these:

AUC? What is it? Could you please elaborate?

Terms and concepts standard in data science daily routine may be unfamiliar to most. This frequently happens when artificial intelligence products are developed to solve real-world problems. In this scenario, data scientists work together and collaborate with domain experts to understand field dynamics and accordingly incorporate them into automated solutions.

Having a critical view, of the added value that artificial intelligence provides in solving a business problem, is crucial. In a lot of situations, the adoption of machine learning may be useless since the tasks can be solved with simple automation rules, or there is no evidence in the available data that justifies the usage of artificial intelligence techniques. That said, choosing the most appropriate metrics to evaluate the effectiveness of proposed solutions, it’s a very important step.

The choice of proper metrics is domain-dependent and changes according to the needs. Choosing AUC as a metric, to present to business stakeholders the goodness/strengths of the machine learning approach adopted, may be risky. Firstly because the definition of AUC may not be clear to all. Secondly, it isn’t easy to give an economic meaning to AUC. Business people are money-oriented. If they don’t understand that the proposed solutions make them save time or money, they likely reject them.

In this post, we don’t suggest a method to choose the correct business metric. Instead, we focus on a more technical problem that is strictly correlated with the metric definition. We are referring to model selection. We want to test the effectiveness of model selection in an imbalanced binary classification context. The scope is to investigate how simple decisions (like metric selection or threshold tuning) influence the final results and how these relate to the business goals.

We start by simulating an unbalanced tabular dataset with 90% negative and 10% positive target samples.

Target distribution of simulated data [image by the author]

We can imagine the minority class (10% of the sample in our case) as the customers that churned in a fixed temporal range, as the failures that happened in an engine system, or also as the number of frauds occurred. For data scientists, working with unbalanced data in the real world is the normality.

Unbalancing is difficult to deal with. Instead of fighting with extreme unbalance, a better approach, which is simple and works in most cases, consists in leveraging it during the learning phase. In other words, it’s better to not engage in experimenting with oversampling methodologies, but it’s best to simply downsample the majority class or leave it as is. Applying a reasonable undersampling ratio, it’s possible to make the models learn from the data. Furthermore, the unbalanced nature of the phenomena is preserved and replicable at inference time.

Comparing techniques to handle target imbalance [image by the author]

With this simple modeling strategy in mind, we are ready to deep dive into model selection.

Machine Learning use cases lifecycle [image by the author]

When searching for the best model or set of parameters in unbalance scenarios, the choice of the suited metrics falls on scoring-based ones. We are referring to all the metrics which evaluate the goodness of fit using the predicted probabilities. In a binary classification context, the most known scoring metrics are AUC, average precision, or cross-entropy.

Using a scoring metric in this situation seems a reasonable solution. We are evaluating the goodness of fit independently from a hard threshold, like one used to compute accuracy, precision, recall, or Fbeta.

Coming back to our simulated use case… Supposing one of the requirements, defined by the business stakeholder, is to obtain a high precision on the minority class. How can we carry out model selection and parameter tuning to satisfy this request?

model = RandomizedSearchCV(
XGBRFClassifier(random_state=1234), 
dict(n_estimators=stats.randint(50,300)), 
n_iter=20, random_state=1234,
cv=5, n_jobs=-1, 
refit=False, error_score='raise',
scoring={
'fbeta': make_scorer(fbeta_score, beta=0.1), 
'roc_auc':'roc_auc', 
'average_precision':'average_precision'
},
).fit(X, y)

We set up a randomized search with a random forest, searching for the optimal number of trees. We register cross-validated scoring for AUC, average precision, and Fbeta. We choose Fbeta with a low beta value (0.1) as an approximation for precision (what we are trying to optimize). The trial results are reported in the plots below.

Fbeta as a function of AUC (on the left) and average precision (on the right) [image by the author]

As expected, there is no clear relation between AUC/average precision and Fbeta. Choosing the model with the best AUC doesn’t guarantee the choice of the model with the best Fbeta.

At this point, with our “optimal” parameter configuration selected according to AUC, we have to operate an additional fine-tuning, on a newer set of data, to select a hard threshold to maximize precision and make our stakeholders happy.

Nothing bad in doing this but, is there a more efficient approach? Can we make the threshold tuning related to the choice of parameters?

Embedding the threshold searching inside the model training is straightforward. With the ThresholdClassifier estimator, it’s possible to tune a binary classification threshold while optimizing a defined scoring function (Fbeta in our case). This is done automatically on a validation set derived by splitting the received training data. The predicted classes are obtained by discretizing the probabilities according to the tuned threshold.

from sklearn.metrics import fbeta_score
from sklearn.model_selection import train_test_split
from sklearn.base import clone, BaseEstimator, ClassifierMixinclass ThresholdClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, estimator, refit=True, val_size=0.3):
self.estimator = estimator
self.refit = refit
self.val_size = val_size
def fit(self, X, y):
def scoring(th, y, prob):
pred = (prob > th).astype(int)
return 0 if not pred.any() else \
-fbeta_score(y, pred, beta=0.1) 
X_train, X_val, y_train, y_val = train_test_split(
X, y, stratify=y, test_size=self.val_size, 
shuffle=True, random_state=1234
)
self.estimator_ = clone(self.estimator)
self.estimator_.fit(X_train, y_train)
prob_val = self.estimator_.predict_proba(X_val)[:,1]
thresholds = np.linspace(0,1, 200)[1:-1]
scores = [scoring(th, y_val, prob_val) 
for th in thresholds]
self.score_ = np.min(scores)
self.th_ = thresholds[np.argmin(scores)]
if self.refit:
self.estimator_.fit(X, y)
if hasattr(self.estimator_, 'classes_'):
self.classes_ = self.estimator_.classes_
return self
def predict(self, X):
proba = self.estimator_.predict_proba(X)[:,1]
return (proba > self.th_).astype(int)
def predict_proba(self, X):
return self.estimator_.predict_proba(X)

The ThresholdClassifier estimator is model agnostic and can be used with any binary classifier that outputs probabilities. In our example, we apply it to our random forest allowing, as before, the search for optimal parameters.

Not surprisingly, there is no relationship between AUC/average precision and Fbeta. Comparing the scores, obtained by the raw random forest and the random forest with threshold tuning, we observe a difference in the value of Fbeta.

Fbeta obtained w/ (red) and w/o threshold tuning (blue) for the same set of parameters [image by the author]

The search for an optimal value of the classification threshold provides better precision on the minority class for the same set of parameters. The results don’t affect the produced probabilities. Scoring metrics, like AUC or average precision, remain unaltered.

Fbeta, as a function of AUC (on the left) and average precision (on the right), obtained w/ (red) and w/o threshold tuning (blue) [image by the author]

We are not here for claiming the models with the best performances by looking at cents improvements of validation metrics. We must pursue business goals. In our simulated scenario, it’s evident that with simple tricks we can obtain better precision. The notable point is that we get these findings without the need for additional validation data and combining parameter search with threshold tuning.

In this post, we outlined the main differences between scoring metrics and accuracy-based ones. We saw how these behave in an unbalance binary classification context to solve real business problems. If our scope is to measure how good we are at detecting churned customers, identifying frauds, or finding failed engine components, using only AUC may produce incomplete/suboptimal solutions. As always, we must deeply understand the business logic from the beginning and try to satisfy them.