Hyperparameter Tuning and Sampling Strategy | V Vaseekaran

By Jessie Hobb On Oct 27, 2022

Finding the best sampling strategy using pipelines and hyperparameter tuning

One of the go-to steps in handling imbalanced machine learning problems is to resample the data. We can either undersample the majority class and/or oversample the minority class. However, there is a question that needs to be addressed: to what number should we reduce the majority class, and/or increase the minority class? An easy but time-consuming method is to alter the resampling values of the majority and minority classes one by one to find the best match. Thanks to the imbalanced-learn library and hyperparameter tuning, we can devise an efficient and relatively simple method to identify the best resampling strategy.

Building the pipeline

The fraud detection dataset, which is CC licensed and can be accessed from the OpenML platform, is chosen for the experiment. The dataset has 31 features: 28 of the features (V1 to V28) are numerical features that are transformed using PCA to preserve confidentiality; ‘Time’ depicts the seconds elapsed for each transaction; ‘Amount’ indicates the transactional amount; ‘Class’ represents whether the transaction is fraudulent or not. The data is highly unbalanced, as only 0.172% of all transactions are fraudulent.

To determine the best sampling ratio, we need to build a machine-learning pipeline. Many data enthusiasts prefer the scikit-learn’s (sklearn) Pipeline, as it provides a simple way to build machine-learning pipelines. However, undersampling and oversampling cannot be done using the regular sklearn Pipeline, as the sampling would occur during the fit and transform methods. This is remedied by the Pipeline class implemented by the imbalanced-learn (imblearn) library. The imblearn’s pipeline ensures that the resampling only occurs during the fit method.

At first, we’ll be loading the data. Then, the features and the labels are obtained from the data, and a train-test split is created, so the test split can be used to evaluate the performance of the model trained using the train split.

Once the train and test sets are created, the pipeline can be instantiated. The pipeline comprises a sequence of steps: transforming the data, resampling, and concluding with the model. To keep it simple, we would be using a numerical scaler (RobustScaler from sklearn) to scale the numerical fields (all the features are numerical in the dataset); then followed by an undersampling method (RandomUnderSampler class), and an oversampling method (SMOTE algorithm), and finally a machine learning model (we are using LightGBM, a framework to implement gradient-boosting algorithm). As an initial benchmark, the majority class is undersampled to 10,000 and the minority class is oversampled to 10,000.

Initial Modeling

Before moving on to hyperparameter tuning, an initial model is trained. The pipeline constructed in the previous section is trained on the train split and then tested on the test split. As the data we chose is highly imbalanced (before resampling), it is ineffective to measure only the accuracy. Therefore, by using sklearn’s classification report, we’ll be monitoring the precision, recall, and the f1-score of both classes.

Classification report for evaluating the base pipeline. Image by the author.

Tuning to Find the Best Sampling Ratio

In this article, we’ll only be focusing on the sampling strategy of the undersampling and oversampling techniques. Initially, two lists are created, which comprise different sampling strategies for both undersampling and oversampling methods. These lists will be used to find the best sampling strategy from the given sampling strategies.

GridSearchCV and RandomizedSearchCV are two hyperparameter tuning classes from sklearn, where the former loops through all the parameters’ values provided to find the best set of values, and the latter randomly chooses the hyperparameter values and runs until the iterations specified by the user are attained. In this experiment, we’ll be using GridSearchCV for hyperparameter tuning. GridSearchCV requires two parameters: estimator and param_grid. The estimator is the model (in our case, the pipeline), and the param_grid is a dictionary with the keys representing the parameters that need to be tuned and the values representing the set of values for the parameters.

Using a pipeline to resample and build the model makes it easier to perform hyperparameter tuning and find the best sampling strategy. As we have specified the names of the undersampling and oversampling techniques in the pipeline (as “undersampler” and “oversampler”), we can access their parameters using “__” to use the parameter for hyperparameter tuning (e.g. “undersampler__sampling_strategy”).

Once the hyperparameter tuning is completed, we can obtain the best set of sampling strategies for undersampling and oversampling from the specified list, and then use them to train the pipeline and evaluate its performance.

Classification report for evaluating the hp-tuned pipeline. Image by the author.

Final Words

Finding the sweet spot of the sampling ratio when resampling is time-consuming and complex, but machine learning pipelines and hyperparameter tuning can provide a simple solution to alleviate the problem.

In this article, the impact of hyperparameter tuning to determine the best sampling ratio for undersampling and oversampling techniques is examined, and the solution is relatively easy to implement.

I hope you found this article useful, and I would love to hear your feedback/criticism about this article, as it would assist me in improving my writing and coding skills.

Cheers!