3-Step Feature Selection Guide in Sklearn to Superchage Your Models | by Bex T. | Oct, 2022

By Jessie Hobb On Oct 4, 2022

Develop a robust Feature Selection workflow for any supervised problem

Learn how to face one of the biggest challenges of machine learning with the best of Sklearn feature selectors.

Introduction

Today, it is common for datasets to have hundreds if not thousands of features. On the surface, this might seem like a good thing — more features give more information about each sample. But more often than not, these additional features don’t provide much value and introduce complexity.

The biggest challenge of Machine Learning is to create models that have robust predictive power by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to lose the oversight of which features are important and which ones aren’t.

That’s why there is an entire skill to be learned in the ML field — feature selection. Feature selection is the process of choosing a subset of the most important features while trying to retain as much information as possible (An excerpt from the first article in this series).

As feature selection is such a pressing issue, there is a myriad of solutions you can select from🤦‍♂️🤦‍♂️. To spare you some pain, I will teach you 3feature selection techniques that, when used together, can supercharge any model’s performance.

This article will give you an overview of these techniques and how to use them without wondering too much about the internals. For a deeper understanding, I have written separate posts for each with the nitty-gritty explained. Let’s get started!

Intro to the dataset and the problem statement

We will be working with the Ansur Male dataset, which contains more than 100 different US Army personnel body measurements. I have been using this dataset excessively throughout this feature selection series because it contains 98 numeric features — a perfect dataset to teach feature selection.

Step I: Variance Thresholding

The first technique will be targeted at the individual properties of each feature. The idea behind Variance Thresholding is that features with low variance do not contribute much to overall predictions. These types of features have distributions with too few unique values or low-enough variances to make no matter. VT helps us to remove them using Sklearn.

One concern before applying VT is the scale of features. As the values in a feature get bigger, the variance grows exponentially. This means that features with different distributions have different scales, so we cannot safely compare their variances. So, we must apply some form of normalization to bring all features to the same scale and then apply VT. Here is the code:

After normalization (here, we are dividing each sample by the feature’s mean), you should choose a threshold between 0 and 1. Instead of using the .transform() method of the VT estimator, we are using get_support() which gives a boolean mask (True values for features that should be kept). Then, it can be used to subset the data while preserving the column names.

This may be a simple technique, but it can go a long in eliminating useless features. For deeper insight and more explanation of the code, you can head over to this article:

Step II: Pairwise Correlation

We will further trim our dataset by focusing on the relationships between features. One of the best metrics that show a linear connection is Pearson’s correlation coefficient (denoted r). The logic behind using r for feature selection is simple. If the correlation between features A and B is 0.9, it means you can predict the values of B using the values of A 90% of the time. In other words, in a dataset where A is present, you can discard B or vice versa.

There isn’t a Sklearn estimator that implements feature selection based on correlation. So, we will do it on our own:

This function is a shorthand that returns the names of columns that should be dropped based on a custom correlation threshold. Usually, the threshold will be over 0.8 to be safe.

In the function, we first create a correlation matrix using .corr(). Next, we create a boolean mask to only include correlations below the correlation matrix’s diagonal. We use this mask to subset the matrix. Finally, in a list comprehension, we find the names of features that should be dropped and return them.

There is a lot I didn’t explain about the code. Even though this function works well, I suggest reading my separate article on feature selection based on the correlation coefficient. I fully explained the concept of correlation and how it is different from causation. There is also a separate section on plotting the perfect correlation matrix as a heatmap and, of course, the explanation of the above function.

For our dataset, we will choose a threshold of 0.9:

The function tells us to drop 13 features:

Now, only 35 features are remaining.

Step III: Recursive Feature Elimination with Cross-Validation (RFECV)

Finally, we will choose the final set of features based on how they affect model performance. Most of the Sklearn models have either .coef_ (linear models) or .feature_importances_ (tree-based and ensemble models) attributes that show the importance of each feature. For example, let’s fit the Linear Regression model to the current set of features and see the computed coefficients:

Summary

Feature selection should not be taken lightly. While reducing model complexity, some algorithms can even see an increase in the performance due to the lack of distracting features in the dataset. It is also not wise to rely on a single method. Instead, approach the problem from different angles and using various techniques.

Today, we saw how to apply feature selection to a dataset in three stages:

Based on the properties of each feature using Variance Thresholding.
Based on the relationships between features using Pairwise Correlation.
Based on how features affect a model’s performance.

Using these techniques in procession should give you reliable results for any supervised problem you face.

Develop a robust Feature Selection workflow for any supervised problem

Learn how to face one of the biggest challenges of machine learning with the best of Sklearn feature selectors.

Introduction

Intro to the dataset and the problem statement

Step I: Variance Thresholding

This may be a simple technique, but it can go a long in eliminating useless features. For deeper insight and more explanation of the code, you can head over to this article:

Step II: Pairwise Correlation

There isn’t a Sklearn estimator that implements feature selection based on correlation. So, we will do it on our own:

This function is a shorthand that returns the names of columns that should be dropped based on a custom correlation threshold. Usually, the threshold will be over 0.8 to be safe.

For our dataset, we will choose a threshold of 0.9:

The function tells us to drop 13 features:

Now, only 35 features are remaining.

Step III: Recursive Feature Elimination with Cross-Validation (RFECV)

Summary

Today, we saw how to apply feature selection to a dataset in three stages:

Based on the properties of each feature using Variance Thresholding.
Based on the relationships between features using Pairwise Correlation.
Based on how features affect a model’s performance.

Using these techniques in procession should give you reliable results for any supervised problem you face.

3-Step Feature Selection Guide in Sklearn to Superchage Your Models | by Bex T. | Oct, 2022

Develop a robust Feature Selection workflow for any supervised problem

Introduction

Intro to the dataset and the problem statement

Step I: Variance Thresholding

Step II: Pairwise Correlation

Step III: Recursive Feature Elimination with Cross-Validation (RFECV)

Summary

Further Reading on Feature Selection

Develop a robust Feature Selection workflow for any supervised problem

Introduction

Intro to the dataset and the problem statement

Step I: Variance Thresholding

Step II: Pairwise Correlation

Step III: Recursive Feature Elimination with Cross-Validation (RFECV)

Summary

Further Reading on Feature Selection