Feature Selection with Boruta in Python | by Andrew D #datascience

By Jessie Hobb On Aug 9, 2022

Learn how the Boruta algorithm works for feature selection. Explanation + template

The feature selection process is fundamental in any machine learning project. In this post we’ll go through the Boruta algorithm, which allows us to create a ranking of our features, from the most important to the least impacting for our model. Boruta is simple to use and a powerful technique that analysts should incorporate in their pipeline.

Boruta is not a stand-alone algorithm: it sits on top of the Random Forest algorithm. In fact, the name Boruta comes from the name of the spirit of the forest in Slavic mythology. To understand how the algorithm works we’ll make a brief introduction to Random Forest.

Random Forest is based on the concept of bagging — which is about creating many random samples from the training set and training a different statistical model for each one. For a classification task the result is the majority of votes from the models, while for a regression task the result is the average of the various models.

The difference between canonical bagging and Random Forest is that the latter always uses only decision trees models. For each sample considered, the decision tree takes into account a limited set of features. This allows the Random Forest algorithm to be able to estimate the importance of each feature, since it stores the error in the predictions based on the split of features considered.

Let’s consider a classification task. The way RF estimates feature importance works in two phases. First, each decision tree creates and stores a prediction. Second, the values of certain features are randomly permuted through the various training samples and the previous step is repeated, tracing the result of the predictions again. The importance of a feature of a single decision tree is calculated as the difference in performance between the model using the original features versus the model using the permuted features divided by the number of examples in the training set. The importance of a feature is the average of the measurements across all trees for that feature. What is not done during this procedure is to calculate the z-scores for each feature. This is where Boruta comes into play.

The idea underlying Boruta is both fascinating and simple at the same time: for all the features in the original dataset, we are going to create random copies of them (called shadow features) and train classifiers based on this extended dataset. To understand the importance of a feature, we compare it to all the generated shadow features. Only features that are statistically more important than these synthetic features are retained as they contribute more to model performance. Let’s see the steps in a bit more detail.

Creates a copy of the training set features and merges them with the original features
Creates random permutations on these synthetic features to remove any kind of correlation between them and the target variable y — basically, these synthetic features are randomized combinations of the original feature from which they derive
Synthetic features are randomized at each new iteration
At each new iteration, computes the z-score of all original and synthetic features. A feature is considered relevant if its importance is higher than the maximum importance of all synthetic features
Applies a statistical test on all original features and keeps memory of its results. The null hypothesis is that the importance of a feature is equal to the maximal importance of synthetic features. The statistical test tests the equality between the original and synthetic features. The null hypothesis is rejected when the importance of a feature is significantly higher or lower than one of those of synthetic features
Removes features that are considered unimportant from both the original and synthetic dataset
Repeat all the steps for an n number of iterations until all features are removed or considered important

It should be noted that Boruta acts as an heuristic: there are no guarantees of its performance. It is therefore advisable to run the process several times and evaluate the results iteratively.

Let’s see how Boruta works in Python with its dedicated library. We will use Sklearn.datasets’ load_diabetes() dataset to test Boruta on a regression problem.

The feature set X is made up of the variables

age (in years)
sex
bmi (body mass index)
bp (mean blood pressure)
s1 (tc, total cholesterol)
s2 (ldl, low-density lipoproteins)
s3 (hdl, high-density lipoproteins)
s4 (tch, total / HDL cholesterol)
s5 (ltg, log of the triglyceride level)
s6 (glu, blood sugar level)

target y is the progression of diabetes recorded over time.

Result of 10 iterations of Boruta on the Sklearn diabetes dataset. Image by Author.

Learn how the Boruta algorithm works for feature selection. Explanation + template

Creates a copy of the training set features and merges them with the original features
Creates random permutations on these synthetic features to remove any kind of correlation between them and the target variable y — basically, these synthetic features are randomized combinations of the original feature from which they derive
Synthetic features are randomized at each new iteration
At each new iteration, computes the z-score of all original and synthetic features. A feature is considered relevant if its importance is higher than the maximum importance of all synthetic features
Applies a statistical test on all original features and keeps memory of its results. The null hypothesis is that the importance of a feature is equal to the maximal importance of synthetic features. The statistical test tests the equality between the original and synthetic features. The null hypothesis is rejected when the importance of a feature is significantly higher or lower than one of those of synthetic features
Removes features that are considered unimportant from both the original and synthetic dataset
Repeat all the steps for an n number of iterations until all features are removed or considered important

It should be noted that Boruta acts as an heuristic: there are no guarantees of its performance. It is therefore advisable to run the process several times and evaluate the results iteratively.

Let’s see how Boruta works in Python with its dedicated library. We will use Sklearn.datasets’ load_diabetes() dataset to test Boruta on a regression problem.

The feature set X is made up of the variables

age (in years)
sex
bmi (body mass index)
bp (mean blood pressure)
s1 (tc, total cholesterol)
s2 (ldl, low-density lipoproteins)
s3 (hdl, high-density lipoproteins)
s4 (tch, total / HDL cholesterol)
s5 (ltg, log of the triglyceride level)
s6 (glu, blood sugar level)

target y is the progression of diabetes recorded over time.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.