Techno Blender
Digitally Yours.

Significantly Increase Your Grid-Search Results With These Parameters | by Tomer Gabay | Dec, 2022

0 39


Grid search over any machine learning pipeline step using an EstimatorSwitch

Photo by Héctor J. Rivas on Unsplash

A very common step in building a machine learning model is to grid search over a classifier’s parameters on the train set, using cross-validation, to find the most optimal parameters. What is less known, is that you can also grid search over virtually any pipeline step, such as feature engineering steps. E.g. which imputation strategy works best for numerical values? Mean, median or arbitrary? Which categorical encoding method to use? One-hot encoding, or maybe ordinal?

In this article, I’ll guide you through the steps to be able to answer such questions in your own machine-learning projects using grid searches.

To install all the required Python packages for this article:

pip install extra-datascience-tools feature-engine

The dataset

Let’s consider the following very simple public domain data set I created which has two columns: last_grade and passed_course. The last grade column contains the grade the student achieved on their last exam and the passed course column is a boolean column with True if the student passed the course and False if the student failed the course. Can we build a model that predicts whether a student passed the course based on their last grade?

Let us first explore the dataset:

import pandas as pd

df = pd.read_csv('last_grades.csv')
df.isna().sum()

OUTPUT
last_grade 125
course_passed 0
dtype: int64

Our target variable course_passed has no nan values, so no need for dropping rows here.

Of course, to prevent any data leakage we should split our data set into a train and test set first before continuing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)

Because most machine learning models don’t allow for nan values, we must consider different imputation strategies. Of course, in general, you would start EDA (explorative data analysis) to determine whether nan values are MAR (Missing at Random) MCAR (Missing Completely at Random) or MNAR (Missing not at Random). A good article that explains the differences between these can be found here:

Instead of analyzing why for some students their last grade is missing, we are simply going to try to grid search over different imputation techniques to illustrate how to grid search over any pipeline step, such as this feature engineering step.

Let’s explore the distribution of the independent variable last_grade :

import seaborn as sns

sns.histplot(data=X_train, x='last_grade')

Distribution of last_grade (Image by Author)

It looks like the last grades are normally distributed with a mean of ~6.5 and values between ~3 and ~9.5.

Let’s also look at the distribution of the target variable to determine which scoring metric to use:

y_train.value_counts()
OUTPUT
True 431
False 412
Name: course_passed, dtype: int64

The target variable is roughly equally divided, which means we can use scikit-learn’s default scorer for classification tasks, which is the accuracy score. In the case of an unequally divided target variable the accuracy score isn’t accurate, use e.g. F1 instead.

Grid searching

Next, we are going to set up the model and the grid-search and run it by just optimizing the classifier’s parameters, which is how I see most data scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer for now to impute the mean and scikit-learn’s DecisionTreeClassifier for predicting the target variable.

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from feature-engine.imputation import MeanMedianImputer

model = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)

param_grid = [
{"tree__max_depth": [None, 2, 5]}
]

gridsearch = GridSearchCV(model, param_grid=param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')

Results from code above (Image by Author)

As we can see from the table above, using GridsearchCV we learned that we can increase the accuracy of the model by ~0.55 just by changing the max_depth of the DecisionTreeClassifier from its default value None to 5. This clearly illustrates the positive impact grid searching can have.

However, we don’t know whether imputing the missing last_grades with the mean is actually the best imputation strategy. What we can do is actually grid search over three different imputation strategies using extra-datascience-toolsEstimatorSwitch :

  • Mean imputation
  • Median imputation
  • Arbitrary number imputation (by default 999 for feature-engine’s ArbitraryNumberImputer .
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch

# create a pipeline with two imputation techniques
model = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)

# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]

# specify the parameter grid for feature engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},

]

# join the parameter grids together
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]

Some important things to notice here:

  • We enclosed both imputers in the Pipeline within extra-datascience-tools’ EstimatorSwitch because we don’t want to use both imputers at the same time. This is because after the first imputer has transformed X there will be no nan values left for the second imputer to transform.
  • We split the parameter grid between a classifier parameter grid and a feature engineering parameter grid. At the bottom of the code, we join these two grids together so that every feature engineering grid is combined with every classifier grid, because we want to try a max_tree_depth of None, 2 and 5 for both the ArbitraryNumberImputer and the MeanMedianImputer .
  • We use a list of dictionaries instead of a dictionary in the feature parameter grid, so that we prevent the MeanMedianImputer and the ArbitraryNumberImputer for being applied at the same time. Using the apply parameter of EstimatorSwitch we can simply turn on or off one of the two imputers. Of course, you also could run the code twice, once with the first imputer commented out, and the second run with the second imputer commented out. However, this will lead to errors in our parameter grid, so we would need to adjust that one as well, and the results of the different imputation strategies aren’t available in the same grid search cv results, which makes it much more difficult to compare.

Let us look at the new results:

gridsearch = GridSearchCV(model, param_grid=model_param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')

Grid-search results on feature engineering (image by Author)

We now see a new best model, which is the decision tree with a max_depth of 2, using the ArbitraryNumberImputer . We improved the accuracy by 1.4% by implementing a different imputation strategy! And as a welcome bonus, our tree depth has shrunk to two, which makes the model easier to interpret.

Of course, grid searching can already take quite some time, and by not only grid searching over the classifier but also over other pipeline steps the grid search can take longer as well. There are a few methods to keep the extra time it takes to a minimum:

  • First grid search over the classifier’s parameters, and then over other steps such as feature engineering steps, or vice versa, depending on the situation.
  • Use extra-datascience-toolsfilter_tried_params to prevent duplicate parameter settings of a grid-search.
  • Use scikit-learn’s HalvingGridSearch or HalvingRandomSearch instead of a GridSearchCV (still in the experimental phase).

Besides using grid searching to optimize a classifier such as a decision tree, we saw you can actually optimize virtually any step in a machine learning pipeline using extra-datascience-toolsEstimatorSwitch by e.g. grid searching over the imputation strategy. Some more examples of pipeline steps which are worth grid searching over beside the imputation strategy and the classifier itself are:


Grid search over any machine learning pipeline step using an EstimatorSwitch

Photo by Héctor J. Rivas on Unsplash

A very common step in building a machine learning model is to grid search over a classifier’s parameters on the train set, using cross-validation, to find the most optimal parameters. What is less known, is that you can also grid search over virtually any pipeline step, such as feature engineering steps. E.g. which imputation strategy works best for numerical values? Mean, median or arbitrary? Which categorical encoding method to use? One-hot encoding, or maybe ordinal?

In this article, I’ll guide you through the steps to be able to answer such questions in your own machine-learning projects using grid searches.

To install all the required Python packages for this article:

pip install extra-datascience-tools feature-engine

The dataset

Let’s consider the following very simple public domain data set I created which has two columns: last_grade and passed_course. The last grade column contains the grade the student achieved on their last exam and the passed course column is a boolean column with True if the student passed the course and False if the student failed the course. Can we build a model that predicts whether a student passed the course based on their last grade?

Let us first explore the dataset:

import pandas as pd

df = pd.read_csv('last_grades.csv')
df.isna().sum()

OUTPUT
last_grade 125
course_passed 0
dtype: int64

Our target variable course_passed has no nan values, so no need for dropping rows here.

Of course, to prevent any data leakage we should split our data set into a train and test set first before continuing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)

Because most machine learning models don’t allow for nan values, we must consider different imputation strategies. Of course, in general, you would start EDA (explorative data analysis) to determine whether nan values are MAR (Missing at Random) MCAR (Missing Completely at Random) or MNAR (Missing not at Random). A good article that explains the differences between these can be found here:

Instead of analyzing why for some students their last grade is missing, we are simply going to try to grid search over different imputation techniques to illustrate how to grid search over any pipeline step, such as this feature engineering step.

Let’s explore the distribution of the independent variable last_grade :

import seaborn as sns

sns.histplot(data=X_train, x='last_grade')

Distribution of last_grade (Image by Author)

It looks like the last grades are normally distributed with a mean of ~6.5 and values between ~3 and ~9.5.

Let’s also look at the distribution of the target variable to determine which scoring metric to use:

y_train.value_counts()
OUTPUT
True 431
False 412
Name: course_passed, dtype: int64

The target variable is roughly equally divided, which means we can use scikit-learn’s default scorer for classification tasks, which is the accuracy score. In the case of an unequally divided target variable the accuracy score isn’t accurate, use e.g. F1 instead.

Grid searching

Next, we are going to set up the model and the grid-search and run it by just optimizing the classifier’s parameters, which is how I see most data scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer for now to impute the mean and scikit-learn’s DecisionTreeClassifier for predicting the target variable.

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from feature-engine.imputation import MeanMedianImputer

model = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)

param_grid = [
{"tree__max_depth": [None, 2, 5]}
]

gridsearch = GridSearchCV(model, param_grid=param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')

Results from code above (Image by Author)

As we can see from the table above, using GridsearchCV we learned that we can increase the accuracy of the model by ~0.55 just by changing the max_depth of the DecisionTreeClassifier from its default value None to 5. This clearly illustrates the positive impact grid searching can have.

However, we don’t know whether imputing the missing last_grades with the mean is actually the best imputation strategy. What we can do is actually grid search over three different imputation strategies using extra-datascience-toolsEstimatorSwitch :

  • Mean imputation
  • Median imputation
  • Arbitrary number imputation (by default 999 for feature-engine’s ArbitraryNumberImputer .
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch

# create a pipeline with two imputation techniques
model = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)

# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]

# specify the parameter grid for feature engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},

]

# join the parameter grids together
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]

Some important things to notice here:

  • We enclosed both imputers in the Pipeline within extra-datascience-tools’ EstimatorSwitch because we don’t want to use both imputers at the same time. This is because after the first imputer has transformed X there will be no nan values left for the second imputer to transform.
  • We split the parameter grid between a classifier parameter grid and a feature engineering parameter grid. At the bottom of the code, we join these two grids together so that every feature engineering grid is combined with every classifier grid, because we want to try a max_tree_depth of None, 2 and 5 for both the ArbitraryNumberImputer and the MeanMedianImputer .
  • We use a list of dictionaries instead of a dictionary in the feature parameter grid, so that we prevent the MeanMedianImputer and the ArbitraryNumberImputer for being applied at the same time. Using the apply parameter of EstimatorSwitch we can simply turn on or off one of the two imputers. Of course, you also could run the code twice, once with the first imputer commented out, and the second run with the second imputer commented out. However, this will lead to errors in our parameter grid, so we would need to adjust that one as well, and the results of the different imputation strategies aren’t available in the same grid search cv results, which makes it much more difficult to compare.

Let us look at the new results:

gridsearch = GridSearchCV(model, param_grid=model_param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')

Grid-search results on feature engineering (image by Author)

We now see a new best model, which is the decision tree with a max_depth of 2, using the ArbitraryNumberImputer . We improved the accuracy by 1.4% by implementing a different imputation strategy! And as a welcome bonus, our tree depth has shrunk to two, which makes the model easier to interpret.

Of course, grid searching can already take quite some time, and by not only grid searching over the classifier but also over other pipeline steps the grid search can take longer as well. There are a few methods to keep the extra time it takes to a minimum:

  • First grid search over the classifier’s parameters, and then over other steps such as feature engineering steps, or vice versa, depending on the situation.
  • Use extra-datascience-toolsfilter_tried_params to prevent duplicate parameter settings of a grid-search.
  • Use scikit-learn’s HalvingGridSearch or HalvingRandomSearch instead of a GridSearchCV (still in the experimental phase).

Besides using grid searching to optimize a classifier such as a decision tree, we saw you can actually optimize virtually any step in a machine learning pipeline using extra-datascience-toolsEstimatorSwitch by e.g. grid searching over the imputation strategy. Some more examples of pipeline steps which are worth grid searching over beside the imputation strategy and the classifier itself are:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment