Significantly Increase Your Grid-Search Results With These Parameters | by Tomer Gabay | Dec, 2022
Grid search over any machine learning pipeline step using an EstimatorSwitch
A very common step in building a machine learning model is to grid search over a classifier’s parameters on the train set, using cross-validation, to find the most optimal parameters. What is less known, is that you can also grid search over virtually any pipeline step, such as feature engineering steps. E.g. which imputation strategy works best for numerical values? Mean, median or arbitrary? Which categorical encoding method to use? One-hot encoding, or maybe ordinal?
In this article, I’ll guide you through the steps to be able to answer such questions in your own machine-learning projects using grid searches.
To install all the required Python packages for this article:
pip install extra-datascience-tools feature-engine
The dataset
Let’s consider the following very simple public domain data set I created which has two columns: last_grade
and passed_course
. The last grade column contains the grade the student achieved on their last exam and the passed course column is a boolean column with True
if the student passed the course and False
if the student failed the course. Can we build a model that predicts whether a student passed the course based on their last grade?
Let us first explore the dataset:
import pandas as pddf = pd.read_csv('last_grades.csv')
df.isna().sum()
OUTPUT
last_grade 125
course_passed 0
dtype: int64
Our target variable course_passed
has no nan
values, so no need for dropping rows here.
Of course, to prevent any data leakage we should split our data set into a train and test set first before continuing.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)
Because most machine learning models don’t allow for nan
values, we must consider different imputation strategies. Of course, in general, you would start EDA (explorative data analysis) to determine whether nan
values are MAR (Missing at Random) MCAR (Missing Completely at Random) or MNAR (Missing not at Random). A good article that explains the differences between these can be found here:
Instead of analyzing why for some students their last grade is missing, we are simply going to try to grid search over different imputation techniques to illustrate how to grid search over any pipeline step, such as this feature engineering step.
Let’s explore the distribution of the independent variable last_grade
:
import seaborn as snssns.histplot(data=X_train, x='last_grade')
It looks like the last grades are normally distributed with a mean of ~6.5 and values between ~3 and ~9.5.
Let’s also look at the distribution of the target variable to determine which scoring metric to use:
y_train.value_counts()
OUTPUT
True 431
False 412
Name: course_passed, dtype: int64
The target variable is roughly equally divided, which means we can use scikit-learn’s default scorer for classification tasks, which is the accuracy score. In the case of an unequally divided target variable the accuracy score isn’t accurate, use e.g. F1 instead.
Grid searching
Next, we are going to set up the model and the grid-search and run it by just optimizing the classifier’s parameters, which is how I see most data scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer
for now to impute the mean and scikit-learn’s DecisionTreeClassifier
for predicting the target variable.
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCVfrom feature-engine.imputation import MeanMedianImputer
model = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)
param_grid = [
{"tree__max_depth": [None, 2, 5]}
]
gridsearch = GridSearchCV(model, param_grid=param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)
pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')
As we can see from the table above, using GridsearchCV
we learned that we can increase the accuracy of the model by ~0.55 just by changing the max_depth
of the DecisionTreeClassifier
from its default value None to 5. This clearly illustrates the positive impact grid searching can have.
However, we don’t know whether imputing the missing last_grades
with the mean is actually the best imputation strategy. What we can do is actually grid search over three different imputation strategies using extra-datascience-tools’ EstimatorSwitch
:
- Mean imputation
- Median imputation
- Arbitrary number imputation (by default 999 for feature-engine’s
ArbitraryNumberImputer
.
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch# create a pipeline with two imputation techniques
model = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)
# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]
# specify the parameter grid for feature engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},
]
# join the parameter grids together
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]
Some important things to notice here:
- We enclosed both imputers in the Pipeline within extra-datascience-tools’
EstimatorSwitch
because we don’t want to use both imputers at the same time. This is because after the first imputer has transformed X there will be nonan
values left for the second imputer to transform. - We split the parameter grid between a classifier parameter grid and a feature engineering parameter grid. At the bottom of the code, we join these two grids together so that every feature engineering grid is combined with every classifier grid, because we want to try a
max_tree_depth
of None, 2 and 5 for both theArbitraryNumberImputer
and theMeanMedianImputer
. - We use a list of dictionaries instead of a dictionary in the feature parameter grid, so that we prevent the
MeanMedianImputer
and theArbitraryNumberImputer
for being applied at the same time. Using theapply
parameter ofEstimatorSwitch
we can simply turn on or off one of the two imputers. Of course, you also could run the code twice, once with the first imputer commented out, and the second run with the second imputer commented out. However, this will lead to errors in our parameter grid, so we would need to adjust that one as well, and the results of the different imputation strategies aren’t available in the same grid search cv results, which makes it much more difficult to compare.
Let us look at the new results:
gridsearch = GridSearchCV(model, param_grid=model_param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')
We now see a new best model, which is the decision tree with a max_depth
of 2, using the ArbitraryNumberImputer
. We improved the accuracy by 1.4% by implementing a different imputation strategy! And as a welcome bonus, our tree depth has shrunk to two, which makes the model easier to interpret.
Of course, grid searching can already take quite some time, and by not only grid searching over the classifier but also over other pipeline steps the grid search can take longer as well. There are a few methods to keep the extra time it takes to a minimum:
- First grid search over the classifier’s parameters, and then over other steps such as feature engineering steps, or vice versa, depending on the situation.
- Use extra-datascience-tools’
filter_tried_params
to prevent duplicate parameter settings of a grid-search. - Use scikit-learn’s
HalvingGridSearch
orHalvingRandomSearch
instead of aGridSearchCV
(still in the experimental phase).
Besides using grid searching to optimize a classifier such as a decision tree, we saw you can actually optimize virtually any step in a machine learning pipeline using extra-datascience-tools’ EstimatorSwitch
by e.g. grid searching over the imputation strategy. Some more examples of pipeline steps which are worth grid searching over beside the imputation strategy and the classifier itself are:
Grid search over any machine learning pipeline step using an EstimatorSwitch
A very common step in building a machine learning model is to grid search over a classifier’s parameters on the train set, using cross-validation, to find the most optimal parameters. What is less known, is that you can also grid search over virtually any pipeline step, such as feature engineering steps. E.g. which imputation strategy works best for numerical values? Mean, median or arbitrary? Which categorical encoding method to use? One-hot encoding, or maybe ordinal?
In this article, I’ll guide you through the steps to be able to answer such questions in your own machine-learning projects using grid searches.
To install all the required Python packages for this article:
pip install extra-datascience-tools feature-engine
The dataset
Let’s consider the following very simple public domain data set I created which has two columns: last_grade
and passed_course
. The last grade column contains the grade the student achieved on their last exam and the passed course column is a boolean column with True
if the student passed the course and False
if the student failed the course. Can we build a model that predicts whether a student passed the course based on their last grade?
Let us first explore the dataset:
import pandas as pddf = pd.read_csv('last_grades.csv')
df.isna().sum()
OUTPUT
last_grade 125
course_passed 0
dtype: int64
Our target variable course_passed
has no nan
values, so no need for dropping rows here.
Of course, to prevent any data leakage we should split our data set into a train and test set first before continuing.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)
Because most machine learning models don’t allow for nan
values, we must consider different imputation strategies. Of course, in general, you would start EDA (explorative data analysis) to determine whether nan
values are MAR (Missing at Random) MCAR (Missing Completely at Random) or MNAR (Missing not at Random). A good article that explains the differences between these can be found here:
Instead of analyzing why for some students their last grade is missing, we are simply going to try to grid search over different imputation techniques to illustrate how to grid search over any pipeline step, such as this feature engineering step.
Let’s explore the distribution of the independent variable last_grade
:
import seaborn as snssns.histplot(data=X_train, x='last_grade')
It looks like the last grades are normally distributed with a mean of ~6.5 and values between ~3 and ~9.5.
Let’s also look at the distribution of the target variable to determine which scoring metric to use:
y_train.value_counts()
OUTPUT
True 431
False 412
Name: course_passed, dtype: int64
The target variable is roughly equally divided, which means we can use scikit-learn’s default scorer for classification tasks, which is the accuracy score. In the case of an unequally divided target variable the accuracy score isn’t accurate, use e.g. F1 instead.
Grid searching
Next, we are going to set up the model and the grid-search and run it by just optimizing the classifier’s parameters, which is how I see most data scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer
for now to impute the mean and scikit-learn’s DecisionTreeClassifier
for predicting the target variable.
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCVfrom feature-engine.imputation import MeanMedianImputer
model = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)
param_grid = [
{"tree__max_depth": [None, 2, 5]}
]
gridsearch = GridSearchCV(model, param_grid=param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)
pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')
As we can see from the table above, using GridsearchCV
we learned that we can increase the accuracy of the model by ~0.55 just by changing the max_depth
of the DecisionTreeClassifier
from its default value None to 5. This clearly illustrates the positive impact grid searching can have.
However, we don’t know whether imputing the missing last_grades
with the mean is actually the best imputation strategy. What we can do is actually grid search over three different imputation strategies using extra-datascience-tools’ EstimatorSwitch
:
- Mean imputation
- Median imputation
- Arbitrary number imputation (by default 999 for feature-engine’s
ArbitraryNumberImputer
.
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch# create a pipeline with two imputation techniques
model = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)
# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]
# specify the parameter grid for feature engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},
]
# join the parameter grids together
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]
Some important things to notice here:
- We enclosed both imputers in the Pipeline within extra-datascience-tools’
EstimatorSwitch
because we don’t want to use both imputers at the same time. This is because after the first imputer has transformed X there will be nonan
values left for the second imputer to transform. - We split the parameter grid between a classifier parameter grid and a feature engineering parameter grid. At the bottom of the code, we join these two grids together so that every feature engineering grid is combined with every classifier grid, because we want to try a
max_tree_depth
of None, 2 and 5 for both theArbitraryNumberImputer
and theMeanMedianImputer
. - We use a list of dictionaries instead of a dictionary in the feature parameter grid, so that we prevent the
MeanMedianImputer
and theArbitraryNumberImputer
for being applied at the same time. Using theapply
parameter ofEstimatorSwitch
we can simply turn on or off one of the two imputers. Of course, you also could run the code twice, once with the first imputer commented out, and the second run with the second imputer commented out. However, this will lead to errors in our parameter grid, so we would need to adjust that one as well, and the results of the different imputation strategies aren’t available in the same grid search cv results, which makes it much more difficult to compare.
Let us look at the new results:
gridsearch = GridSearchCV(model, param_grid=model_param_grid)
gridsearch.fit(X_train, y_train)
gridsearch.train(X_train, y_train)pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')
We now see a new best model, which is the decision tree with a max_depth
of 2, using the ArbitraryNumberImputer
. We improved the accuracy by 1.4% by implementing a different imputation strategy! And as a welcome bonus, our tree depth has shrunk to two, which makes the model easier to interpret.
Of course, grid searching can already take quite some time, and by not only grid searching over the classifier but also over other pipeline steps the grid search can take longer as well. There are a few methods to keep the extra time it takes to a minimum:
- First grid search over the classifier’s parameters, and then over other steps such as feature engineering steps, or vice versa, depending on the situation.
- Use extra-datascience-tools’
filter_tried_params
to prevent duplicate parameter settings of a grid-search. - Use scikit-learn’s
HalvingGridSearch
orHalvingRandomSearch
instead of aGridSearchCV
(still in the experimental phase).
Besides using grid searching to optimize a classifier such as a decision tree, we saw you can actually optimize virtually any step in a machine learning pipeline using extra-datascience-tools’ EstimatorSwitch
by e.g. grid searching over the imputation strategy. Some more examples of pipeline steps which are worth grid searching over beside the imputation strategy and the classifier itself are: