Techno Blender
Digitally Yours.

A step-by-step guide to robust ML classification | by Ryan Burke | Mar, 2023

0 43


Photo by Luca Bravo on Unsplash

In previous articles, I focused mainly on presenting individual algorithms that I found interesting. Here, I walk through a complete ML classification project. The goal is to touch on some of the common pitfalls in ML projects and describe to the readers how to avoid them. I will also demonstrate how we can go further by analysing our model errors to gain important insights that normally go unseen.

If you would like to see the whole notebook, please check it out → here

Below, you will find a list of the libraries I used for today’s analyses. They consist of the standard data science toolkit along with the necessary sklearn libraries.

import sys
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer, RobustScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.feature_selection import RFECV, SelectFromModel, SelectKBest, f_classif
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score, ConfusionMatrixDisplay, f1_score
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from scipy.stats import uniform

from imblearn.over_sampling import ADASYN

import swifter

# Always good to set a seed for reproducibility
SEED = 8
np.random.seed(SEED)

Today’s dataset includes the forest cover data that is ready-to-employ with sklearn. Here’s a description from sklearn’s site.

Data Set Characteristics:

The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven cover types, making this a multi-class classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.

Number of Instances: 581 012

Feature information (Name / Data Type / Measurement / Description)

  • Elevation / quantitative /meters / Elevation in meters
  • Aspect / quantitative / azimuth / Aspect in degrees azimuth
  • Slope / quantitative / degrees / Slope in degrees
  • Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features
  • Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features
  • Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway
  • Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice
  • Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice
  • Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice
  • Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points
  • Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
  • Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation

Number of classes:

  • Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

Here’s a simple function to load this data into your notebook as a dataframe.

columns = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0', 'Wilderness_Area_1', 'Wilderness_Area_2',
'Wilderness_Area_3', 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7', 'Soil_Type_8',
'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11', 'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15', 'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18',
'Soil_Type_19', 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23', 'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27', 'Soil_Type_28',
'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31', 'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35', 'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38',
'Soil_Type_39']

from sklearn import datasets
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=columns)
df['target'] = pd.Series(sklearn_dataset.target)
return df

df = sklearn_to_df(datasets.fetch_covtype())
df_name=df.columns
df.head(3)

Using df.info() and df.describe() to get to know our data better, we see that there are no missing data and it consists of quantitative variables. The dataset is also rather large (> 580 000 rows). I originally tried to run this on the entire dataset, but it took FOREVER, so I recommend using a fraction of the data.

Regarding the target variable, which is the forest cover class, using df.target.value_counts(), we see the following distribution (in descending order):

Class 2 = 283,301
Class 1 = 211,840
Class 3 = 35,754
Class 7 = 20,510
Class 6 = 17,367
Class 5 = 9,493
Class 4 = 2,747

It is important to note that our classes are imbalanced and we will need to keep this in mind when selecting a metric to evaluate our models.

One of the most common misunderstandings when running ML models is processing our data prior to splitting. Why is this a problem?

Let’s say we plan on scaling our data using the whole dataset. The equations below are taken from their respective links.

Ex1 StandardScaler()

z = (x — u) / s

Ex2 MinMaxScaler()

X_std = (X – X.min()) / (X.max() – X.min())
X_scaled = X_std * (max – min) + min

The most important thing we should notice is they include information such as mean, standard deviation, min, max. If we perform these functions prior to splitting, the features in our train set will be computed based on information included in the test set. This is an example of data leakage.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Therefore, the first step after getting to know our dataset is to split it and keep your test set unseen until the very end. In the code below, we split the data into 80% (training set) and 20% (test set). You will also note that I have only kept 50,000 total samples to reduce the time it takes to train & evaluate our models. Trust me, you will thank me later!

It is also worth noting that we are stratifying on the target variable. This is good practice for imbalanced datasets as it maintains the distribution of classes in the train and test set. If we don’t do this, there’s a chance that some of the underrepresented classes aren’t even present in our train or test sets.

# here we are first separating our df into features (X) and target (y)
X = df[df_name[0:54]]
Y = df[df_name[54]]

# now we are separating into training (80%) and test (20%) sets. The test set won't be seen until we want to test our top model!
X_train, X_test, y_train, y_test =train_test_split(X,Y,
train_size = 40_000,
test_size=10_000,
random_state=SEED,
stratify=df['target']) # we stratify to ensure similar distribution in train/test

With our train and test sets ready, we can now work on the fun stuff. The first step in this project is to generate some features that could add useful information to train our models.

This step can be a little tricky. In the real world, this requires domain-specific knowledge on the particular subject you are working. To be completely transparent with you, despite being a lover of nature and everything outdoors, I am no expert in why certain trees grow in specific areas.

For this reason, I have consulted [1] [2] [3] who have a better understanding of this domain than myself. I have amalgamated the knowledge from these references to create the features you will find below.

# engineering new columns from our df
def FeatureEngineering(X):

X['Aspect'] = X['Aspect'] % 360
X['Aspect_120'] = (X['Aspect'] + 120) % 360

X['Hydro_Elevation_sum'] = X['Elevation'] + X['Vertical_Distance_To_Hydrology']

X['Hydro_Elevation_diff'] = abs(X['Elevation'] - X['Vertical_Distance_To_Hydrology'])

X['Hydro_Euclidean'] = np.sqrt(X['Horizontal_Distance_To_Hydrology']**2 +
X['Vertical_Distance_To_Hydrology']**2)

X['Hydro_Manhattan'] = abs(X['Horizontal_Distance_To_Hydrology'] +
X['Vertical_Distance_To_Hydrology'])

X['Hydro_Distance_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Vertical_Distance_To_Hydrology']

X['Hydro_Distance_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] - X['Vertical_Distance_To_Hydrology'])

X['Hydro_Fire_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points']

X['Hydro_Fire_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'])

X['Hydro_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'])/2

X['Hydro_Road_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways']

X['Hydro_Road_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'])

X['Hydro_Road_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'])/2

X['Road_Fire_sum'] = X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points']

X['Road_Fire_diff'] = abs(X['Horizontal_Distance_To_Roadways'] - X['Horizontal_Distance_To_Fire_Points'])

X['Road_Fire_mean'] = (X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points'])/2

X['Hydro_Road_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'] +
X['Horizontal_Distance_To_Fire_Points'])/3

return X

X_train = X_train.swifter.apply(FeatureEngineering, axis = 1)
X_test = X_test.swifter.apply(FeatureEngineering, axis = 1)

On a side note, when you are working with large datasets, pandas can be somewhat slow. Using swifter, as you can see in the last two lines above, you can significantly speed up the time it takes to apply a function to your dataframe. The article → here compares several methods used to speed this process up.

At this point we have more than 70 features. If the goal is end up with the best performing model, then you could try to use all of these as inputs. With that said, often in business there is a trade-off between performance and complexity that needs to be considered.

As an example, suppose we have 94% accuracy in our model using all of these features. Then, imagine we have 89% accuracy with only 4 features. What is the cost we are willing to pay for a more interpretable model. Always weigh performance and complexity.

Keeping that in mind, I will perform feature selection to try and reduce the complexity right away. Sklearn provides many options worth considering. In this example, I will use SelectKBest which will select a pre-specified number of features that provide the best performance. Below, I have requested (and listed) the best performing 15 features. These are the features that I will use to train the models in the following section.

selector = SelectKBest(f_classif, k=15)
selector.fit(X_train, y_train)
mask = selector.get_support()
X_train_reduced_cols = X_train.columns[mask]

X_train_reduced_cols

>>> Index(['Elevation', 'Wilderness_Area_3', 'Soil_Type_2', 'Soil_Type_3',
'Soil_Type_9', 'Soil_Type_37', 'Soil_Type_38', 'Hydro_Elevation_sum',
'Hydro_Elevation_diff', 'Hydro_Road_sum', 'Hydro_Road_diff',
'Hydro_Road_mean', 'Road_Fire_sum', 'Road_Fire_mean',
'Hydro_Road_Fire_mean'],
dtype='object')

In this section I will compare three different classifiers:

I have provided links for those who wish to investigate each model further. They will also be helpful in the section on hyperparameter tuning, where you can find all modifiable parameters when trying to improve your models. Below you will find two functions to define and evaluate the baseline models.

# baseline models
def GetBaseModels():
baseModels = []
baseModels.append(('KNN' , KNeighborsClassifier()))
baseModels.append(('RF' , RandomForestClassifier()))
baseModels.append(('ET' , ExtraTreesClassifier()))

return baseModels

def ModelEvaluation(X_train, y_train,models):
# define number of folds and evaluation metric
num_folds = 10
scoring = "f1_weighted" #This is suitable for imbalanced classes

results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=num_folds, random_state=SEED, shuffle = True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs = -1)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

return names, results

There are some key elements in the second function that are worth discussing further. The first of which is StratifiedKFold. Recall, we split the original dataset into 80% training and 20% test. The test set will be reserved for the final evaluation of our top performing model.

Using cross-validation will provide us with a better evaluation of our models. Specifically, I have set up a 10-fold cross-validation. For those not familiar, the model is trained on k — 1 folds and is validated on the remaining fold at each step. At the end you will have access to an average and variation of the k models, providing you with better insight than a simple train-test evaluation. Stratified K fold, as I eluded to earlier, is used to ensure that each fold has an approximately equal representation of the target classes.

The second point worth discussing is the scoring metric. There are many metrics available to evaluate the performance of your models, and often there are several that could suit your project. It’s important to keep in mind what you are trying to demonstrate with the results. If you work in a business setting, often the metric that is most easily explained to those without a data background is preferred.

On the other hand, there are metrics that are unsuitable to your analyses. For this project, we have imbalanced classes. If you go to the link provided above, you will find options for this case. I opted to use the weighted F1 score. Let’s briefly discuss why I chose this metric.

A very common classification metric is accuracy, which is the percent of correct classifications. While this may seem like an excellent option, suppose we have a binary classification where the target classes are uneven (i.e. group 1 = 90, group 2 = 10). It is possible to have 90% accuracy, which is great, but if we explore further, we have correctly classified all of group 1 and failed to classify any of the group 2. In this case our model is not terribly informative.

If we would have used the weighted F1 score we would have a result of 42.6%. If you’re interested in reading more on the F1 score → here is an article explaining how it is calculated.

After training the baseline models, I have plotted the results from each below. The baseline models all performed relatively well. Remember, at this point I have done nothing to the data (i.e. transform, remove outliers). The Extra trees classifier had the highest weighted F1 score at 86.9%.

Results from the 10-fold CV. KNN had the lowest at 78.8%, the RF was second with 85.9%, and the ET had the highest weighted F1 score at 86.9%. Image provided by author

The next step in this project will look at the effects of data transformation on model performance. Whereas many decision tree-based algorithms are not sensitive to the magnitude of the data, it is reasonable to expect that models measuring distance between samples , such as the KNN perform differently when scaled [4] [5]. In this section, we will scale our data using StandardScaler and MinMaxScaler as described above. Below you will find a function that describes a pipeline that will apply the scaler and then train the model using scaled data.

def GetScaledModel(nameOfScaler):

if nameOfScaler == 'standard':
scaler = StandardScaler()
elif nameOfScaler =='minmax':
scaler = MinMaxScaler()

pipelines = []
pipelines.append((nameOfScaler+'KNN' , Pipeline([('Scaler', scaler),('KNN' , KNeighborsClassifier())])))
pipelines.append((nameOfScaler+'RF' , Pipeline([('Scaler', scaler),('RF' , RandomForestClassifier())])))
pipelines.append((nameOfScaler+'ET' , Pipeline([('Scaler', scaler),('ET' , ExtraTreesClassifier())])))

return pipelines

The results using the StandardScaler are presented below. We see that our hypothesis regarding scaling the data appears to hold. Both the random forest and extra trees classifiers both performed nearly identically, whereas the KNN improved in performance by roughly 4%. Despite this increase, the two tree-based classifiers still outperform the scaled KNN.

Results from the 10-fold CV using the StandardScaler to transform our data. KNN still had the lowest although the performance increased to 83.8%. The RF was second with 85.8%, and the ET once again had the highest weighted F1 score at 86.8%. Image provided by author

Similar results can be seen when the MinMaxScaler is used. The results from all models are almost identical to those presented using the StandardScaler.

Results from the 10-fold CV using the MinMaxScaler to transform our data. Each performed almost identically to those using StandardScaler. KNN still had the lowest at 83.9%. The RF was second with 86.0%, and the ET once again had the highest weighted F1 score at 87.0%. Image provided by author

It is worth noting at this point that I also checked the effect of removing outliers. For this, I removed values that were beyond +/- 3 SD for each feature. I am not presenting the results here because there were no values outside this range. If you are interested in seeing how this was performed, please feel free to check out the notebook found at the link provided at the beginning of this article.

The next step is to try and improve our models by tuning the hyperparameters. We will do so on the scaled data because it had the best average performance when considering our three models. Sklearn discusses this in more detail → here.

I chose to use GridSearchCV (CV for cross validated). Below you will find a function that performs a 10-fold cross validation on the models we have been using. The one additional detail here is that we need to provide the list of hyperparameters we want to be evaluated.

Up to this point, we have not even looked at our test set. Before commencing the grid search, we will scale our train and test data using the StandardScaler. We are doing this here because we are going to find the best hyperparameters for each model and use those as inputs into a VotingClassifier (as we will discuss in the next section).

To properly scale our full dataset we have to follow the procedure below. You will see that the scaler is only fit on the training data. Both the training and test set are transformed based on the scaling parameters found with the training set, thus eliminating any chance of data leakage.

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_reduced), columns=X_train_reduced.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_reduced), columns=X_test_reduced.columns)
class GridSearch(object):

def __init__(self,X_train,y_train,model,hyperparameters):

self.X_train = X_train
self.y_train = y_train
self.model = model
self.hyperparameters = hyperparameters

def GridSearch(self):

cv = 10
clf = GridSearchCV(self.model,
self.hyperparameters,
cv=cv,
verbose=0,
n_jobs=-1,
)
# fit grid search
best_model = clf.fit(self.X_train, self.y_train)
message = (best_model.best_score_, best_model.best_params_)
print("Best: %f using %s" % (message))

return best_model,best_model.best_params_

def BestModelPredict(self,X_train):

best_model,_ = self.GridSearch()
pred = best_model.predict(X_train)
return pred

Next, I have provided the grid search parameters that were tested for each of the models.

# 1) KNN
model_KNN = KNeighborsClassifier()
neighbors = [1,3,5,7,9,11,13,15,17,19] # Number of neighbors to use by default for k_neighbors queries
param_grid = dict(n_neighbors=neighbors)

# 2) RF
model_RF = RandomForestClassifier()
n_estimators_value = [50,100,150,200,250,300] # The number of trees
criterion = ['gini', 'entropy', 'log_loss'] # The function to measure the quality of a split
param_grid = dict(n_estimators=n_estimators_value, criterion=criterion)

# 3) ET
model_ET = ExtraTreesClassifier()
n_estimators_value = [50,100,150,200,250,300] # The number of trees
criterion = ['gini', 'entropy', 'log_loss'] # The function to measure the quality of a split
param_grid = dict(n_estimators=n_estimators_value, criterion=criterion)

We have determined the best combination of parameters to optimise our models. These parameters will be used as the inputs into a VotingClassifier, which is an ensemble estimator that trains several models and then aggregates the findings for a more robust prediction. I found this → article which provides a detailed overview of the voting classifier and the different ways to use it.

The best parameters for each model are listed below. The output from the voting classifer shows that we achieved a weighted F1 score of 87.5% on the training set and 88.4% on the test set. Not bad!

param = {'n_neighbors': 1}
model1 = KNeighborsClassifier(**param)

param = {'criterion': 'entropy', 'n_estimators': 300}
model2 = RandomForestClassifier(**param)

param = {'criterion': 'gini', 'n_estimators': 300}
model3 = ExtraTreesClassifier(**param)

# create the models based on above parameters
estimators = [('KNN',model1), ('RF',model2), ('ET',model3)]

# create the ensemble model
kfold = StratifiedKFold(n_splits=10, random_state=SEED, shuffle = True)
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X_train_scaled, y_train, cv=kfold)
print('F1 weighted score on train: ',results.mean())
ensemble_model = ensemble.fit(X_train_scaled,y_train)
pred = ensemble_model.predict(X_test_scaled)
print('F1 weighted score on test:' , (y_test == pred).mean())

>>> F1 weighted score on train: 0.8747
>>> F1 weighted score on test: 0.8836

The performance of our model is pretty good. With that said, it can be very insightful to investigate where the model failed. Below, you will find the code to generate a confusion matrix. Let’s see if we can learn something.

from sklearn.metrics import plot_confusion_matrix
cfm_raw = plot_confusion_matrix(ensemble_model, X_test_scaled, y_test, values_format = '') # add normalize = 'true' for precision matrix or 'pred' for recall matrix
plt.savefig("cfm_raw.png")
Confusion matrix on the test set. Image by author

Right away, it becomes quite evident that the underrepresented classes are not learned very well. This is so important because despite using a metric that is appropriate to evaluate imbalanced classes, you can’t make a model learn something that isn’t there.

To analyse our errors, we could create visualisations; however, with 15 features and 7 classes this can start to feel like one of those trippy stereogram images that you stare at until an image forms. An alternative approach is the following.

In this section I am going to compare the predicted values to the ground truth in our test set and create a new variable, ‘error’. Below, I am setting up a dataset to be used in a binary classification analysis, where the target is error vs. no error using the same features as above.

Since we already know that the underrepresented classes were not well learned, the goal here is to see which features were most associated with errors independent of class.

# add predicted values test_df to compare with ground truth
test_df['predicted'] = pred

# create class 0 = no error , 1 = error
test_df['error'] = (test_df['target']!=test_df['predicted']).astype(int)

# create our error classification set
X_error = test_df[['Elevation', 'Wilderness_Area_3', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_9', 'Soil_Type_37', 'Soil_Type_38',
'Hydro_Elevation_sum', 'Hydro_Elevation_diff', 'Hydro_Road_sum', 'Hydro_Road_diff', 'Hydro_Road_mean', 'Road_Fire_sum',
'Road_Fire_mean', 'Hydro_Road_Fire_mean']]

X_error_names = X_error.columns
y_error = test_df['error']

With our new dataset, the next step is to build a classification model. This time we are going to add a step using SHAP. This will allow us to understand how each feature impacts the model, which in our case is error.

Below, we have used a Random Forest to fit the data. Once again we are using K-fold cross-validation to give us a better estimate of the contribution of each feature. At the bottom, I have generated a dataframe with the average, standard deviation, and maximum shap values.

import shap
kfold = StratifiedKFold(n_splits=10, random_state=SEED, shuffle = True)

list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kfold.split(X_error, y_error):
X_error_train, X_error_test = X_error.iloc[train_index], X_error.iloc[test_index]
y_error_train, y_error_test = y_error.iloc[train_index], y_error.iloc[test_index]
X_error_train = pd.DataFrame(X_error_train,columns=X_error_names)
X_error_test = pd.DataFrame(X_error_test,columns=X_error_names)

#training model
clf = RandomForestClassifier(criterion = 'entropy', n_estimators = 300, random_state=SEED)
clf.fit(X_error_train, y_error_train)

#explaining model
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_error_test)
#for each iteration we save the test_set index and the shap_values
list_shap_values.append(shap_values)

# flatten list of lists, pick the sv for 1 class, stack the result (you only need to look at 1 class for binary classification since values will be opposite to one another)
shap_values_av = np.vstack([sv[1] for sv in list_shap_values])
sv = np.abs(shap_values_av).mean(0)
sv_std = np.abs(shap_values_av).std(0)
sv_max = np.abs(shap_values_av).max(0)
importance_df = pd.DataFrame({
"column_name": X_error_names,
"shap_values_av": sv,
"shap_values_std": sv_std,
"shap_values_max": sv_max
})

For a better visual experience, below is a shap summary plot. On the left hand side we have the feature names. The plot demonstrates the impact of each feature on the model for different values of that feature. Whereas the dispersion (how far to the right or left) describes the overall impact of a feature on the model, the colouring provides us with a little extra information.

SHAP summary plot for the error classification model. Image by author

The first thing we notice is that the features with the greatest impact on the model relate more to distance features (i.e. to water, road, or fire ignition points) than to the type of forest (wilderness area) or soil type.

Next, when we look at the colour distribution, we see a more clear differentiation of high vs low values for the first feature Hydro_Road_Fire_mean than the rest. The same might be said for Road_Fire_mean, albeit to a lesser degree.

To interpret what this means, we can formulate a statement like the following : When the average distance to water, fire ignition points and road is low, there is a more likely chance of making an error.

Once again, I must insist that my forestry ‘expertise’ is limited to a couple of weeks. I did do some research to help me interpret what this could mean and came across a couple of articles [6] [7] that suggest the distance to the road is a significant factor in the risk of forest fires.

This leads me to hypothesise that forest fire may be a significant factor influencing the errors made on our dataset. It seems logical to me that areas impacted by fire would have a very different representation of forest diversity to those unaffected by fire. I’m sure someone with more experience could let me know if that makes sense 🙂

Today, we went through a step-by-step ML multi-classification problem. We touched on some important considerations when conducting these analyses, namely the importance of splitting the dataset before we start to manipulate it. This is one of the most common pitfalls in ML projects that can lead to serious issues limiting our ability to generalise our findings.

We also touched on the importance of selecting an appropriate metric to evaluate our models. Here, we used the weighted F1 score, which was appropriate for imbalanced classes. Despite this, we still saw that the underrepresented classes were not well learned.

In my notebook, I also included a section on oversampling to create balanced classes using ADASYN, which is a variation of SMOTE. To save you the suspense, upsampling significantly improved the results on the training set (but not the test set).

This leads us to the error analysis, which is an important part of any ML project. A binary error classification was performed and may suggest that forest fires were implicated in many of the model errors. This could also explain, to a certain extent, why upsampling didn’t improve our final model.

Finally, I want to thank you all for taking the time to read this article! I hope some of you found it to be helpful 🙂


Photo by Luca Bravo on Unsplash

In previous articles, I focused mainly on presenting individual algorithms that I found interesting. Here, I walk through a complete ML classification project. The goal is to touch on some of the common pitfalls in ML projects and describe to the readers how to avoid them. I will also demonstrate how we can go further by analysing our model errors to gain important insights that normally go unseen.

If you would like to see the whole notebook, please check it out → here

Below, you will find a list of the libraries I used for today’s analyses. They consist of the standard data science toolkit along with the necessary sklearn libraries.

import sys
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer, RobustScaler
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.feature_selection import RFECV, SelectFromModel, SelectKBest, f_classif
from sklearn.metrics import classification_report, confusion_matrix, balanced_accuracy_score, ConfusionMatrixDisplay, f1_score
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from scipy.stats import uniform

from imblearn.over_sampling import ADASYN

import swifter

# Always good to set a seed for reproducibility
SEED = 8
np.random.seed(SEED)

Today’s dataset includes the forest cover data that is ready-to-employ with sklearn. Here’s a description from sklearn’s site.

Data Set Characteristics:

The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven cover types, making this a multi-class classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.

Number of Instances: 581 012

Feature information (Name / Data Type / Measurement / Description)

  • Elevation / quantitative /meters / Elevation in meters
  • Aspect / quantitative / azimuth / Aspect in degrees azimuth
  • Slope / quantitative / degrees / Slope in degrees
  • Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features
  • Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features
  • Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway
  • Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice
  • Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice
  • Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice
  • Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points
  • Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
  • Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation

Number of classes:

  • Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation

Here’s a simple function to load this data into your notebook as a dataframe.

columns = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0', 'Wilderness_Area_1', 'Wilderness_Area_2',
'Wilderness_Area_3', 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7', 'Soil_Type_8',
'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11', 'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15', 'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18',
'Soil_Type_19', 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23', 'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27', 'Soil_Type_28',
'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31', 'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35', 'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38',
'Soil_Type_39']

from sklearn import datasets
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=columns)
df['target'] = pd.Series(sklearn_dataset.target)
return df

df = sklearn_to_df(datasets.fetch_covtype())
df_name=df.columns
df.head(3)

Using df.info() and df.describe() to get to know our data better, we see that there are no missing data and it consists of quantitative variables. The dataset is also rather large (> 580 000 rows). I originally tried to run this on the entire dataset, but it took FOREVER, so I recommend using a fraction of the data.

Regarding the target variable, which is the forest cover class, using df.target.value_counts(), we see the following distribution (in descending order):

Class 2 = 283,301
Class 1 = 211,840
Class 3 = 35,754
Class 7 = 20,510
Class 6 = 17,367
Class 5 = 9,493
Class 4 = 2,747

It is important to note that our classes are imbalanced and we will need to keep this in mind when selecting a metric to evaluate our models.

One of the most common misunderstandings when running ML models is processing our data prior to splitting. Why is this a problem?

Let’s say we plan on scaling our data using the whole dataset. The equations below are taken from their respective links.

Ex1 StandardScaler()

z = (x — u) / s

Ex2 MinMaxScaler()

X_std = (X – X.min()) / (X.max() – X.min())
X_scaled = X_std * (max – min) + min

The most important thing we should notice is they include information such as mean, standard deviation, min, max. If we perform these functions prior to splitting, the features in our train set will be computed based on information included in the test set. This is an example of data leakage.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Therefore, the first step after getting to know our dataset is to split it and keep your test set unseen until the very end. In the code below, we split the data into 80% (training set) and 20% (test set). You will also note that I have only kept 50,000 total samples to reduce the time it takes to train & evaluate our models. Trust me, you will thank me later!

It is also worth noting that we are stratifying on the target variable. This is good practice for imbalanced datasets as it maintains the distribution of classes in the train and test set. If we don’t do this, there’s a chance that some of the underrepresented classes aren’t even present in our train or test sets.

# here we are first separating our df into features (X) and target (y)
X = df[df_name[0:54]]
Y = df[df_name[54]]

# now we are separating into training (80%) and test (20%) sets. The test set won't be seen until we want to test our top model!
X_train, X_test, y_train, y_test =train_test_split(X,Y,
train_size = 40_000,
test_size=10_000,
random_state=SEED,
stratify=df['target']) # we stratify to ensure similar distribution in train/test

With our train and test sets ready, we can now work on the fun stuff. The first step in this project is to generate some features that could add useful information to train our models.

This step can be a little tricky. In the real world, this requires domain-specific knowledge on the particular subject you are working. To be completely transparent with you, despite being a lover of nature and everything outdoors, I am no expert in why certain trees grow in specific areas.

For this reason, I have consulted [1] [2] [3] who have a better understanding of this domain than myself. I have amalgamated the knowledge from these references to create the features you will find below.

# engineering new columns from our df
def FeatureEngineering(X):

X['Aspect'] = X['Aspect'] % 360
X['Aspect_120'] = (X['Aspect'] + 120) % 360

X['Hydro_Elevation_sum'] = X['Elevation'] + X['Vertical_Distance_To_Hydrology']

X['Hydro_Elevation_diff'] = abs(X['Elevation'] - X['Vertical_Distance_To_Hydrology'])

X['Hydro_Euclidean'] = np.sqrt(X['Horizontal_Distance_To_Hydrology']**2 +
X['Vertical_Distance_To_Hydrology']**2)

X['Hydro_Manhattan'] = abs(X['Horizontal_Distance_To_Hydrology'] +
X['Vertical_Distance_To_Hydrology'])

X['Hydro_Distance_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Vertical_Distance_To_Hydrology']

X['Hydro_Distance_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] - X['Vertical_Distance_To_Hydrology'])

X['Hydro_Fire_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points']

X['Hydro_Fire_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'])

X['Hydro_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Fire_Points'])/2

X['Hydro_Road_sum'] = X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways']

X['Hydro_Road_diff'] = abs(X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'])

X['Hydro_Road_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'])/2

X['Road_Fire_sum'] = X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points']

X['Road_Fire_diff'] = abs(X['Horizontal_Distance_To_Roadways'] - X['Horizontal_Distance_To_Fire_Points'])

X['Road_Fire_mean'] = (X['Horizontal_Distance_To_Roadways'] + X['Horizontal_Distance_To_Fire_Points'])/2

X['Hydro_Road_Fire_mean'] = (X['Horizontal_Distance_To_Hydrology'] + X['Horizontal_Distance_To_Roadways'] +
X['Horizontal_Distance_To_Fire_Points'])/3

return X

X_train = X_train.swifter.apply(FeatureEngineering, axis = 1)
X_test = X_test.swifter.apply(FeatureEngineering, axis = 1)

On a side note, when you are working with large datasets, pandas can be somewhat slow. Using swifter, as you can see in the last two lines above, you can significantly speed up the time it takes to apply a function to your dataframe. The article → here compares several methods used to speed this process up.

At this point we have more than 70 features. If the goal is end up with the best performing model, then you could try to use all of these as inputs. With that said, often in business there is a trade-off between performance and complexity that needs to be considered.

As an example, suppose we have 94% accuracy in our model using all of these features. Then, imagine we have 89% accuracy with only 4 features. What is the cost we are willing to pay for a more interpretable model. Always weigh performance and complexity.

Keeping that in mind, I will perform feature selection to try and reduce the complexity right away. Sklearn provides many options worth considering. In this example, I will use SelectKBest which will select a pre-specified number of features that provide the best performance. Below, I have requested (and listed) the best performing 15 features. These are the features that I will use to train the models in the following section.

selector = SelectKBest(f_classif, k=15)
selector.fit(X_train, y_train)
mask = selector.get_support()
X_train_reduced_cols = X_train.columns[mask]

X_train_reduced_cols

>>> Index(['Elevation', 'Wilderness_Area_3', 'Soil_Type_2', 'Soil_Type_3',
'Soil_Type_9', 'Soil_Type_37', 'Soil_Type_38', 'Hydro_Elevation_sum',
'Hydro_Elevation_diff', 'Hydro_Road_sum', 'Hydro_Road_diff',
'Hydro_Road_mean', 'Road_Fire_sum', 'Road_Fire_mean',
'Hydro_Road_Fire_mean'],
dtype='object')

In this section I will compare three different classifiers:

I have provided links for those who wish to investigate each model further. They will also be helpful in the section on hyperparameter tuning, where you can find all modifiable parameters when trying to improve your models. Below you will find two functions to define and evaluate the baseline models.

# baseline models
def GetBaseModels():
baseModels = []
baseModels.append(('KNN' , KNeighborsClassifier()))
baseModels.append(('RF' , RandomForestClassifier()))
baseModels.append(('ET' , ExtraTreesClassifier()))

return baseModels

def ModelEvaluation(X_train, y_train,models):
# define number of folds and evaluation metric
num_folds = 10
scoring = "f1_weighted" #This is suitable for imbalanced classes

results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=num_folds, random_state=SEED, shuffle = True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs = -1)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

return names, results

There are some key elements in the second function that are worth discussing further. The first of which is StratifiedKFold. Recall, we split the original dataset into 80% training and 20% test. The test set will be reserved for the final evaluation of our top performing model.

Using cross-validation will provide us with a better evaluation of our models. Specifically, I have set up a 10-fold cross-validation. For those not familiar, the model is trained on k — 1 folds and is validated on the remaining fold at each step. At the end you will have access to an average and variation of the k models, providing you with better insight than a simple train-test evaluation. Stratified K fold, as I eluded to earlier, is used to ensure that each fold has an approximately equal representation of the target classes.

The second point worth discussing is the scoring metric. There are many metrics available to evaluate the performance of your models, and often there are several that could suit your project. It’s important to keep in mind what you are trying to demonstrate with the results. If you work in a business setting, often the metric that is most easily explained to those without a data background is preferred.

On the other hand, there are metrics that are unsuitable to your analyses. For this project, we have imbalanced classes. If you go to the link provided above, you will find options for this case. I opted to use the weighted F1 score. Let’s briefly discuss why I chose this metric.

A very common classification metric is accuracy, which is the percent of correct classifications. While this may seem like an excellent option, suppose we have a binary classification where the target classes are uneven (i.e. group 1 = 90, group 2 = 10). It is possible to have 90% accuracy, which is great, but if we explore further, we have correctly classified all of group 1 and failed to classify any of the group 2. In this case our model is not terribly informative.

If we would have used the weighted F1 score we would have a result of 42.6%. If you’re interested in reading more on the F1 score → here is an article explaining how it is calculated.

After training the baseline models, I have plotted the results from each below. The baseline models all performed relatively well. Remember, at this point I have done nothing to the data (i.e. transform, remove outliers). The Extra trees classifier had the highest weighted F1 score at 86.9%.

Results from the 10-fold CV. KNN had the lowest at 78.8%, the RF was second with 85.9%, and the ET had the highest weighted F1 score at 86.9%. Image provided by author

The next step in this project will look at the effects of data transformation on model performance. Whereas many decision tree-based algorithms are not sensitive to the magnitude of the data, it is reasonable to expect that models measuring distance between samples , such as the KNN perform differently when scaled [4] [5]. In this section, we will scale our data using StandardScaler and MinMaxScaler as described above. Below you will find a function that describes a pipeline that will apply the scaler and then train the model using scaled data.

def GetScaledModel(nameOfScaler):

if nameOfScaler == 'standard':
scaler = StandardScaler()
elif nameOfScaler =='minmax':
scaler = MinMaxScaler()

pipelines = []
pipelines.append((nameOfScaler+'KNN' , Pipeline([('Scaler', scaler),('KNN' , KNeighborsClassifier())])))
pipelines.append((nameOfScaler+'RF' , Pipeline([('Scaler', scaler),('RF' , RandomForestClassifier())])))
pipelines.append((nameOfScaler+'ET' , Pipeline([('Scaler', scaler),('ET' , ExtraTreesClassifier())])))

return pipelines

The results using the StandardScaler are presented below. We see that our hypothesis regarding scaling the data appears to hold. Both the random forest and extra trees classifiers both performed nearly identically, whereas the KNN improved in performance by roughly 4%. Despite this increase, the two tree-based classifiers still outperform the scaled KNN.

Results from the 10-fold CV using the StandardScaler to transform our data. KNN still had the lowest although the performance increased to 83.8%. The RF was second with 85.8%, and the ET once again had the highest weighted F1 score at 86.8%. Image provided by author

Similar results can be seen when the MinMaxScaler is used. The results from all models are almost identical to those presented using the StandardScaler.

Results from the 10-fold CV using the MinMaxScaler to transform our data. Each performed almost identically to those using StandardScaler. KNN still had the lowest at 83.9%. The RF was second with 86.0%, and the ET once again had the highest weighted F1 score at 87.0%. Image provided by author

It is worth noting at this point that I also checked the effect of removing outliers. For this, I removed values that were beyond +/- 3 SD for each feature. I am not presenting the results here because there were no values outside this range. If you are interested in seeing how this was performed, please feel free to check out the notebook found at the link provided at the beginning of this article.

The next step is to try and improve our models by tuning the hyperparameters. We will do so on the scaled data because it had the best average performance when considering our three models. Sklearn discusses this in more detail → here.

I chose to use GridSearchCV (CV for cross validated). Below you will find a function that performs a 10-fold cross validation on the models we have been using. The one additional detail here is that we need to provide the list of hyperparameters we want to be evaluated.

Up to this point, we have not even looked at our test set. Before commencing the grid search, we will scale our train and test data using the StandardScaler. We are doing this here because we are going to find the best hyperparameters for each model and use those as inputs into a VotingClassifier (as we will discuss in the next section).

To properly scale our full dataset we have to follow the procedure below. You will see that the scaler is only fit on the training data. Both the training and test set are transformed based on the scaling parameters found with the training set, thus eliminating any chance of data leakage.

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_reduced), columns=X_train_reduced.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_reduced), columns=X_test_reduced.columns)
class GridSearch(object):

def __init__(self,X_train,y_train,model,hyperparameters):

self.X_train = X_train
self.y_train = y_train
self.model = model
self.hyperparameters = hyperparameters

def GridSearch(self):

cv = 10
clf = GridSearchCV(self.model,
self.hyperparameters,
cv=cv,
verbose=0,
n_jobs=-1,
)
# fit grid search
best_model = clf.fit(self.X_train, self.y_train)
message = (best_model.best_score_, best_model.best_params_)
print("Best: %f using %s" % (message))

return best_model,best_model.best_params_

def BestModelPredict(self,X_train):

best_model,_ = self.GridSearch()
pred = best_model.predict(X_train)
return pred

Next, I have provided the grid search parameters that were tested for each of the models.

# 1) KNN
model_KNN = KNeighborsClassifier()
neighbors = [1,3,5,7,9,11,13,15,17,19] # Number of neighbors to use by default for k_neighbors queries
param_grid = dict(n_neighbors=neighbors)

# 2) RF
model_RF = RandomForestClassifier()
n_estimators_value = [50,100,150,200,250,300] # The number of trees
criterion = ['gini', 'entropy', 'log_loss'] # The function to measure the quality of a split
param_grid = dict(n_estimators=n_estimators_value, criterion=criterion)

# 3) ET
model_ET = ExtraTreesClassifier()
n_estimators_value = [50,100,150,200,250,300] # The number of trees
criterion = ['gini', 'entropy', 'log_loss'] # The function to measure the quality of a split
param_grid = dict(n_estimators=n_estimators_value, criterion=criterion)

We have determined the best combination of parameters to optimise our models. These parameters will be used as the inputs into a VotingClassifier, which is an ensemble estimator that trains several models and then aggregates the findings for a more robust prediction. I found this → article which provides a detailed overview of the voting classifier and the different ways to use it.

The best parameters for each model are listed below. The output from the voting classifer shows that we achieved a weighted F1 score of 87.5% on the training set and 88.4% on the test set. Not bad!

param = {'n_neighbors': 1}
model1 = KNeighborsClassifier(**param)

param = {'criterion': 'entropy', 'n_estimators': 300}
model2 = RandomForestClassifier(**param)

param = {'criterion': 'gini', 'n_estimators': 300}
model3 = ExtraTreesClassifier(**param)

# create the models based on above parameters
estimators = [('KNN',model1), ('RF',model2), ('ET',model3)]

# create the ensemble model
kfold = StratifiedKFold(n_splits=10, random_state=SEED, shuffle = True)
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X_train_scaled, y_train, cv=kfold)
print('F1 weighted score on train: ',results.mean())
ensemble_model = ensemble.fit(X_train_scaled,y_train)
pred = ensemble_model.predict(X_test_scaled)
print('F1 weighted score on test:' , (y_test == pred).mean())

>>> F1 weighted score on train: 0.8747
>>> F1 weighted score on test: 0.8836

The performance of our model is pretty good. With that said, it can be very insightful to investigate where the model failed. Below, you will find the code to generate a confusion matrix. Let’s see if we can learn something.

from sklearn.metrics import plot_confusion_matrix
cfm_raw = plot_confusion_matrix(ensemble_model, X_test_scaled, y_test, values_format = '') # add normalize = 'true' for precision matrix or 'pred' for recall matrix
plt.savefig("cfm_raw.png")
Confusion matrix on the test set. Image by author

Right away, it becomes quite evident that the underrepresented classes are not learned very well. This is so important because despite using a metric that is appropriate to evaluate imbalanced classes, you can’t make a model learn something that isn’t there.

To analyse our errors, we could create visualisations; however, with 15 features and 7 classes this can start to feel like one of those trippy stereogram images that you stare at until an image forms. An alternative approach is the following.

In this section I am going to compare the predicted values to the ground truth in our test set and create a new variable, ‘error’. Below, I am setting up a dataset to be used in a binary classification analysis, where the target is error vs. no error using the same features as above.

Since we already know that the underrepresented classes were not well learned, the goal here is to see which features were most associated with errors independent of class.

# add predicted values test_df to compare with ground truth
test_df['predicted'] = pred

# create class 0 = no error , 1 = error
test_df['error'] = (test_df['target']!=test_df['predicted']).astype(int)

# create our error classification set
X_error = test_df[['Elevation', 'Wilderness_Area_3', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_9', 'Soil_Type_37', 'Soil_Type_38',
'Hydro_Elevation_sum', 'Hydro_Elevation_diff', 'Hydro_Road_sum', 'Hydro_Road_diff', 'Hydro_Road_mean', 'Road_Fire_sum',
'Road_Fire_mean', 'Hydro_Road_Fire_mean']]

X_error_names = X_error.columns
y_error = test_df['error']

With our new dataset, the next step is to build a classification model. This time we are going to add a step using SHAP. This will allow us to understand how each feature impacts the model, which in our case is error.

Below, we have used a Random Forest to fit the data. Once again we are using K-fold cross-validation to give us a better estimate of the contribution of each feature. At the bottom, I have generated a dataframe with the average, standard deviation, and maximum shap values.

import shap
kfold = StratifiedKFold(n_splits=10, random_state=SEED, shuffle = True)

list_shap_values = list()
list_test_sets = list()
for train_index, test_index in kfold.split(X_error, y_error):
X_error_train, X_error_test = X_error.iloc[train_index], X_error.iloc[test_index]
y_error_train, y_error_test = y_error.iloc[train_index], y_error.iloc[test_index]
X_error_train = pd.DataFrame(X_error_train,columns=X_error_names)
X_error_test = pd.DataFrame(X_error_test,columns=X_error_names)

#training model
clf = RandomForestClassifier(criterion = 'entropy', n_estimators = 300, random_state=SEED)
clf.fit(X_error_train, y_error_train)

#explaining model
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_error_test)
#for each iteration we save the test_set index and the shap_values
list_shap_values.append(shap_values)

# flatten list of lists, pick the sv for 1 class, stack the result (you only need to look at 1 class for binary classification since values will be opposite to one another)
shap_values_av = np.vstack([sv[1] for sv in list_shap_values])
sv = np.abs(shap_values_av).mean(0)
sv_std = np.abs(shap_values_av).std(0)
sv_max = np.abs(shap_values_av).max(0)
importance_df = pd.DataFrame({
"column_name": X_error_names,
"shap_values_av": sv,
"shap_values_std": sv_std,
"shap_values_max": sv_max
})

For a better visual experience, below is a shap summary plot. On the left hand side we have the feature names. The plot demonstrates the impact of each feature on the model for different values of that feature. Whereas the dispersion (how far to the right or left) describes the overall impact of a feature on the model, the colouring provides us with a little extra information.

SHAP summary plot for the error classification model. Image by author

The first thing we notice is that the features with the greatest impact on the model relate more to distance features (i.e. to water, road, or fire ignition points) than to the type of forest (wilderness area) or soil type.

Next, when we look at the colour distribution, we see a more clear differentiation of high vs low values for the first feature Hydro_Road_Fire_mean than the rest. The same might be said for Road_Fire_mean, albeit to a lesser degree.

To interpret what this means, we can formulate a statement like the following : When the average distance to water, fire ignition points and road is low, there is a more likely chance of making an error.

Once again, I must insist that my forestry ‘expertise’ is limited to a couple of weeks. I did do some research to help me interpret what this could mean and came across a couple of articles [6] [7] that suggest the distance to the road is a significant factor in the risk of forest fires.

This leads me to hypothesise that forest fire may be a significant factor influencing the errors made on our dataset. It seems logical to me that areas impacted by fire would have a very different representation of forest diversity to those unaffected by fire. I’m sure someone with more experience could let me know if that makes sense 🙂

Today, we went through a step-by-step ML multi-classification problem. We touched on some important considerations when conducting these analyses, namely the importance of splitting the dataset before we start to manipulate it. This is one of the most common pitfalls in ML projects that can lead to serious issues limiting our ability to generalise our findings.

We also touched on the importance of selecting an appropriate metric to evaluate our models. Here, we used the weighted F1 score, which was appropriate for imbalanced classes. Despite this, we still saw that the underrepresented classes were not well learned.

In my notebook, I also included a section on oversampling to create balanced classes using ADASYN, which is a variation of SMOTE. To save you the suspense, upsampling significantly improved the results on the training set (but not the test set).

This leads us to the error analysis, which is an important part of any ML project. A binary error classification was performed and may suggest that forest fires were implicated in many of the model errors. This could also explain, to a certain extent, why upsampling didn’t improve our final model.

Finally, I want to thank you all for taking the time to read this article! I hope some of you found it to be helpful 🙂

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment