Techno Blender
Digitally Yours.

SHAP for Categorical Features with CatBoost | by Conor O’Sullivan | Aug, 2022

0 102


Avoid post-processing the SHAP values of categorical features

Photo by Andrew Ridley on Unsplash

Typically, to model a categorical feature it first needs to be transformed using one-hot encodings. We end up with a binary variable for each category. The problem with this is that each variable will have its own SHAP value. This makes it difficult to see the overall contribution of the original categorical feature. In a previous article, we explored a solution to this. It involved digging into the SHAP values object and manually adding the individual SHAP values. As an alternative, we can use CatBoost.

CatBoost is a gradient boosting library. A major advantage over other libraries is that it can handle non-numerical features. That is categorical features can be used without transforming them. A resulting benefit is that the SHAP values of a CatBoost model are easy to interpret. Unlike other models, there will be only one SHAP value for each categorical feature.

We will explore how to calculate and interpret SHAP values of CatBoost models. We will also apply some of the aggregations provided by the SHAP package. We will see that the aggregations are limited. Specifically, when it comes to understanding the nature of the relationship of a categorical feature. So, to address this limitation, will explore how we can create a beeswarm plot for an individual feature. Along the way, we will walk through the Python code used to get these results.

For this analysis, we will be using the same dataset as before. That is a mushroom classification dataset. You can see a snapshot of this dataset in Figure 1. The target variable is the mushroom’s class. That is if the mushroom is poisonous (p) or edible (e). You can find this dataset in UCI’s MLR.

Figure 1: Mushroom dataset snapshot (source: author) (dataset source: UCI) (licence: CC BY 4.0)

For model features, we have 22 categorical features. For each feature, the categories are represented by a letter. For example, odor has 9 unique categories- almond (a), anise (l), creosote (c), fishy (y), foul (f), musty (m), none (n), pungent (p), spicy (s). This is what the mushroom smells like.

We’ll walk you through the code used to analyse this dataset and you can find the full script on GitHub. To start, we will be using the Python packages below. We have some common packages for handling and visualising data (lines 2–4). We use CatBoostClassifier for modelling (line 6). Finally, we use shap to understand how our model works (line 8). Make sure you have all these packages installed.

We import our dataset (line 2). We need a numerical target variable so we transform it by setting poisonous = 1 and edible = 0 (line 6). We also get the categorical features (line 7). At this point in the previous article, we needed to transform these features. With CatBoost we can use them as is.

You can see this when we train our model below (line 7). We pass the non-numerical features (X), target variable (y) and list indicating with features are categorical (cat_features). All of our features are categorical. This means cat_features is a list of numbers from 0 to 21. In the end, the classifier is made up of 20 trees each with a maximum depth of 3. It had an accuracy of 98.7% on the training set.

We can now move on to understanding how our model is making these predictions. If you are unfamiliar with SHAP or the python package, I suggest reading the article below. We go in-depth on how to interpret SHAP values. We also explore some of the aggregations used in this article.

Waterwall plot

We start by calculating the SHAP values (lines 2–3). We then visualise the SHAP values of the first prediction using a waterfall plot (line 6). You can see this plot in Figure 2. This tells us how each of the categorical feature values has contributed to the prediction. For example, we can see that this mushroom has an almond (a) odor. This has decreased the log odds by 0.85. In other words, it has decreased the likelihood that the mushroom is poisonous.

Figure 2: SHAP waterfall for CatBoost (source: author)

From the chart above, it is easy to see the contributions of each feature. In comparison, we have the waterfall plot in Figure 3. As mentioned, this was created in the previous article. To model the categorical features, we first transformed them using one-hot encodings. This means that each of the binary features has its own SHAP value. For example, odor will have 9 SHAP values. One for every unique category. As a result, it is difficult to understand the overall contribution of odor to the prediction.

Figure 3: SHAP waterfall for one-hot encodings (source: author)

We are able to take the SHAP values in Figure 3 to create a plot similar to Figure 2. That is so we have only one SHAP value for each categorical feature. To do this we need to “post-process” the SHAP values by adding all the values for one categorical feature together. Unfortunately, there is no straightforward way to do this. We need to manually update the SHAP values object ourselves. We have seen that by using CatBoost we can avoid this process.

Absolute mean SHAP

The SHAP aggregations also work for CatBoost. For example, we use the mean SHAP plot in the code below. Looking at Figure 5, we can use this plot to highlight important categorical features. For example, we can see that odor tends to have large positive/ negative SHAP values.

Figure 4: mean SHAP plot (source: author)

Beeswarm

Another common aggregation is the beeswarm plot. For continuous variables, this plot is useful as it can help explain the nature of the relationships. That is we can see how SHAP values are associated with the feature values. However, for the categorical features, the feature values are not numerical. As a result, in Figure 6, you can see the SHAP values are all given the same colour. We need to create our own plots to understand the nature of these relationships.

Figure 5: beeswarm plot (source: author)

Beeswarm for one feature

One way is to use a beeswarm plot for an individual feature. You can see what we mean in Figure 6. Here we have grouped the SHAP values for the odor feature base on the odor category. For example, you can see that a foul smell leads to higher SHAP values. These mushrooms are more likely to be poisonous. In the previous article, we used boxplots to get similar results.

Figure 6: beeswarm for odor (source: author)

We won’t discuss the code for this plot in detail. In a nutshell, we need to create a new SHAP values object, shap_values_odor. This is done by “post-processing” the SHAP values so they are in the form we want. We replace the original SHAP values with the SHAP values for odor (line 24). We also replace the feature names with the odor categories (line 43). If we create shap_values_odor correctly, we can use the beeswarm function to create the plot (line 46).

In the end, SHAP and CatBoost are powerful tools to analyse categorical features. Both packages work together seamlessly. The downside is that you may not want to use CatBoost. If you are working with models like RandomForest, XGBoost or neural networks then you’ll need to use the alternative solution. You can find this in the article below. We also go into more detail on how we can post-process SHAP values.


Avoid post-processing the SHAP values of categorical features

Photo by Andrew Ridley on Unsplash

Typically, to model a categorical feature it first needs to be transformed using one-hot encodings. We end up with a binary variable for each category. The problem with this is that each variable will have its own SHAP value. This makes it difficult to see the overall contribution of the original categorical feature. In a previous article, we explored a solution to this. It involved digging into the SHAP values object and manually adding the individual SHAP values. As an alternative, we can use CatBoost.

CatBoost is a gradient boosting library. A major advantage over other libraries is that it can handle non-numerical features. That is categorical features can be used without transforming them. A resulting benefit is that the SHAP values of a CatBoost model are easy to interpret. Unlike other models, there will be only one SHAP value for each categorical feature.

We will explore how to calculate and interpret SHAP values of CatBoost models. We will also apply some of the aggregations provided by the SHAP package. We will see that the aggregations are limited. Specifically, when it comes to understanding the nature of the relationship of a categorical feature. So, to address this limitation, will explore how we can create a beeswarm plot for an individual feature. Along the way, we will walk through the Python code used to get these results.

For this analysis, we will be using the same dataset as before. That is a mushroom classification dataset. You can see a snapshot of this dataset in Figure 1. The target variable is the mushroom’s class. That is if the mushroom is poisonous (p) or edible (e). You can find this dataset in UCI’s MLR.

Figure 1: Mushroom dataset snapshot (source: author) (dataset source: UCI) (licence: CC BY 4.0)

For model features, we have 22 categorical features. For each feature, the categories are represented by a letter. For example, odor has 9 unique categories- almond (a), anise (l), creosote (c), fishy (y), foul (f), musty (m), none (n), pungent (p), spicy (s). This is what the mushroom smells like.

We’ll walk you through the code used to analyse this dataset and you can find the full script on GitHub. To start, we will be using the Python packages below. We have some common packages for handling and visualising data (lines 2–4). We use CatBoostClassifier for modelling (line 6). Finally, we use shap to understand how our model works (line 8). Make sure you have all these packages installed.

We import our dataset (line 2). We need a numerical target variable so we transform it by setting poisonous = 1 and edible = 0 (line 6). We also get the categorical features (line 7). At this point in the previous article, we needed to transform these features. With CatBoost we can use them as is.

You can see this when we train our model below (line 7). We pass the non-numerical features (X), target variable (y) and list indicating with features are categorical (cat_features). All of our features are categorical. This means cat_features is a list of numbers from 0 to 21. In the end, the classifier is made up of 20 trees each with a maximum depth of 3. It had an accuracy of 98.7% on the training set.

We can now move on to understanding how our model is making these predictions. If you are unfamiliar with SHAP or the python package, I suggest reading the article below. We go in-depth on how to interpret SHAP values. We also explore some of the aggregations used in this article.

Waterwall plot

We start by calculating the SHAP values (lines 2–3). We then visualise the SHAP values of the first prediction using a waterfall plot (line 6). You can see this plot in Figure 2. This tells us how each of the categorical feature values has contributed to the prediction. For example, we can see that this mushroom has an almond (a) odor. This has decreased the log odds by 0.85. In other words, it has decreased the likelihood that the mushroom is poisonous.

Figure 2: SHAP waterfall for CatBoost (source: author)

From the chart above, it is easy to see the contributions of each feature. In comparison, we have the waterfall plot in Figure 3. As mentioned, this was created in the previous article. To model the categorical features, we first transformed them using one-hot encodings. This means that each of the binary features has its own SHAP value. For example, odor will have 9 SHAP values. One for every unique category. As a result, it is difficult to understand the overall contribution of odor to the prediction.

Figure 3: SHAP waterfall for one-hot encodings (source: author)

We are able to take the SHAP values in Figure 3 to create a plot similar to Figure 2. That is so we have only one SHAP value for each categorical feature. To do this we need to “post-process” the SHAP values by adding all the values for one categorical feature together. Unfortunately, there is no straightforward way to do this. We need to manually update the SHAP values object ourselves. We have seen that by using CatBoost we can avoid this process.

Absolute mean SHAP

The SHAP aggregations also work for CatBoost. For example, we use the mean SHAP plot in the code below. Looking at Figure 5, we can use this plot to highlight important categorical features. For example, we can see that odor tends to have large positive/ negative SHAP values.

Figure 4: mean SHAP plot (source: author)

Beeswarm

Another common aggregation is the beeswarm plot. For continuous variables, this plot is useful as it can help explain the nature of the relationships. That is we can see how SHAP values are associated with the feature values. However, for the categorical features, the feature values are not numerical. As a result, in Figure 6, you can see the SHAP values are all given the same colour. We need to create our own plots to understand the nature of these relationships.

Figure 5: beeswarm plot (source: author)

Beeswarm for one feature

One way is to use a beeswarm plot for an individual feature. You can see what we mean in Figure 6. Here we have grouped the SHAP values for the odor feature base on the odor category. For example, you can see that a foul smell leads to higher SHAP values. These mushrooms are more likely to be poisonous. In the previous article, we used boxplots to get similar results.

Figure 6: beeswarm for odor (source: author)

We won’t discuss the code for this plot in detail. In a nutshell, we need to create a new SHAP values object, shap_values_odor. This is done by “post-processing” the SHAP values so they are in the form we want. We replace the original SHAP values with the SHAP values for odor (line 24). We also replace the feature names with the odor categories (line 43). If we create shap_values_odor correctly, we can use the beeswarm function to create the plot (line 46).

In the end, SHAP and CatBoost are powerful tools to analyse categorical features. Both packages work together seamlessly. The downside is that you may not want to use CatBoost. If you are working with models like RandomForest, XGBoost or neural networks then you’ll need to use the alternative solution. You can find this in the article below. We also go into more detail on how we can post-process SHAP values.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment