SHAP for Categorical Features with CatBoost | by Conor O’Sullivan | Aug, 2022

By Jessie Hobb On Aug 10, 2022

Avoid post-processing the SHAP values of categorical features

Typically, to model a categorical feature it first needs to be transformed using one-hot encodings. We end up with a binary variable for each category. The problem with this is that each variable will have its own SHAP value. This makes it difficult to see the overall contribution of the original categorical feature. In a previous article, we explored a solution to this. It involved digging into the SHAP values object and manually adding the individual SHAP values. As an alternative, we can use CatBoost.

CatBoost is a gradient boosting library. A major advantage over other libraries is that it can handle non-numerical features. That is categorical features can be used without transforming them. A resulting benefit is that the SHAP values of a CatBoost model are easy to interpret. Unlike other models, there will be only one SHAP value for each categorical feature.

We will explore how to calculate and interpret SHAP values of CatBoost models. We will also apply some of the aggregations provided by the SHAP package. We will see that the aggregations are limited. Specifically, when it comes to understanding the nature of the relationship of a categorical feature. So, to address this limitation, will explore how we can create a beeswarm plot for an individual feature. Along the way, we will walk through the Python code used to get these results.

For this analysis, we will be using the same dataset as before. That is a mushroom classification dataset. You can see a snapshot of this dataset in Figure 1. The target variable is the mushroom’s class. That is if the mushroom is poisonous (p) or edible (e). You can find this dataset in UCI’s MLR.

Figure 1: Mushroom dataset snapshot (source: author) (dataset source: UCI) (licence: CC BY 4.0)

For model features, we have 22 categorical features. For each feature, the categories are represented by a letter. For example, odor has 9 unique categories- almond (a), anise (l), creosote (c), fishy (y), foul (f), musty (m), none (n), pungent (p), spicy (s). This is what the mushroom smells like.

We’ll walk you through the code used to analyse this dataset and you can find the full script on GitHub. To start, we will be using the Python packages below. We have some common packages for handling and visualising data (lines 2–4). We use CatBoostClassifier for modelling (line 6). Finally, we use shap to understand how our model works (line 8). Make sure you have all these packages installed.

Figure 2: SHAP waterfall for CatBoost (source: author)

Avoid post-processing the SHAP values of categorical features

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

SHAP for Categorical Features with CatBoost | by Conor O’Sullivan | Aug, 2022

Avoid post-processing the SHAP values of categorical features

Waterwall plot

Absolute mean SHAP

Beeswarm

Beeswarm for one feature

Avoid post-processing the SHAP values of categorical features

Waterwall plot

Absolute mean SHAP

Beeswarm

Beeswarm for one feature