Using SHAP with Cross-Validation in Python | by Dan Kirk | Dec, 2022

By Jessie Hobb On Dec 28, 2022

Making AI not only explainable but also robust

Introduction

In many situations, machine learning models are preferred over traditional linear models because of their superior predictive performance and their ability to handle complex nonlinear data. However, a common criticism of machine learning models is their lack of interpretability. For example, ensemble methods such as XGBoost and Random Forest, combine the results of many individual learners to generate their results. Although this often leads to superior performance, it makes it hard to know the contribution of each feature in the dataset to the output.

To get around this, explainable AI (xAI) has been conceived and is growing in popularity. The xAI field aims to explain how such unexplainable models (so-called black-box models) made their prediction, allowing the best of both world: prediction accuracy and explainability. The motivation for this is that many real-world applications of machine learning require not just good predictive performance but also an explanation of how the results were generated. For example, in the medical field where lives may be lost or saved based on decisions made by a model, it is important to know what the drivers of the decision were. Additionally, being able to identify important variables can be informative for identifying mechanisms or treatment avenues.

One of the most popular and effective xAI techniques is SHAP. The SHAP concept was introduced in 2017 by Lundberg & Lee but actually builds on Shapley values from game theory, which existed long before. Briefly, SHAP values work by calculating the marginal contribution of each feature by looking at the prediction (per observation) in many models with and without the feature, weighting this contribution in each of these reduced feature set models, and then summing the weighted contribution of all of these instances. Those who would appreciate a more thorough description can see the links above, but for our purposes here it suffices to say: the larger the absolute SHAP value for an observation, the larger the effect on the prediction. It thus follows that the larger the average of the absolute SHAP values for all observations for a given feature, the more important the feature is.

Implementing SHAP values in Python is easy with the SHAP library, and many walkthroughs already exist online explaining how this can be done. However, I found two major shortcomings in all of the guides that I came across for incorporating SHAP values into Python code.

The first is that most of the guides use SHAP values on basic train/test splits but not on cross-validation (see Figure 1). Using cross-validation gives a much better idea about the generalizability of your results, whereas the results from a simple train/test split are liable to drastic changes based on how the data is partitioned. As I explain in my recent article on “Machine learning in Nutrition Research”, cross-validation should almost always be preferred over train/test split, unless the dataset you are dealing with is huge.

Figure 1: Different evaluation procedures in machine learning, taken from my article “Machine learning in Nutrition Research” (Kirk et al., 2022).

The other shortcoming is that none of the guides I came across used multiple repeats of cross-validation to derive their SHAP values. Whilst cross-validation is a big improvement on a simple train/test split, it should ideally be repeated multiple times using different splits of the data each time. This is especially important with smaller datasets where the results can change greatly depending on how the data is split. This is why it is often advocated to repeat cross-validation 100x in order to have confidence in your results.

To deal with these shortcomings, I decided to write some code to implement this myself. This walk-through will show you how to get SHAP values for multiple repeats of cross-validation and in corporate a nested cross-validation scheme. For our model dataset, we will use the Boston Housing dataset, and our algorithm of choice will be the powerful but uninterpretable Random Forest.

Implementation of SHAP Values

Whenever you’re building code with various loops as we will be, it usually makes sense to start with the innermost loop and work outwards. By trying to start from the outside and build the code in the order it will also run, it’s easier to get confused and also harder to troubleshoot when things go wrong.

Thus, we start with the basic implementation of SHAP values. I will presume you’re familiar with the general use of SHAP and what the code for its implementation looks like, so I won’t spend too long on an explanation. I leave comments throughout (as is always good to do) so you can check those, and if you’re still unsure then check out the links in the introduction or the docs of the library. I also import libraries as they’re used rather than all at once at the start, just to help with intuition.

Figure 2: SHAP on a simple train/test split

Making AI not only explainable but also robust

Introduction

Implementation of SHAP Values

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.