The Ultimate Guide to PDPs and ICE Plots | by Conor O’Sullivan | Jun, 2022

By Jessie Hobb On Jun 29, 2022

The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

(source: author)

Both PDPs and ICE plots can help us understand how our models make predictions. Using PDPs we can visualise the relationship between model features and the target variable. They can tell us if a relationship is linear, non-linear or if there is no relationship. Similarly, ICE Plots can be used when there are interactions between features. We will go into depth on these two methods.

We start with PDPs. We will take you step-by-step through how PDPs are created. You will see that they are an intuitive method. Even so, we will also explain the mathematics behind PDPs. We then move on to ICE Plots. You will see that, with a good understanding of PDPs, these are easy to understand. Along the way, we discuss different applications and variations of these methods, including:

PDPs and ICE Plots for continuous and categorical features
PDPs for binary target variables
PDPs for 2 model features
Derivative PDP
Feature importance based on PDPs and ICE Plots

We make sure to discuss the advantages and limitations of both methods. These are important sections. They help us understand when the methods are most appropriate. They also tell us how they can lead to incorrect conclusions in some situations.

We end by walking you through both R and Python code for these methods. For R, we apply the ICEbox and iml packages to create visualisations. We also use vip to calculate feature importance. For Python, we will be using scikit-learn’s implementation, PartialDepenceDisplays. You can find links to the GitHub repos with code in these sections.

We start with a step-by-step walk-through of PDPs. To explain this method, we randomly generated a dataset with 1000 rows. It contains details on the sales of second-hand cars. You can see the features in Table 1. We want to predict the price of a car using the first 5 features. You can find this dataset on Kaggle.

Table 1: overview of second-hand car sales dataset (source: author) (dataset: kaggle)(licence: CC0)

To predict price we first need to train a model using this dataset. In our case, we trained a random forest with 100 trees. The exact model is not important as PDPs are a model agnostic method. We will see now that they are built using model predictions. We do not consider the inner workings of a model. This means the visualisations we explore will be similar for random forests, XGBoost, neural networks, etc…

In Table 2, we have one observation in our dataset used to train the model. In the last column, we can see the predicted price for this car. To create a PDP, we vary the value for one of the features and record the resulting predictions. For example, if we changed car_age we would get a different prediction. We do this while holding the other features constant at their real values. For example, owner_age will remain at 19 and km_drive will remain at 27,544.

Table 2: training data and prediction example (source: author)

Looking at Figure 1, we can see the result of this process. This shows the relationship between the predicted price (partial yhat) and car_age for this specific observation. You can see that as car_age increases the predicted price decreases. The black point gives the original predicted price (4,654) and car_age (4.03).

Figure 1: predicted price for varying car_age for observation 1 (source: author)

We then repeat this process for every observation or subset of observations in our dataset. In Figure 2, you can see the prediction lines for 100 observations. To be clear, for each observation we have only varied car_age. We hold the remaining feature constant at their original values. These will not have the same values as the observation we saw in Table 2. This explains why each line starts at a different level.

Figure 2: prediction lines for 100 observations (source: author)

To create a PDP, the final step is to calculate the average prediction at each value for car_age. This gives us the bold yellow line in Figure 3. This line is the PDP. By holding the other features constant and averaging over the observations we are able to isolate the relationship with car_age. We can see that the predicted price tends to decrease with car_age.

Figure 3: partial dependence plot for car age (source: author)

You may have noticed the short lines on the x-axis. These are the quartiles of car_age. That is 10% of car_age values are less than the first line and 90% are less than the last line. This is known as the rug plot. It shows us the distribution of the feature. In this case, the values of car_age are fairly evenly distributed across its range. We will understand why this is useful when we discuss the limitations of PDPs.

You may also have noticed that not all the individual prediction lines follow the PDP trend. Some of the higher lines seem to increase. This suggests that, for those observations, price has the opposite relationship with car_age. That is the predicted prices increase as car_age increases. Keep this in mind. We will come back to it later when we discuss ICE Plots.

Mathematics behind PDPs

For the more mathematically minded, there is also a formal definition of PDPs. Let’s start with the function for PDP we just created. This is given by Equation 1. Set C will contain all the features excluding car_age. For a given car_age value and observation i, we find the predicted price using the original values for the features in C. We do this for all 100 observations and find the average. This equation will give us the bold yellow line we saw in Figure 3.

Equation 1: PD function for car_age (source: author)

In Equation 2, we have generalised the above equation. This is the PD function for a set of features S. C will contain all the features excluding those in S. We now also average over n observations. This can be up to the total number of observations in the dataset (i.e. 1000).

Equation 2: approximated PD function for feature set S (source: author)

Until now, we have only discussed cases when S consists of one feature. In the previous section, we had S = {car_age}. Later, we will show you a PDP of 2 features. We will generally not include more than 2 features in S. Otherwise, it becomes difficult to visualise the PD function.

The above equation is actually only an approximation of the PD function. The true mathematical definition is given in Equation 3. For given values in set S, we find the expected prediction w.r.t. the set C. To do this, we need to integrate our model function w.r.t. the probability of observing the values in set C. To fully understand this equation you will need some experience with stochastic calculus.

Equation 3: PD function for feature set S (source: author)

When working with PDPs an understanding of the approximation is good enough. The true PD function is not practical to implement. Firstly, it will be more computationally expensive to implement than the approximation. Secondly, given our limited number of observations, it is only possible to approximate the probabilities. That is we can not find the true probabilities of each observation. Lastly, many models will not be continuous functions making them difficult to integrate.

PDPs for continuous features

With an understanding of how we create PDPs we will move on to using them. Using these plots we can understand the nature of the relationship between a model feature and the target variable. For example, we have already seen the car_age PDP in Figure 4. The predicted price decreases at a fairly constant rate. This suggests that car_age has a linear relationship with price.

In Figure 5, we can see the PDP for another feature, repairs. This is the number of repairs/services the car received. Initially, the predicted price tends to increase with the number of repairs. We would expect a reliable car to have received some regular maintenance. Then, at around 6/7 repairs, the price tends to decrease. Excessive repairs may indicate that something is wrong with the car. From this, we can see that price has a non-linear relationship with repairs.

From the above, we can see that PDPs are useful when it comes to visualising non-linear relationships. We explore this topic in more depth in the article below. When dealing with many features it may be impractical to look at all the PDPs. So we also discuss using mutual information and feature importance to help find non-linear relationships.

In Figure 6, we can see an example of a PDP when a feature has no relationship with the target variable. For owner_age the PDP is constant. This tells us that the predicted price does not change when we vary owner_age. Later, we will see how we can use this idea of PDP variation to create a feature importance score.

Figure 6: owner_age PDP (source: author)

PDPs for categorical features

The features we have discussed above have all been continuous. We can also create PDPs for categorical features. For example, see the plot for car_type in Figure 7. Here we calculate the average prediction for each type of car — normal (0) or classic (1). Instead of a line, we visualise these with a histogram. We can see that classic cars tend to be sold at a higher price.

PDP for a binary target variable

PDPs for binary target variables are similar to those with continuous targets. Suppose we want to predict if the car price is above (1) or below (0) average. We build a random forest to predict this binary variable using the same features. We can create PDPs for this model using the same process as before. Except now our prediction is a probability.

For example, in Figure 8, you can see the PDP for car_age. We now have a predicted probability on the y-axis. This is the probability that the car is above the average price for a second-hand car. We can see that the probability tends to decrease with the age of the car.

Figure 8: PDP for a binary target variable (source: author)

PDP of 2 features

Going back to our continuous target variable. We can also visualise the PDP of two features. In Figure 9, we can see the average predictions at different combinations of km_driven and car_age. This chart is created in the same way as the PDP of one feature. That is by keeping the remaining features at their original values.

Figure 9: 2 feature PDP (source: author)

These PDPs are useful for visualising interactions between features. The above chart suggests a possible interaction between km_driven and car_age. That is the predicted price tends to be lower when both features have larger values. You should be cautious when drawing these types of conclusions.

This is because we can get similar results if the two features are correlated. Later, when we discuss the limitations of PDPs, will see that this is actually the case. That is km_driven is correlated with car_age. The amount the car has driven tends to be higher when the car is older. This is why we see a lower predicted price when both features are higher.

Derivative PDP

Derivative PDP is a variation of a PDP. It shows the slope/derivative of a PDP. It can be used to get a better understanding of the original PDP. However, in most cases, the insight we can gain from these plots is limited. They are generally more useful for non-linear relationships.

For example, take the derivative PDP for repairs in Figure 10. This is the derivative of the line we saw earlier in Figure 5. We can see that the derivative is 0 at around 6 repairs. At this point, the derivative changes from positive to negative. In other words, the original PDP changes from increasing to decreasing w.r.t. repairs. This tells us that after 6 repairs a car’s price will tend to decrease.

Figure 10: derivative PDP for repairs (source: author)

PDP feature importance

To end this section, we have a feature importance score based on PDPs. This is done by determining the “flatness” of the PDP for each feature. Specifically, for continuous variables, we calculate the standard deviation of the values of the plot. For categorical variables, we estimate the SD by first taking the range. That is the maximum less the minimum PDP value. We then divide the range by 4. This calculation comes from a concept called the range rule.

You can see the PDP-based feature importance for our features in Figure 11. Notice that the score for owner_age is relatively low. This makes sense if we think back to the PDP in Figure 6. We saw that the PDP was relatively constant. In other words, the y-axis values are always close to their mean value. They have a low standard deviation.

Figure 11: feature importance based on PDP (source: author)

There are better-known feature importance scores such as permutation feature importance. You may prefer to use this PDP-based score as it can provide some consistency. If you are analysing feature trends you can now use a feature importance score calculated using similar logic. You can also avoid explaining the logic of two different methods.

With PDPs out of the way, let’s move on to ICE Plots. You will be happy to know that we’ve already discussed the process of creating them. Take Figure 12 below. This is the plot we created just before the car_age PDP in Figure 3. It is an ICE Plot for car_age. ICE Plots are made up of the prediction lines for each individual observation.

Figure 12: ICE Plot for car_age (source: author)

ICE plots are useful when there are interactions in your model. That is if the relationship of a feature with the target variable depends on the value of another feature. It may be hard to see this in the above chart. To make things clearer we can centre our ICE plot. In Figure 13 we have done this by making all the prediction lines start at 0. It is now clear that for some observations the predicted price tends to increase with car_age.

Figure 13: centred ICE Plot (source: author)

To understand what is causing this behaviour we can change the colour of the ICE Plot. In Figure 14, we have changed the colour based on car_type. We make the lines blue for classic cars and red for normal cars. We can now see that this relationship comes from an interaction between car_age and car_type. Intuitively, it makes sense that a classic car would increase in value as it got older.

Figure 14: coloured ICE Plot (source: author)

A final addition is to add the PDP line to the plot. In this way, we can combine ICE Plots and PDPs. This can emphasise how some of the observations deviate from the average trend. We can see that if we had only relied on the PDP we would have missed this interaction. That is when using the PDP in Figure 3, we concluded that price tends to decrease with age for all cars.

Figure 15: combined PDP and ICE Plot (source: author)

Like with PDPs, ICE Plots can help us visualise important relationships in our data. To find those relationships we may need to use metrics like feature importance. An alternative metric used specifically to highlight interactions in a model is the Friedman’s H-statistic. We discuss using all of these methods in the article below.

ICE Plots of categorical features

We can use boxplots to visualise ICE Plots of categorical features. For example, we have the ICE Plot for car_type in Figure 16. The bold lines in the middle of the boxes give the average prediction. In other words, they are the PDPs. We can see the predicted price tends to be lower for normal cars (0).

Figure 16: car_type ICE plot (source: author)

ICE Plot-based feature importance

We can also calculate an ICE Plot-based feature importance score. This is similar to the PDP-based score except we no longer consider the average prediction line. We now calculate the score using the individual prediction lines. This means the score will consider interactions between features.

We have the ICE Plot-based scores for our model in Figure 17. We can compare these to the PDP-based scores in Figure 11. The biggest difference is the score for car_age is now larger. It increased from 300 to 345. This makes sense in light of the analysis we did above. We saw that there is an interaction that impacts the relationship between car_age and price. The PDP-based feature importance does not consider this interaction.

Figure 17: ICE plot-based feature importance (source: author)

By now, hopefully, we have a good understanding of PDPs and ICE Plots and the insights we can gain from them. We’re going to move on to discuss the advantages of these approaches. Then in the next section, we will discuss the limitations. It is crucial to understand these so that you do not draw incorrect conclusions from the plots.

Isolate feature trends

We can visualise relationships in our data using scatter plots but data is messy. For example, we can see the interaction between car_age and car_type in Figure 18. The points vary around the true underlying trend. This is because of statistical noise and the fact that price also has relationships with other features. In a real dataset, this problem will likely be worse. Ultimately, it can be difficult to see trends in data.

Figure 18: scatter plot of car_age and car_type interaction (source: author)

Working with PDPs and ICE Plots, we are no longer working with raw data values. We are working with model predictions. If built correctly, a model will capture underlying relationships in our data and ignore statistical noise. We can then isolate the trend of a particular feature. This is done by holding the other feature values constant and averaging over observations.

This is what makes PDPs and ICE Plots so useful. They allow us to strip out noise and the effect of other features. This makes it easier to see the underlying relationships in our data. In this sense, these methods can be used for data exploration and not just for understanding our models.

Straight forward to explain

Hopefully, by walking you through the process of building a PDP it was easy to understand. In this way, you can gain an intuitive understanding of the method without the need for a mathematical definition. This also means that the methods are easy to explain to a non-technical person. This can be useful in an industry setting.

Easy to implement

These methods are also easy to implement. We just need to vary feature values and record the resulting predictions. We do not even need to consider the inner workings of the model. This means the same implementation can be used with any model. When we discuss the R and Python code you will see that there are already good implementations for these methods.

Assume feature independence

Moving on to limitations, we will start with the main problem with these methods. This is that they assume features are independent. This is not always the case as features can be correlated or associated. For example, take the scatter plot of km_driven and car_age in Figure 19. There is a clear correlation. Intuitively this makes sense. Older cars will tend to have driven longer distances.

Figure 19: scatter plot of km_driven vs car_age (source: author)

The issue is that when we build a prediction line for an observation we will sample all possible values of a feature. For example, suppose we want to build a PDP for km_driven. Take the observation given in red in Figure 20. It has a car_age of 10. To build the prediction line we will vary km_driven for all values in the dotted oval. However, in reality, observations with this car_age will only have driven distances within the solid oval.

Figure 20: issues with random sampling (source: author)

Our model was not trained on observations outside of the solid oval. Still, we are creating the PDP with predictions from these observations. The result is that the prediction line is built on observations that the model has not “seen” before. This can produce unintuitive results and lead to incorrect conclusions about feature trends.

Equal focus on all feature values

Even if features are uncorrelated, we can still come to incorrect conclusions. For each observation, we are sampling over all possible feature values. This gives equal weight to all values. In reality, some values of the feature will be less common. Such as at the extremes of the feature’s distribution. There will be more uncertainty about the trend at these values.

Considering this, it is common to include a rug plot. These help us understand the distribution of the feature. Before we saw a quantile version of a rug plot. Figure 21, gives an alternative version. Here we have an individual line for each observation. We can see that there are fewer observations for higher values of km_driven. We should, therefore, interpret the trend at these values with more care.

Figure 21: alternative PDP for km_driven (source: author)

Conclusions depend on your model

As mentioned in the advantages, working with model predictions can help us see relationships more clearly. The issue is that models can make incorrect predictions. An underfitted model can miss important relationships. By modelling noise, an overfitted model can present relationships that are not really there. Ultimately, the conclusions we draw will depend on our model. It is important to consider the performance of the model.

We may still have issues even with an accurate model. A model can ignore some relationships in favour of others. We can make the incorrect conclusion that the ignored features do not have relationships with the target variable. This means that, when doing data exploration, you may want to restrict the features to the subset you are interested in exploring.

PDPs ignore interactions

As mentioned before, by using an average, the PDPs can miss interactions. As a result, the PDP-based feature importance will also miss these interactions. This means that, if interactions are present, the score can underestimate the importance of a feature. We saw this with the car_age feature. One solution is to use ICE Plots or just stick to permutation feature importance.

Limitations of implementations

In the next sections, we will walk you through the code used to implement these methods. We will see that each implementation has its own pros and cons. Some packages will not have implementations for all the plots we discussed. For example, if you are working with Python there is no implementation (that I know of) for derivative PDPs or feature importance.

In this section, we will walk you through R code used to create PDPs and ICE Plots. We will look at using three different packages. We use ICEbox and iml to create the plots. Combined these allow us to create all the plots we discussed above. For the feature importance scores, we use vip. You can find all the code we discuss on GitHub.

We start by loading our dataset (line 1). This is the same one we discussed in Table 1 at the beginning of the article. We also set car_type to a categorical feature (line 2).

The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

PDPs and ICE Plots for continuous and categorical features
PDPs for binary target variables
PDPs for 2 model features
Derivative PDP
Feature importance based on PDPs and ICE Plots

Mathematics behind PDPs

PDPs for continuous features

PDPs for categorical features

PDP for a binary target variable

PDP of 2 features

Derivative PDP

PDP feature importance

ICE Plots of categorical features

ICE Plot-based feature importance

Isolate feature trends

Straight forward to explain

Easy to implement

Assume feature independence

Equal focus on all feature values

Conclusions depend on your model

PDPs ignore interactions

Limitations of implementations

We start by loading our dataset (line 1). This is the same one we discussed in Table 1 at the beginning of the article. We also set car_type to a categorical feature (line 2).

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

The Ultimate Guide to PDPs and ICE Plots | by Conor O’Sullivan | Jun, 2022

The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

Mathematics behind PDPs

PDPs for continuous features

PDPs for categorical features

PDP for a binary target variable

PDP of 2 features

Derivative PDP

PDP feature importance

ICE Plots of categorical features

ICE Plot-based feature importance

Isolate feature trends

Straight forward to explain

Easy to implement

Assume feature independence

Equal focus on all feature values

Conclusions depend on your model

PDPs ignore interactions

Limitations of implementations

Modelling

Package — ICEbox

Package — iml

Package — vip

Continuous target variable

Binary target variable

The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

Mathematics behind PDPs

PDPs for continuous features

PDPs for categorical features

PDP for a binary target variable

PDP of 2 features

Derivative PDP

PDP feature importance

ICE Plots of categorical features

ICE Plot-based feature importance

Isolate feature trends

Straight forward to explain

Easy to implement

Assume feature independence

Equal focus on all feature values

Conclusions depend on your model

PDPs ignore interactions

Limitations of implementations

Modelling

Package — ICEbox

Package — iml

Package — vip

Continuous target variable

Binary target variable