Techno Blender
Digitally Yours.

The Ultimate Guide to PDPs and ICE Plots | by Conor O’Sullivan | Jun, 2022

0 73


The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

(source: author)

Both PDPs and ICE plots can help us understand how our models make predictions. Using PDPs we can visualise the relationship between model features and the target variable. They can tell us if a relationship is linear, non-linear or if there is no relationship. Similarly, ICE Plots can be used when there are interactions between features. We will go into depth on these two methods.

We start with PDPs. We will take you step-by-step through how PDPs are created. You will see that they are an intuitive method. Even so, we will also explain the mathematics behind PDPs. We then move on to ICE Plots. You will see that, with a good understanding of PDPs, these are easy to understand. Along the way, we discuss different applications and variations of these methods, including:

  • PDPs and ICE Plots for continuous and categorical features
  • PDPs for binary target variables
  • PDPs for 2 model features
  • Derivative PDP
  • Feature importance based on PDPs and ICE Plots

We make sure to discuss the advantages and limitations of both methods. These are important sections. They help us understand when the methods are most appropriate. They also tell us how they can lead to incorrect conclusions in some situations.

We end by walking you through both R and Python code for these methods. For R, we apply the ICEbox and iml packages to create visualisations. We also use vip to calculate feature importance. For Python, we will be using scikit-learn’s implementation, PartialDepenceDisplays. You can find links to the GitHub repos with code in these sections.

We start with a step-by-step walk-through of PDPs. To explain this method, we randomly generated a dataset with 1000 rows. It contains details on the sales of second-hand cars. You can see the features in Table 1. We want to predict the price of a car using the first 5 features. You can find this dataset on Kaggle.

Table 1: overview of second-hand car sales dataset (source: author) (dataset: kaggle)(licence: CC0)

To predict price we first need to train a model using this dataset. In our case, we trained a random forest with 100 trees. The exact model is not important as PDPs are a model agnostic method. We will see now that they are built using model predictions. We do not consider the inner workings of a model. This means the visualisations we explore will be similar for random forests, XGBoost, neural networks, etc…

In Table 2, we have one observation in our dataset used to train the model. In the last column, we can see the predicted price for this car. To create a PDP, we vary the value for one of the features and record the resulting predictions. For example, if we changed car_age we would get a different prediction. We do this while holding the other features constant at their real values. For example, owner_age will remain at 19 and km_drive will remain at 27,544.

Table 2: training data and prediction example (source: author)

Looking at Figure 1, we can see the result of this process. This shows the relationship between the predicted price (partial yhat) and car_age for this specific observation. You can see that as car_age increases the predicted price decreases. The black point gives the original predicted price (4,654) and car_age (4.03).

Figure 1: predicted price for varying car_age for observation 1 (source: author)

We then repeat this process for every observation or subset of observations in our dataset. In Figure 2, you can see the prediction lines for 100 observations. To be clear, for each observation we have only varied car_age. We hold the remaining feature constant at their original values. These will not have the same values as the observation we saw in Table 2. This explains why each line starts at a different level.

Figure 2: prediction lines for 100 observations (source: author)

To create a PDP, the final step is to calculate the average prediction at each value for car_age. This gives us the bold yellow line in Figure 3. This line is the PDP. By holding the other features constant and averaging over the observations we are able to isolate the relationship with car_age. We can see that the predicted price tends to decrease with car_age.

Figure 3: partial dependence plot for car age (source: author)

You may have noticed the short lines on the x-axis. These are the quartiles of car_age. That is 10% of car_age values are less than the first line and 90% are less than the last line. This is known as the rug plot. It shows us the distribution of the feature. In this case, the values of car_age are fairly evenly distributed across its range. We will understand why this is useful when we discuss the limitations of PDPs.

You may also have noticed that not all the individual prediction lines follow the PDP trend. Some of the higher lines seem to increase. This suggests that, for those observations, price has the opposite relationship with car_age. That is the predicted prices increase as car_age increases. Keep this in mind. We will come back to it later when we discuss ICE Plots.

Mathematics behind PDPs

For the more mathematically minded, there is also a formal definition of PDPs. Let’s start with the function for PDP we just created. This is given by Equation 1. Set C will contain all the features excluding car_age. For a given car_age value and observation i, we find the predicted price using the original values for the features in C. We do this for all 100 observations and find the average. This equation will give us the bold yellow line we saw in Figure 3.

Equation 1: PD function for car_age (source: author)

In Equation 2, we have generalised the above equation. This is the PD function for a set of features S. C will contain all the features excluding those in S. We now also average over n observations. This can be up to the total number of observations in the dataset (i.e. 1000).

Equation 2: approximated PD function for feature set S (source: author)

Until now, we have only discussed cases when S consists of one feature. In the previous section, we had S = {car_age}. Later, we will show you a PDP of 2 features. We will generally not include more than 2 features in S. Otherwise, it becomes difficult to visualise the PD function.

The above equation is actually only an approximation of the PD function. The true mathematical definition is given in Equation 3. For given values in set S, we find the expected prediction w.r.t. the set C. To do this, we need to integrate our model function w.r.t. the probability of observing the values in set C. To fully understand this equation you will need some experience with stochastic calculus.

Equation 3: PD function for feature set S (source: author)

When working with PDPs an understanding of the approximation is good enough. The true PD function is not practical to implement. Firstly, it will be more computationally expensive to implement than the approximation. Secondly, given our limited number of observations, it is only possible to approximate the probabilities. That is we can not find the true probabilities of each observation. Lastly, many models will not be continuous functions making them difficult to integrate.

PDPs for continuous features

With an understanding of how we create PDPs we will move on to using them. Using these plots we can understand the nature of the relationship between a model feature and the target variable. For example, we have already seen the car_age PDP in Figure 4. The predicted price decreases at a fairly constant rate. This suggests that car_age has a linear relationship with price.

Figure 4: car_age PDP (source: author)

In Figure 5, we can see the PDP for another feature, repairs. This is the number of repairs/services the car received. Initially, the predicted price tends to increase with the number of repairs. We would expect a reliable car to have received some regular maintenance. Then, at around 6/7 repairs, the price tends to decrease. Excessive repairs may indicate that something is wrong with the car. From this, we can see that price has a non-linear relationship with repairs.

Figure 5: repairs PDP (source: author)

From the above, we can see that PDPs are useful when it comes to visualising non-linear relationships. We explore this topic in more depth in the article below. When dealing with many features it may be impractical to look at all the PDPs. So we also discuss using mutual information and feature importance to help find non-linear relationships.

In Figure 6, we can see an example of a PDP when a feature has no relationship with the target variable. For owner_age the PDP is constant. This tells us that the predicted price does not change when we vary owner_age. Later, we will see how we can use this idea of PDP variation to create a feature importance score.

Figure 6: owner_age PDP (source: author)

PDPs for categorical features

The features we have discussed above have all been continuous. We can also create PDPs for categorical features. For example, see the plot for car_type in Figure 7. Here we calculate the average prediction for each type of car — normal (0) or classic (1). Instead of a line, we visualise these with a histogram. We can see that classic cars tend to be sold at a higher price.

Figure 7: car_type PDP (source: author)

PDP for a binary target variable

PDPs for binary target variables are similar to those with continuous targets. Suppose we want to predict if the car price is above (1) or below (0) average. We build a random forest to predict this binary variable using the same features. We can create PDPs for this model using the same process as before. Except now our prediction is a probability.

For example, in Figure 8, you can see the PDP for car_age. We now have a predicted probability on the y-axis. This is the probability that the car is above the average price for a second-hand car. We can see that the probability tends to decrease with the age of the car.

Figure 8: PDP for a binary target variable (source: author)

PDP of 2 features

Going back to our continuous target variable. We can also visualise the PDP of two features. In Figure 9, we can see the average predictions at different combinations of km_driven and car_age. This chart is created in the same way as the PDP of one feature. That is by keeping the remaining features at their original values.

Figure 9: 2 feature PDP (source: author)

These PDPs are useful for visualising interactions between features. The above chart suggests a possible interaction between km_driven and car_age. That is the predicted price tends to be lower when both features have larger values. You should be cautious when drawing these types of conclusions.

This is because we can get similar results if the two features are correlated. Later, when we discuss the limitations of PDPs, will see that this is actually the case. That is km_driven is correlated with car_age. The amount the car has driven tends to be higher when the car is older. This is why we see a lower predicted price when both features are higher.

Derivative PDP

Derivative PDP is a variation of a PDP. It shows the slope/derivative of a PDP. It can be used to get a better understanding of the original PDP. However, in most cases, the insight we can gain from these plots is limited. They are generally more useful for non-linear relationships.

For example, take the derivative PDP for repairs in Figure 10. This is the derivative of the line we saw earlier in Figure 5. We can see that the derivative is 0 at around 6 repairs. At this point, the derivative changes from positive to negative. In other words, the original PDP changes from increasing to decreasing w.r.t. repairs. This tells us that after 6 repairs a car’s price will tend to decrease.

Figure 10: derivative PDP for repairs (source: author)

PDP feature importance

To end this section, we have a feature importance score based on PDPs. This is done by determining the “flatness” of the PDP for each feature. Specifically, for continuous variables, we calculate the standard deviation of the values of the plot. For categorical variables, we estimate the SD by first taking the range. That is the maximum less the minimum PDP value. We then divide the range by 4. This calculation comes from a concept called the range rule.

You can see the PDP-based feature importance for our features in Figure 11. Notice that the score for owner_age is relatively low. This makes sense if we think back to the PDP in Figure 6. We saw that the PDP was relatively constant. In other words, the y-axis values are always close to their mean value. They have a low standard deviation.

Figure 11: feature importance based on PDP (source: author)

There are better-known feature importance scores such as permutation feature importance. You may prefer to use this PDP-based score as it can provide some consistency. If you are analysing feature trends you can now use a feature importance score calculated using similar logic. You can also avoid explaining the logic of two different methods.

With PDPs out of the way, let’s move on to ICE Plots. You will be happy to know that we’ve already discussed the process of creating them. Take Figure 12 below. This is the plot we created just before the car_age PDP in Figure 3. It is an ICE Plot for car_age. ICE Plots are made up of the prediction lines for each individual observation.

Figure 12: ICE Plot for car_age (source: author)

ICE plots are useful when there are interactions in your model. That is if the relationship of a feature with the target variable depends on the value of another feature. It may be hard to see this in the above chart. To make things clearer we can centre our ICE plot. In Figure 13 we have done this by making all the prediction lines start at 0. It is now clear that for some observations the predicted price tends to increase with car_age.

Figure 13: centred ICE Plot (source: author)

To understand what is causing this behaviour we can change the colour of the ICE Plot. In Figure 14, we have changed the colour based on car_type. We make the lines blue for classic cars and red for normal cars. We can now see that this relationship comes from an interaction between car_age and car_type. Intuitively, it makes sense that a classic car would increase in value as it got older.

Figure 14: coloured ICE Plot (source: author)

A final addition is to add the PDP line to the plot. In this way, we can combine ICE Plots and PDPs. This can emphasise how some of the observations deviate from the average trend. We can see that if we had only relied on the PDP we would have missed this interaction. That is when using the PDP in Figure 3, we concluded that price tends to decrease with age for all cars.

Figure 15: combined PDP and ICE Plot (source: author)

Like with PDPs, ICE Plots can help us visualise important relationships in our data. To find those relationships we may need to use metrics like feature importance. An alternative metric used specifically to highlight interactions in a model is the Friedman’s H-statistic. We discuss using all of these methods in the article below.

ICE Plots of categorical features

We can use boxplots to visualise ICE Plots of categorical features. For example, we have the ICE Plot for car_type in Figure 16. The bold lines in the middle of the boxes give the average prediction. In other words, they are the PDPs. We can see the predicted price tends to be lower for normal cars (0).

Figure 16: car_type ICE plot (source: author)

ICE Plot-based feature importance

We can also calculate an ICE Plot-based feature importance score. This is similar to the PDP-based score except we no longer consider the average prediction line. We now calculate the score using the individual prediction lines. This means the score will consider interactions between features.

We have the ICE Plot-based scores for our model in Figure 17. We can compare these to the PDP-based scores in Figure 11. The biggest difference is the score for car_age is now larger. It increased from 300 to 345. This makes sense in light of the analysis we did above. We saw that there is an interaction that impacts the relationship between car_age and price. The PDP-based feature importance does not consider this interaction.

Figure 17: ICE plot-based feature importance (source: author)

By now, hopefully, we have a good understanding of PDPs and ICE Plots and the insights we can gain from them. We’re going to move on to discuss the advantages of these approaches. Then in the next section, we will discuss the limitations. It is crucial to understand these so that you do not draw incorrect conclusions from the plots.

Isolate feature trends

We can visualise relationships in our data using scatter plots but data is messy. For example, we can see the interaction between car_age and car_type in Figure 18. The points vary around the true underlying trend. This is because of statistical noise and the fact that price also has relationships with other features. In a real dataset, this problem will likely be worse. Ultimately, it can be difficult to see trends in data.

Figure 18: scatter plot of car_age and car_type interaction (source: author)

Working with PDPs and ICE Plots, we are no longer working with raw data values. We are working with model predictions. If built correctly, a model will capture underlying relationships in our data and ignore statistical noise. We can then isolate the trend of a particular feature. This is done by holding the other feature values constant and averaging over observations.

This is what makes PDPs and ICE Plots so useful. They allow us to strip out noise and the effect of other features. This makes it easier to see the underlying relationships in our data. In this sense, these methods can be used for data exploration and not just for understanding our models.

Straight forward to explain

Hopefully, by walking you through the process of building a PDP it was easy to understand. In this way, you can gain an intuitive understanding of the method without the need for a mathematical definition. This also means that the methods are easy to explain to a non-technical person. This can be useful in an industry setting.

Easy to implement

These methods are also easy to implement. We just need to vary feature values and record the resulting predictions. We do not even need to consider the inner workings of the model. This means the same implementation can be used with any model. When we discuss the R and Python code you will see that there are already good implementations for these methods.

Assume feature independence

Moving on to limitations, we will start with the main problem with these methods. This is that they assume features are independent. This is not always the case as features can be correlated or associated. For example, take the scatter plot of km_driven and car_age in Figure 19. There is a clear correlation. Intuitively this makes sense. Older cars will tend to have driven longer distances.

Figure 19: scatter plot of km_driven vs car_age (source: author)

The issue is that when we build a prediction line for an observation we will sample all possible values of a feature. For example, suppose we want to build a PDP for km_driven. Take the observation given in red in Figure 20. It has a car_age of 10. To build the prediction line we will vary km_driven for all values in the dotted oval. However, in reality, observations with this car_age will only have driven distances within the solid oval.

Figure 20: issues with random sampling (source: author)

Our model was not trained on observations outside of the solid oval. Still, we are creating the PDP with predictions from these observations. The result is that the prediction line is built on observations that the model has not “seen” before. This can produce unintuitive results and lead to incorrect conclusions about feature trends.

Equal focus on all feature values

Even if features are uncorrelated, we can still come to incorrect conclusions. For each observation, we are sampling over all possible feature values. This gives equal weight to all values. In reality, some values of the feature will be less common. Such as at the extremes of the feature’s distribution. There will be more uncertainty about the trend at these values.

Considering this, it is common to include a rug plot. These help us understand the distribution of the feature. Before we saw a quantile version of a rug plot. Figure 21, gives an alternative version. Here we have an individual line for each observation. We can see that there are fewer observations for higher values of km_driven. We should, therefore, interpret the trend at these values with more care.

Figure 21: alternative PDP for km_driven (source: author)

Conclusions depend on your model

As mentioned in the advantages, working with model predictions can help us see relationships more clearly. The issue is that models can make incorrect predictions. An underfitted model can miss important relationships. By modelling noise, an overfitted model can present relationships that are not really there. Ultimately, the conclusions we draw will depend on our model. It is important to consider the performance of the model.

We may still have issues even with an accurate model. A model can ignore some relationships in favour of others. We can make the incorrect conclusion that the ignored features do not have relationships with the target variable. This means that, when doing data exploration, you may want to restrict the features to the subset you are interested in exploring.

PDPs ignore interactions

As mentioned before, by using an average, the PDPs can miss interactions. As a result, the PDP-based feature importance will also miss these interactions. This means that, if interactions are present, the score can underestimate the importance of a feature. We saw this with the car_age feature. One solution is to use ICE Plots or just stick to permutation feature importance.

Limitations of implementations

In the next sections, we will walk you through the code used to implement these methods. We will see that each implementation has its own pros and cons. Some packages will not have implementations for all the plots we discussed. For example, if you are working with Python there is no implementation (that I know of) for derivative PDPs or feature importance.

In this section, we will walk you through R code used to create PDPs and ICE Plots. We will look at using three different packages. We use ICEbox and iml to create the plots. Combined these allow us to create all the plots we discussed above. For the feature importance scores, we use vip. You can find all the code we discuss on GitHub.

We start by loading our dataset (line 1). This is the same one we discussed in Table 1 at the beginning of the article. We also set car_type to a categorical feature (line 2).

Modelling

Before we create PDPs and ICE Plots we need a model. We use the randomForest package to do this (line 1). We build a model using price and the 6 features (lines 4–6). Specifically, we have used a random forest with 100 trees (line 6).

For most of the plots below, we will be using the model, rf. This has been trained on the continuous target variable. We also want to build a model, rf_binary, on a binary target variable. This is to show you how the code and output differ for the different types of targets.

To start, we create our binary target variable. It has a value of 1 if the original car price is above average and 0 if it is below average (lines 2–4). We then build a random forest just as before (lines 7–9). With these models, we can now move on to using PDPs and ICE Plots to understand how they work. As we go forward, the output will be displayed below the relevant code.

Package — ICEbox

We will start with the ICEbox package (line 1). We will use it to create a PDP for car_age. We create an iceplot object for car_age using the ice function (lines 4–7). We pass our model, features and target variable (lines 4–6). This object will contain all the individual prediction lines for car_age. It will also contain the average prediction line (i.e. the PDP).

We then use the plot function to display the iceplot object (lines 9–11). By default, this package will always display ICE Plots. To create a PDP, we need to hide the individual prediction lines. We do this by making them all white (line 11). We also hide the points that give the original car_age values (line 10).

(source: author)

To create an ICE plot we can use the same iceplot object. Now, instead of hiding the individual lines, we colour them by their car_type (line 5). Classic and normal cars are distinguished by making the lines blue and red respectfully. We have also entered the ICE Plot (line 3). We have restricted the plot to 100 observations by only plotting 10% of the individual lines (line 2).

(source: author)

We can also use the ICEBox package to plot derivative PDPs. We create an iceplot object for repairs in the same way as before (lines 1–4). We then use this to create a dice object (line 5). Finally, we plot the dice object (line 7–10). We have set plot_sd = F (line 10). This hides the standard deviation of the individual prediction lines.

(source: author)

The last plot we create with this package is a PDP for the binary target variable. The code is similar to before. The only other difference, besides from using rf_binary, is to pass in a prediction function (lines 4–6). This lets the ice function know that our predictions are given in terms of probabilities.

(source: author)

The ICEBox package has some advantages. By colouring an ICE Plot with another feature, it allows us to clearly see interactions. It is also the only package that has implemented derivative PDPs. In terms of limitations, it does not handle categorical features. This means we cannot create plots for the car_type feature. It has also not implemented PDPs for 2 features.

Package — iml

The next package, iml, can address some of these limitations. We’ll start by using it to create a PDP for car_age. We create a predictor object using our random forest and dataset (line 4). Using this, we then create a feature effect object (lines 7 -9). We are using “pdp” as the feature effect method (line 9). Finally, we plot this feature effect object (line 10).

(source: author)

We use similar code to create an ICE Plot for car_age. We use the same predictor object as before (line 1). We now set the method to “pdp+ice” (line 3). This will give us a combined PDP and ICE Plot. We have also centred the plot so all prediction lines start at 0 (line 4). Setting the method to “ice” would remove the yellow PDP.

(source: author)

The iml package can also handle categorical features. Below we create PDP for car_type (lines 2–5). This gives us the histogram below. Similarly, we create an ICE Plot for car_type (lines 8–11). This gives us the boxplots.

(source: author)

Another advantage of iml is it has implemented PDPs for 2 features. Below we create a PDP for car_age and km_driven. The code is similar to before. The only difference is we pass a vector with both of the feature names (line 2).

(source: author)

Lastly, we create a PDP for our binary target variable. We create the predictor object with rf_binary (line 1) but the rest of the code is the same as before. You can see we now have two plots. Notice that they are inverses of each other. This is because the first plot gives the probability that the car is below average (0). The second plot gives the probability that it is above average (1). Having plots for each value is useful when your target variable has more than the 2 values.

(source: author)

Package — vip

Neither of the above packages have implemented feature importance scores. To calculate those we use the vip package (line 1). It is straightforward to create the PDP-base scores (line 4) and ICE-based scores (line 7). We have set the method to “firm”. This stands for feature importance ranking measure.

(source: author)

In this section, we walk you through Python code used to create PDPs and ICE Plots. You can also find this on GitHub. We will be using scikit-learn’s implementation, PartialDependenceDisplay. An alternative package, which we won’t discuss, is PDPbox.

We start by importing Python packages. We import some common packages for handling and visualising data (lines 2–4). We have two different modelling packages — RandomFrorestRegressor (line 6) and xgboost (line 7). Lastly, we have our packages used to create PDPs and ICE Plots (line 9–10). The first package is used to visualise the plots. The second package is used to get the predictions used to create the plots. This comes in handy later on when we explore the categorical features.

The PDP packages are provided by scikit-learn. You can still use them with models that were not created with a scikit-learn package. You will see this later when we model a binary target variable with the xgb package.

Continuous target variable

We’ll start with the continuous target variable. We load our dataset (line 2). This is the same one we discussed in Table 1 at the beginning of the article. We get our target variable (line 5) and features (line 6). We use these to train a random forest (lines 9–10). Specifically, the random forest is made up of 100 trees. Each tree has a maximum depth of 4.

We can now use the PartialDependenceDisplay package to understand how this model works. To start we will create a PDP for car_age. To do this we use the from_estimator function. We pass in the model, X feature matrix and the feature name. You can see the output below the code.

(source: author)

To create an ICE Plot, we need to set the kind parameter to ‘individual’ (line 4). We have also centred the plot (line 6). By default, these functions only display the 5% to 95% percentiles of the feature. You need to change this if you want to display the full range of car_age. That is for 0 to 40. We do this on line 5.

(source: author)

We explore some other options below. Setting the kind feature to ‘both’ will display both the PDP and ICE Plot (line 4). We can also use the ice_lines_kw (line 4) and pd_line_kw (line 5) parameters to change the style of the ICE plot and PDP lines respectfully.

(source: author)

Creating a PDP of 2 features is similar to before. All we need to do is pass an array of feature names instead of an individual feature name. You can see how we do this for car_age and km_driven below.

(source: author)

A downside of the scikit-learn package is that it treats categorical features as continuous features. You can see the result of this below. The PDP and ICE Plots are given by lines. Histograms and boxplots would give a better interpretation.

(source: author)

We can get around this by creating our own plots. We do this using the partial_dependence function (line 3). We use this function in the same way as the for_estimator function. The difference is that it does not display a plot. It only returns the data used to create the plots. pd_ice will contain the data for both the PDP and ICE Plots of car_type.

You can see how we can use this to create a PDP below. We start by getting the averages from pd_ice (line 2). This will be the average prediction for normal and classic cars. Using these, we plot a bar chart (lines 5–14).

[break up]
(source: author)

For an ICE Plot, we can create a boxplot instead. We get the individual predictions from pd_ice (line 2). We then split these into the normal and classic car predictions (lines 4–6). Finally, we plot our boxplot (line 9–14).

(source: author)

Binary target variable

To end, we will create an ICE Plot for a binary target variable. To start we create the variable (lines 2–3). It has a value of 1 if the original car price is above average and 0 if it is below average. We train a model using this binary variable (lines 6–7). This time we have used an XBGClassifier with a max_depth of 2 and 100 trees.

We plot an ICE Plot for car_age just as before. You will notice that the y-axis now gives (centred) probabilities instead of prices. You can also see that it is straightforward to use the scikit-learn packages with other modelling packages.

(source: author)

The downside to using Python is there are no implementations (that I know of) for derivative PDPs and the feature importance scores. Although, for most problems, the above charts are all you’ll need. If you need the other methods you can implement them yourselves using the partial_dependence function.

I hope you found this article useful! I really do want it to be the ULTIMATE guide to PDPs and ICE Plots. If you think I’ve missed anything please reach out in the comments. Also, let me know if anything is unclear. I’d be happy to update the article 🙂


The intuition, maths and code (R and Python) behind partial dependence plots and individual conditional expectation plots

(source: author)

Both PDPs and ICE plots can help us understand how our models make predictions. Using PDPs we can visualise the relationship between model features and the target variable. They can tell us if a relationship is linear, non-linear or if there is no relationship. Similarly, ICE Plots can be used when there are interactions between features. We will go into depth on these two methods.

We start with PDPs. We will take you step-by-step through how PDPs are created. You will see that they are an intuitive method. Even so, we will also explain the mathematics behind PDPs. We then move on to ICE Plots. You will see that, with a good understanding of PDPs, these are easy to understand. Along the way, we discuss different applications and variations of these methods, including:

  • PDPs and ICE Plots for continuous and categorical features
  • PDPs for binary target variables
  • PDPs for 2 model features
  • Derivative PDP
  • Feature importance based on PDPs and ICE Plots

We make sure to discuss the advantages and limitations of both methods. These are important sections. They help us understand when the methods are most appropriate. They also tell us how they can lead to incorrect conclusions in some situations.

We end by walking you through both R and Python code for these methods. For R, we apply the ICEbox and iml packages to create visualisations. We also use vip to calculate feature importance. For Python, we will be using scikit-learn’s implementation, PartialDepenceDisplays. You can find links to the GitHub repos with code in these sections.

We start with a step-by-step walk-through of PDPs. To explain this method, we randomly generated a dataset with 1000 rows. It contains details on the sales of second-hand cars. You can see the features in Table 1. We want to predict the price of a car using the first 5 features. You can find this dataset on Kaggle.

Table 1: overview of second-hand car sales dataset (source: author) (dataset: kaggle)(licence: CC0)

To predict price we first need to train a model using this dataset. In our case, we trained a random forest with 100 trees. The exact model is not important as PDPs are a model agnostic method. We will see now that they are built using model predictions. We do not consider the inner workings of a model. This means the visualisations we explore will be similar for random forests, XGBoost, neural networks, etc…

In Table 2, we have one observation in our dataset used to train the model. In the last column, we can see the predicted price for this car. To create a PDP, we vary the value for one of the features and record the resulting predictions. For example, if we changed car_age we would get a different prediction. We do this while holding the other features constant at their real values. For example, owner_age will remain at 19 and km_drive will remain at 27,544.

Table 2: training data and prediction example (source: author)

Looking at Figure 1, we can see the result of this process. This shows the relationship between the predicted price (partial yhat) and car_age for this specific observation. You can see that as car_age increases the predicted price decreases. The black point gives the original predicted price (4,654) and car_age (4.03).

Figure 1: predicted price for varying car_age for observation 1 (source: author)

We then repeat this process for every observation or subset of observations in our dataset. In Figure 2, you can see the prediction lines for 100 observations. To be clear, for each observation we have only varied car_age. We hold the remaining feature constant at their original values. These will not have the same values as the observation we saw in Table 2. This explains why each line starts at a different level.

Figure 2: prediction lines for 100 observations (source: author)

To create a PDP, the final step is to calculate the average prediction at each value for car_age. This gives us the bold yellow line in Figure 3. This line is the PDP. By holding the other features constant and averaging over the observations we are able to isolate the relationship with car_age. We can see that the predicted price tends to decrease with car_age.

Figure 3: partial dependence plot for car age (source: author)

You may have noticed the short lines on the x-axis. These are the quartiles of car_age. That is 10% of car_age values are less than the first line and 90% are less than the last line. This is known as the rug plot. It shows us the distribution of the feature. In this case, the values of car_age are fairly evenly distributed across its range. We will understand why this is useful when we discuss the limitations of PDPs.

You may also have noticed that not all the individual prediction lines follow the PDP trend. Some of the higher lines seem to increase. This suggests that, for those observations, price has the opposite relationship with car_age. That is the predicted prices increase as car_age increases. Keep this in mind. We will come back to it later when we discuss ICE Plots.

Mathematics behind PDPs

For the more mathematically minded, there is also a formal definition of PDPs. Let’s start with the function for PDP we just created. This is given by Equation 1. Set C will contain all the features excluding car_age. For a given car_age value and observation i, we find the predicted price using the original values for the features in C. We do this for all 100 observations and find the average. This equation will give us the bold yellow line we saw in Figure 3.

Equation 1: PD function for car_age (source: author)

In Equation 2, we have generalised the above equation. This is the PD function for a set of features S. C will contain all the features excluding those in S. We now also average over n observations. This can be up to the total number of observations in the dataset (i.e. 1000).

Equation 2: approximated PD function for feature set S (source: author)

Until now, we have only discussed cases when S consists of one feature. In the previous section, we had S = {car_age}. Later, we will show you a PDP of 2 features. We will generally not include more than 2 features in S. Otherwise, it becomes difficult to visualise the PD function.

The above equation is actually only an approximation of the PD function. The true mathematical definition is given in Equation 3. For given values in set S, we find the expected prediction w.r.t. the set C. To do this, we need to integrate our model function w.r.t. the probability of observing the values in set C. To fully understand this equation you will need some experience with stochastic calculus.

Equation 3: PD function for feature set S (source: author)

When working with PDPs an understanding of the approximation is good enough. The true PD function is not practical to implement. Firstly, it will be more computationally expensive to implement than the approximation. Secondly, given our limited number of observations, it is only possible to approximate the probabilities. That is we can not find the true probabilities of each observation. Lastly, many models will not be continuous functions making them difficult to integrate.

PDPs for continuous features

With an understanding of how we create PDPs we will move on to using them. Using these plots we can understand the nature of the relationship between a model feature and the target variable. For example, we have already seen the car_age PDP in Figure 4. The predicted price decreases at a fairly constant rate. This suggests that car_age has a linear relationship with price.

Figure 4: car_age PDP (source: author)

In Figure 5, we can see the PDP for another feature, repairs. This is the number of repairs/services the car received. Initially, the predicted price tends to increase with the number of repairs. We would expect a reliable car to have received some regular maintenance. Then, at around 6/7 repairs, the price tends to decrease. Excessive repairs may indicate that something is wrong with the car. From this, we can see that price has a non-linear relationship with repairs.

Figure 5: repairs PDP (source: author)

From the above, we can see that PDPs are useful when it comes to visualising non-linear relationships. We explore this topic in more depth in the article below. When dealing with many features it may be impractical to look at all the PDPs. So we also discuss using mutual information and feature importance to help find non-linear relationships.

In Figure 6, we can see an example of a PDP when a feature has no relationship with the target variable. For owner_age the PDP is constant. This tells us that the predicted price does not change when we vary owner_age. Later, we will see how we can use this idea of PDP variation to create a feature importance score.

Figure 6: owner_age PDP (source: author)

PDPs for categorical features

The features we have discussed above have all been continuous. We can also create PDPs for categorical features. For example, see the plot for car_type in Figure 7. Here we calculate the average prediction for each type of car — normal (0) or classic (1). Instead of a line, we visualise these with a histogram. We can see that classic cars tend to be sold at a higher price.

Figure 7: car_type PDP (source: author)

PDP for a binary target variable

PDPs for binary target variables are similar to those with continuous targets. Suppose we want to predict if the car price is above (1) or below (0) average. We build a random forest to predict this binary variable using the same features. We can create PDPs for this model using the same process as before. Except now our prediction is a probability.

For example, in Figure 8, you can see the PDP for car_age. We now have a predicted probability on the y-axis. This is the probability that the car is above the average price for a second-hand car. We can see that the probability tends to decrease with the age of the car.

Figure 8: PDP for a binary target variable (source: author)

PDP of 2 features

Going back to our continuous target variable. We can also visualise the PDP of two features. In Figure 9, we can see the average predictions at different combinations of km_driven and car_age. This chart is created in the same way as the PDP of one feature. That is by keeping the remaining features at their original values.

Figure 9: 2 feature PDP (source: author)

These PDPs are useful for visualising interactions between features. The above chart suggests a possible interaction between km_driven and car_age. That is the predicted price tends to be lower when both features have larger values. You should be cautious when drawing these types of conclusions.

This is because we can get similar results if the two features are correlated. Later, when we discuss the limitations of PDPs, will see that this is actually the case. That is km_driven is correlated with car_age. The amount the car has driven tends to be higher when the car is older. This is why we see a lower predicted price when both features are higher.

Derivative PDP

Derivative PDP is a variation of a PDP. It shows the slope/derivative of a PDP. It can be used to get a better understanding of the original PDP. However, in most cases, the insight we can gain from these plots is limited. They are generally more useful for non-linear relationships.

For example, take the derivative PDP for repairs in Figure 10. This is the derivative of the line we saw earlier in Figure 5. We can see that the derivative is 0 at around 6 repairs. At this point, the derivative changes from positive to negative. In other words, the original PDP changes from increasing to decreasing w.r.t. repairs. This tells us that after 6 repairs a car’s price will tend to decrease.

Figure 10: derivative PDP for repairs (source: author)

PDP feature importance

To end this section, we have a feature importance score based on PDPs. This is done by determining the “flatness” of the PDP for each feature. Specifically, for continuous variables, we calculate the standard deviation of the values of the plot. For categorical variables, we estimate the SD by first taking the range. That is the maximum less the minimum PDP value. We then divide the range by 4. This calculation comes from a concept called the range rule.

You can see the PDP-based feature importance for our features in Figure 11. Notice that the score for owner_age is relatively low. This makes sense if we think back to the PDP in Figure 6. We saw that the PDP was relatively constant. In other words, the y-axis values are always close to their mean value. They have a low standard deviation.

Figure 11: feature importance based on PDP (source: author)

There are better-known feature importance scores such as permutation feature importance. You may prefer to use this PDP-based score as it can provide some consistency. If you are analysing feature trends you can now use a feature importance score calculated using similar logic. You can also avoid explaining the logic of two different methods.

With PDPs out of the way, let’s move on to ICE Plots. You will be happy to know that we’ve already discussed the process of creating them. Take Figure 12 below. This is the plot we created just before the car_age PDP in Figure 3. It is an ICE Plot for car_age. ICE Plots are made up of the prediction lines for each individual observation.

Figure 12: ICE Plot for car_age (source: author)

ICE plots are useful when there are interactions in your model. That is if the relationship of a feature with the target variable depends on the value of another feature. It may be hard to see this in the above chart. To make things clearer we can centre our ICE plot. In Figure 13 we have done this by making all the prediction lines start at 0. It is now clear that for some observations the predicted price tends to increase with car_age.

Figure 13: centred ICE Plot (source: author)

To understand what is causing this behaviour we can change the colour of the ICE Plot. In Figure 14, we have changed the colour based on car_type. We make the lines blue for classic cars and red for normal cars. We can now see that this relationship comes from an interaction between car_age and car_type. Intuitively, it makes sense that a classic car would increase in value as it got older.

Figure 14: coloured ICE Plot (source: author)

A final addition is to add the PDP line to the plot. In this way, we can combine ICE Plots and PDPs. This can emphasise how some of the observations deviate from the average trend. We can see that if we had only relied on the PDP we would have missed this interaction. That is when using the PDP in Figure 3, we concluded that price tends to decrease with age for all cars.

Figure 15: combined PDP and ICE Plot (source: author)

Like with PDPs, ICE Plots can help us visualise important relationships in our data. To find those relationships we may need to use metrics like feature importance. An alternative metric used specifically to highlight interactions in a model is the Friedman’s H-statistic. We discuss using all of these methods in the article below.

ICE Plots of categorical features

We can use boxplots to visualise ICE Plots of categorical features. For example, we have the ICE Plot for car_type in Figure 16. The bold lines in the middle of the boxes give the average prediction. In other words, they are the PDPs. We can see the predicted price tends to be lower for normal cars (0).

Figure 16: car_type ICE plot (source: author)

ICE Plot-based feature importance

We can also calculate an ICE Plot-based feature importance score. This is similar to the PDP-based score except we no longer consider the average prediction line. We now calculate the score using the individual prediction lines. This means the score will consider interactions between features.

We have the ICE Plot-based scores for our model in Figure 17. We can compare these to the PDP-based scores in Figure 11. The biggest difference is the score for car_age is now larger. It increased from 300 to 345. This makes sense in light of the analysis we did above. We saw that there is an interaction that impacts the relationship between car_age and price. The PDP-based feature importance does not consider this interaction.

Figure 17: ICE plot-based feature importance (source: author)

By now, hopefully, we have a good understanding of PDPs and ICE Plots and the insights we can gain from them. We’re going to move on to discuss the advantages of these approaches. Then in the next section, we will discuss the limitations. It is crucial to understand these so that you do not draw incorrect conclusions from the plots.

Isolate feature trends

We can visualise relationships in our data using scatter plots but data is messy. For example, we can see the interaction between car_age and car_type in Figure 18. The points vary around the true underlying trend. This is because of statistical noise and the fact that price also has relationships with other features. In a real dataset, this problem will likely be worse. Ultimately, it can be difficult to see trends in data.

Figure 18: scatter plot of car_age and car_type interaction (source: author)

Working with PDPs and ICE Plots, we are no longer working with raw data values. We are working with model predictions. If built correctly, a model will capture underlying relationships in our data and ignore statistical noise. We can then isolate the trend of a particular feature. This is done by holding the other feature values constant and averaging over observations.

This is what makes PDPs and ICE Plots so useful. They allow us to strip out noise and the effect of other features. This makes it easier to see the underlying relationships in our data. In this sense, these methods can be used for data exploration and not just for understanding our models.

Straight forward to explain

Hopefully, by walking you through the process of building a PDP it was easy to understand. In this way, you can gain an intuitive understanding of the method without the need for a mathematical definition. This also means that the methods are easy to explain to a non-technical person. This can be useful in an industry setting.

Easy to implement

These methods are also easy to implement. We just need to vary feature values and record the resulting predictions. We do not even need to consider the inner workings of the model. This means the same implementation can be used with any model. When we discuss the R and Python code you will see that there are already good implementations for these methods.

Assume feature independence

Moving on to limitations, we will start with the main problem with these methods. This is that they assume features are independent. This is not always the case as features can be correlated or associated. For example, take the scatter plot of km_driven and car_age in Figure 19. There is a clear correlation. Intuitively this makes sense. Older cars will tend to have driven longer distances.

Figure 19: scatter plot of km_driven vs car_age (source: author)

The issue is that when we build a prediction line for an observation we will sample all possible values of a feature. For example, suppose we want to build a PDP for km_driven. Take the observation given in red in Figure 20. It has a car_age of 10. To build the prediction line we will vary km_driven for all values in the dotted oval. However, in reality, observations with this car_age will only have driven distances within the solid oval.

Figure 20: issues with random sampling (source: author)

Our model was not trained on observations outside of the solid oval. Still, we are creating the PDP with predictions from these observations. The result is that the prediction line is built on observations that the model has not “seen” before. This can produce unintuitive results and lead to incorrect conclusions about feature trends.

Equal focus on all feature values

Even if features are uncorrelated, we can still come to incorrect conclusions. For each observation, we are sampling over all possible feature values. This gives equal weight to all values. In reality, some values of the feature will be less common. Such as at the extremes of the feature’s distribution. There will be more uncertainty about the trend at these values.

Considering this, it is common to include a rug plot. These help us understand the distribution of the feature. Before we saw a quantile version of a rug plot. Figure 21, gives an alternative version. Here we have an individual line for each observation. We can see that there are fewer observations for higher values of km_driven. We should, therefore, interpret the trend at these values with more care.

Figure 21: alternative PDP for km_driven (source: author)

Conclusions depend on your model

As mentioned in the advantages, working with model predictions can help us see relationships more clearly. The issue is that models can make incorrect predictions. An underfitted model can miss important relationships. By modelling noise, an overfitted model can present relationships that are not really there. Ultimately, the conclusions we draw will depend on our model. It is important to consider the performance of the model.

We may still have issues even with an accurate model. A model can ignore some relationships in favour of others. We can make the incorrect conclusion that the ignored features do not have relationships with the target variable. This means that, when doing data exploration, you may want to restrict the features to the subset you are interested in exploring.

PDPs ignore interactions

As mentioned before, by using an average, the PDPs can miss interactions. As a result, the PDP-based feature importance will also miss these interactions. This means that, if interactions are present, the score can underestimate the importance of a feature. We saw this with the car_age feature. One solution is to use ICE Plots or just stick to permutation feature importance.

Limitations of implementations

In the next sections, we will walk you through the code used to implement these methods. We will see that each implementation has its own pros and cons. Some packages will not have implementations for all the plots we discussed. For example, if you are working with Python there is no implementation (that I know of) for derivative PDPs or feature importance.

In this section, we will walk you through R code used to create PDPs and ICE Plots. We will look at using three different packages. We use ICEbox and iml to create the plots. Combined these allow us to create all the plots we discussed above. For the feature importance scores, we use vip. You can find all the code we discuss on GitHub.

We start by loading our dataset (line 1). This is the same one we discussed in Table 1 at the beginning of the article. We also set car_type to a categorical feature (line 2).

Modelling

Before we create PDPs and ICE Plots we need a model. We use the randomForest package to do this (line 1). We build a model using price and the 6 features (lines 4–6). Specifically, we have used a random forest with 100 trees (line 6).

For most of the plots below, we will be using the model, rf. This has been trained on the continuous target variable. We also want to build a model, rf_binary, on a binary target variable. This is to show you how the code and output differ for the different types of targets.

To start, we create our binary target variable. It has a value of 1 if the original car price is above average and 0 if it is below average (lines 2–4). We then build a random forest just as before (lines 7–9). With these models, we can now move on to using PDPs and ICE Plots to understand how they work. As we go forward, the output will be displayed below the relevant code.

Package — ICEbox

We will start with the ICEbox package (line 1). We will use it to create a PDP for car_age. We create an iceplot object for car_age using the ice function (lines 4–7). We pass our model, features and target variable (lines 4–6). This object will contain all the individual prediction lines for car_age. It will also contain the average prediction line (i.e. the PDP).

We then use the plot function to display the iceplot object (lines 9–11). By default, this package will always display ICE Plots. To create a PDP, we need to hide the individual prediction lines. We do this by making them all white (line 11). We also hide the points that give the original car_age values (line 10).

(source: author)

To create an ICE plot we can use the same iceplot object. Now, instead of hiding the individual lines, we colour them by their car_type (line 5). Classic and normal cars are distinguished by making the lines blue and red respectfully. We have also entered the ICE Plot (line 3). We have restricted the plot to 100 observations by only plotting 10% of the individual lines (line 2).

(source: author)

We can also use the ICEBox package to plot derivative PDPs. We create an iceplot object for repairs in the same way as before (lines 1–4). We then use this to create a dice object (line 5). Finally, we plot the dice object (line 7–10). We have set plot_sd = F (line 10). This hides the standard deviation of the individual prediction lines.

(source: author)

The last plot we create with this package is a PDP for the binary target variable. The code is similar to before. The only other difference, besides from using rf_binary, is to pass in a prediction function (lines 4–6). This lets the ice function know that our predictions are given in terms of probabilities.

(source: author)

The ICEBox package has some advantages. By colouring an ICE Plot with another feature, it allows us to clearly see interactions. It is also the only package that has implemented derivative PDPs. In terms of limitations, it does not handle categorical features. This means we cannot create plots for the car_type feature. It has also not implemented PDPs for 2 features.

Package — iml

The next package, iml, can address some of these limitations. We’ll start by using it to create a PDP for car_age. We create a predictor object using our random forest and dataset (line 4). Using this, we then create a feature effect object (lines 7 -9). We are using “pdp” as the feature effect method (line 9). Finally, we plot this feature effect object (line 10).

(source: author)

We use similar code to create an ICE Plot for car_age. We use the same predictor object as before (line 1). We now set the method to “pdp+ice” (line 3). This will give us a combined PDP and ICE Plot. We have also centred the plot so all prediction lines start at 0 (line 4). Setting the method to “ice” would remove the yellow PDP.

(source: author)

The iml package can also handle categorical features. Below we create PDP for car_type (lines 2–5). This gives us the histogram below. Similarly, we create an ICE Plot for car_type (lines 8–11). This gives us the boxplots.

(source: author)

Another advantage of iml is it has implemented PDPs for 2 features. Below we create a PDP for car_age and km_driven. The code is similar to before. The only difference is we pass a vector with both of the feature names (line 2).

(source: author)

Lastly, we create a PDP for our binary target variable. We create the predictor object with rf_binary (line 1) but the rest of the code is the same as before. You can see we now have two plots. Notice that they are inverses of each other. This is because the first plot gives the probability that the car is below average (0). The second plot gives the probability that it is above average (1). Having plots for each value is useful when your target variable has more than the 2 values.

(source: author)

Package — vip

Neither of the above packages have implemented feature importance scores. To calculate those we use the vip package (line 1). It is straightforward to create the PDP-base scores (line 4) and ICE-based scores (line 7). We have set the method to “firm”. This stands for feature importance ranking measure.

(source: author)

In this section, we walk you through Python code used to create PDPs and ICE Plots. You can also find this on GitHub. We will be using scikit-learn’s implementation, PartialDependenceDisplay. An alternative package, which we won’t discuss, is PDPbox.

We start by importing Python packages. We import some common packages for handling and visualising data (lines 2–4). We have two different modelling packages — RandomFrorestRegressor (line 6) and xgboost (line 7). Lastly, we have our packages used to create PDPs and ICE Plots (line 9–10). The first package is used to visualise the plots. The second package is used to get the predictions used to create the plots. This comes in handy later on when we explore the categorical features.

The PDP packages are provided by scikit-learn. You can still use them with models that were not created with a scikit-learn package. You will see this later when we model a binary target variable with the xgb package.

Continuous target variable

We’ll start with the continuous target variable. We load our dataset (line 2). This is the same one we discussed in Table 1 at the beginning of the article. We get our target variable (line 5) and features (line 6). We use these to train a random forest (lines 9–10). Specifically, the random forest is made up of 100 trees. Each tree has a maximum depth of 4.

We can now use the PartialDependenceDisplay package to understand how this model works. To start we will create a PDP for car_age. To do this we use the from_estimator function. We pass in the model, X feature matrix and the feature name. You can see the output below the code.

(source: author)

To create an ICE Plot, we need to set the kind parameter to ‘individual’ (line 4). We have also centred the plot (line 6). By default, these functions only display the 5% to 95% percentiles of the feature. You need to change this if you want to display the full range of car_age. That is for 0 to 40. We do this on line 5.

(source: author)

We explore some other options below. Setting the kind feature to ‘both’ will display both the PDP and ICE Plot (line 4). We can also use the ice_lines_kw (line 4) and pd_line_kw (line 5) parameters to change the style of the ICE plot and PDP lines respectfully.

(source: author)

Creating a PDP of 2 features is similar to before. All we need to do is pass an array of feature names instead of an individual feature name. You can see how we do this for car_age and km_driven below.

(source: author)

A downside of the scikit-learn package is that it treats categorical features as continuous features. You can see the result of this below. The PDP and ICE Plots are given by lines. Histograms and boxplots would give a better interpretation.

(source: author)

We can get around this by creating our own plots. We do this using the partial_dependence function (line 3). We use this function in the same way as the for_estimator function. The difference is that it does not display a plot. It only returns the data used to create the plots. pd_ice will contain the data for both the PDP and ICE Plots of car_type.

You can see how we can use this to create a PDP below. We start by getting the averages from pd_ice (line 2). This will be the average prediction for normal and classic cars. Using these, we plot a bar chart (lines 5–14).

[break up]
(source: author)

For an ICE Plot, we can create a boxplot instead. We get the individual predictions from pd_ice (line 2). We then split these into the normal and classic car predictions (lines 4–6). Finally, we plot our boxplot (line 9–14).

(source: author)

Binary target variable

To end, we will create an ICE Plot for a binary target variable. To start we create the variable (lines 2–3). It has a value of 1 if the original car price is above average and 0 if it is below average. We train a model using this binary variable (lines 6–7). This time we have used an XBGClassifier with a max_depth of 2 and 100 trees.

We plot an ICE Plot for car_age just as before. You will notice that the y-axis now gives (centred) probabilities instead of prices. You can also see that it is straightforward to use the scikit-learn packages with other modelling packages.

(source: author)

The downside to using Python is there are no implementations (that I know of) for derivative PDPs and the feature importance scores. Although, for most problems, the above charts are all you’ll need. If you need the other methods you can implement them yourselves using the partial_dependence function.

I hope you found this article useful! I really do want it to be the ULTIMATE guide to PDPs and ICE Plots. If you think I’ve missed anything please reach out in the comments. Also, let me know if anything is unclear. I’d be happy to update the article 🙂

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment