Techno Blender
Digitally Yours.

SHAP for Time Series Event Detection | by Nakul Upadhya | Feb, 2023

0 43


Photo by Luke Chesser on Unsplash

Using a modified KernelSHAP for time-series event detection

Feature importance is a widespread technique used to explain how machine learning models make their predictions. The technique assigns a score or weight to each feature, indicating how much that feature contributes to the prediction. The scores can be used to identify the most important features and to understand how the model is making its predictions. One frequently used version of this is Shapley values, a model-agnostic metric based on game theory that distributes the “payout” (the prediction) fairly among the features [1]. One extension for Shapley values is KernelSHAP which uses a kernel trick along with local-surrogate models to approximate the value of the Shapley values, which allows it to compute feature importance values for more complex models such as Neural Networks [2].

KernelSHAP is often applied to explain time-series predictions, but it does come with some significant constraints and drawbacks in this domain:

  1. Time series prediction often involves large windows of past data, and this can cause computation numerical underflow errors when applying KernelSHAP, especially in multivariate time-series prediction [3].
  2. KernelSHAP assumes feature independence. This can often work in tabular data cases, but feature and time-step independence is an exception rather than the norm in time series [3].
  3. KernelSHAP uses coefficients of a linear model that has been fit to perturbations of data. In the time-series case, however, a Vector AutoRegressive model (VAR) is often more apt to model a process instead of just a linear model [3].

To fix these issues, researchers at J.P. Morgan’s AI Research division (Villani et al.) proposed variations of KernelSHAP that are more suited to time series data in their October 2022 paper [3]:

  1. The researchers first created VARSHAP, a KernelSHAP alteration that uses VAR Models instead of a linear model. This modification makes it The researchers also calculated a closed-form method to calculate SHAP values for AR, MA, ARMA, and VARMAX models.
  2. Along with VARSHAP as the basis, the researchers proposed Time-Consistent SHAP which leverages the temporal component of the problem to reduce the computation of SHAP values.

Using the Time Consistent SHAP measure, the researchers showcased a promising method for event detection by capturing surges of feature importance.

In this post, I will first explain how KernelSHAP is calculated and how to modify it for VARSHAP. I will then also explain how to get Time-Consistent SHAP (TC-SHAP) and how to use TC-SHAP to identify events in time-series analysis.

The formula of the SHAP value as provided by [2] is:

Equation 1: SHAP equation

Where the phi in the equation above is the SHAP value of feature i given the value function v (the value function is usually the model predictions). C is the set of all features and N is the size of C, or the number of features. P(C) is the powerset of all features without feature i. Delta(S, i) is the change in prediction feature i causes when added to the feature coalition S (which is a set within the powerset C).

The equation summarizes down to “add the weighted marginal contribution of feature i to each possible coalition of features that doesn’t include i”.

The issue that KernelSHAP handles is that this computation can get incredibly large as the powerset size scales exponentially with the number of features. The KernelSHAP is calculated by solving the following problem:

Equation 2: KernelSHAP Equation [3]

Where h_x is a masking function that is applied on z, a binary vector sampled from a set Z which represents the collection of all possible coalitions of features. This function maps the coalition represented by z to a masked data point which is then put into our model (f_theta). The goal is to find the best Linear Model (g) that estimates the model performance across all the masks. The weights in the linear model are the KernelSHAP values. This is all possible due to a combinatorial kernel defined by:

Equation 3: Combinatorial Kernel [3]

To instead calculate VARSHAP, simply replace the linear representation of g with a VAR model. According to the authors, since both the coefficients of a linear model and a VAR model are estimated through Ordinary Least Squares, all the math for KernelSHAP holds, and becomes more representative for a time series [3].

As mentioned before, SHAP is a method that interprets the features of a model as players in a game and uses Shapley values to find fair allocations of rewards. However, for games that develop over time, these allocations may not provide enough incentive for all parties to pursue the initial goal. To avoid this, game theorists use imputation schedules and the concept of time consistency to manage incentives across time [3].

This idea of competing interests through time extends to features as well, as the traditional SHAP methods consider the same feature at a different time step as a different player in the game. According to the authors, we can potentially bridge this gap by adding time consistency [3].

The time consistency of SHAP values can be presented as follows:

Equation 4: The time consistency of Shapley Values [3]

In this equation, beta represents an imputation schedule of payments made to player (feature) i across t time steps and phi(0,i) is the total value that the player contributes to the game (prediction). Think of this as similar to a business partnership.

Each individual (AKA the feature) pays an initial amount into the startup fund(which is phi at time step 0). Then in future time steps, the individual is periodically paid returns as they contribute more to the business outcomes (AKA the end prediction). These payouts also disincentive any individual from acting against the business interests. By framing the problem this way, TC-SHAP works much better in the context of time series since now the different time steps of a feature are modeled as one entity instead of as separate players.

To use these in practice, the following steps can be taken:

  1. Compute the total SHAP contribution of each feature (phi at time step 0) by masking the feature by replacing the feature with either zeros or the feature average. Repeat this for all features.
  2. Then we need to compute the “subgame SHAPs” for each time step in our window (t-w). This is done by changing the masking mechanism in Equation 2 so that instead of only masking the time-step (t-w), we instead mask all the time steps between t-w and t (again with either zeros or the mean).
  3. Then we simply calculate the imputation schedule using equations 4 and 5.
Equation 5: Imputation schedule [3]

Step 1 calculates the “initial investment”. Step 2 then enforces the idea that we have N features across multiple timesteps (W) instead of having N*W features. Step 3 wraps it all together by providing the imputation schedule (or the periodic returns of each “investor”).

This procedure also has the benefit of reducing the number of computations from 2^(N*W) to W*2^N where W is the number of timesteps used for predictions and N is the number of features [3].

Once calculated, we can interpret the TC SHAP values as “how, at a given time step, the evolution of features will affect the coalition of other feature trajectories.” In simpler terms, TC SHAP represents how a feature at a given time step changes how other features contribute together in future time steps. The feature-timestep points that heavily impact future collaboration will by definition heavily impact the end predictions.

While getting the importance of a given time step for a single prediction is useful, time series analysis often involves analyzing multiple predictions and patterns and we may want to know what some influential time steps are overall in the model’s prediction [3].

According to the authors, we can find the influential time step by adding up the TC SHAP values for a given set of predictions (or all predictions if we want a global event detection mechanism). By plotting this out, we can then easily see which time steps are important and where some important events may have happened [3].

The authors demonstrate the effectiveness of this approach with the Individual Household Electricity dataset. The authors trained an LSTM network followed by a dense layer to predict power consumption. They then calculated the TC-SHAP values and summed them up to get the event detection convolutions. They then overlayed the convolutions onto the target time series.

Figure 1: Event Detection Convolution (Blue) for Sub-Metering 2 and 3 compared to the target (Figure from [3])

As shown in Figure 1, the large shifts in the target variables can be explained by large spikes in the event convolutions. For example, there is a large spike in the event convolution for sub_metering_2 right after time-step 25. This was then followed by a large drop in the target soon after. Similarly, a large drop in the event convolution was followed by a large drop in the target around time step 75 in sub_metering_3. Most of the large shifts can be explained by some sub-meter shifts.

The modifications to KernelSHAP fill a large hole in the current work. Apart from this, there is not a large amount of work in developing post-hoc interpretability methods that specifically address time series feature importance. TC-SHAP helps tackle this issue and is sorely needed.

There are some concerns and further work needed around this new method, however. One such concern (that the author also addresses) is the significant difference in the explanations between VARSHAP and TC-SHAP, indicating that more work is needed to examine the exact interpretation of these values. Additionally, while TC-SHAP in theory overcomes the independence issue, more experimentation is necessary to fully confirm this claim.

Additionally, model-agnostic methods in general can be misleading as they can only provide an estimation of importance, but not the true importance. However, for most use cases this rough evaluation is enough, and having a method that addresses temporal dependencies is incredibly useful.

  1. SHAP Package for Python: https://shap.readthedocs.io/en/latest/index.html
  2. A more in-depth explanation of Kernel SHAP: https://christophm.github.io/interpretable-ml-book/shap.html#kernelshap

References

[1] L. Shapley. A value for n-person games. (1953). Contributions to the Theory of Games 2.28.

[2] S.M. Lundberg, S-I. Lee. A Unified Approach to Interpreting Model Predictions. (2017). Advances in Neural Information Processing Systems, 30.

[3] M. Villani, J. Lockhart, D. Magazzeni. Feature Importance for Time Series Data: Improving KernelSHAP (2022). ICAIF Workshop on Explainable Artificial Intelligence in Finance


Photo by Luke Chesser on Unsplash

Using a modified KernelSHAP for time-series event detection

Feature importance is a widespread technique used to explain how machine learning models make their predictions. The technique assigns a score or weight to each feature, indicating how much that feature contributes to the prediction. The scores can be used to identify the most important features and to understand how the model is making its predictions. One frequently used version of this is Shapley values, a model-agnostic metric based on game theory that distributes the “payout” (the prediction) fairly among the features [1]. One extension for Shapley values is KernelSHAP which uses a kernel trick along with local-surrogate models to approximate the value of the Shapley values, which allows it to compute feature importance values for more complex models such as Neural Networks [2].

KernelSHAP is often applied to explain time-series predictions, but it does come with some significant constraints and drawbacks in this domain:

  1. Time series prediction often involves large windows of past data, and this can cause computation numerical underflow errors when applying KernelSHAP, especially in multivariate time-series prediction [3].
  2. KernelSHAP assumes feature independence. This can often work in tabular data cases, but feature and time-step independence is an exception rather than the norm in time series [3].
  3. KernelSHAP uses coefficients of a linear model that has been fit to perturbations of data. In the time-series case, however, a Vector AutoRegressive model (VAR) is often more apt to model a process instead of just a linear model [3].

To fix these issues, researchers at J.P. Morgan’s AI Research division (Villani et al.) proposed variations of KernelSHAP that are more suited to time series data in their October 2022 paper [3]:

  1. The researchers first created VARSHAP, a KernelSHAP alteration that uses VAR Models instead of a linear model. This modification makes it The researchers also calculated a closed-form method to calculate SHAP values for AR, MA, ARMA, and VARMAX models.
  2. Along with VARSHAP as the basis, the researchers proposed Time-Consistent SHAP which leverages the temporal component of the problem to reduce the computation of SHAP values.

Using the Time Consistent SHAP measure, the researchers showcased a promising method for event detection by capturing surges of feature importance.

In this post, I will first explain how KernelSHAP is calculated and how to modify it for VARSHAP. I will then also explain how to get Time-Consistent SHAP (TC-SHAP) and how to use TC-SHAP to identify events in time-series analysis.

The formula of the SHAP value as provided by [2] is:

Equation 1: SHAP equation

Where the phi in the equation above is the SHAP value of feature i given the value function v (the value function is usually the model predictions). C is the set of all features and N is the size of C, or the number of features. P(C) is the powerset of all features without feature i. Delta(S, i) is the change in prediction feature i causes when added to the feature coalition S (which is a set within the powerset C).

The equation summarizes down to “add the weighted marginal contribution of feature i to each possible coalition of features that doesn’t include i”.

The issue that KernelSHAP handles is that this computation can get incredibly large as the powerset size scales exponentially with the number of features. The KernelSHAP is calculated by solving the following problem:

Equation 2: KernelSHAP Equation [3]

Where h_x is a masking function that is applied on z, a binary vector sampled from a set Z which represents the collection of all possible coalitions of features. This function maps the coalition represented by z to a masked data point which is then put into our model (f_theta). The goal is to find the best Linear Model (g) that estimates the model performance across all the masks. The weights in the linear model are the KernelSHAP values. This is all possible due to a combinatorial kernel defined by:

Equation 3: Combinatorial Kernel [3]

To instead calculate VARSHAP, simply replace the linear representation of g with a VAR model. According to the authors, since both the coefficients of a linear model and a VAR model are estimated through Ordinary Least Squares, all the math for KernelSHAP holds, and becomes more representative for a time series [3].

As mentioned before, SHAP is a method that interprets the features of a model as players in a game and uses Shapley values to find fair allocations of rewards. However, for games that develop over time, these allocations may not provide enough incentive for all parties to pursue the initial goal. To avoid this, game theorists use imputation schedules and the concept of time consistency to manage incentives across time [3].

This idea of competing interests through time extends to features as well, as the traditional SHAP methods consider the same feature at a different time step as a different player in the game. According to the authors, we can potentially bridge this gap by adding time consistency [3].

The time consistency of SHAP values can be presented as follows:

Equation 4: The time consistency of Shapley Values [3]

In this equation, beta represents an imputation schedule of payments made to player (feature) i across t time steps and phi(0,i) is the total value that the player contributes to the game (prediction). Think of this as similar to a business partnership.

Each individual (AKA the feature) pays an initial amount into the startup fund(which is phi at time step 0). Then in future time steps, the individual is periodically paid returns as they contribute more to the business outcomes (AKA the end prediction). These payouts also disincentive any individual from acting against the business interests. By framing the problem this way, TC-SHAP works much better in the context of time series since now the different time steps of a feature are modeled as one entity instead of as separate players.

To use these in practice, the following steps can be taken:

  1. Compute the total SHAP contribution of each feature (phi at time step 0) by masking the feature by replacing the feature with either zeros or the feature average. Repeat this for all features.
  2. Then we need to compute the “subgame SHAPs” for each time step in our window (t-w). This is done by changing the masking mechanism in Equation 2 so that instead of only masking the time-step (t-w), we instead mask all the time steps between t-w and t (again with either zeros or the mean).
  3. Then we simply calculate the imputation schedule using equations 4 and 5.
Equation 5: Imputation schedule [3]

Step 1 calculates the “initial investment”. Step 2 then enforces the idea that we have N features across multiple timesteps (W) instead of having N*W features. Step 3 wraps it all together by providing the imputation schedule (or the periodic returns of each “investor”).

This procedure also has the benefit of reducing the number of computations from 2^(N*W) to W*2^N where W is the number of timesteps used for predictions and N is the number of features [3].

Once calculated, we can interpret the TC SHAP values as “how, at a given time step, the evolution of features will affect the coalition of other feature trajectories.” In simpler terms, TC SHAP represents how a feature at a given time step changes how other features contribute together in future time steps. The feature-timestep points that heavily impact future collaboration will by definition heavily impact the end predictions.

While getting the importance of a given time step for a single prediction is useful, time series analysis often involves analyzing multiple predictions and patterns and we may want to know what some influential time steps are overall in the model’s prediction [3].

According to the authors, we can find the influential time step by adding up the TC SHAP values for a given set of predictions (or all predictions if we want a global event detection mechanism). By plotting this out, we can then easily see which time steps are important and where some important events may have happened [3].

The authors demonstrate the effectiveness of this approach with the Individual Household Electricity dataset. The authors trained an LSTM network followed by a dense layer to predict power consumption. They then calculated the TC-SHAP values and summed them up to get the event detection convolutions. They then overlayed the convolutions onto the target time series.

Figure 1: Event Detection Convolution (Blue) for Sub-Metering 2 and 3 compared to the target (Figure from [3])

As shown in Figure 1, the large shifts in the target variables can be explained by large spikes in the event convolutions. For example, there is a large spike in the event convolution for sub_metering_2 right after time-step 25. This was then followed by a large drop in the target soon after. Similarly, a large drop in the event convolution was followed by a large drop in the target around time step 75 in sub_metering_3. Most of the large shifts can be explained by some sub-meter shifts.

The modifications to KernelSHAP fill a large hole in the current work. Apart from this, there is not a large amount of work in developing post-hoc interpretability methods that specifically address time series feature importance. TC-SHAP helps tackle this issue and is sorely needed.

There are some concerns and further work needed around this new method, however. One such concern (that the author also addresses) is the significant difference in the explanations between VARSHAP and TC-SHAP, indicating that more work is needed to examine the exact interpretation of these values. Additionally, while TC-SHAP in theory overcomes the independence issue, more experimentation is necessary to fully confirm this claim.

Additionally, model-agnostic methods in general can be misleading as they can only provide an estimation of importance, but not the true importance. However, for most use cases this rough evaluation is enough, and having a method that addresses temporal dependencies is incredibly useful.

  1. SHAP Package for Python: https://shap.readthedocs.io/en/latest/index.html
  2. A more in-depth explanation of Kernel SHAP: https://christophm.github.io/interpretable-ml-book/shap.html#kernelshap

References

[1] L. Shapley. A value for n-person games. (1953). Contributions to the Theory of Games 2.28.

[2] S.M. Lundberg, S-I. Lee. A Unified Approach to Interpreting Model Predictions. (2017). Advances in Neural Information Processing Systems, 30.

[3] M. Villani, J. Lockhart, D. Magazzeni. Feature Importance for Time Series Data: Improving KernelSHAP (2022). ICAIF Workshop on Explainable Artificial Intelligence in Finance

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment