Is Attention Explanation?. The problem of models’ explanations in… | by Denis Vorotyntsev | Jun, 2022

By Jessie Hobb On Jun 21, 2022

The problem of models’ explanations in machine learning is always present. We want glass-box models for various reasons. First, data scientists wish to understand models’ predictions to avoid data leaks and to be able to debug. Project managers want to understand the reasoning behind the models’ choices to avoid blind decision-making, especially in “touchy” fields such as banking or healthcare. Finally, users want to know why a model made a prediction to trust it.

In the Natural Language Processing field, there are many ways to explain models’ predictions. One of the most frequently used is to look at an attention map. The idea is to train a neural network with attention layers and use these layers to highlight important parts of the text.

Recently I came across two papers, “Attention is not Explanation” and “Attention is not not Explanation“, which argue on the applicability of the attention mechanism for models’ explanations in NLP. Quite a beef, huh? I want to share some insights from these papers in this blog post.

Before we jump to papers polemic, let’s refresh our memory on attention.

Attention mechanisms were introduced in the 1990s, but they became popular after publishing “Attention Is All You Need“. An attention mechanism tries to enhance some parts of the input data while diminishing others. The motivation is that some parts of the input are more critical than others. Thus we should pay more attention to them. Understanding which part is more important depends on the context, and a model tries to learn it.

There are multiple ways how to calculate attention. One of the most frequently used — is scaled dot-product attention, which was introduced in the “Attention Is All You Need” paper:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by a square root of dk, and apply a softmax function to obtain the weights on the values.

The figure is taken from the “Attention Is All You Need” paper.

An example of attention visualization for an aspect-based sentiment analysis task is presented below. Words are highlighted according to attention scores. We see that the words “location“, “price“, and “excellent” are important for the “Hotel location” task, while the words “cleanest room“, and “bathroom” are important for the “Hotel cleanliness” task. These words are more or less expected to be important, so we may conclude that model is doing fine.

The figure is taken from “Attention in Natural Language Processing“

The attention mechanism is one of the core parts of Transformer-like models, which are state of the art in the modern NLP. Attention at the same time improves the model’s performance and, naturally assumed, can be used to understand its decision-making. However, little research was done to justify such interpretations. So, can we use attention maps to understand a model?

Jain and Wallace published their paper “Attention is not Explanation” in 2019. The paper assesses the claim that we can use attention weights as models’ explanations. In the paper, they present two properties of attention weights, which should be present for an “attention as explanation” approach:

Attention weights should correlate with feature importance measures (e.g., gradient-based measures);
Alternative (or counterfactual) attention weight configurations ought to yield corresponding changes in prediction (and if they do not, then they are equally plausible as explanations).

The authors run several experiments to test these properties using different NLP tasks and datasets. They aimed to answer the following questions: Do learned attention weights agree with alternative, natural measures of feature importance? And, Had we attended to different features, would the prediction have been different?

In one of such experiments, authors analyzed a correlation between attention explanation and alternative explanation approaches — gradient-based feature importance and leave-one-out (LOO) measure. They found that observed correlations are modest. They concluded that, in general, attention weights do not strongly or consistently agree with standard feature importance scores.

Mean and std. dev. of correlations between gradient/leave-one-out importance measures and attention weights. The figure is taken from “Attention is not Explanation” paper.

In another experiment, the authors tried to generate an alternative attention map that produces a close prediction. If an alternative attention map is very different from the initial one yet makes the same predictions, the reliability of explanations is under a question. They called such an alternative attention map “adversarial attention”. Authors managed to find adversarial attention for many instances using random permutations.

For example, in the picture below, we see an attention map for the AG News dataset instance. The words “motors” and “daimlerchrysler” are essential. Replacing the attention map with an adversarial attention map so that the word “their” becomes essential doesn’t change predictions too much. A prediction delta equals 0.006, but the “explanation” is entirely different!

The figure is taken from “Attention is not Explanation” paper.

Authors conclude that an attention map’s ability to provide transparency or meaningful explanations for model predictions is, at best, questionable.

Soon after publishing “Attention is not Explanation” paper, Wiegreffe and Pinter made a response — “Attention is not not Explanation” in which they challenged assumptions made in the prior work. In the paper they made two claims to support it — existence does not entail exclusivity, and attention distribution is not primitive.

Existence does not entail exclusivity

Attention provides an explanation, not the explanation. I.e., the existence of alternative attention maps that yield the exact predictions don’t prove that an attention map can’t be used as an explanation. Given the degree of freedom of LSTM models, it’s not a surprise we can get the exact predictions using different attention maps. The presence of an alternative attention map that yields close predictions doesn’t disprove the usefulness of the attention map for an explanation.

Attention distribution is not a primitive

Attention is a model component, whose parameters are learned during model training. Adversarial attention removes the linkage between attention and other layers. To make adversarial attention “fair”, the authors run the following experiment. They trained and fine-tuned a model whose goal is to make similar predictions as the initial model while having a different attention map. If we can achieve the same performance as an initial model but with a completely different attention distribution, it questions the meaningful link between attention and predictions.

To train the model, the authors used the following loss function:

The figure is taken from “Attention is not not Explanation” paper.

Where TVD is Total Variation Distance is used to compare prediction scores, and JSD — Jensen-Shannon Divergence for comparing attention weighting distributions:

This loss motivates to minimize the distance between predictions while maximizing the distance between initial and adversarial attention distributions.

From the results of the experiments, we see that the authors failed to achieve similar performance with the adversarial model. It indicates that trained attention learned something meaningful about the relationship between tokens and prediction which cannot be easily “hacked” adversarially.

Their main conclusion in “Attention is not not Explanation” paper was that Jain and Wallace did not disprove the usefulness of attention mechanisms for explainability. However, in both papers, the authors agree that further research needs to be performed.

So, is attention explanation? No, attention should not be blindly used as an explanation, especially for decision-making.

As pointed out in Thoughts on “Attention is Not Not Explanation”, a model that provides a plausible but at the same time, an unfaithful explanation would be the most dangerous possible outcome. However, attention weights can be used for low-risk tasks: sanity checks and model debugging. We can start thinking about “attention is not explanation” in the same way that “correlation is not causation“. Correlation can be causation, but generally speaking, it’s not.