Quantifying the Uncertainty for Speech Recognition | by Kacper Kubara

By Jessie Hobb On Jun 3, 2022

When and How to Trust Your Speech Recognition Model

Logo of the Attendi Speech Service. Displayed with the author’s permission

What is Uncertainty?
Uncertainty Estimation Methods for Speech Recognition
How Can We Benefit from Uncertainty Estimation?
Summary
About Me
References

In the past few years, automatic speech recognition (ASR) has shifted towards more complex and bigger neural network architectures. A higher complexity brings benefits to the performance of the model but, on the other hand, it becomes more difficult to trust the results. A complex neural network model is like a black box and we can only hope that it will work equally well for unseen data.

At Attendi, we offer a tailored speech service for (health)care professionals in the Netherlands. The audio recorded by users can have distinct noise background, jargon, or accent that the model hasn’t seen (or in this case, heard🙂) before. Knowing how much we can trust the predictions is beneficial in this scenario.

By measuring the uncertainty of predictions, we can potentially:

Find the utterances most likely to be wrong and highlight them so the users can check it
Finetune the model on the most ‘uncertain’ samples
Monitor the performance of the model

But how can we measure uncertainty? Fortunately, there are a few relevant methods for uncertainty estimation that can be applied to ASR systems. In this article, we will look at how we can define uncertainty and what methods could be applied to speech recognition models. Lastly, we will learn in which practical settings we can apply uncertainty estimation to ASR systems.

Uncertainty measures how confident the model is in making its predictions [1]. Depending on the source of uncertainty, we can categorize it into two types: aleatoric and epistemic.

The aleatoric type is a statistical uncertainty that appears due to an inherently random process. For example, the outcome of a coin flip has aleatoric uncertainty. The coin flip outcome is either heads or tails,
either of which can occur with a 50% chance. The aleatoric type is an irreducible part of the uncertainty. It is not possible to change the probability of getting heads or tails, as this event is described by an inherently random process.

Epistemic uncertainty is a result of an inappropriate model architecture, training procedure, a distributional shift in data, or the appearance of previously unknown data (e.g. an image of a bird for dog/cat
image classification) [1]. This is a reducible part of the uncertainty which means that we can increase the reliability of the predictions by improving the model and its training parameters, or by improving the variability in the training data.

In more practical settings, we are mostly interested in predictive uncertainty, i.e. uncertainty that can be estimated after a single forward-pass of the network. This type of uncertainty is quite useful, as it is straightforward to implement into an existing ASR pipeline. We would also prefer a method that focuses on the epistemic type since it’s the only reducible part of the uncertainty.

Ideally, we would like to know if the model’s output is uncertain for the given input data X. In the ASR domain, it is common to define a method that outputs a score S(X) after each model’s prediction. For a value range s₀, and s₁, we can determine whether the output of the model is uncertain.

A simple threshold function that indicates the uncertainty, Sel(X), can be defined as follows:

Eq 1. Given the score function S(X), we can define which data makes the model’s output uncertain.

where s₀ and s₁ are lower and upper thresholds, respectively. The values of s₀ and s₁ are determined on a case-by-case basis as they depend on many factors, e.g. model architecture or the uncertainty estimation method.

With this simple threshold function, we can identify data that the ASR model struggles with. The main issue, however, is that the score function S(X) does not have access to ground-truth labels since we make predictions on ‘live’ data which is unlabelled. Therefore, those score functions can only compute uncertainty using a raw model’s output, or have to make assumptions on the ground truth label distribution.

In the following subsections, we will look at a selection of these methods and understand how they can generate scores S(X).

Maximum Softmax Probability (MSP)

A simple approach is to estimate the uncertainty using the softmax distribution of the model [2]. To compute the score S(X), we can take a maximum value of the softmax output of the model.

Eq 2. Taking the max value from the softmax output

But how do we generalize this approach to the ASR? The above equation considers only a non-sequential output ([batch_size, no. classes] ), which is not the case when we want to predict the whole utterance ([batch_size, no. predicted tokens, no. classes]). We need to define another function that aggregates the score for the whole utterance:

Eq 3. Averaging MSP scores over the [no. predicted tokens] dimension

where softmax(Yₜ) is the softmax distribution output for token t.

In the above equation, we compute an average uncertainty score over the whole utterance. It is also possible to use a max, min, or other aggregation function instead of the mean. However, while this approach provides a single score for the whole utterance, those scores are not as informative as per-token scores. I.e. after aggregation, we can no longer flag incorrect words/tokens. The best we can do is to find an incorrect utterance. Simply speaking, we lose the granularity of detail.

ODIN

There are a few disadvantages to the MSP approach. Neural network models tend to make overconfident predictions on out-of-domain (OOD) data [3]. Often, they are not well-calibrated, which means that the softmax output does not correlate well with the confidence in making the predictions [3].

ODIN attempts to address this problem by adding temperature scaling and small perturbations to the input data [4]. The equations are very similar to the MSP approach.

Firstly, the temperature scaling is used to calibrate the softmax output so it aligns better with the confidence of the prediction:

Eq 4. ODIN modifies the softmax equation by adding a temperature scaling parameter T

where neural network f=(f₁, …, fₙ) is trained to classify N classes. To obtain a final score S(x), we can take a max value from the score Si (as in Equation 2).

In the paper [4], the authors noted that temperature scaling has to be sufficiently large and noted the best performance with T=1000 for their model.

As a second step, we add input perturbations. It helps to make a wider gap in softmax scores between in- and out-of-distribution samples. To add input perturbation, we need to do the following:

Compute the maximum softmax score (Eq. 4 and Eq. 2)
Backpropagate the cross-entropy loss w.r.t to the input x
Compute the gradient w.r.t input x
Take a sign(gradients), scale it with epsilon, and subtract it from the input x

This procedure can be explained with the equation below:

Eq 5. Input perturbation to help separate scores for in- and out-of-distribution samples

To apply it to the sequential ASR task, we can follow the same approach as for the MSP. Aggregating it over a [no. predicted tokens] would reduce it to a numerical score.

GradNorm

The max function in MSP and ODIN cherrypicks a single value from the softmax distribution and disregards other output values. On top of that, these methods don’t look at how the input signal propagates throughout the network. All these methods do is blindly look at a single element from the final softmax layer.

GradNorm [5] approaches this problem from a different angle and looks at the model holistically. To compute the uncertainty score, GradNorm looks at how gradients are propagated throughout the network and at the final output distribution.

The main concept of the GradNorm is as follows. We assume that if input data X is uncertain, then the output of the final softmax layer would be (more or less) uniformly distributed. If the model is confident in its prediction, then we would see a spike for a certain class in the softmax layer.

To compute the uncertainty we first compute the KL divergence loss between the softmax distribution and the uniform distribution:

Eq 6. We assume that ground truth distribution u is uniform. Based on that, we compute the KL divergence loss between softmax(f(x)) and u.

To compute the final score, we backpropagate the KL divergence loss and then compute a gradient magnitude (p-norm) of pₜₕ layer. In the paper, they found that the best scores are computed from the very last layer so p=1:

Eq 7. To compute the final score, backpropagate KL divergence loss and compute the p-norm of the pth layer. p=1 gives the best results and is also the most computationally efficient.

Knowing the uncertainty of predictions can be quite useful in practical use-cases. In this section, we will look at some of the most useful applications.

User Correction Input

Uncertainty scores allow us to find utterances that are likely to be incorrect. We can highlight those utterances for the user, so he/she can find them quickly and correct them.

Active Learning

Active learning assumes that we have a certain quota for adding ground truth labels to new data. In the context of ASR, we can hire annotators to transcribe a certain number of audio samples to help finetune the model. But which audio should we prioritize for manual transcription?

In this case, we can compute uncertainty scores and select the audio that has a certain, desired S(X). We can take the most ‘uncertain’ samples, or choose a specific range of S(X).

Monitoring

A deployed ASR model can be exposed to diverse audio signals. They can have a different environmental noise, jargon, or accent, which can influence the model’s performance. We hope that, by monitoring the uncertainty scores from audio, we can detect when the model’s performance degrades and act upon it. However, this is somewhat still a work in progress for us. We first need to establish whether the performance of the model (WER, in this case) correlates with the uncertainty.

In this article, we defined what uncertainty is and how we can use it as score S(X) to measure the model’s confidence in its predictions.

There are different methods that compute uncertainty. Maximum Softmax Probability (MSP) and ODIN compute the uncertainty score using the softmax output. The drawback of these approaches, however, is that they only blindly look at the model’s output. GradNorm takes a different approach and looks both at the output distribution and how the gradient backpropagates throughout the model.

Lastly, we looked at the potential use-cases of the uncertainty score estimation.