Introduction to Speech Enhancement: Part 1 | by Mattia Di Gangi | Jan, 2023


Photo by Wan San Yip on Unsplash

Speech enhancement is a set of methods and techniques aiming at improving speech quality in terms of intelligibility and/or perceptive quality using speech audio signal processing techniques [Wikipedia]. It has many practical use cases, including cleaning speech signal from noise in hearing aids, or recovering a hard-to-understand speech signal from a noisy channel/environment.

It is similar but different from speech separation [1], that is the algorithmic separation of one audio signal into two different channels for, respectively, the speech and the background. It is possible to apply speech enhancement to get the speech signal and then apply additional algorithms to compute the residual signal, or the difference between the original signal and the enhanced speech.

Different speech enhancement methods are applied to reduce the effect of different noise models (e.g. stationary vs non-stationary). While this field of research is not new at all, deep learning approaches flourished in recent years and improved the quality of speech enhancement in the most challenging scenario of non-stationary noise, even with a low signal-to-noise ratio (SNR).

Terminology and Background

Signal: observation of a physical quantity that varies and is measurable. Audio, video, images, all fall under this definition.

Noise model: it is a mathematic model of the stochastic process underlying a noise signal. Since the process is stochastic, it is usually described in terms of a probability distribution.

(Non-)Stationary process: a signal is assumed generated by an underlying process. It can be deterministic or stochastic. Since we know everything about a deterministic process, the interest lies on stochastic processes. A stochastic process is said stationary if its unconditional joint distribution does not depend on time. In other words, given a model for the process, the probability of an event X occurring at time t does not depend on t itself. By extension, a signal produced by a stationary process is also called stationary. In the audio domain, a signal is stationary when its frequency or spectral content does not change over time. This implies that, for speech enhancement, the real challenge is given by non-stationary noise, since it is more unpredictable and more hardly distinguishable from speech, as the human voice is also non-stationary: the frequencies in our voice signals change all the time.

Similarly to most empirical research fields, in speech enhancement we need datasets for both training our models and evaluating them, metrics to evaluate the results, as well as hardware and software tools to run our algorithms.

As hardware requirements, we definitely need a modern computer equipped with (at least) one NVIDIA GPU for training, while a modern multicore CPU can be enough during inference, although a high CPU usage is expected.

As per software, we need a codebase that uses one deep learning library like Tensorflow, Pytorch, or one of those derived from Jax. The FullSubnet+ repository linked above can be a good starting point.

Then, we come to data. The easiest way to measure quality and enable supervised learning is to have a reference signal that the network is supposed to reconstruct. For this reason, the training sets are usually built by having a collection of clean speech signals and a collection of noise signals. Then, both are combined by adding them with different levels of SNR. This way, during the training it is possible to generate a really high number of speech/noise combinations that the network must learn to discriminate to obtain enhanced speech signal. Speech enhancement is encouraged in the research community by means of shared tasks, and as such it is possible to find public datasets from those sources. The Deep Noise Suppression (DNS) Challenge is held annually and presented at Interspeech. In the related repo you can find links to the provided datasets as well as additional resources.

The evaluation is usually performed by means of manual and automatic assessment. The trade-off between the two has been stated many times but it may be worth to repeat it. Manual evaluation is performed by humans: it is costly and slow, because humans need to be found, instructed and payed, and of course they need to listen to the audios before evaluating them. Automatic metrics are fast and cheap, so they are repeatable during a development cycle, but they can capture only some aspects of the evaluation and not be much reliable in some cases.

In the case of speech enhancement, there are 4 main automatic metrics that are commonly used: perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), cepstral distance (CD), and weighted spectral slope distance (WSS). All of them compute a distance between the clean speech and the enhanced speech. PESQ compares the quality between the samples of the two signals and produces a score between -0.5 and 4.5, the higher the better. The other three metrics compute distances between different properties of the two spectra: the formants, the log spectral distance, or a weighted distance of the spectral slopes. When it is important, human evaluation can be added to assess the perceived quality or to count the recognizable words from the enhanced signal. Speech recognition may be used to recognize the words, but then we need to cope its own errors.

For what concerns the signal representation, there are speech enhancement models that work on the time domain and others that work on the frequency domain. The frequency domain is obtained by applying the Fourier transform to the signal and is an atemporal representation of the signal that gives us information about its frequency components. In practice, the representation in the frequency domain is more useful than the representation in the time domain for speech enhancement, and all state-of-the-art methods use it.

Example

Let’s show an example to get a hearing understanding of the topic. We start from a video commercial containing loud music that may make the speech hard to hear.

Released with a Creative Commons license

We first extract the audio track by using the famous ffmpeg tool:

Then, we apply the FullSubNet+ algorithm [2], a deep-learning based method with open-source code and a pre-trained model available. It is very easy to reproduce this result by following the instructions in the repo:

And finally, to satisfy your curiosity, the residual audio obtained by spectral subtraction:

The speech quality in enhanced_audio.wav is notably improved and the music is, according to the time, totally removed or its volume is lowered considerably. The result shows the effectiveness of modern speech enhancement, but it also shows that the results are not perfect and more research is going on in this field. Also, by running the code, you will notice that this operation is quite expensive in computational terms. The availability of a GPU makes the process much faster, but reducing the computational complexity while keeping or improving the quality is an important research goal.

That’s all for the first part of this series! We have discussed the ideas and the reasons for speech enhancement, and given some background about digital signals.

In the second part of this series will go into the details to explain a publicly available state-of-the-art model for speech enhancement.

In the meanwhile, I encourage you to read through the linked resources in the article, and to go through teaching material for digital signal processing if the topic is of interest for you. One freely-available book on the topic is Think DSP. It is possible to read it online or download the pdf, and of course to buy a physical copy to support its author (I do not receive any money from it).

Thank you for reading so far and stay tuned for the next part!

[1] Bahmaninezhad, Fahimeh, et al. “A unified framework for speech separation.” arXiv preprint arXiv:1912.07814 (2019).

[2] Chen, Jun, et al. “FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement.” ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. https://arxiv.org/abs/2203.12188


Photo by Wan San Yip on Unsplash

Speech enhancement is a set of methods and techniques aiming at improving speech quality in terms of intelligibility and/or perceptive quality using speech audio signal processing techniques [Wikipedia]. It has many practical use cases, including cleaning speech signal from noise in hearing aids, or recovering a hard-to-understand speech signal from a noisy channel/environment.

It is similar but different from speech separation [1], that is the algorithmic separation of one audio signal into two different channels for, respectively, the speech and the background. It is possible to apply speech enhancement to get the speech signal and then apply additional algorithms to compute the residual signal, or the difference between the original signal and the enhanced speech.

Different speech enhancement methods are applied to reduce the effect of different noise models (e.g. stationary vs non-stationary). While this field of research is not new at all, deep learning approaches flourished in recent years and improved the quality of speech enhancement in the most challenging scenario of non-stationary noise, even with a low signal-to-noise ratio (SNR).

Terminology and Background

Signal: observation of a physical quantity that varies and is measurable. Audio, video, images, all fall under this definition.

Noise model: it is a mathematic model of the stochastic process underlying a noise signal. Since the process is stochastic, it is usually described in terms of a probability distribution.

(Non-)Stationary process: a signal is assumed generated by an underlying process. It can be deterministic or stochastic. Since we know everything about a deterministic process, the interest lies on stochastic processes. A stochastic process is said stationary if its unconditional joint distribution does not depend on time. In other words, given a model for the process, the probability of an event X occurring at time t does not depend on t itself. By extension, a signal produced by a stationary process is also called stationary. In the audio domain, a signal is stationary when its frequency or spectral content does not change over time. This implies that, for speech enhancement, the real challenge is given by non-stationary noise, since it is more unpredictable and more hardly distinguishable from speech, as the human voice is also non-stationary: the frequencies in our voice signals change all the time.

Similarly to most empirical research fields, in speech enhancement we need datasets for both training our models and evaluating them, metrics to evaluate the results, as well as hardware and software tools to run our algorithms.

As hardware requirements, we definitely need a modern computer equipped with (at least) one NVIDIA GPU for training, while a modern multicore CPU can be enough during inference, although a high CPU usage is expected.

As per software, we need a codebase that uses one deep learning library like Tensorflow, Pytorch, or one of those derived from Jax. The FullSubnet+ repository linked above can be a good starting point.

Then, we come to data. The easiest way to measure quality and enable supervised learning is to have a reference signal that the network is supposed to reconstruct. For this reason, the training sets are usually built by having a collection of clean speech signals and a collection of noise signals. Then, both are combined by adding them with different levels of SNR. This way, during the training it is possible to generate a really high number of speech/noise combinations that the network must learn to discriminate to obtain enhanced speech signal. Speech enhancement is encouraged in the research community by means of shared tasks, and as such it is possible to find public datasets from those sources. The Deep Noise Suppression (DNS) Challenge is held annually and presented at Interspeech. In the related repo you can find links to the provided datasets as well as additional resources.

The evaluation is usually performed by means of manual and automatic assessment. The trade-off between the two has been stated many times but it may be worth to repeat it. Manual evaluation is performed by humans: it is costly and slow, because humans need to be found, instructed and payed, and of course they need to listen to the audios before evaluating them. Automatic metrics are fast and cheap, so they are repeatable during a development cycle, but they can capture only some aspects of the evaluation and not be much reliable in some cases.

In the case of speech enhancement, there are 4 main automatic metrics that are commonly used: perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), cepstral distance (CD), and weighted spectral slope distance (WSS). All of them compute a distance between the clean speech and the enhanced speech. PESQ compares the quality between the samples of the two signals and produces a score between -0.5 and 4.5, the higher the better. The other three metrics compute distances between different properties of the two spectra: the formants, the log spectral distance, or a weighted distance of the spectral slopes. When it is important, human evaluation can be added to assess the perceived quality or to count the recognizable words from the enhanced signal. Speech recognition may be used to recognize the words, but then we need to cope its own errors.

For what concerns the signal representation, there are speech enhancement models that work on the time domain and others that work on the frequency domain. The frequency domain is obtained by applying the Fourier transform to the signal and is an atemporal representation of the signal that gives us information about its frequency components. In practice, the representation in the frequency domain is more useful than the representation in the time domain for speech enhancement, and all state-of-the-art methods use it.

Example

Let’s show an example to get a hearing understanding of the topic. We start from a video commercial containing loud music that may make the speech hard to hear.

Released with a Creative Commons license

We first extract the audio track by using the famous ffmpeg tool:

Then, we apply the FullSubNet+ algorithm [2], a deep-learning based method with open-source code and a pre-trained model available. It is very easy to reproduce this result by following the instructions in the repo:

And finally, to satisfy your curiosity, the residual audio obtained by spectral subtraction:

The speech quality in enhanced_audio.wav is notably improved and the music is, according to the time, totally removed or its volume is lowered considerably. The result shows the effectiveness of modern speech enhancement, but it also shows that the results are not perfect and more research is going on in this field. Also, by running the code, you will notice that this operation is quite expensive in computational terms. The availability of a GPU makes the process much faster, but reducing the computational complexity while keeping or improving the quality is an important research goal.

That’s all for the first part of this series! We have discussed the ideas and the reasons for speech enhancement, and given some background about digital signals.

In the second part of this series will go into the details to explain a publicly available state-of-the-art model for speech enhancement.

In the meanwhile, I encourage you to read through the linked resources in the article, and to go through teaching material for digital signal processing if the topic is of interest for you. One freely-available book on the topic is Think DSP. It is possible to read it online or download the pdf, and of course to buy a physical copy to support its author (I do not receive any money from it).

Thank you for reading so far and stay tuned for the next part!

[1] Bahmaninezhad, Fahimeh, et al. “A unified framework for speech separation.” arXiv preprint arXiv:1912.07814 (2019).

[2] Chen, Jun, et al. “FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement.” ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. https://arxiv.org/abs/2203.12188

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsEnhancementGangiIntroductionJanlatest newsMattiaPartSpeechTechnology
Comments (0)
Add Comment