VALL-E — The Future of Text to Speech? | by Elad Rapaport | Apr, 2023


DALL-E 2: a record player that receives text on one side, and outputs sound waves on the other side. digital art.

Hello readers,

In this article, we will dive deep into a new and exciting text-to-speech model developed by Microsoft Research, called VALL-E. The paper presenting the work has been released on Jan. 5, 2023, and since then has been gaining much attention online. It is worth noting that as of writing this article, no pre-trained model has been released and the only option currently to battle-test this model is to train it by yourself.

Nevertheless, the idea presented in this paper is novel and interesting and worth digging into, regardless of whether I can immediately clone my voice with it or not.

This article will be organized as follows:

  • Part 1 — Introduction to Text to Speech, Basic Concepts
  • Part 2 — VALL-E: Text to Speech as a Language Model
  • Part 3 — Encodec: The Workhorse Behind VALL-E
  • Part 4 — Problem formulation and training of VALL-E
  • Part 5— Some Coding
  • Part 6— Conclusions & Thoughts Ahead

The technology of text-to-speech is not new and has been around since the “Voder” — the first electronic voice synthesizer from Bell Labs in 1939 which required manual operation. Since then, the field has seen incredible developments and up until ~2017, the dominant technology was concatenative speech synthesis. This technology is based on the concatenation of pre-recorded speech segments to create intelligible speech. Although this technology can produce lifelike results, its drawbacks are obvious — it cannot generate new voices which don’t exist in the pre-recorded database, and it cannot generate speech with a different tone or emotion.

Fast-forward to the era of deep learning. Nowadays, the dominant strategy in text-to-speech synthesis is summarized in Figure 1. Let’s go over its different parts.

Figure 1. A model neural text-to-speech pipeline. Image by author.
  • First, we have a phonemizer that transforms text into phonemes. Phonemes are a textual representation of the pronunciation of words (for example — the word tomato will have different phonemes in an American and British accent), and this representation helps the downstream model achieve better results.
  • Afterward, we have an acoustic model which transforms these phonemes into a Mel spectrogram, which is a representation of audio in the time X frequency domain. A spectrogram is achieved by applying a short Fourier transform (STFT) on overlapping time windows of a raw audio waveform (here is an excellent explanation about the Mel spectrogram — https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53). Of course in this case the spectrogram is being created by a statistical model, as no input audio exists in real-time text-to-speech. Examples of recent model architectures include Tacotron2, DeepVoice 3, and TransformerTTS.
  • The final stage is the conversion of the Mel spectrogram into a waveform. A waveform is usually sampled at 24/48 kHz, where each sample is digitized into a 16-bit number. These numbers represent the amount of air pressure at each moment in time, which is the sound we eventually hear in our ears. Why can’t we just deterministically convert the spectrogram into a waveform? Because it requires major upsampling in the time domain which requires us to create information that doesn’t explicitly exist in the spectrogram, and also because spectrograms don’t contain phase information (only frequency). So, as in the conversion of phonemes to a Mel spectrogram, here as well we need a statistical model to convert the spectrogram into a waveform and these models are called Vocoders. Examples of Vocoders include WaveNet, WaveRNN, and MelGAN.

Additionally, there are recent models such as VITS and YourTTS, which employ an end-to-end model to generate waveforms from text input. Another example of such an end-to-end system is a paper titled “End-to-End Adversarial Text-to-Speech” by Deepmind (which is excellently explained by Yannic Kilcher here — https://www.youtube.com/watch?v=WTB2p4bqtXU). In this paper, they employ a GAN-like training procedure to produce realistic speech sound waves. They also need to tackle the alignment problem, which is the degree to which word utterances in the generated samples align in time with the same word utterances in the ground truth samples. This problem does not “solve on its own” and requires explicit handling in the model architecture.

The main drawback of these end-to-end TTS models is their incredible complexity. Text and speech are such different modalities, and this requires complex models which tackle problems such as alignment, speaker identity, and language in an explicit manner, making these models highly complex. The charm of VALL-E, which we will soon dive into, is that it takes the relative simplicity of generative language models and employs them creatively in the field of speech generation. For people like me who are new to the field of TTS and speech in general and have some experience in NLP, it allows a good entry point into this fascinating field.

This short overview did not do justice to the immense field of TTS, which one can spend a lifetime studying and understanding (I do encourage you to dive a bit deeper). Yet, we are here today to talk about VALL-E, so allow me to jump straight to it.

As in other text-to-speech systems, the input to VALL-E is phonemicized text, and the output is the corresponding sound waveform. Additionally, VALL-E employs a prompting mechanism in which a 3-second audio sample is fed as additional input to the model. This allows the generation of a speech utterance of the input text which is conditioned on the given audio prompt — in practice, this means the ability to perform zero-shot speech generation, which is the generation of speech from a voice unseen in the training data. The high-level structure of VALL-E is presented in Figure 2.

Figure 2. The high-level structure of VALL-E. Image taken from the original paper [1].

Let’s understand what happens in this pipeline. First, we have a phoneme conversion of the text, which is standard procedure as we understood already, and doesn’t require any learning mechanism. In order to process these phonemes by the model we have a phoneme embedding layer that takes as input a vector of indices into the phoneme vocabulary and outputs a matrix of embeddings corresponding to the input indices.

The 3-second acoustic prompt, which the output speech is conditioned on, is fed into an audio codec encoder. In VALL-E they use a pre-trained audio encoder for this — Encodec (developed by Facebook Research — https://arxiv.org/abs/2210.13438). Encodec takes as input a waveform of speech and outputs a compressed discrete representation of it via recursive vector quantization (RVQ) using an encoder-decoder neural architecture. We will dive into Encodec in Part 3 of this article, but for now, let’s just assume that it outputs a discrete representation of the audio signal by splitting it into fixed time windows and assigning each window a representation from a known vocabulary of audio embeddings (conceptually, very similar to word embeddings).

Once the model receives these two inputs, it can act as an autoregressive language model and output the next discrete audio representation. Because the audio representations come from a fixed which was learned by Encodec, we can think of this simply as predicting the next word in a sentence out of a fixed vocabulary of words (a fixed vocabulary of sound representations, in our case). After these sound representations are predicted they are transformed back into the original waveform representation using the Decoder part of the Encodec model.

In Figure 3 we compare the pipeline of VALL-E to the traditional neural TTS pipeline. We see that the main difference is the intermediate representation of audio. In VALL-E they gave up on the Mel spectrogram and used the representation created by the Encodec model. It is worth noting though that under the hood Encodec uses a spectrogram representation as well, so it is still somewhat in use in this architecture, albeit less prominently.

Figure 3. The VALL-E pipeline VS the traditional neural TTS pipeline. Image by author.

In the results section of the VALL-E paper, they have shown that they outperform the previous state-of-the-art zero-shot TTS model, YourTTS, on the LibriSpeech data on several metrics which include human-based evaluations such as similarity mean option score (SMOS) and algorithm-based evaluations such as word error rate (WER). In an interesting ablation study, they show that the phoneme prompt contributes to the content of generation (by reducing WER) and the audio prompt contributes to speaker similarity (by improving a speaker similarity metric).

We will now dive into the Encodec model which is responsible for converting audio to discrete tokens and back and is the enabler for using a language model approach to audio generation in this paper.

In Figure 4 we can see the Encodec architecture. It is an encoder-decoder architecture that learns a condensed representation of the audio signal via the task of reconstruction. Let’s go over its different parts to understand what is going on under the hood.

Figure 4. The Encodec architecture. Image taken from the original paper [2].

On the far left we have our original waveform, which is sampled at 24/48 kHz, and each sample is represented by 16 bits (65536 options). The raw signal gets passed into the Encoder which includes 1D convolution operations for downsampling and a two-layer LSTM for sequence modeling. The output of the encoder is 75/150 latent timesteps (compare this to the original 24/48K!), with a depth dimension of 128.

The decoder is simply a mirrored version of the encoder, using transposed convolutions in order to upsample the latent space and construct the audio waveform (here is a good explanation on transposed convolutions https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11).

The interesting bit here is, of course, the quantizer. How does Encodec quantize the continuous domain of sound? Using a technique called residual vector quantization (RVQ) which consists of projecting an input vector onto the closest entry in a codebook of a given size. Let’s break that sentence down. First, what is a codebook?

In the case of VALL-E, a codebook is a dictionary of vectors of size 1024, where each entry represents a vector of size 128. Our goal in vector quantization is to map a certain vector to the closest vector in the codebook (by Euclidean distance, for example), thereafter it can be represented by the index of that vector in the codebook (assuming everyone has access to the codebook). Of course, in this way, we lose a lot of information. What if no vector in the codebook accurately resembles our vector? Hence the “residual” in RVQ!

In Figure 5, I show how a vector is quantized using residual vector quantization. In the example, we have 3 codebooks. The input vector is compared to each of the vectors in the first codebook and assigned to the closest one (C1,1). Then, the residual between the C1,1 and the input is calculated and we try to match the residual to the next codebook, and so on until we reach the end of our codebooks. The final RVQ representation is the indices that were matched in each of the codebooks (1, 3, 2 in our example). This encoding method is extremely efficient. If we have 8 codebooks where each contains 1024 entries — we can represent 1024⁸=1.2e+24 different vectors using only 1024*8=8192 numbers! Of course, the sender and receiver must hold the same codebook for this quantization method to work. If you want to learn more about RVQ, such as how the codebooks are trained, I recommend reading another paper that Encodec is based on called SoundStream — https://arxiv.org/abs/2107.03312 (yes, this is a rabbit hole).

Figure 5. Example of Residual Vector Quantization. Image by author.

Back to the Encodec pipeline in Figure 4, let’s notice 3 additional details which are relevant to its training procedure:

  1. Mel spectrograms are created both from the input audio and from the generated audio. These spectrograms are compared and the signal from the comparison is used as a loss to direct the model training.
  2. Several discriminators are used in order to compare a short-time Fourier transform (STFT) of the original and synthetic waveform. This GAN loss gives a different signal than the Mel spectrogram comparison and was found useful for Encodec.
  3. The quantizer contains transformers that are used for further compression of the audio signal. This is not the transformer in VALL-E that predicts the next token of speech, as confusing as it may be. For further understanding of the transformers in Encodec, I recommend reading the paper or watching the video by Aleksa Gordic — https://www.youtube.com/watch?v=mV7bhf6b2Hs.

Let’s summarize what we know so far. VALL-E is a text-to-speech model that resembles language models in its operational mode, such that it predicts the next discrete audio token for a given prompt, which consists of phonemicized text and audio input. These discrete tokens are learned by another model called Encodec (which itself is based on SoundStream) that uses an encoder-decoder architecture with residual vector quantization to convert audio into discrete codes.

VALL-E contains two transformer models which are used to process the input data (phonemicized text and audio) — an autoregressive (AR) transformer that attends only to past data, and a non-autoregressive (NAR) transformer that attends to all points in time. Let’s see why.

Eight different codebooks are used in VALL-E as part of the Encodec model, where each codebook consists of 1024 entries. The codes from the first quantizer (codebook) are processed by the AR model according to this Equation 1. Let’s first clarify some terminology here:

  • C represents the generated output — as discrete audio codes
  • C~ is the 3-second input acoustic prompt
  • x is the input text as a phoneme sequence
  • C:,₁ represents data from the first quantizer/codebook for C

So, Equation 1 shows that the output for the first quantizer is conditioned on the input data, and on the previous timesteps’ outputs for the first quantizer (just like an autoregressive language model).

Equation 1. The autoregressive model — is applied to the first quantizer. Image taken from the original paper [1].

In Equation 2 we see the generation of codes for quantizers 2 to 8. Unlike the previous case, here the output for each quantizer is conditioned on all of the timesteps from the previous quantizers (when calculating codes for quantizer #7, the model is conditioned on data generated for quantizers 1 to 6). Unlike the AR model, this allows parallel generation of all timesteps in a single quantizer because it is only dependent on previous quantizer codes and not on previous timesteps of the same quantizer. The authors stressed this point because fast inference is especially important in text-to-speech models which need to generate speech in real-time scenarios.

Equation 1. The non-autoregressive model — is applied to the second to eight quantizers. Image taken from the original paper [1].

Equations 1 and 2 are visually depicted in Figure 6, which shows the AR and NAR models together and highlights the differences between them. We see that the AR transformer is used to predict only C:,₁ which are the tokens for the first quantizer. While doing so, it attends to the previous tokens it has generated. the NAR transformer attends to the previous quantizers, and not to previous timesteps (the previous tokens of the current quantizer are not available in the NAR model).

Figure 6. AR and NAR models in VALL-E. Image taken from the original paper [1].

VALL-E has been trained on 60K hours of audio from the LibriLight dataset, containing 7000 distinct speakers (which is over 100 times more data than previous state-of-the-art). The dataset is audio-only, hence for labeling an automatic speech recognition model was used. The Encodec model is used as a pre-trained model and no fine-tuning is performed on it for VALL-E as far as I could understand.

For training, random 10–20 second samples were taken from LibriLight. For the acoustic prompt, another 3 seconds were taken from the same utterance. They used 16 Tesla V-100 GPUs to train the model, which is very modest compared to large SOTA language models!

We learned about the procedure and data, now let’s try to use the unofficial Pytorch implementation of VALL-E in GitHub.

VALL-E doesn’t have an official implementation on GitHub, so for my experimentations, I will rely on the unofficial version which was released — https://github.com/enhuiz/vall-e. Moreover, no model checkpoint has been released so you have to train it from scratch.

There is also a Google Colab notebook to follow along with a simple training example — https://colab.research.google.com/drive/1wEze0kQ0gt9B3bQmmbtbSXCoCTpq5vg-?usp=sharing. In this example, they overfit the model on a single utterance of “hello world” and they show that the model is able to reproduce this single utterance. I was interested in two things:

  1. I wanted to replicate their “hello world” experiment with my own voice, just to see that the pipeline is working properly
  2. I wanted to replicate the experiment done by James Skelton from Paperspace — https://blog.paperspace.com/training-vall-e-from-scratch-on-your-own-voice-samples/, where he trained a model on a very small subset of his own recordings, and managed to replicate his voice with it (on something he already recorded)

Why the limited experiments? Because training this model from scratch takes many resources which I don’t currently have, plus I assume a pre-trained model will be released sooner or later.

So how did I succeed? I managed to replicate the “hello world” experiment, but unfortunately, I didn’t manage to replicate the Paperspace experiment — I just got a model which creates a garbled sound that vaguely reminds my voice. This is probably because of a lack of resources (I am training it on a Google Colab instance which times out after 12 hours). But still, I want to go over the process with you. My version of the VALL-E notebook is here — https://colab.research.google.com/drive/1NNOsvfiOfGeV-BBgGkwf0pyGAwAgx3Gi#scrollTo=SbWtNBVg_Tfd.

Once you run the following line in the Colab notebook —

!git clone --recurse-submodules https://github.com/enhuiz/vall-e.git

You will see a directory called vall-e in your file browser. The path content/vall-e/data/testcontains the data for the “hello world” experiment. Notice that it contains two files because for some reason it breaks with only one. To replicate this experiment, simply delete the files in the data directory using !rm content/vall-e/data/test/*, record yourself saying “Hello world”, and save it as two .wav files with different names. Put the .wav files in the data directory including two text files containing the words “hello world” (the text files should have the same names as the .wav files with a.normalized.txt suffix).

Following that, you will run these two cells:

!python -m vall_e.emb.qnt data/test
!python -m vall_e.emb.g2p data/test

The first cell will run the Encodec model on your own data and perform quantization, just as we discussed earlier. The second cell will convert the text “hello world” into phonemes.

Afterward, the processed data is ready and you can run the cells that operate the training procedure. There is separate training for the NAR and AR models (remember that as we saw earlier, the NAR model training is dependent on the AR model, but the AR model uses and produces only the first quantizer data, thus independent of the NAR model).

!python -m vall_e.train yaml=config/test/ar.yml
!python -m vall_e.train yaml=config/test/nar.yml

After the model has finished training you will run this cell:

!mkdir -p zoo
!python -m vall_e.export zoo/ar.pt yaml=config/test/ar.yml
!python -m vall_e.export zoo/nar.pt yaml=config/test/nar.yml

Which saves the latest model checkpoint (that has been automatically created) into a directory called zoo.

Finally, you will perform inference with the model using:

!python -m vall_e 'hello world' /content/vall-e/data/test/hello_world.wav toy.wav --ar-ckpt zoo/ar.pt --nar-ckpt zoo/nar.pt

This will run the model with a text prompt of “Hello world”, and an audio prompt of the same utterance. It will save the generated sample as toy.wav which you can then listen to using:

from IPython.display import Audio
Audio('toy.wav')

And that’s it! You created your own VALL-E “Hello world”. Unless you have many computing resources, it is probably best to wait for a pre-trained model to come around to actually make use of this model further.

In this article, we saw VALL-E, a new text-to-speech architecture by Microsoft Research. VALL-E generates audio in a language-model-like manner, which differentiates it from recent state-of-the-art methods that are usually end-to-end or follow a text->spectrogram->waveform creation pipeline.

We also talked about the Encodec model, which performs audio quantization and is used as a pre-trained model in the training of VALL-E. Encodec is fascinating in itself and manages to create super-condensed audio representations using residual vector quantization. The creators of VALL-E leveraged this feature and built a generative “language” model on top of this quantization.

Finally, we saw some code and replicated the “hello world” experiment from the unofficial code with our own voice. The official code for this paper hasn’t been released, nor has a model checkpoint been released. It would be interesting to see and use a pre-trained model for VALL-E, which I assume will turn up sooner or later. Nevertheless, this was an interesting learning journey.

See you next time!

Elad

[1] https://arxiv.org/abs/2301.02111 — The VALL-E paper (Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers)

[2] https://arxiv.org/abs/2301.02111 — The Encodec paper (High Fidelity Neural Audio Compression)

[3] https://wiki.aalto.fi/display/ITSP/Concatenative+speech+synthesis — Explanation on concatenative speech synthesis

[4] https://www.youtube.com/watch?v=aLBedWj-5CQ&t=1s — Deep dive into speech synthesis meetup (HuggingFace)

[5] https://www.youtube.com/watch?v=MA8PCvmr8B0 — Pushing the frontier of neural text to speech (Microsoft Research)

[6] https://www.youtube.com/watch?v=G9k-2mYl6Vo&t=5593s — Excellent video by John Tan Chong Min about VALL-E

[7] https://www.youtube.com/watch?v=mV7bhf6b2Hs — Excellent video by Aleksa Gordic about Encodec


DALL-E 2: a record player that receives text on one side, and outputs sound waves on the other side. digital art.

Hello readers,

In this article, we will dive deep into a new and exciting text-to-speech model developed by Microsoft Research, called VALL-E. The paper presenting the work has been released on Jan. 5, 2023, and since then has been gaining much attention online. It is worth noting that as of writing this article, no pre-trained model has been released and the only option currently to battle-test this model is to train it by yourself.

Nevertheless, the idea presented in this paper is novel and interesting and worth digging into, regardless of whether I can immediately clone my voice with it or not.

This article will be organized as follows:

  • Part 1 — Introduction to Text to Speech, Basic Concepts
  • Part 2 — VALL-E: Text to Speech as a Language Model
  • Part 3 — Encodec: The Workhorse Behind VALL-E
  • Part 4 — Problem formulation and training of VALL-E
  • Part 5— Some Coding
  • Part 6— Conclusions & Thoughts Ahead

The technology of text-to-speech is not new and has been around since the “Voder” — the first electronic voice synthesizer from Bell Labs in 1939 which required manual operation. Since then, the field has seen incredible developments and up until ~2017, the dominant technology was concatenative speech synthesis. This technology is based on the concatenation of pre-recorded speech segments to create intelligible speech. Although this technology can produce lifelike results, its drawbacks are obvious — it cannot generate new voices which don’t exist in the pre-recorded database, and it cannot generate speech with a different tone or emotion.

Fast-forward to the era of deep learning. Nowadays, the dominant strategy in text-to-speech synthesis is summarized in Figure 1. Let’s go over its different parts.

Figure 1. A model neural text-to-speech pipeline. Image by author.
  • First, we have a phonemizer that transforms text into phonemes. Phonemes are a textual representation of the pronunciation of words (for example — the word tomato will have different phonemes in an American and British accent), and this representation helps the downstream model achieve better results.
  • Afterward, we have an acoustic model which transforms these phonemes into a Mel spectrogram, which is a representation of audio in the time X frequency domain. A spectrogram is achieved by applying a short Fourier transform (STFT) on overlapping time windows of a raw audio waveform (here is an excellent explanation about the Mel spectrogram — https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53). Of course in this case the spectrogram is being created by a statistical model, as no input audio exists in real-time text-to-speech. Examples of recent model architectures include Tacotron2, DeepVoice 3, and TransformerTTS.
  • The final stage is the conversion of the Mel spectrogram into a waveform. A waveform is usually sampled at 24/48 kHz, where each sample is digitized into a 16-bit number. These numbers represent the amount of air pressure at each moment in time, which is the sound we eventually hear in our ears. Why can’t we just deterministically convert the spectrogram into a waveform? Because it requires major upsampling in the time domain which requires us to create information that doesn’t explicitly exist in the spectrogram, and also because spectrograms don’t contain phase information (only frequency). So, as in the conversion of phonemes to a Mel spectrogram, here as well we need a statistical model to convert the spectrogram into a waveform and these models are called Vocoders. Examples of Vocoders include WaveNet, WaveRNN, and MelGAN.

Additionally, there are recent models such as VITS and YourTTS, which employ an end-to-end model to generate waveforms from text input. Another example of such an end-to-end system is a paper titled “End-to-End Adversarial Text-to-Speech” by Deepmind (which is excellently explained by Yannic Kilcher here — https://www.youtube.com/watch?v=WTB2p4bqtXU). In this paper, they employ a GAN-like training procedure to produce realistic speech sound waves. They also need to tackle the alignment problem, which is the degree to which word utterances in the generated samples align in time with the same word utterances in the ground truth samples. This problem does not “solve on its own” and requires explicit handling in the model architecture.

The main drawback of these end-to-end TTS models is their incredible complexity. Text and speech are such different modalities, and this requires complex models which tackle problems such as alignment, speaker identity, and language in an explicit manner, making these models highly complex. The charm of VALL-E, which we will soon dive into, is that it takes the relative simplicity of generative language models and employs them creatively in the field of speech generation. For people like me who are new to the field of TTS and speech in general and have some experience in NLP, it allows a good entry point into this fascinating field.

This short overview did not do justice to the immense field of TTS, which one can spend a lifetime studying and understanding (I do encourage you to dive a bit deeper). Yet, we are here today to talk about VALL-E, so allow me to jump straight to it.

As in other text-to-speech systems, the input to VALL-E is phonemicized text, and the output is the corresponding sound waveform. Additionally, VALL-E employs a prompting mechanism in which a 3-second audio sample is fed as additional input to the model. This allows the generation of a speech utterance of the input text which is conditioned on the given audio prompt — in practice, this means the ability to perform zero-shot speech generation, which is the generation of speech from a voice unseen in the training data. The high-level structure of VALL-E is presented in Figure 2.

Figure 2. The high-level structure of VALL-E. Image taken from the original paper [1].

Let’s understand what happens in this pipeline. First, we have a phoneme conversion of the text, which is standard procedure as we understood already, and doesn’t require any learning mechanism. In order to process these phonemes by the model we have a phoneme embedding layer that takes as input a vector of indices into the phoneme vocabulary and outputs a matrix of embeddings corresponding to the input indices.

The 3-second acoustic prompt, which the output speech is conditioned on, is fed into an audio codec encoder. In VALL-E they use a pre-trained audio encoder for this — Encodec (developed by Facebook Research — https://arxiv.org/abs/2210.13438). Encodec takes as input a waveform of speech and outputs a compressed discrete representation of it via recursive vector quantization (RVQ) using an encoder-decoder neural architecture. We will dive into Encodec in Part 3 of this article, but for now, let’s just assume that it outputs a discrete representation of the audio signal by splitting it into fixed time windows and assigning each window a representation from a known vocabulary of audio embeddings (conceptually, very similar to word embeddings).

Once the model receives these two inputs, it can act as an autoregressive language model and output the next discrete audio representation. Because the audio representations come from a fixed which was learned by Encodec, we can think of this simply as predicting the next word in a sentence out of a fixed vocabulary of words (a fixed vocabulary of sound representations, in our case). After these sound representations are predicted they are transformed back into the original waveform representation using the Decoder part of the Encodec model.

In Figure 3 we compare the pipeline of VALL-E to the traditional neural TTS pipeline. We see that the main difference is the intermediate representation of audio. In VALL-E they gave up on the Mel spectrogram and used the representation created by the Encodec model. It is worth noting though that under the hood Encodec uses a spectrogram representation as well, so it is still somewhat in use in this architecture, albeit less prominently.

Figure 3. The VALL-E pipeline VS the traditional neural TTS pipeline. Image by author.

In the results section of the VALL-E paper, they have shown that they outperform the previous state-of-the-art zero-shot TTS model, YourTTS, on the LibriSpeech data on several metrics which include human-based evaluations such as similarity mean option score (SMOS) and algorithm-based evaluations such as word error rate (WER). In an interesting ablation study, they show that the phoneme prompt contributes to the content of generation (by reducing WER) and the audio prompt contributes to speaker similarity (by improving a speaker similarity metric).

We will now dive into the Encodec model which is responsible for converting audio to discrete tokens and back and is the enabler for using a language model approach to audio generation in this paper.

In Figure 4 we can see the Encodec architecture. It is an encoder-decoder architecture that learns a condensed representation of the audio signal via the task of reconstruction. Let’s go over its different parts to understand what is going on under the hood.

Figure 4. The Encodec architecture. Image taken from the original paper [2].

On the far left we have our original waveform, which is sampled at 24/48 kHz, and each sample is represented by 16 bits (65536 options). The raw signal gets passed into the Encoder which includes 1D convolution operations for downsampling and a two-layer LSTM for sequence modeling. The output of the encoder is 75/150 latent timesteps (compare this to the original 24/48K!), with a depth dimension of 128.

The decoder is simply a mirrored version of the encoder, using transposed convolutions in order to upsample the latent space and construct the audio waveform (here is a good explanation on transposed convolutions https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11).

The interesting bit here is, of course, the quantizer. How does Encodec quantize the continuous domain of sound? Using a technique called residual vector quantization (RVQ) which consists of projecting an input vector onto the closest entry in a codebook of a given size. Let’s break that sentence down. First, what is a codebook?

In the case of VALL-E, a codebook is a dictionary of vectors of size 1024, where each entry represents a vector of size 128. Our goal in vector quantization is to map a certain vector to the closest vector in the codebook (by Euclidean distance, for example), thereafter it can be represented by the index of that vector in the codebook (assuming everyone has access to the codebook). Of course, in this way, we lose a lot of information. What if no vector in the codebook accurately resembles our vector? Hence the “residual” in RVQ!

In Figure 5, I show how a vector is quantized using residual vector quantization. In the example, we have 3 codebooks. The input vector is compared to each of the vectors in the first codebook and assigned to the closest one (C1,1). Then, the residual between the C1,1 and the input is calculated and we try to match the residual to the next codebook, and so on until we reach the end of our codebooks. The final RVQ representation is the indices that were matched in each of the codebooks (1, 3, 2 in our example). This encoding method is extremely efficient. If we have 8 codebooks where each contains 1024 entries — we can represent 1024⁸=1.2e+24 different vectors using only 1024*8=8192 numbers! Of course, the sender and receiver must hold the same codebook for this quantization method to work. If you want to learn more about RVQ, such as how the codebooks are trained, I recommend reading another paper that Encodec is based on called SoundStream — https://arxiv.org/abs/2107.03312 (yes, this is a rabbit hole).

Figure 5. Example of Residual Vector Quantization. Image by author.

Back to the Encodec pipeline in Figure 4, let’s notice 3 additional details which are relevant to its training procedure:

  1. Mel spectrograms are created both from the input audio and from the generated audio. These spectrograms are compared and the signal from the comparison is used as a loss to direct the model training.
  2. Several discriminators are used in order to compare a short-time Fourier transform (STFT) of the original and synthetic waveform. This GAN loss gives a different signal than the Mel spectrogram comparison and was found useful for Encodec.
  3. The quantizer contains transformers that are used for further compression of the audio signal. This is not the transformer in VALL-E that predicts the next token of speech, as confusing as it may be. For further understanding of the transformers in Encodec, I recommend reading the paper or watching the video by Aleksa Gordic — https://www.youtube.com/watch?v=mV7bhf6b2Hs.

Let’s summarize what we know so far. VALL-E is a text-to-speech model that resembles language models in its operational mode, such that it predicts the next discrete audio token for a given prompt, which consists of phonemicized text and audio input. These discrete tokens are learned by another model called Encodec (which itself is based on SoundStream) that uses an encoder-decoder architecture with residual vector quantization to convert audio into discrete codes.

VALL-E contains two transformer models which are used to process the input data (phonemicized text and audio) — an autoregressive (AR) transformer that attends only to past data, and a non-autoregressive (NAR) transformer that attends to all points in time. Let’s see why.

Eight different codebooks are used in VALL-E as part of the Encodec model, where each codebook consists of 1024 entries. The codes from the first quantizer (codebook) are processed by the AR model according to this Equation 1. Let’s first clarify some terminology here:

  • C represents the generated output — as discrete audio codes
  • C~ is the 3-second input acoustic prompt
  • x is the input text as a phoneme sequence
  • C:,₁ represents data from the first quantizer/codebook for C

So, Equation 1 shows that the output for the first quantizer is conditioned on the input data, and on the previous timesteps’ outputs for the first quantizer (just like an autoregressive language model).

Equation 1. The autoregressive model — is applied to the first quantizer. Image taken from the original paper [1].

In Equation 2 we see the generation of codes for quantizers 2 to 8. Unlike the previous case, here the output for each quantizer is conditioned on all of the timesteps from the previous quantizers (when calculating codes for quantizer #7, the model is conditioned on data generated for quantizers 1 to 6). Unlike the AR model, this allows parallel generation of all timesteps in a single quantizer because it is only dependent on previous quantizer codes and not on previous timesteps of the same quantizer. The authors stressed this point because fast inference is especially important in text-to-speech models which need to generate speech in real-time scenarios.

Equation 1. The non-autoregressive model — is applied to the second to eight quantizers. Image taken from the original paper [1].

Equations 1 and 2 are visually depicted in Figure 6, which shows the AR and NAR models together and highlights the differences between them. We see that the AR transformer is used to predict only C:,₁ which are the tokens for the first quantizer. While doing so, it attends to the previous tokens it has generated. the NAR transformer attends to the previous quantizers, and not to previous timesteps (the previous tokens of the current quantizer are not available in the NAR model).

Figure 6. AR and NAR models in VALL-E. Image taken from the original paper [1].

VALL-E has been trained on 60K hours of audio from the LibriLight dataset, containing 7000 distinct speakers (which is over 100 times more data than previous state-of-the-art). The dataset is audio-only, hence for labeling an automatic speech recognition model was used. The Encodec model is used as a pre-trained model and no fine-tuning is performed on it for VALL-E as far as I could understand.

For training, random 10–20 second samples were taken from LibriLight. For the acoustic prompt, another 3 seconds were taken from the same utterance. They used 16 Tesla V-100 GPUs to train the model, which is very modest compared to large SOTA language models!

We learned about the procedure and data, now let’s try to use the unofficial Pytorch implementation of VALL-E in GitHub.

VALL-E doesn’t have an official implementation on GitHub, so for my experimentations, I will rely on the unofficial version which was released — https://github.com/enhuiz/vall-e. Moreover, no model checkpoint has been released so you have to train it from scratch.

There is also a Google Colab notebook to follow along with a simple training example — https://colab.research.google.com/drive/1wEze0kQ0gt9B3bQmmbtbSXCoCTpq5vg-?usp=sharing. In this example, they overfit the model on a single utterance of “hello world” and they show that the model is able to reproduce this single utterance. I was interested in two things:

  1. I wanted to replicate their “hello world” experiment with my own voice, just to see that the pipeline is working properly
  2. I wanted to replicate the experiment done by James Skelton from Paperspace — https://blog.paperspace.com/training-vall-e-from-scratch-on-your-own-voice-samples/, where he trained a model on a very small subset of his own recordings, and managed to replicate his voice with it (on something he already recorded)

Why the limited experiments? Because training this model from scratch takes many resources which I don’t currently have, plus I assume a pre-trained model will be released sooner or later.

So how did I succeed? I managed to replicate the “hello world” experiment, but unfortunately, I didn’t manage to replicate the Paperspace experiment — I just got a model which creates a garbled sound that vaguely reminds my voice. This is probably because of a lack of resources (I am training it on a Google Colab instance which times out after 12 hours). But still, I want to go over the process with you. My version of the VALL-E notebook is here — https://colab.research.google.com/drive/1NNOsvfiOfGeV-BBgGkwf0pyGAwAgx3Gi#scrollTo=SbWtNBVg_Tfd.

Once you run the following line in the Colab notebook —

!git clone --recurse-submodules https://github.com/enhuiz/vall-e.git

You will see a directory called vall-e in your file browser. The path content/vall-e/data/testcontains the data for the “hello world” experiment. Notice that it contains two files because for some reason it breaks with only one. To replicate this experiment, simply delete the files in the data directory using !rm content/vall-e/data/test/*, record yourself saying “Hello world”, and save it as two .wav files with different names. Put the .wav files in the data directory including two text files containing the words “hello world” (the text files should have the same names as the .wav files with a.normalized.txt suffix).

Following that, you will run these two cells:

!python -m vall_e.emb.qnt data/test
!python -m vall_e.emb.g2p data/test

The first cell will run the Encodec model on your own data and perform quantization, just as we discussed earlier. The second cell will convert the text “hello world” into phonemes.

Afterward, the processed data is ready and you can run the cells that operate the training procedure. There is separate training for the NAR and AR models (remember that as we saw earlier, the NAR model training is dependent on the AR model, but the AR model uses and produces only the first quantizer data, thus independent of the NAR model).

!python -m vall_e.train yaml=config/test/ar.yml
!python -m vall_e.train yaml=config/test/nar.yml

After the model has finished training you will run this cell:

!mkdir -p zoo
!python -m vall_e.export zoo/ar.pt yaml=config/test/ar.yml
!python -m vall_e.export zoo/nar.pt yaml=config/test/nar.yml

Which saves the latest model checkpoint (that has been automatically created) into a directory called zoo.

Finally, you will perform inference with the model using:

!python -m vall_e 'hello world' /content/vall-e/data/test/hello_world.wav toy.wav --ar-ckpt zoo/ar.pt --nar-ckpt zoo/nar.pt

This will run the model with a text prompt of “Hello world”, and an audio prompt of the same utterance. It will save the generated sample as toy.wav which you can then listen to using:

from IPython.display import Audio
Audio('toy.wav')

And that’s it! You created your own VALL-E “Hello world”. Unless you have many computing resources, it is probably best to wait for a pre-trained model to come around to actually make use of this model further.

In this article, we saw VALL-E, a new text-to-speech architecture by Microsoft Research. VALL-E generates audio in a language-model-like manner, which differentiates it from recent state-of-the-art methods that are usually end-to-end or follow a text->spectrogram->waveform creation pipeline.

We also talked about the Encodec model, which performs audio quantization and is used as a pre-trained model in the training of VALL-E. Encodec is fascinating in itself and manages to create super-condensed audio representations using residual vector quantization. The creators of VALL-E leveraged this feature and built a generative “language” model on top of this quantization.

Finally, we saw some code and replicated the “hello world” experiment from the unofficial code with our own voice. The official code for this paper hasn’t been released, nor has a model checkpoint been released. It would be interesting to see and use a pre-trained model for VALL-E, which I assume will turn up sooner or later. Nevertheless, this was an interesting learning journey.

See you next time!

Elad

[1] https://arxiv.org/abs/2301.02111 — The VALL-E paper (Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers)

[2] https://arxiv.org/abs/2301.02111 — The Encodec paper (High Fidelity Neural Audio Compression)

[3] https://wiki.aalto.fi/display/ITSP/Concatenative+speech+synthesis — Explanation on concatenative speech synthesis

[4] https://www.youtube.com/watch?v=aLBedWj-5CQ&t=1s — Deep dive into speech synthesis meetup (HuggingFace)

[5] https://www.youtube.com/watch?v=MA8PCvmr8B0 — Pushing the frontier of neural text to speech (Microsoft Research)

[6] https://www.youtube.com/watch?v=G9k-2mYl6Vo&t=5593s — Excellent video by John Tan Chong Min about VALL-E

[7] https://www.youtube.com/watch?v=mV7bhf6b2Hs — Excellent video by Aleksa Gordic about Encodec

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AprEladFuturelatest newsmachine learningRapaportSpeechTechnologyTextVALLE
Comments (0)
Add Comment