Lessons Learned from Making an AI Opera | by Nico Westerbeck

Read this before starting your next singing voice synthesis project

Have you ever wondered what it takes to stage an AI Opera on one of the most prestigious stages of Germany? Or if not, do you wonder now as you read this punchline? This post will give you an idea on the lessons learned during our making of a Singing-Voice-Synthesis (SVS) system for the first ever professional opera where an AI had a main role. Chasing Waterfalls was staged in Semperoper Dresden in September 2022. This post is more a collection of pitfalls that we fell for than a cohesive story and aimed at people with some previous knowledge in TTS or SVS systems. We believe mistakes are worth sharing and actually more valuable than things that worked out of the box. But first, what do we mean by AI opera?

Scene from »chasing waterfalls« © Semperoper Dresden/Ludwig Olah

Shortly spoken, Chasing Waterfalls is the attempt to stage an opera on the topic of AI, that uses AI for visual and acoustic elements. Specifically, the opera was composed for 6 human singers and one singing voice synthesis system (“AI voice”), which perform together with a human orchestra and electric sound scenes. In addition to human-composed appearances throughout the opera, there is one scene where the AI character is supposed to compose for itself. In this post, we only focus on the singing voice synthesis, as this is what we at T-Systems MMS were tasked with. The compositional AI was built by the artist collective Kling Klang Klong based on GPT-3 and a sheet music transformer. Human-made parts of the opera were composed by Angus Lee with concept, coordination and more by Phase 7 (full list of contributors).

Our requirements were to synthesize a convincing opera voice for an unknown sheet music and text which are part of the opera. Furthermore, we were tasked to satisfy artistic needs that came up during the project. The resulting architecture is based on HifiSinger and DiffSinger, where we use a transformer encoder-decoder adjusted with ideas from HifiSinger, combined with a shallow diffusion decoder and Hifi-GAN as a vocoder. We use Global Style Tokens for controllability and obtain phoneme alignments through the Montreal Forced Aligner. We recorded our own dataset with help of the amazing Eir Inderhaug. We publish our code in three repositories, one for the acoustic model, one for the vocoder and one for a browser-based inference frontend. To enable you to experiment with our work, we add preprocessing routines for the CSD dataset format, but note that the CSD dataset does not permit commercial use and contains children songs sung by a pop singer, so do not expect to get an opera voice when training on that data.

Did it work? Well, the reviews of the opera as a whole are mixed, some were amazed and some call it a clickbait on tik-tok. The artistic reviews rarely went in detail on the technical quality of the SVS system, except for one statement in tag24.de, loosely translated by us:

Then, the central scene. While the avatars are sleeping, the AI is texting, composing and singing. […] But: That moment isn’t particularly spectacular. If you didn’t know, you would not recognize the work of the AI as such.

That is basically the best compliment we could have gotten and means we subjectively match human performance, at least for this reviewer. We still see that occasionally the AI misses some consonants and the transition between notes is a bit choppy. The quality could certainly be improved but that would require more data, time and hardware. With the resources available to us we managed to train a model that is not completely out-of-place on a professional opera stage. But, judge for yourself:

So, how does it sound like? Here are some sneak peaks, you can have a look at all of them in our github repository.

This section is mainly interesting if you are planning your next deep-learning project. We had a project duration from November 2021 until August 2022, with the premiere of the opera being in September. We had our dataset ready in May, so effective experimentation happened from May to August. In this time, we trained 96 different configurations of the acoustic model and 25 of the vocoder on our dedicated hardware. The machine we were working on had 2 A-100 GPUs, 1TB RAM and 128 CPU Cores and was busy training something basically all the time, we scheduled our experiments to make the best use of the hardware available to us. Hence, we estimate an energy consumption of ca 2MWh for the project. The final training took 20h for the transformer AM which was not pre-trained, 30h for the diffusion decoder which was also not pretrained, 120h to pretrain the vocoder on LJSpeech and 10h to fine-tune the vocoder. For inference, we need ca 6GB GPU RAM and the real-time factor is ca 10 for the entire pipeline, meaning we can synthesize 10s of audio in 1s of GPU time. Our dataset consisted of 56 pieces, of which 50 were present in 3 different interpretations, summing to 156 pieces and 3h:32m of audio.

In the literature, there is no clear distinction between time-aligned MIDIs and sheet music MIDIs — what do we mean with that? For training of FastSpeech 2, a phoneme alignment is obtained through the montreal forced aligner see Section 2.2, which we also use for our duration predictor training. FastSpeech 1 obtains these alignment from a teacher student model and HifiSinger uses nAlign, but essentially FastSpeech-like models require time-aligned information. Unfortunately, the timing that phonemes are sung with is not really comparable to the sheet music timing.

In some cases, there is no time-wise overlap between the note timespan and where the phonemes were actually sung due to rhythmic variations added by the singer or consonants shortly before or after the note.
Breathing pauses are generally not noted in the sheet music, hence the singer places them during notes, often at the end.
If notes are not sung in a connected fashion, small pauses between phonemes are present.
If notes are sung in a connected fashion, it is not perfectly clear where one note ends and the next one starts, especially if two vowels follow each other.

These discrepancies pose a question to the way data is fed to the model. If the time-aligned information is used directly as training data, the model is incapable of singing sheet music as the breaks and timings are missing during inference. If sheet-music timing is used as training data, the phoneme-level duration predictor targets are unclear, as only syllable-level durations are present in the sheet music data. There are two fundamental ways to deal with this problem. If there is enough data present, directly feeding syllable embeddings to the model should yield best results, as training a duration predictor becomes unnecessary (the syllable durations are clear at inference time). Training syllable embeddings was not possible with the limited amount of data available to us, so we choose to use phoneme embeddings and preprocess the data to be as close to sheet music as possible. At first, we remove unvoiced sections detected by the aligner that had no corresponding unvoiced section in the sheet music to prevent gaps in the duration predictor targets. We extend neighboring phonemes to keep the relative lengths of the phonemes constant and to span the resulting gaps. Phonemes not labelled by the aligner get a default length in the middle of the section they should appear in. Very long phonemes and notes are split up into multiple smaller ones.

FastSpeech 1 recommends to train the duration predictor in log-space:

We predict the length in the logarithmic domain, which makes them more Gaussian and easier to train.

(see Section 3.3 FastSpeech). Now, this gives two options on how this could be implemented, either the duration predictor outputs are exponentiated before loss calculation or the targets are transformed to log domain:

mse(exp(x), y)
mse(x, log(y))
Do not predict in log space

ming024’s FastSpeech implementation uses Option 2 and xcmyz’s implementation does not predict in log space at all as with Option 3. The argument is that the log-space makes the durations more gaussian, and indeed if we have a look at the distributions, the raw durations look more poisson-like while in log space it looks closer to a gaussian.

Mathematically, Option 1 does not make the MSE calculation more gaussian, hence does not alleviate the bias and should not make sense in this context. Training with MSE loss should make Option 2 the more favorable one while Option 1 should be roughly equivalent to Option 3 except for better numerical stability in the output layer. According to expectations, we find the duration predictor to have a better validation loss and less bias with Option 2, but astonishingly the subjective overall quality of the generated speech is better in Option 1. It almost seems like having a biased duration predictor is a good thing. This only holds with activated syllable guidance, where the errors of the duration predictor across the syllable are corrected to yield the exact syllable duration from the sheet music. We do not conduct a MOS study to prove this point, and the subjective judgement is only based on the perception of us and the artists which whom we collaborated, so it is up to the reader to experiment on their own. However, we believe this to be an interesting question for future SVS publications. Option 1 and 3 do not really differ a lot except that we ran into heavy gradient clipping on Option 3 and thus chose Option 1.

We have the requirement to synthesize at least 16 second snippets during inference to be compatible with the compositional AI. However, training on 16s snippets with global attention exhausted our hardware budget to such an extend that training would become infeasible. The bottleneck is the quadratic complexity of the attention mechanism combined with the high mel-resolution recommended by HifiSinger of ca 5ms hop-size. As a result, the decoder had to form attention matrices of more than 4000×4000 elements, which neither fit into GPU memory nor yielded sensible results. After brief experimentations with linear complexity attention patterns, which resolved the hardware bottleneck but still did not yield sensible results, we switched to local attention in the decoder. We do not only gain the capability of synthesizing longer snippets, but also improve overall subjective quality. After also switching the encoder to local attention, we could see another improvement in subjective quality.

To us, this makes a lot of sense. Training a global attention mechanisms on snippets makes it a snippet-local attention mechanism. This means that there is never an attention calculated across the snipping border. Actually using local attention means that each token always has the ability to attend to at least N tokens in both directions, where N is the local attention context. Furthermore, a token can not attend further than N tokens, which makes sense in the case of speech processing. While features like singing style might span multiple tokens, most of the information for generating a mel frame should come from the note and phoneme sung at this point in time. To incorporate singing style, we adapt GST, even lowering the amount of information that needs a wide attention span. Capping the attention window makes this explicit, the model does not have to learn that the attention matrix should be very diagonal, as it is technically constrained to create at least some sort of diagonality. Hence, we observe a quality improvement, and recommend local attention as a possible improvement to both TTS and SVS systems.

In the interaction with our artist colleagues, it became clear that the artists would like to have some sort of control over what the model synthesizes. For a normal opera singer, this is incorporated through feedback from the conductor during rehearsals, which takes forms such as “Sing this part less restrained”, “More despair in bar 78 to 80”, etc. While being able to respect textual feedback would be great, this is a research effort of its own and exceeds the scope of the project. Hence we had to implement a different control mechanism. We considered three options:

A FastSpeech 2-like Variance Adapter (see Section 2.3) which uses extracted or labelled features to feed additional embeddings to the decoder
An unsupervised approach like Global Style Tokens which trains a limited number of tokens through features extracted from the mel targets, which can be manually activated during inference
A semi-supervised approach that takes textual labels to extract emotion information.

Both Option 1 and 3 require additional labelling work, or at least sophisticated feature extraction, hence we tried Option 2 first. We found GSTs to deliver reasonable results that satisfy the requirements of changing something, despite the level of control being lower than wanted. When trained to produce four tokens, we consistently had at least two tokens representing unwanted features such as mumbling or distortion, and the tokens generally tended to be very sensitive to small changes during inference. We believe that more data could alleviate these problems, as unsupervised approaches generally need a lot of data to work which we did not have.

You can have a listen for yourself, remember the sample Hello TDS I can sing an opera? Here are adaptations for it with different style tokens. Also, we can synthesize multiple versions of the same snippet with random noise added to the style tokens to create a choir.

Especially for music. We had two problems, we were unsure which songs we can use for model training and whom the model belongs to after it was trained.

It is unclear which data can be used for training a SVS model. There are multiple rulings possible here, the most extreme would be that you can freely use any audio to train a SVS, or in the other direction that no part of the dataset can have any copyright on it, neither the composition nor the recording. A possible middle ground could be that using compositions if they are re-recorded is not an infringement, as the resulting SVS, if it is not overfitting, will not remember the specific compositions but will reflect the voice timbre of the singer in the recording. But so far, no sensible court rulings in german law are known to us, hence we assumed the most strict version and used royalty free compositions recorded by a professional opera singer who agreed to the recordings being used for model training. A big thanks again to Eir Inderhaug for her incredible performance.

Furthermore, we had to ask the question who would be the copyright-eligible owner of the model outputs and model itself. It could be the singer who gave their songs as data to train on, it could be us who trained the model, nobody, or something completely unexpected. After talking back to multiple law experts in our company, we came to the conclusion: nobody knows. It is still an open law question whom the models and inference outputs belong to. If a court rules that the creators of the dataset always have maximal ownership of the model, that would mean you and me probably own GPT-3 as it was trained on crawled data from the entire internet. If the courts rule that dataset creation does not entitle to model ownership at all, there would be no legal way to stop deepfakes. Likely, future cases might fall somewhere in between, but as we did not have enough precedents in german law, we assumed the worst possible ruling. However for machine-learning projects that rely on crawled datasets, this is an immense risk and possible deal-breaker that should be assessed at project start. Especially music copyright has seen some extreme rulings. Hopefully, the law situation will stabilize in the mid term to reduce the margin of uncertainty.

A 22khz hifi-gan does not work on 44khz audio. Which is unfortunate because there are plenty of speech datasets in 22khz which can be used for pretraining, but even fine-tuning on 44khz when having pretrained on 22khz does absolutely not work. Which makes sense, because the convolutions suddenly see everything with twice the frequency, but it meant that we had to upsample our pretraining dataset for the vocoder and start from a blank model instead of being able to use a pretrained model from the internet. The same holds for changing mel parameters, a completely new training was necessary when we adjusted the lower and upper mel frequency boundaries.

Check your data! This lesson basically applies to any data science project, long story short, we lost a lot of time training on poorly labeled data. In our case, we did not notice that the notes labelled were of a different pitch than what the singer produced, a mistake that happened through mixing up sheet music files. To somebody without perfect pitch listening capabilities, such a discrepancy is not immediately obvious, and even less to a team of data scientists who are musically illiterate compared to the artists we worked with. We only found that mistake because one of the style tokens learned to represent pitch and we could not figure out why. In future projects, we will set explicit data reviews where domain experts check the data based on even unexpected possible mistakes. A good rule of thumb could be if you spend less than half of your time directly with the data, you are probably overly focussed on architecture and hyperparameters.

Especially at the project start, be very divergent with technology choice. Early on, we discovered MLP Singer, which seemed like a good starting system because at that time it was the only deep SVS with an open-source code and an available CSD dataset. By the time we learned that adapting it for the opera would likely be more effort than implementing something on the basis of HifiSinger, we had already made a decision to use the format and similar songs to the CSD dataset. However, as mentioned previously, this format and the choice of songs has its flaws. We could have avoided being locked to that format and the hassle that came with it if we had spent more time critically evaluating the dataset and framework choice early on instead of focussing on getting a working prototype.

It was a very experimental project with plenty of learnings, and we grew as a team during the making. Hopefully, we managed to share some learnings. If you are interested in the opera, you have the option until 06.11.2022 to see it in Hong Kong. If you are interested in more information, contact us via mail (Maximilian.Jaeger ät t-systems.com, Robby.Fritzsch ät t-systems.com, Nico.Westerbeck ät t-systems.com) and we are happy to provide more details.

Read this before starting your next singing voice synthesis project

Hello TDS I can sing an opera.

Then, the central scene. While the avatars are sleeping, the AI is texting, composing and singing. […] But: That moment isn’t particularly spectacular. If you didn’t know, you would not recognize the work of the AI as such.

So, how does it sound like? Here are some sneak peaks, you can have a look at all of them in our github repository.

In some cases, there is no time-wise overlap between the note timespan and where the phonemes were actually sung due to rhythmic variations added by the singer or consonants shortly before or after the note.
Breathing pauses are generally not noted in the sheet music, hence the singer places them during notes, often at the end.
If notes are not sung in a connected fashion, small pauses between phonemes are present.
If notes are sung in a connected fashion, it is not perfectly clear where one note ends and the next one starts, especially if two vowels follow each other.

FastSpeech 1 recommends to train the duration predictor in log-space:

We predict the length in the logarithmic domain, which makes them more Gaussian and easier to train.

mse(exp(x), y)
mse(x, log(y))
Do not predict in log space

A FastSpeech 2-like Variance Adapter (see Section 2.3) which uses extracted or labelled features to feed additional embeddings to the decoder
An unsupervised approach like Global Style Tokens which trains a limited number of tokens through features extracted from the mel targets, which can be manually activated during inference
A semi-supervised approach that takes textual labels to extract emotion information.

Especially for music. We had two problems, we were unsure which songs we can use for model training and whom the model belongs to after it was trained.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Lessons Learned from Making an AI Opera | by Nico Westerbeck | Oct, 2022

Read this before starting your next singing voice synthesis project

Read this before starting your next singing voice synthesis project

Related Posts