Music Genre Classification Using a Divide & Conquer CRNN | by Max Hilsdorf | Sep, 2022
An Effective Method for Smaller Datasets
Ever since CNNs started to blow up in the field of image processing in 2012, these networks were quickly applied to music genre classification — and with great success! Today, training CNNs on so-called spectrograms has become state of the art, displacing almost all previously used methods based on hand-crafted features, MFCCs, and/or support vector machines (SVM).
In recent years, it has started to show that adding recurrent layers like LSTMs or GRUs to classical CNN architectures yields better classification results. An intuitive explanation for this is that the convolutional layers figure out the WHAT and the WHEN, while the recurrent layer finds meaningful relationships between the WHAT and the WHEN. This architecture is known as a convolutional recurrent neural network (CRNN).
When learning about divide & conquer for the first time, I was confused, because I had heard of it in a military context. The military definition is
“to make a group of people disagree and fight with one another so that they will not join together against one” (www.merriam-webster.com)
However, this is almost the opposite of its meaning for our purpose! See Table 1, where the three-step process behind divide & conquer in computer science and music genre classification is laid out.
Although the definitions are not the same, they do overlap substantially, and personally, I find the term divide & conquer very fitting in both cases. In music genre classification, the term was first used (to my knowledge) by Dong (2018). However, other researchers like Nasrullah & Zhao (2019) have also used this method — just not by the same name.
Often, genre classification is based on exactly one 30-second snippet of a track. This is partly because commonly used music data sources like the GTZAN or FMA datasets or the Spotify Web API provide tracks of this length. However, there are three major advantages to applying the divide & conquer approach:
1. More Data
Given that you have full-length tracks available, taking a 30-second slice and throwing away the rest of a 3–4 minute track is very data inefficient. Using divide & conquer, most of the audio signal can be used. Moreover, by allowing for overlap between slices, even more snippets can be drawn. In fact, I would argue that this is a form of natural and fairly seamless data augmentation.
For example, you can get more than 80x the training data from a 3-minute track if you draw 3-second snippets with an overlap of 1 second compared to drawing one 30-second snippet per track. Even if you only have 30-second tracks available, you can get 14 snippets out of each of them with 3-second snippets and 1 second of overlap.
2. Lower-Dimensional Data
A 30-second snippet produces quite a large spectrogram. With common parameters, one spectrogram can have a shape of (1290 x 120), i.e. over 150k data points. Naturally, a 3-second snippet with the same parameters will produce a ~(129 x 120) spectrogram with only 15k data points. Depending on the machine learning model and architecture you are using, this can reduce the model complexity significantly.
Side note: In case you are unaware of what spectrograms are or why and how they are used for audio classification, I recommend this article by The Experimental Writer for a nice and intuitive explanation.
3. Indifferent to Audio Input Length
If you want to apply your trained model in the real world, you are going to encounter tracks of all lengths. Instead of having to struggle with finding just the right 30-second snippet to extract from it, you just start drawing 3-second snippets until the whole track is covered. And what if the input track is less than 30 seconds long? Is your model robust enough to deal with 10-second jingles which are zero-padded to reach 30 seconds? With divide & conquer, this is no issue.
Disadvantages
There are three major downsides to using divide & conquer. Firstly, it adds extra processing steps to split the audio files and to perform the aggregated predictions for a full track. Moreover, the snippet-based approach requires a track-wise train-validation-test split to avoid intercorrelated training-, validation-, and test datasets.
Luckily, both of these steps are already included in my single-label audio processing pipeline SLAPP, which is freely available on GitHub.
Lastly, the individual snippet predictions are usually aggregated using some sort of majority vote. With that, your model is completely oblivious to any musical relationships which unfold over more than, e.g., 3 seconds of audio. That is unless you develop another complex meta classifier for the aggregation process.
When to Use Divide & Conquer
If you either have lots of data and require no more of it or if you want to analyze musical structures which unfold over longer time frames, you may not want to use a divide & conquer approach. However, if you have limited data and want to build a robust and flexible classifier with it, do consider this exciting and effective approach!
What is SLAPP?
This article also serves as a showcase for my newly developed single-label audio processing pipeline (SLAPP). This tool automates the entire data processing workflow for single-label audio classification tasks. This includes splitting the tracks into snippets, computing spectrograms, performing a track-wise train-validation-test split, and much more.
Since processing audio data takes a really long time on a home computer, SLAPP allows you to shut down your computer at any time and reload your progress without any significant loss of time or data. After having developed this pipeline for my bachelor’s thesis, I tried it out on a couple of classification tasks and had great fun and success with it.
Check out SLAPP on GitHub and try it out for yourself.
Using SLAPP to Process GTZAN
GTZAN is a well-known and publicly available dataset for genre classification, featuring 100 30-second-long tracks from 10 different genres. This dataset is perfect for our purposes because there is a huge body of research behind it and because it only has a limited amount of data.
In order to use SLAPP, you need to place your MP3 (not WAV) files in a folder structure just like the one shown in Figure 1.
Since GTZAN comes in exactly such a folder structure (yay!), all we need to do now is to convert the WAV files into MP3 files and we can get started. I suggest you do this by looping through the directory and copying each file into a new directory with the same structure using
from pydub import AudioSegmentAudioSegment.from_wav("/input/file.wav").export("/output/file.mp3", format="mp3")
Next, all you need to do is clone the SLAPP repository by navigating into your desired directory and using:
git clone https://github.com/MaxHilsdorf/single_label_audio_processing_pipeline
Make sure that your system fulfills all the requirements to use SLAPP; for instance, the Python libraries Pydub and Librosa as well as the audio codec FFmpeg (described in detail in the repository).
Now, within SLAPP, we open “pipeline_parameters.json” and set the parameters like this:
This way, we extract up to 14 3-second slices from each 30-second track within GTZAN, allowing for an overlap of 1 second. By running “build_dataset.py”, the dataset is built, and by running “process_dataset.py”, the dataset is then normalized and shuffled. That is all you have to do. SLAPP takes care of the rest for you.
With validation- and test splits of 10% each, you are going to obtain 11,186 training spectrograms and 1,386 spectrograms for both, validation- and test data. Each spectrogram has a shape of (130 x 100).
Side note: If you are not just reading along, but actually coding along, this processing step will take a while, because there are lots of computational steps. At any time during the process, you can stop the scripts and relaunch them to resume where you left off. If you experience any technical issues with SLAPP, feel free to let me know, since this is my first software release and it may have lots of bugs or my documentation may be insufficient.
Model Building
Using the Keras library, I built a CRNN like this:
The idea behind this CRNN architecture is to use convolutional layers + max pooling to squeeze the frequency dimension to a scalar value while keeping the time dimension as a vector. This allows us to apply a gated recurrent unit (GRU) to the features extracted by the convolutional blocks. This type of architecture was (to my knowledge) introduced by Choi et al. (2017) and has been adopted in several studies, for instance by Nasrullah & Zhao (2018) and Bisharad & Laskar (2019).
See Figure 2 for a detailed overview of the architecture used.
In the development phase, I tried out different variations of this architecture and compared the validation loss between these models. This particular model achieved top results with only ~400k parameters.
Side Note: SLAPP will, in this case, give you a training data tensor of shape (11186, 100, 130). However, to train this specific CRNN, the input needs to be reshaped to (11186, 130, 100, 1), effectively swapping the time- and frequency axes (numpy.swapaxes) and adding another dimension (numpy.expand_dims).
Model Training
I trained the model with Adam optimizer, a learning rate of 0.0001, categorical cross-entropy loss, and a batch size of 32. Additionally, I used the EarlyStopping, ReduceLROnPlateau, and ModelCheckpoint callbacks provided by Keras to ensure smooth training and that only the best model is saved.
As you can see in Figure 3, the training metrics reach precision and recall scores close to 90%, while the validation precision and recall scores converge somewhere between 62 and 73%. Seeing as we are dealing only with 3-seconds of audio and 10 balanced classes ( → 10% accuracy through random guessing), this is already quite impressive. Overall, the best model in this training run achieved a validation accuracy of 67.32%.
To achieve more stable results, 10 training runs were started and evaluated to obtain mean scores and standard deviations for the metrics of interest. Keep in mind that this article does not aim to offer a sophisticated statistical evaluation, but rather to show the effectiveness of divide & conquer classification using an example.
How to Compute Track-Based Predictions?
For the track-based predictions, I loaded each track in the test dataset as an MP3 file. The test track names are automatically written to “train_val_test_dict.json” in your data directory by SLAPP. From there, the procedure I used is the following:
- Split the track into as many snippets as possible (3-second snippets with 2 sec overlap in this case).
- Transform each snippet into a mel spectrogram that suits the input shape of the trained model.
- Get the raw class predictions for each snippet.
- Sum the raw scores for each class across all snippets.
- Your global prediction is the class with the highest score sum.
Since this can get quite complicated for beginners, I am going to provide you with another module that automates this track-based prediction process: Please find the module and its documentation here.
Classification Results
I trained and evaluated the model described above ten times and averaged the accuracy scores to get rid of some of the randomness of the process.
See Figure 4 for a visual overview of the accuracy scores obtained. The first thing we notice is that the model performed approximately equally well on the validation- and test data, which is a good sign. Further, all standard deviations are really small, which tells us that we can really on differences in mean accuracy.
It is easy to see that the track-based classification outperformed the snippet-based classification by an average difference of ~12.2 percentage points. While we are not going to do any statistical tests for the sake of simplicity, it is undeniable that the track-based classification through divide & conquer is superior in this case.
What If We Had Used The Full 30-second Snippets?
So far, we have shown that (in this example) the track-based divide & conquer classification outperforms the 3-second single-snippet classification. However, maybe we would have achieved a much higher accuracy if we had built a CRNN on the 1000 30-second audio slices in “raw” GTZAN.
That is possible, so I tried it out. I built another CRNN of a similar architecture (although ~half the parameters because there is less data) based on 30-second slices and trained it 10 times to average the accuracy scores obtained. This method averages out at an accuracy of only 55.8% (SD=0.043), underperforming any of the 3-second models by a large margin.
Divide & Conquer Does In Fact Conquer
This article shows how a divide & conquer approach can help to build a strong classifier for audio classification problems even when not much training data is present. The data processing pipeline SLAPP makes it really easy to perform the processing steps necessary to build a divide & conquer classifier for single-label audio classification tasks.
3 Seconds are Better Than 30 Seconds
How come 30-second slices are even worse than 3-second slices although they hold more information? It seems that increasing the dataset size by drawing 3-second slices helps the model to extract meaningful features, whereas it seems to get lost when dealing with a really small dataset consisting of huge spectrograms.
Arguably, if other data formats like MFCCs or hand-crafted features with a smaller dimensionality had been used, the model would have dealt more easily with the 30-second audio files. However, this article goes to show that almost any audio dataset can be classified with spectrogram data — if divide & conquer is used.
Other Applications of SLAPP + Divide & Conquer
If you want to try the proposed method on another similar problem, here are some ideas that may inspire your next data science project:
- Sex recognition based on speech data
- Accent recognition based on speech data
- Emotion classifier for music
- Building classifiers using audio data from machine sensors
Thank you very much for your interest in my work! Please let me know if anything isn’t working for you or if you have trouble understanding or applying the concepts.
[1] Bisharad, D. & Laskar, R. H. (2019). Music Genre Recognition Using Convolutional Recurrent Neural Network Architecture. In: Expert Systems 36,4.
[2] Choi, K.; Fazekas, G.; Sandler, M. & Cho, K. (2017). Convolutional Recurrent Neural Networks for Music Classification. In: International Conference on Acoustics, Speech, and Signal Processing 2017.
[3] Dong, M. (2018). Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification. In: Conference on Cognitive Computational Neuroscience, 5–8 September 2018, Philadelphia, Pennsylvania.
[4] Nasrullah, Z. & Zhao, Y. (2019). Music Artist Clasification from Audio, Text, and Images Using Deep Features. In: arXiv, DOI: 10.48550/arXiv.1707.04916
An Effective Method for Smaller Datasets
Ever since CNNs started to blow up in the field of image processing in 2012, these networks were quickly applied to music genre classification — and with great success! Today, training CNNs on so-called spectrograms has become state of the art, displacing almost all previously used methods based on hand-crafted features, MFCCs, and/or support vector machines (SVM).
In recent years, it has started to show that adding recurrent layers like LSTMs or GRUs to classical CNN architectures yields better classification results. An intuitive explanation for this is that the convolutional layers figure out the WHAT and the WHEN, while the recurrent layer finds meaningful relationships between the WHAT and the WHEN. This architecture is known as a convolutional recurrent neural network (CRNN).
When learning about divide & conquer for the first time, I was confused, because I had heard of it in a military context. The military definition is
“to make a group of people disagree and fight with one another so that they will not join together against one” (www.merriam-webster.com)
However, this is almost the opposite of its meaning for our purpose! See Table 1, where the three-step process behind divide & conquer in computer science and music genre classification is laid out.
Although the definitions are not the same, they do overlap substantially, and personally, I find the term divide & conquer very fitting in both cases. In music genre classification, the term was first used (to my knowledge) by Dong (2018). However, other researchers like Nasrullah & Zhao (2019) have also used this method — just not by the same name.
Often, genre classification is based on exactly one 30-second snippet of a track. This is partly because commonly used music data sources like the GTZAN or FMA datasets or the Spotify Web API provide tracks of this length. However, there are three major advantages to applying the divide & conquer approach:
1. More Data
Given that you have full-length tracks available, taking a 30-second slice and throwing away the rest of a 3–4 minute track is very data inefficient. Using divide & conquer, most of the audio signal can be used. Moreover, by allowing for overlap between slices, even more snippets can be drawn. In fact, I would argue that this is a form of natural and fairly seamless data augmentation.
For example, you can get more than 80x the training data from a 3-minute track if you draw 3-second snippets with an overlap of 1 second compared to drawing one 30-second snippet per track. Even if you only have 30-second tracks available, you can get 14 snippets out of each of them with 3-second snippets and 1 second of overlap.
2. Lower-Dimensional Data
A 30-second snippet produces quite a large spectrogram. With common parameters, one spectrogram can have a shape of (1290 x 120), i.e. over 150k data points. Naturally, a 3-second snippet with the same parameters will produce a ~(129 x 120) spectrogram with only 15k data points. Depending on the machine learning model and architecture you are using, this can reduce the model complexity significantly.
Side note: In case you are unaware of what spectrograms are or why and how they are used for audio classification, I recommend this article by The Experimental Writer for a nice and intuitive explanation.
3. Indifferent to Audio Input Length
If you want to apply your trained model in the real world, you are going to encounter tracks of all lengths. Instead of having to struggle with finding just the right 30-second snippet to extract from it, you just start drawing 3-second snippets until the whole track is covered. And what if the input track is less than 30 seconds long? Is your model robust enough to deal with 10-second jingles which are zero-padded to reach 30 seconds? With divide & conquer, this is no issue.
Disadvantages
There are three major downsides to using divide & conquer. Firstly, it adds extra processing steps to split the audio files and to perform the aggregated predictions for a full track. Moreover, the snippet-based approach requires a track-wise train-validation-test split to avoid intercorrelated training-, validation-, and test datasets.
Luckily, both of these steps are already included in my single-label audio processing pipeline SLAPP, which is freely available on GitHub.
Lastly, the individual snippet predictions are usually aggregated using some sort of majority vote. With that, your model is completely oblivious to any musical relationships which unfold over more than, e.g., 3 seconds of audio. That is unless you develop another complex meta classifier for the aggregation process.
When to Use Divide & Conquer
If you either have lots of data and require no more of it or if you want to analyze musical structures which unfold over longer time frames, you may not want to use a divide & conquer approach. However, if you have limited data and want to build a robust and flexible classifier with it, do consider this exciting and effective approach!
What is SLAPP?
This article also serves as a showcase for my newly developed single-label audio processing pipeline (SLAPP). This tool automates the entire data processing workflow for single-label audio classification tasks. This includes splitting the tracks into snippets, computing spectrograms, performing a track-wise train-validation-test split, and much more.
Since processing audio data takes a really long time on a home computer, SLAPP allows you to shut down your computer at any time and reload your progress without any significant loss of time or data. After having developed this pipeline for my bachelor’s thesis, I tried it out on a couple of classification tasks and had great fun and success with it.
Check out SLAPP on GitHub and try it out for yourself.
Using SLAPP to Process GTZAN
GTZAN is a well-known and publicly available dataset for genre classification, featuring 100 30-second-long tracks from 10 different genres. This dataset is perfect for our purposes because there is a huge body of research behind it and because it only has a limited amount of data.
In order to use SLAPP, you need to place your MP3 (not WAV) files in a folder structure just like the one shown in Figure 1.
Since GTZAN comes in exactly such a folder structure (yay!), all we need to do now is to convert the WAV files into MP3 files and we can get started. I suggest you do this by looping through the directory and copying each file into a new directory with the same structure using
from pydub import AudioSegmentAudioSegment.from_wav("/input/file.wav").export("/output/file.mp3", format="mp3")
Next, all you need to do is clone the SLAPP repository by navigating into your desired directory and using:
git clone https://github.com/MaxHilsdorf/single_label_audio_processing_pipeline
Make sure that your system fulfills all the requirements to use SLAPP; for instance, the Python libraries Pydub and Librosa as well as the audio codec FFmpeg (described in detail in the repository).
Now, within SLAPP, we open “pipeline_parameters.json” and set the parameters like this:
This way, we extract up to 14 3-second slices from each 30-second track within GTZAN, allowing for an overlap of 1 second. By running “build_dataset.py”, the dataset is built, and by running “process_dataset.py”, the dataset is then normalized and shuffled. That is all you have to do. SLAPP takes care of the rest for you.
With validation- and test splits of 10% each, you are going to obtain 11,186 training spectrograms and 1,386 spectrograms for both, validation- and test data. Each spectrogram has a shape of (130 x 100).
Side note: If you are not just reading along, but actually coding along, this processing step will take a while, because there are lots of computational steps. At any time during the process, you can stop the scripts and relaunch them to resume where you left off. If you experience any technical issues with SLAPP, feel free to let me know, since this is my first software release and it may have lots of bugs or my documentation may be insufficient.
Model Building
Using the Keras library, I built a CRNN like this:
The idea behind this CRNN architecture is to use convolutional layers + max pooling to squeeze the frequency dimension to a scalar value while keeping the time dimension as a vector. This allows us to apply a gated recurrent unit (GRU) to the features extracted by the convolutional blocks. This type of architecture was (to my knowledge) introduced by Choi et al. (2017) and has been adopted in several studies, for instance by Nasrullah & Zhao (2018) and Bisharad & Laskar (2019).
See Figure 2 for a detailed overview of the architecture used.
In the development phase, I tried out different variations of this architecture and compared the validation loss between these models. This particular model achieved top results with only ~400k parameters.
Side Note: SLAPP will, in this case, give you a training data tensor of shape (11186, 100, 130). However, to train this specific CRNN, the input needs to be reshaped to (11186, 130, 100, 1), effectively swapping the time- and frequency axes (numpy.swapaxes) and adding another dimension (numpy.expand_dims).
Model Training
I trained the model with Adam optimizer, a learning rate of 0.0001, categorical cross-entropy loss, and a batch size of 32. Additionally, I used the EarlyStopping, ReduceLROnPlateau, and ModelCheckpoint callbacks provided by Keras to ensure smooth training and that only the best model is saved.
As you can see in Figure 3, the training metrics reach precision and recall scores close to 90%, while the validation precision and recall scores converge somewhere between 62 and 73%. Seeing as we are dealing only with 3-seconds of audio and 10 balanced classes ( → 10% accuracy through random guessing), this is already quite impressive. Overall, the best model in this training run achieved a validation accuracy of 67.32%.
To achieve more stable results, 10 training runs were started and evaluated to obtain mean scores and standard deviations for the metrics of interest. Keep in mind that this article does not aim to offer a sophisticated statistical evaluation, but rather to show the effectiveness of divide & conquer classification using an example.
How to Compute Track-Based Predictions?
For the track-based predictions, I loaded each track in the test dataset as an MP3 file. The test track names are automatically written to “train_val_test_dict.json” in your data directory by SLAPP. From there, the procedure I used is the following:
- Split the track into as many snippets as possible (3-second snippets with 2 sec overlap in this case).
- Transform each snippet into a mel spectrogram that suits the input shape of the trained model.
- Get the raw class predictions for each snippet.
- Sum the raw scores for each class across all snippets.
- Your global prediction is the class with the highest score sum.
Since this can get quite complicated for beginners, I am going to provide you with another module that automates this track-based prediction process: Please find the module and its documentation here.
Classification Results
I trained and evaluated the model described above ten times and averaged the accuracy scores to get rid of some of the randomness of the process.
See Figure 4 for a visual overview of the accuracy scores obtained. The first thing we notice is that the model performed approximately equally well on the validation- and test data, which is a good sign. Further, all standard deviations are really small, which tells us that we can really on differences in mean accuracy.
It is easy to see that the track-based classification outperformed the snippet-based classification by an average difference of ~12.2 percentage points. While we are not going to do any statistical tests for the sake of simplicity, it is undeniable that the track-based classification through divide & conquer is superior in this case.
What If We Had Used The Full 30-second Snippets?
So far, we have shown that (in this example) the track-based divide & conquer classification outperforms the 3-second single-snippet classification. However, maybe we would have achieved a much higher accuracy if we had built a CRNN on the 1000 30-second audio slices in “raw” GTZAN.
That is possible, so I tried it out. I built another CRNN of a similar architecture (although ~half the parameters because there is less data) based on 30-second slices and trained it 10 times to average the accuracy scores obtained. This method averages out at an accuracy of only 55.8% (SD=0.043), underperforming any of the 3-second models by a large margin.
Divide & Conquer Does In Fact Conquer
This article shows how a divide & conquer approach can help to build a strong classifier for audio classification problems even when not much training data is present. The data processing pipeline SLAPP makes it really easy to perform the processing steps necessary to build a divide & conquer classifier for single-label audio classification tasks.
3 Seconds are Better Than 30 Seconds
How come 30-second slices are even worse than 3-second slices although they hold more information? It seems that increasing the dataset size by drawing 3-second slices helps the model to extract meaningful features, whereas it seems to get lost when dealing with a really small dataset consisting of huge spectrograms.
Arguably, if other data formats like MFCCs or hand-crafted features with a smaller dimensionality had been used, the model would have dealt more easily with the 30-second audio files. However, this article goes to show that almost any audio dataset can be classified with spectrogram data — if divide & conquer is used.
Other Applications of SLAPP + Divide & Conquer
If you want to try the proposed method on another similar problem, here are some ideas that may inspire your next data science project:
- Sex recognition based on speech data
- Accent recognition based on speech data
- Emotion classifier for music
- Building classifiers using audio data from machine sensors
Thank you very much for your interest in my work! Please let me know if anything isn’t working for you or if you have trouble understanding or applying the concepts.
[1] Bisharad, D. & Laskar, R. H. (2019). Music Genre Recognition Using Convolutional Recurrent Neural Network Architecture. In: Expert Systems 36,4.
[2] Choi, K.; Fazekas, G.; Sandler, M. & Cho, K. (2017). Convolutional Recurrent Neural Networks for Music Classification. In: International Conference on Acoustics, Speech, and Signal Processing 2017.
[3] Dong, M. (2018). Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification. In: Conference on Cognitive Computational Neuroscience, 5–8 September 2018, Philadelphia, Pennsylvania.
[4] Nasrullah, Z. & Zhao, Y. (2019). Music Artist Clasification from Audio, Text, and Images Using Deep Features. In: arXiv, DOI: 10.48550/arXiv.1707.04916