Classifying Music Genres with LightGBM | by Louis Magowan | Jun, 2022

By Jessie Hobb On Jun 28, 2022

Optuna Hyperparameter Optimization and Dimension Reduction

Image by Tima Miroshnichenko from Pexels

This article outlines a process for using tuned LightGBM models to classify songs into genres, based on the songs’ audio and lyric features. It draws on publicly available music data from a Kaggle music dataset. LightGBM is a gradient-boosting framework, based on decision tree algorithms, that is regarded as one of the best out-of-the box models available today. It comes with a wide range of tunable parameters which we will be attempting to optimize using Optuna, a hyperparameter optimization framework.

The article’s outline is as follows:

Data Overview and Pre-processing
Exploratory Analysis
Exploratory Dimension Reduction of Lyric Features
Modelling and Hyperparameter Optimization
Results
Take-Home Points

The Kaggle dataset contains information on over 18,000 songs with information on its audio features (e.g. how energetic or full of speech it is, or what tempo or key it is played in etc.), as well as the lyrics for the song. A more detailed overview of the columns available can be seen on Kaggle.

In its initial form the data is a bit hard to work with, so required some filtering and tidying up before it could be used. The full outline of exactly what pre-processing steps were taken can be viewed within the GitHub repo. The prep_kaggle_data.py script can be used to process the data from Kaggle, or, alternatively, the processed data is also available as a zipped CSV within the repo.

The pre-processing steps are the least interesting part of this project, so we’ll gloss over them in a summary:

Filter data to only include English language songs and remove songs of “Latin” genre (as these were almost entirely Spanish so would have given a bad class imbalance).
Tidy up song lyrics by making them lowercase, removing punctuation and stopwords. Create a count of how many times each remaining word appears in a song’s lyrics and then filter out the words that are least frequent (messy data/noise) across all songs.
Transform the remaining word counts such that each song contains columns that count how many times a word appeared in that song’s lyrics.

All the code for this section and the following one can be found within the eda.ipynb notebook in the repo.

Let’s check out the class balance for our labels (genres).

There appears to be a bit of class imbalance, so cross-validation and train-test splits should probably be stratified.

What about the sparsity of audio and lyric features? The dataframe below looks at their respective percentage sparsities. Audio features are not sparse and the only sparsity in them is coming from the mode column (where the song has mode=1 if it’s in a major scale and mode=0 if it’s minor). By contrast, the lyric features are extremely sparse. This makes sense, as the majority of non-stopword words in lyrics aren’t going to appear consistently across different songs.

The sparsity of the lyric features could be a good indication that they would be well-suited for dimension reduction.

Many machine learning algorithms can perform worse if they deal with data that has an extremely large number of features (dimensions). This is particularly the case if many of those features are highly sparse. This is where dimension reduction can be useful.

The idea is to project the high dimensional data into a lower dimension subspace, while retaining as much of the variance present in the data as possible.

We will initially use two methods (PCA and t-SNE) to explore whether it is appropriate to use dimension reduction on our lyric data, as well as get an early indication of what a good range of dimensions to reduce into might be. The actual dimension reduction used during modelling will be implemented slightly differently (user’s choice of Truncated SVD, PCA and a Keras autoencoder), as we’ll see later.

Principal Component Analysis

A commonly used method of dimension reduction is Principal Component Analysis or PCA (a good primer on it can be found here). We can use it to look at how much variance in the lyric columns we can explain relative to the number of dimensions we reduce them into. For example, we can see that by reducing the lyrics to ~400 dimensions (principal components in this case) we still retain 60% of the variance in the lyrics. Around ~800 dimensions and we can cover 80% of the variance. An added advantage of reducing the dimensions is that it eliminates the sparsity of the lyric features, making them easier to model with.

The code for the above plot and PCA can be found below.

Gist by Author

t-SNE Visualisation

We can also go a step further and visualise how separable our data is along a range of dimension reductions. The t-SNE algorithm can be used to further reduce our lyric principal components into 2 dimensions, i.e. a graph that the human brain can perceive. For more information on t-SNE, a good article is here. Essentially the graph is showing us what our lyric data would look like if it’s projected into 2-D space when we use e.g. all 1806 features, or reduce it into 1000, 500, 100 principal components etc.

Ideally, we would want to see that at some particular number of reduced dimensions (e.g. cutoff = 1000) the genres became much more separable.

However, based on the results of the t-SNE plots, it does not appear as if any particular number of dimensions/principal components is going to result in data that is easier to separate. All the genres seem to be fairly mixed in lyric features. Accurate classification of all genres is thus likely to be difficult.

The code for the t-SNE plots is as follows:

Gist by Author

We can get the data in its final form to be modelled with using the following code. We do it this way so that we can easily change the method of dimension reduction we want to use on the lyric features, as well as the number of dimensions to reduce into. We’ll be experimenting with PCA, Truncated SVD and a Keras undercomplete autoencoder across a range of output dimensions.

In the interests of brevity, the code for the autoencoder has been omitted but can be found in the autoencode function within the custom_functions.py file in the repo.

Gist by Author

Now we have the data normalised, transformed and reduced in the way we want we can start modelling with it.

Building the Model

First things first, let’s define an evaluation metric to assess our model’s performance with and to optimize against. As the genres/classes in our data are slightly imbalanced, a macro F1 score could be a good option to use as it values the contributions of the classes equally. It is defined below, along with some fixed parameters that we’ll apply to all of our LightGBM models.

Gist by Author

Next, we have to define an objective function for Optuna to optimize against. It’s a function that returns a performance metric, which in our case is going to be a stratified 5-fold cross-validation, supercalifragilistic (jk jk, but it sure is a mouthful), macro F1 score.

LightGBM comes with tonnes of tunable parameters, so there’s quite a lot going on in this bit of code. If you’d like to learn more about them, the documentation is good, as are this and this article.

The key takeaway, however, is that in the param argument we are defining a search space of possible hyperparameters (initially quite a broad search space) which Optuna is going to test our model with.

Gist by Author

Hyperparameter Optimization

Now that we have our objective function, we can use Optuna to tune the hyperparameters for our model.

We create a “study” which runs our objective function with different values of hyperparameters, with each run of the model being referred to as a “trial.” The study keeps a record of what values/combinations of hyperparameter were used for a particular trial, as well as the model’s performance in that trial (in terms of 5-fold, stratified CV macro F1).

We can also assign a pruner to the study in order to reduce training times. The HyperbandPruner will prematurely end, or “prune”, a trial if it is clear early on that the current selection of hyperparameters will lead to a poor performance.

Gist by Author

We can then visualize the performance of the different hyperparameter combinations using Optuna’s visualization module. For example, we can use plot_param_importances(study) to look at which hyperparameters mattered the most for model performance/affected optimization the most.

Graph by Author

We can also use plot_parallel_coordinate(study) to see which combinations/ranges of hyperparameters were tried that led to high objective values (good performances).

We can then use plot_optimization_history to look at what the best objective value/strongest model performance was versus the number of trials that were run.

Graph by Author

Finally, we can then either:

Run our final model using the best hyperparameters identified by the study. The best hyperparameters are stored within the study.best_params attribute. The params argument in the final model would then just need to be updated to params = {**fixed_params, **study.best_params} , as seen in the code below.
Or, run further rounds of hyperparameter tuning/studies, narrowing your search space/hyperparameter ranges to be closer to the previously identified best hyperparameter values with each extra round. Then run your final model using study.best_params .

Gist by Author

Okay, now we’ve got our final model let’s evaluate it! We’re going to look at train, cross-validation and test F1 scores. We can also store our results into a dataframe along with the dimension reduction method used, the number of reduced dimensions used, and the number of trials we ran the study for.

Gist by Author

By saving the results of all the above code into a CSV, we can compare a range of reduction methods, trials, and reduced dimensions to see which ones give the best model overall. Out of the values tried, PCA with 400 reduced dimensions and 1000 trials appears to have yielded the best model- achieving a test macro F1 score of 66.48%.

More values for reduced dimensions and greater amounts of trials could be used, but this quickly becomes computationally expensive (hours and hours to run).

Pre-process the music data, tidying up lyrics into columns that count occurrences of non-stopword words across songs.
Exploratory data analysis: Consider class balance and feature sparsity. Other EDA can be found in the eda.ipynb notebook too.
Exploratory dimension reduction of highly sparse lyric features: Use PCA and then t-SNE to visualise (project into 2-D space) the music genres over a range of reduced dimension options/cutoffs.
Split the data into test and train sets, then process it with your chosen method of dimension reduction (Truncated SVD, PCA or Keras undercomplete encoder).
Define your objective function/build the LightGBM model. Use the Optuna study to trial a range of hyperparameter values (search space) for it.
***Optional: Repeat 5, but narrow the ranges of hyperparameter values to be closer to the best values identified by the first study/round of hyperparameter tuning.
Run your final model using the best hyperparameter values found by the Optuna study.
Evaluate the model on the test set.

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019, July). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2623–2631).
Autoencoder Feature Extractions, Machine Learning Mastery
Dataset: Kaggle Dataset of Music Data, Muhammad Nakhaee. Originally scraped from Spotify API, licensed for limited commercial use for Non-Streaming SDAs.
Kaggler’s Guide to LightGBM Hyperparameter Tuning with Optuna in 2021, Bex T.
You Are Missing Out on LightGBM. It Crushes XGBoost in Every Aspect, Bex T.

Optuna Hyperparameter Optimization and Dimension Reduction

Image by Tima Miroshnichenko from Pexels

The article’s outline is as follows:

Data Overview and Pre-processing
Exploratory Analysis
Exploratory Dimension Reduction of Lyric Features
Modelling and Hyperparameter Optimization
Results
Take-Home Points

The pre-processing steps are the least interesting part of this project, so we’ll gloss over them in a summary:

Filter data to only include English language songs and remove songs of “Latin” genre (as these were almost entirely Spanish so would have given a bad class imbalance).
Tidy up song lyrics by making them lowercase, removing punctuation and stopwords. Create a count of how many times each remaining word appears in a song’s lyrics and then filter out the words that are least frequent (messy data/noise) across all songs.
Transform the remaining word counts such that each song contains columns that count how many times a word appeared in that song’s lyrics.

All the code for this section and the following one can be found within the eda.ipynb notebook in the repo.

Let’s check out the class balance for our labels (genres).

There appears to be a bit of class imbalance, so cross-validation and train-test splits should probably be stratified.

The sparsity of the lyric features could be a good indication that they would be well-suited for dimension reduction.

The idea is to project the high dimensional data into a lower dimension subspace, while retaining as much of the variance present in the data as possible.

Principal Component Analysis

The code for the above plot and PCA can be found below.

Gist by Author

t-SNE Visualisation

Ideally, we would want to see that at some particular number of reduced dimensions (e.g. cutoff = 1000) the genres became much more separable.

The code for the t-SNE plots is as follows:

Gist by Author

In the interests of brevity, the code for the autoencoder has been omitted but can be found in the autoencode function within the custom_functions.py file in the repo.

Gist by Author

Now we have the data normalised, transformed and reduced in the way we want we can start modelling with it.

Building the Model

Gist by Author

The key takeaway, however, is that in the param argument we are defining a search space of possible hyperparameters (initially quite a broad search space) which Optuna is going to test our model with.

Gist by Author

Hyperparameter Optimization

Now that we have our objective function, we can use Optuna to tune the hyperparameters for our model.

Gist by Author

Graph by Author

We can also use plot_parallel_coordinate(study) to see which combinations/ranges of hyperparameters were tried that led to high objective values (good performances).

We can then use plot_optimization_history to look at what the best objective value/strongest model performance was versus the number of trials that were run.

Graph by Author

Finally, we can then either:

Run our final model using the best hyperparameters identified by the study. The best hyperparameters are stored within the study.best_params attribute. The params argument in the final model would then just need to be updated to params = {**fixed_params, **study.best_params} , as seen in the code below.
Or, run further rounds of hyperparameter tuning/studies, narrowing your search space/hyperparameter ranges to be closer to the previously identified best hyperparameter values with each extra round. Then run your final model using study.best_params .

Gist by Author

More values for reduced dimensions and greater amounts of trials could be used, but this quickly becomes computationally expensive (hours and hours to run).

Pre-process the music data, tidying up lyrics into columns that count occurrences of non-stopword words across songs.
Exploratory data analysis: Consider class balance and feature sparsity. Other EDA can be found in the eda.ipynb notebook too.
Exploratory dimension reduction of highly sparse lyric features: Use PCA and then t-SNE to visualise (project into 2-D space) the music genres over a range of reduced dimension options/cutoffs.
Split the data into test and train sets, then process it with your chosen method of dimension reduction (Truncated SVD, PCA or Keras undercomplete encoder).
Define your objective function/build the LightGBM model. Use the Optuna study to trial a range of hyperparameter values (search space) for it.
***Optional: Repeat 5, but narrow the ranges of hyperparameter values to be closer to the best values identified by the first study/round of hyperparameter tuning.
Run your final model using the best hyperparameter values found by the Optuna study.
Evaluate the model on the test set.

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019, July). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2623–2631).
Autoencoder Feature Extractions, Machine Learning Mastery
Dataset: Kaggle Dataset of Music Data, Muhammad Nakhaee. Originally scraped from Spotify API, licensed for limited commercial use for Non-Streaming SDAs.
Kaggler’s Guide to LightGBM Hyperparameter Tuning with Optuna in 2021, Bex T.
You Are Missing Out on LightGBM. It Crushes XGBoost in Every Aspect, Bex T.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.