Techno Blender
Digitally Yours.

Songs to playlist classification using NLP | by Gabriele Albini | Nov, 2022

0 35


A guided approach to assign new songs to Spotify playlists, using word2vec and logistic regression

Photo by israel palacio on Unsplash

This article will present a NLP project aimed at assigning songs to playlists.

Two playlists have been selected from Spotify and, via the Spotify API, information such as: artist, song titles, popularity, etc. was downloaded. The song lyrics data, not available through the API, was obtained using webscraping.

Next, some data pre-processing steps were performed on the raw lyrics in order to train a Word2Vec model and encode the text into high dimensional vectors.

A 2D representation of each playlist was generated using PCA and we finally approached the task of playlist assignment using new songs. This task was solved via a Logistic Regression model and a graphical representation was given.

Here’s an overview of the approach used :

Image by author

The article is based on songs lyrics present on these two Spotify playlists:

  • The first is the Global playlist which includes the top 50 songs listened by users across the platform. The playlist is updated daily and it generally includes trending pop songs.
  • The second playlist is a metal mix, including the top 50 metal songs streamed

The music genres of the two playlists seem quite “far”: let’s verify if this “distance” is also present in the vectors that we will obtain using NLP models and let’s confirm if that will help us in a playlist assignment task.

1.1 Downloading playlist information with the Spotify API

In order to use the Spotify API, first of all, we should create a developer account on https://developer.spotify.com/ and create a new app.

Next, by clicking on our app, we can obtain:

Image by author

With the above information, we can connect to the API:

Finally, by passing the playlist IDs (that we can get from the URLs on Spotify), we can develop a function to get the information we want, specifically: Artist, Title, Album, Popularity.

(Genre is a field that was manually created to differentiate between the two playlists).

Image by author

1.2 Scraping song lyrics

At the time of writing, Spotify API doesn’t allow to extract song lyrics. However, with artist and title, we can perform some web scraping in order to obtain the lyrics.

The code developed for this purpose is quite long and it is beyond the scope of the article. I created a tutorial on webscraping which I reference here.

With this step done, the data finally looks like:

Image by author

The data used in the article has been uploaded on this github repository.

1.3 Data Pre-processing

Our goal is to embed words into vectors using the Word2Vec model.

In order to do so, we will proceed with the following steps, so that the raw text can be cleaned and transform into a format that can be used by the model:

  • Remove useless lines (e.g. “Line repeat”, “Chorus” which can be found in lyrics) numbers, symbols
  • Lower each word
  • Lemmatize words: we preferred lemmatization to stemming. The two techniques gave very similar results in the project, but while stemming truncates the words, lemmatization transforms words into a common “base” form, giving more readable results. For example, the word “caring” would be truncated to “car” with stemming, but transformed into “care” with lemmatization
  • (Stopwords, that are normally eliminated in NLP pre-processing steps were kept here. Removing them didn’t improve results on this project — probably because the vocabulary is not that large, and the Word2Vec model can make use of them in the context evaluation)
  • (Finally, word tokenization will be performed at model training)

After these steps, each song is converted to a list and each list-element is a song line:

Image by author

1.4 Playlists overview with World cloud

We will now have a very first look at the data by generating two world cloud charts, one per playlist.

To produce the below images, the lyrics of all songs were merged into two variables, one per playlist, and then the world cloud was generated using masks.

These charts are based on the occurrences of words, highlighting the ones that occur the most, in each playlist:

Metal playlist | Image by author
Global plyalist | Image by author

In order to train a model able to assign new songs to the playlists, we will need to embed lyrics into vectors.

There are several strategies to do that, which could be grouped into two categories:

  1. Models that are based on terms frequencies, these are known as “bag of words” approaches, such as the Tf-Idf model or N-grams
  2. Models that use simple neural networks to extract a vector representation which takes “distance” (or similarity) between words into account. These models are more advanced and their starting point is that two words that appear in the same context should be “close” in terms of meaning. So, with these models, it is possible to capture similarities between words based on the context they appear in. These models can be further categorized into two approaches:
  • Global matrix factorization methods (e.g. LSA, LDA)
  • Local context windows methods (e.g. Word2Vec using cbow / skip-gram). This will be our focus area.

(Note: There are many other approaches that can be used, such as GloVe which aims at extracting word meaning from the distribution of word occurrences using the full corpus — [1] Pennington et al., 2014).

Word2Vec will be used in this project, as the Python Gensim library provides built-in methods that exactly serves our purposes.

2.1 Word2Vec model overview

Mikolov et al. (2013) [2] developed this model which consists of a one-hidden-layer neural network, trained on a word classification task. The network learns syntactic and semantic relationships of a word with its context (using both preceding and forward words in a given window).

There are two possible algorithms that can be used within the model: Skip-gram and continuous bag-of-words (CBOW).

In the skip-gram variant, the objective is to predict a word context, given the word; while the cbow is specular: its objective is to predict the word given its context.

The skip-gram model is the one that performed better on our project and its architecture is the following:

Image by Mikolov et al. (2013) [2]

In order to obtain a vectorial representation of the word, which successfully predict surrounding words, the algorithm maximizes this average log probability of observing some context words, given our current word:

Image by Mikolov et al. (2013) [3]

In the formula:

  • c is the “window”, or the size of the context of a word. It is a model hyperparameter expressed in “words”, e.g. 7 means that we’re using a context of 7 words around our current word w_t
  • T indicates the training size (i.e. all the words in the corpus used to train the model)
  • The conditional probability is defined with the following softmax function, where the “v” represents the input and output vectors for the word w, in the neural network:
Image by Mikolov et al. (2013) [3]

The final output will be vectors (one per word in the vocabulary) of a desired dimension (another model hyperparameter).

These vectors should represent with high accuracy the relationship among words. For instance, Mikolov et all., 2013 [3] obtained vectors that could be used to perform linear operations, like: <King> – <Man> + <Woman> would give a result which is very close to the vector corresponding to <Queen>.

2.2 Probability approximation

The above maximisation would require the computation of the gradient of the log-probability.

The paper Mikolov et all., 2013 [3] shows how the computation of this gradient has a complexity of O(V), i.e. proportional to the size of the vocabulary (normally huge). This represents a bottleneck in terms of implementation.

The same paper, however, presents several strategies to implement the word2vec in a more efficient way, such as:

  • An approximation of the softmax via the hierarchical softmax. This simplifies the output layer representation by using a binary tree, and therefore reducing the number of parameters.
  • Negative sampling: This technique allows to differentiate “noise” from the data, by training a model to sample our target word from a noise distribution.

Both techniques were used on the project, and the hierarchical softmax approximation gave better results.

3.1 Word2Vec Model Training

To train the word2vec model, we first of all tokenize the lyrics and chose the hyperparameters that gave maximum performances on the classification task we’ll present below:

  • sg = 1 to use the Skip-gram algorithm
  • hs = 1 to use the hierarchical softmax approach to approximate probabilities
  • vector_size = 300 means that the output vectors, for each word, will have 300 coordinates. This is the same vector dimension used by word2vec models trained on the entire Google news corpus
  • window = 7 means that our context will be based on 7 words. Different windows sizes were used and a number between 5 and 10 gave best results in this project.
  • min_count = 3 means that words that appear less than 3 times will be excluded from the corpus. This will filter out words that are too rare to be of any use and will decrease the size of our vocabulary.

With these values, the model trained very fast and the vocabulary size on this small dataset is 1223.

3.2 Plotting playlist centroids

After training, we obtain one vector per word in our playlists. We can therefore compute : playlist centroids, most representative words in the playlists.

  • First, we have extracted the vector corresponding to each word in a playlist and averaged them to obtain centroids
  • Then, we extracted words that are similar to the centroid which are our most representative words of each playlist.

The “top” 10 words per metal and global playlists, respectively, are:

Image by author

It is interesting to notice that the most representative words do not correspond necessarily to the most frequent words (shown in the word cloud images above). As expected, the algorithm goes beyond word frequency to calculate vectors coordinates.

We can also visualize these words, we’ll use PCA to present in 2D our 300-coordinate vectors:

Image by author

Similarly, we can plot the PCA of the two playlist centroids:

Image by author

We can separate the two playlists in terms of their most representative words and the two centroids. Let’s exploit these distances to assign new songs to playlists.

3.3 Song Classification with Logistic Regression

We will now use our vectors and playlist labels, to train a logistic regression model. The model will be tested on a test set, composed by 6 unobserved songs. (These songs have been previously part of these two playlists on Spotify but were no longer part of them). We will confirm if the model will assign the songs to the right playlist:

Image by author

Logistic regression is a generalised linear regression model, which is a very common classification technique, especially used for binary classification (2 classes. However, there are adaptations of this model to multi-class classification problems).

The model returns the probability of the record to belong to “class 1”; thresholds can be set in order to “hard”-assign records to “class 1” only if the probability is above the threshold.

Data pre-processing:

Let’s first prepare the data sets:

  • we will use the lyrics we’ve already imported to compute an average vector per song. This will be our train set.
  • We will upload the new songs and compute a vector per song (averaging words vectors), using the same word2vec model. This will be our test set.

Hyperparameter tuning:

Next, we run a grid search with cross validation to identify the best model hyperparameters:

Model training & testing:

We can now train the optimal model on the train set and then test it by classifying the new songs:

We obtain the below predicted labels and performances: very high accuracy on the train set and one misclassified song on the test set.

Image by author

Graphical overview:

With the use of PCA on the Word2Vec model vectors, can now visualize the test-set songs (each dot represents a song, obtained averaging the word vectors). Let’s check how close the new songs are to the two playlist centroids:

Image by author

With this representation, we can also visualize if new songs are very close to the playlist centroid, making the classification more robust. In fact, we can see why one song was misclassified: it is much closer to the “wrong” playlist centroid.

This article presents a possible strategy to assign new songs to existing playlists based on their lyrics.

By using word2vec for lyrics embedding and logistic regression for classification, very good results were achieved.

In order to generalise this strategy, different embedding techniques and different regression models could be compared, ideally using a much larger dataset, which normally improves the word embedding task.

Thank you for reading!


A guided approach to assign new songs to Spotify playlists, using word2vec and logistic regression

Photo by israel palacio on Unsplash

This article will present a NLP project aimed at assigning songs to playlists.

Two playlists have been selected from Spotify and, via the Spotify API, information such as: artist, song titles, popularity, etc. was downloaded. The song lyrics data, not available through the API, was obtained using webscraping.

Next, some data pre-processing steps were performed on the raw lyrics in order to train a Word2Vec model and encode the text into high dimensional vectors.

A 2D representation of each playlist was generated using PCA and we finally approached the task of playlist assignment using new songs. This task was solved via a Logistic Regression model and a graphical representation was given.

Here’s an overview of the approach used :

Image by author

The article is based on songs lyrics present on these two Spotify playlists:

  • The first is the Global playlist which includes the top 50 songs listened by users across the platform. The playlist is updated daily and it generally includes trending pop songs.
  • The second playlist is a metal mix, including the top 50 metal songs streamed

The music genres of the two playlists seem quite “far”: let’s verify if this “distance” is also present in the vectors that we will obtain using NLP models and let’s confirm if that will help us in a playlist assignment task.

1.1 Downloading playlist information with the Spotify API

In order to use the Spotify API, first of all, we should create a developer account on https://developer.spotify.com/ and create a new app.

Next, by clicking on our app, we can obtain:

Image by author

With the above information, we can connect to the API:

Finally, by passing the playlist IDs (that we can get from the URLs on Spotify), we can develop a function to get the information we want, specifically: Artist, Title, Album, Popularity.

(Genre is a field that was manually created to differentiate between the two playlists).

Image by author

1.2 Scraping song lyrics

At the time of writing, Spotify API doesn’t allow to extract song lyrics. However, with artist and title, we can perform some web scraping in order to obtain the lyrics.

The code developed for this purpose is quite long and it is beyond the scope of the article. I created a tutorial on webscraping which I reference here.

With this step done, the data finally looks like:

Image by author

The data used in the article has been uploaded on this github repository.

1.3 Data Pre-processing

Our goal is to embed words into vectors using the Word2Vec model.

In order to do so, we will proceed with the following steps, so that the raw text can be cleaned and transform into a format that can be used by the model:

  • Remove useless lines (e.g. “Line repeat”, “Chorus” which can be found in lyrics) numbers, symbols
  • Lower each word
  • Lemmatize words: we preferred lemmatization to stemming. The two techniques gave very similar results in the project, but while stemming truncates the words, lemmatization transforms words into a common “base” form, giving more readable results. For example, the word “caring” would be truncated to “car” with stemming, but transformed into “care” with lemmatization
  • (Stopwords, that are normally eliminated in NLP pre-processing steps were kept here. Removing them didn’t improve results on this project — probably because the vocabulary is not that large, and the Word2Vec model can make use of them in the context evaluation)
  • (Finally, word tokenization will be performed at model training)

After these steps, each song is converted to a list and each list-element is a song line:

Image by author

1.4 Playlists overview with World cloud

We will now have a very first look at the data by generating two world cloud charts, one per playlist.

To produce the below images, the lyrics of all songs were merged into two variables, one per playlist, and then the world cloud was generated using masks.

These charts are based on the occurrences of words, highlighting the ones that occur the most, in each playlist:

Metal playlist | Image by author
Global plyalist | Image by author

In order to train a model able to assign new songs to the playlists, we will need to embed lyrics into vectors.

There are several strategies to do that, which could be grouped into two categories:

  1. Models that are based on terms frequencies, these are known as “bag of words” approaches, such as the Tf-Idf model or N-grams
  2. Models that use simple neural networks to extract a vector representation which takes “distance” (or similarity) between words into account. These models are more advanced and their starting point is that two words that appear in the same context should be “close” in terms of meaning. So, with these models, it is possible to capture similarities between words based on the context they appear in. These models can be further categorized into two approaches:
  • Global matrix factorization methods (e.g. LSA, LDA)
  • Local context windows methods (e.g. Word2Vec using cbow / skip-gram). This will be our focus area.

(Note: There are many other approaches that can be used, such as GloVe which aims at extracting word meaning from the distribution of word occurrences using the full corpus — [1] Pennington et al., 2014).

Word2Vec will be used in this project, as the Python Gensim library provides built-in methods that exactly serves our purposes.

2.1 Word2Vec model overview

Mikolov et al. (2013) [2] developed this model which consists of a one-hidden-layer neural network, trained on a word classification task. The network learns syntactic and semantic relationships of a word with its context (using both preceding and forward words in a given window).

There are two possible algorithms that can be used within the model: Skip-gram and continuous bag-of-words (CBOW).

In the skip-gram variant, the objective is to predict a word context, given the word; while the cbow is specular: its objective is to predict the word given its context.

The skip-gram model is the one that performed better on our project and its architecture is the following:

Image by Mikolov et al. (2013) [2]

In order to obtain a vectorial representation of the word, which successfully predict surrounding words, the algorithm maximizes this average log probability of observing some context words, given our current word:

Image by Mikolov et al. (2013) [3]

In the formula:

  • c is the “window”, or the size of the context of a word. It is a model hyperparameter expressed in “words”, e.g. 7 means that we’re using a context of 7 words around our current word w_t
  • T indicates the training size (i.e. all the words in the corpus used to train the model)
  • The conditional probability is defined with the following softmax function, where the “v” represents the input and output vectors for the word w, in the neural network:
Image by Mikolov et al. (2013) [3]

The final output will be vectors (one per word in the vocabulary) of a desired dimension (another model hyperparameter).

These vectors should represent with high accuracy the relationship among words. For instance, Mikolov et all., 2013 [3] obtained vectors that could be used to perform linear operations, like: <King> – <Man> + <Woman> would give a result which is very close to the vector corresponding to <Queen>.

2.2 Probability approximation

The above maximisation would require the computation of the gradient of the log-probability.

The paper Mikolov et all., 2013 [3] shows how the computation of this gradient has a complexity of O(V), i.e. proportional to the size of the vocabulary (normally huge). This represents a bottleneck in terms of implementation.

The same paper, however, presents several strategies to implement the word2vec in a more efficient way, such as:

  • An approximation of the softmax via the hierarchical softmax. This simplifies the output layer representation by using a binary tree, and therefore reducing the number of parameters.
  • Negative sampling: This technique allows to differentiate “noise” from the data, by training a model to sample our target word from a noise distribution.

Both techniques were used on the project, and the hierarchical softmax approximation gave better results.

3.1 Word2Vec Model Training

To train the word2vec model, we first of all tokenize the lyrics and chose the hyperparameters that gave maximum performances on the classification task we’ll present below:

  • sg = 1 to use the Skip-gram algorithm
  • hs = 1 to use the hierarchical softmax approach to approximate probabilities
  • vector_size = 300 means that the output vectors, for each word, will have 300 coordinates. This is the same vector dimension used by word2vec models trained on the entire Google news corpus
  • window = 7 means that our context will be based on 7 words. Different windows sizes were used and a number between 5 and 10 gave best results in this project.
  • min_count = 3 means that words that appear less than 3 times will be excluded from the corpus. This will filter out words that are too rare to be of any use and will decrease the size of our vocabulary.

With these values, the model trained very fast and the vocabulary size on this small dataset is 1223.

3.2 Plotting playlist centroids

After training, we obtain one vector per word in our playlists. We can therefore compute : playlist centroids, most representative words in the playlists.

  • First, we have extracted the vector corresponding to each word in a playlist and averaged them to obtain centroids
  • Then, we extracted words that are similar to the centroid which are our most representative words of each playlist.

The “top” 10 words per metal and global playlists, respectively, are:

Image by author

It is interesting to notice that the most representative words do not correspond necessarily to the most frequent words (shown in the word cloud images above). As expected, the algorithm goes beyond word frequency to calculate vectors coordinates.

We can also visualize these words, we’ll use PCA to present in 2D our 300-coordinate vectors:

Image by author

Similarly, we can plot the PCA of the two playlist centroids:

Image by author

We can separate the two playlists in terms of their most representative words and the two centroids. Let’s exploit these distances to assign new songs to playlists.

3.3 Song Classification with Logistic Regression

We will now use our vectors and playlist labels, to train a logistic regression model. The model will be tested on a test set, composed by 6 unobserved songs. (These songs have been previously part of these two playlists on Spotify but were no longer part of them). We will confirm if the model will assign the songs to the right playlist:

Image by author

Logistic regression is a generalised linear regression model, which is a very common classification technique, especially used for binary classification (2 classes. However, there are adaptations of this model to multi-class classification problems).

The model returns the probability of the record to belong to “class 1”; thresholds can be set in order to “hard”-assign records to “class 1” only if the probability is above the threshold.

Data pre-processing:

Let’s first prepare the data sets:

  • we will use the lyrics we’ve already imported to compute an average vector per song. This will be our train set.
  • We will upload the new songs and compute a vector per song (averaging words vectors), using the same word2vec model. This will be our test set.

Hyperparameter tuning:

Next, we run a grid search with cross validation to identify the best model hyperparameters:

Model training & testing:

We can now train the optimal model on the train set and then test it by classifying the new songs:

We obtain the below predicted labels and performances: very high accuracy on the train set and one misclassified song on the test set.

Image by author

Graphical overview:

With the use of PCA on the Word2Vec model vectors, can now visualize the test-set songs (each dot represents a song, obtained averaging the word vectors). Let’s check how close the new songs are to the two playlist centroids:

Image by author

With this representation, we can also visualize if new songs are very close to the playlist centroid, making the classification more robust. In fact, we can see why one song was misclassified: it is much closer to the “wrong” playlist centroid.

This article presents a possible strategy to assign new songs to existing playlists based on their lyrics.

By using word2vec for lyrics embedding and logistic regression for classification, very good results were achieved.

In order to generalise this strategy, different embedding techniques and different regression models could be compared, ideally using a much larger dataset, which normally improves the word embedding task.

Thank you for reading!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment