Techno Blender
Digitally Yours.

Unleash the Hidden Patterns: A Guide to Unsupervised Machine Learning for Article Recommender System | by Suhas Maddali | Mar, 2023

0 47


Photo by Salomé Watel on Unsplash

There have been a lot of talks lately about the incredible capabilities of artificial intelligence and machine learning. As we go on to see various frontiers at which ML could be applied, there is a growing possibility of higher value generated as a result of this transition. Companies such as Google, Microsoft, and NVIDIA are pushing the boundaries at which AI and machine learning are used to build a great society with technological advancements.

Now that there is a lot of hype in this AI, an often less talked about topic is that of unsupervised machine learning. I bet that people might have had the opportunity to enjoy some of the greatest streaming services such as Netflix and Amazon Videos with their state-of-the-art recommendation systems. However, there has been less talk about the incredible capabilities of recommendation system which is also part of the unsupervised machine learning.

In this article, we will mainly focus on building an article recommender system based on the previously read articles by the user, giving them a positive experience and an inclination to read more of these similar types of articles. This comes under unsupervised machine learning as we do not have labels about whether a particular article is related to others or not. Instead, we are only given textual information based on which, we must highlight and visualize the similarities before recommending items to users.

Initially, we start with the data and understand their types before continuing our effort to recommend articles. After going through the data, we will visualize key insights from the data. Finally, we build a recommender system based on cosine similarity scores between an article and a list of all the articles to get a recommendation.

Reading the Data

The first step would be to read the data. Below is the code for reading the data. After reading the data, we would be performing exploratory data analysis. Here is the code and the representation of data.

Note: The dataset was taken from Articles sharing and reading from CI&T DeskDrop | Kaggle under Database Contents License (DbCL) v1.0 — Open Data Commons: legal tools for open data license

Input Data (Image by Author)

There are title, text, and language features that are most important in our data. Other features such as content type and author user agent are not very suitable for this task of building a recommendation system. Steps would be taken to remove such features and only consider the features such as title, text, and language respectively. Note that not all the features are shown in the figure as this would become a large image.

Data Information (Image by Author)

We get the information about the data and determine the number of non-null values present in the dataset. It is seen that there are features such as authorSessionID and authorUserAgent that contain more than 50 percent missing values. We are going to be removing all of these features along with others as discussed above and would be focusing only on the title, text, and language features.

Exploratory Data Analysis (EDA)

Let us now explore the dataset and get an understanding of it by using visuals and graphs. The first step would be to explore the type of text content present in the data.

Countplot of content types (Image by Author)

Observing from the plot, a large portion of the contents are in the form of HTML links to be used for recommendation. These HTML links contain the actual text data and the titles. Our focus here is to extract meaningful insights from text instead of using other formats such as ‘video’ and ‘rich’ formats. Therefore, we can remove these categories from our project.

Countplot of languages (Image of Author)

A large portion of our text and titles are English titles. Since our focus is only on the English language, we can take the steps to remove other categories to improve our recommendation model efficiency.

Title Wordcloud (Image by Author)

Wordcloud gives a good understanding of the most frequently occurring words in our corpus of text. The higher the occurrence of words in a corpus, the bigger the text sizes in a word cloud plot. As seen from the plot, word titles having terms such as ‘Google’, ‘Machine Learning’, and ‘Apple’ are often used in our title space.

Text Wordcloud (Image by Author)

Now that we explored the title and found the most commonly occurring words, it is time to explore the text itself in our list of articles to find interesting trends and patterns. A large proportion of articles contain words such as ‘data’, ‘user’, and ‘time’.

Percentage of Variance explained (Image by Author)

Converting the given set of titles into a TFIDF format and applying principal component analysis, the above plot showcases the cumulative percentage of variance explained by each of the components. It is to be noted that with the increase in the number of components, there is an increase in the variance explained as more information from components leads to more explanation and information.

Percentage of Variance explained (Image by Author)

This plot shows the variance explained by the tfidf components on the entire text instead of only considering the title. This plot is quite different from the previous plot in that only a few sets of principal components are able to explain a large portion of the variance in the data. Therefore, this can help in dimensionality reduction as fewer features are able to explain a large portion of the variance.

K-means clustering plot (Image by Author)

Clustering is a technique where data points that are similar are grouped together to find interesting patterns and commonalities among them. In this way, recommendations are given based on the cluster at which a data point is present.

In order to determine the right number of clusters, it is handy to use the k-means clustering model and follow the elbow method to find the optimum number of clusters. In our case, the best value of k is 11 as this follows the shape of an elbow.

PCA with clustering 2D plot (Image by Author)

After performing PCA and clustering based on the optimum number of clusters, it is time to visualize the results of the clustering. Based on the plot above, the clustering is working quite well as there is a pattern found in the clusters.

PCA with clustering 3D plot (Image by Author)

Let’s determine how the plot looks in 3D to find the underlying patterns. As can be seen, there is a lot of room for the separation of clusters, hence guiding our recommender system to give good suggestions to the user based on the previously read text.

There are other dimensionality reduction techniques such as TSNE and Kernal PCA. Going through each of them to determine the best clustering would be handy for a recommender system. Steps must be taken to visualize the data points from the text and generate interesting patterns.

TSNE with clustering 2D plot (Image by Author)

Reducing the dimensions using TSNE and visualizing the representations with 11 clusters, it is seen that the data points are spread out quite well. Therefore, there are lower chances for articles that are similar to be in the same cluster. As a result, PCA was performing well in terms of clustering and determining the optimum number of clusters. We will also use 3D visualizations to guide our thinking and understanding of clustering mechanisms.

TSNE with clustering 3D plot (Image by Author)

After performing the task of clustering, it is seen that a lot of data embeddings of the text are quite close to each other. Therefore, it can be challenging to accurately perform clustering when the points are not spread in various directions. Hence, we might look for alternative clustering approaches and dimensionality reduction techniques.

Kernel PCA with clustering 2D Plot (Image by Author)

Kernel PCA is another popular method that is used for dimensionality reduction. As could be seen, the data representations are spread out well for determining and using the number of clusters. Overall, the algorithm reduces the dimensions well and separated data points well. Let us also go over the 3d representation of clusters as a result of performing dimensionality reduction with this technique.

Kernel PCA with clustering 3D Plot (Image by Author)

After plotting a 3d generated representation with kernel PCA, clustering took place quite well and the points are quite spread out. Therefore, this approach can help generate recommendations based on the clustering approach.

Say, a user visits a particular website and reads an interesting piece of text. After this step, the items that are present in the cluster based on kernel PCA representation would be recommended to the user. The user can find these articles fascinating, resulting in the growth of the business.

After performing the previous steps, we define a function that generates a list of useful features that could be used by various recommender system models to make recommendations. These features used in the function below are important and give a good representation of the type of text along with its content and readability.

We apply this function to our data frame and generate a new set of features that are used by recommender systems. Finally, cosine similarity is taken into account when determining the distance between the text of interest with the list of all the other possible texts and articles.

After getting the cosine difference between the present text based on the features generated and a kernel PCA representation, it is compared with the list of existing articles to determine the ones with the lowest distance from clusters. As a result, those articles are recommended to users which makes them engaging and fun to read.

We import various useful libraries for measuring the cosine similarities and generating recommendations for articles that are mostly similar in structure and content.

Conclusion

After going through this exhaustive article, you should be getting a good idea about the working details of a particular implementation of recommendation systems on articles. Performing dimensionality reduction can ensure that we would reduce compute resources and also reduce the impact of outliers when making predictions or providing recommendations. Thanks for taking the time to read this article.


Photo by Salomé Watel on Unsplash

There have been a lot of talks lately about the incredible capabilities of artificial intelligence and machine learning. As we go on to see various frontiers at which ML could be applied, there is a growing possibility of higher value generated as a result of this transition. Companies such as Google, Microsoft, and NVIDIA are pushing the boundaries at which AI and machine learning are used to build a great society with technological advancements.

Now that there is a lot of hype in this AI, an often less talked about topic is that of unsupervised machine learning. I bet that people might have had the opportunity to enjoy some of the greatest streaming services such as Netflix and Amazon Videos with their state-of-the-art recommendation systems. However, there has been less talk about the incredible capabilities of recommendation system which is also part of the unsupervised machine learning.

In this article, we will mainly focus on building an article recommender system based on the previously read articles by the user, giving them a positive experience and an inclination to read more of these similar types of articles. This comes under unsupervised machine learning as we do not have labels about whether a particular article is related to others or not. Instead, we are only given textual information based on which, we must highlight and visualize the similarities before recommending items to users.

Initially, we start with the data and understand their types before continuing our effort to recommend articles. After going through the data, we will visualize key insights from the data. Finally, we build a recommender system based on cosine similarity scores between an article and a list of all the articles to get a recommendation.

Reading the Data

The first step would be to read the data. Below is the code for reading the data. After reading the data, we would be performing exploratory data analysis. Here is the code and the representation of data.

Note: The dataset was taken from Articles sharing and reading from CI&T DeskDrop | Kaggle under Database Contents License (DbCL) v1.0 — Open Data Commons: legal tools for open data license

Input Data (Image by Author)

There are title, text, and language features that are most important in our data. Other features such as content type and author user agent are not very suitable for this task of building a recommendation system. Steps would be taken to remove such features and only consider the features such as title, text, and language respectively. Note that not all the features are shown in the figure as this would become a large image.

Data Information (Image by Author)

We get the information about the data and determine the number of non-null values present in the dataset. It is seen that there are features such as authorSessionID and authorUserAgent that contain more than 50 percent missing values. We are going to be removing all of these features along with others as discussed above and would be focusing only on the title, text, and language features.

Exploratory Data Analysis (EDA)

Let us now explore the dataset and get an understanding of it by using visuals and graphs. The first step would be to explore the type of text content present in the data.

Countplot of content types (Image by Author)

Observing from the plot, a large portion of the contents are in the form of HTML links to be used for recommendation. These HTML links contain the actual text data and the titles. Our focus here is to extract meaningful insights from text instead of using other formats such as ‘video’ and ‘rich’ formats. Therefore, we can remove these categories from our project.

Countplot of languages (Image of Author)

A large portion of our text and titles are English titles. Since our focus is only on the English language, we can take the steps to remove other categories to improve our recommendation model efficiency.

Title Wordcloud (Image by Author)

Wordcloud gives a good understanding of the most frequently occurring words in our corpus of text. The higher the occurrence of words in a corpus, the bigger the text sizes in a word cloud plot. As seen from the plot, word titles having terms such as ‘Google’, ‘Machine Learning’, and ‘Apple’ are often used in our title space.

Text Wordcloud (Image by Author)

Now that we explored the title and found the most commonly occurring words, it is time to explore the text itself in our list of articles to find interesting trends and patterns. A large proportion of articles contain words such as ‘data’, ‘user’, and ‘time’.

Percentage of Variance explained (Image by Author)

Converting the given set of titles into a TFIDF format and applying principal component analysis, the above plot showcases the cumulative percentage of variance explained by each of the components. It is to be noted that with the increase in the number of components, there is an increase in the variance explained as more information from components leads to more explanation and information.

Percentage of Variance explained (Image by Author)

This plot shows the variance explained by the tfidf components on the entire text instead of only considering the title. This plot is quite different from the previous plot in that only a few sets of principal components are able to explain a large portion of the variance in the data. Therefore, this can help in dimensionality reduction as fewer features are able to explain a large portion of the variance.

K-means clustering plot (Image by Author)

Clustering is a technique where data points that are similar are grouped together to find interesting patterns and commonalities among them. In this way, recommendations are given based on the cluster at which a data point is present.

In order to determine the right number of clusters, it is handy to use the k-means clustering model and follow the elbow method to find the optimum number of clusters. In our case, the best value of k is 11 as this follows the shape of an elbow.

PCA with clustering 2D plot (Image by Author)

After performing PCA and clustering based on the optimum number of clusters, it is time to visualize the results of the clustering. Based on the plot above, the clustering is working quite well as there is a pattern found in the clusters.

PCA with clustering 3D plot (Image by Author)

Let’s determine how the plot looks in 3D to find the underlying patterns. As can be seen, there is a lot of room for the separation of clusters, hence guiding our recommender system to give good suggestions to the user based on the previously read text.

There are other dimensionality reduction techniques such as TSNE and Kernal PCA. Going through each of them to determine the best clustering would be handy for a recommender system. Steps must be taken to visualize the data points from the text and generate interesting patterns.

TSNE with clustering 2D plot (Image by Author)

Reducing the dimensions using TSNE and visualizing the representations with 11 clusters, it is seen that the data points are spread out quite well. Therefore, there are lower chances for articles that are similar to be in the same cluster. As a result, PCA was performing well in terms of clustering and determining the optimum number of clusters. We will also use 3D visualizations to guide our thinking and understanding of clustering mechanisms.

TSNE with clustering 3D plot (Image by Author)

After performing the task of clustering, it is seen that a lot of data embeddings of the text are quite close to each other. Therefore, it can be challenging to accurately perform clustering when the points are not spread in various directions. Hence, we might look for alternative clustering approaches and dimensionality reduction techniques.

Kernel PCA with clustering 2D Plot (Image by Author)

Kernel PCA is another popular method that is used for dimensionality reduction. As could be seen, the data representations are spread out well for determining and using the number of clusters. Overall, the algorithm reduces the dimensions well and separated data points well. Let us also go over the 3d representation of clusters as a result of performing dimensionality reduction with this technique.

Kernel PCA with clustering 3D Plot (Image by Author)

After plotting a 3d generated representation with kernel PCA, clustering took place quite well and the points are quite spread out. Therefore, this approach can help generate recommendations based on the clustering approach.

Say, a user visits a particular website and reads an interesting piece of text. After this step, the items that are present in the cluster based on kernel PCA representation would be recommended to the user. The user can find these articles fascinating, resulting in the growth of the business.

After performing the previous steps, we define a function that generates a list of useful features that could be used by various recommender system models to make recommendations. These features used in the function below are important and give a good representation of the type of text along with its content and readability.

We apply this function to our data frame and generate a new set of features that are used by recommender systems. Finally, cosine similarity is taken into account when determining the distance between the text of interest with the list of all the other possible texts and articles.

After getting the cosine difference between the present text based on the features generated and a kernel PCA representation, it is compared with the list of existing articles to determine the ones with the lowest distance from clusters. As a result, those articles are recommended to users which makes them engaging and fun to read.

We import various useful libraries for measuring the cosine similarities and generating recommendations for articles that are mostly similar in structure and content.

Conclusion

After going through this exhaustive article, you should be getting a good idea about the working details of a particular implementation of recommendation systems on articles. Performing dimensionality reduction can ensure that we would reduce compute resources and also reduce the impact of outliers when making predictions or providing recommendations. Thanks for taking the time to read this article.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment