Recommender System: Collaborative Filtering with Matrix Factorization | by Christie Natashia | Apr, 2023


Implementation Contents

  • Data Import
  • Data Pre-Processing
  • Implementation #1: Matrix Factorization in Python from Scratch
  • Implementation #2: Matrix Factorization with Surprise Package

The complete notebook on Matrix Factorization implementation is available here.

Since we are developing a recommendation system like Netflix, but we may not have access to their big data, we are going to use a great dataset from MovieLens for this practice [1] with permission. Besides, you can read and review their README files for the usage licenses and other details. This dataset comprises millions of movies, users, and users’ past-interacting ranking.

After extracting the zip file, there will be 4 csv given as follows:

Snapshot of data -Image by Author

Btw, Collaborative Filtering has a problem with user cold-start. The cold-start problem refers to a situation in which a system or algorithm could not make accurate predictions or recommendations for new users, items, or entities that has no prior information. This can happen when there is little or no historical data available for the new users or items, making it difficult for the system to understand their preferences or characteristics.

The cold-start problem is a common challenge in recommendation systems, where the system needs to provide personalized recommendations for users with limited or no interaction history.

In this stage, we are going to select users who have at least interacted with 2000 movies and movies who have been rated by 1000 users (this can be a good way to reduce the size of data and ofc with less null data. Besides, my RAM could never handle massive table)

My RAM condition -Source: KC Green’s 2013 webcomic

Actually, you can also use the small subset of 100k ratings which is provided by MovieLens. I just want to optimize my computer resources as much as I can with less null data.

Data output after data pre-processing -Image by Author

As is customary, we will divide the data into two groups: a training set and a testing set — by utilizing the train_test_split method.

While the information we require is present, it is not presented in a way that is beneficial for humans to comprehend. However, I have created a table that presents the same data in a format that is easier for humans to understand.

Raw data -Image by Author

Here is the Python snippet for implementing Matrix Factorization with the gradient descent. The matrix_factorization function returns 2 matrices: nP (user matrix) and nQ (item matrix).

Then, fit the training dataset to the model and here I set n_factor K = 5. Following that, predictions can be computed by multiplying nP and the transpose of nQ using the dot product method, as illustrated in the code snippet below.

As a result, here is the final prediction that the matrix_factorization produce

New predicted rating in train set-Image form Author

Prediction on the Test Set

The following snippet leverages the given nP (user matrix) and nQ (movie matrix) to make a prediction on the test set

The rating and pred_rating output of test set-Image from Author

Evaluating The Prediction Performance

Although there are various evaluation metrics for Recommender Systems, such as Precision@K, Recall@K, MAP@K, and the list goes on. For this exercise, I will employ a basic accuracy metric namely RMSE. I probably will write other evaluation metrics in greater detail in the subsequent article.

As the result, the RMSE on the test set is 0.829, which is pretty decent even before the hyper-tuning is implemented. Definitely, we can tune several parameters like learning rate, n_factor, epochs steps for better outcomes.


Implementation Contents

  • Data Import
  • Data Pre-Processing
  • Implementation #1: Matrix Factorization in Python from Scratch
  • Implementation #2: Matrix Factorization with Surprise Package

The complete notebook on Matrix Factorization implementation is available here.

Since we are developing a recommendation system like Netflix, but we may not have access to their big data, we are going to use a great dataset from MovieLens for this practice [1] with permission. Besides, you can read and review their README files for the usage licenses and other details. This dataset comprises millions of movies, users, and users’ past-interacting ranking.

After extracting the zip file, there will be 4 csv given as follows:

Snapshot of data -Image by Author

Btw, Collaborative Filtering has a problem with user cold-start. The cold-start problem refers to a situation in which a system or algorithm could not make accurate predictions or recommendations for new users, items, or entities that has no prior information. This can happen when there is little or no historical data available for the new users or items, making it difficult for the system to understand their preferences or characteristics.

The cold-start problem is a common challenge in recommendation systems, where the system needs to provide personalized recommendations for users with limited or no interaction history.

In this stage, we are going to select users who have at least interacted with 2000 movies and movies who have been rated by 1000 users (this can be a good way to reduce the size of data and ofc with less null data. Besides, my RAM could never handle massive table)

My RAM condition -Source: KC Green’s 2013 webcomic

Actually, you can also use the small subset of 100k ratings which is provided by MovieLens. I just want to optimize my computer resources as much as I can with less null data.

Data output after data pre-processing -Image by Author

As is customary, we will divide the data into two groups: a training set and a testing set — by utilizing the train_test_split method.

While the information we require is present, it is not presented in a way that is beneficial for humans to comprehend. However, I have created a table that presents the same data in a format that is easier for humans to understand.

Raw data -Image by Author

Here is the Python snippet for implementing Matrix Factorization with the gradient descent. The matrix_factorization function returns 2 matrices: nP (user matrix) and nQ (item matrix).

Then, fit the training dataset to the model and here I set n_factor K = 5. Following that, predictions can be computed by multiplying nP and the transpose of nQ using the dot product method, as illustrated in the code snippet below.

As a result, here is the final prediction that the matrix_factorization produce

New predicted rating in train set-Image form Author

Prediction on the Test Set

The following snippet leverages the given nP (user matrix) and nQ (movie matrix) to make a prediction on the test set

The rating and pred_rating output of test set-Image from Author

Evaluating The Prediction Performance

Although there are various evaluation metrics for Recommender Systems, such as Precision@K, Recall@K, MAP@K, and the list goes on. For this exercise, I will employ a basic accuracy metric namely RMSE. I probably will write other evaluation metrics in greater detail in the subsequent article.

As the result, the RMSE on the test set is 0.829, which is pretty decent even before the hyper-tuning is implemented. Definitely, we can tune several parameters like learning rate, n_factor, epochs steps for better outcomes.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
AprChristiecollaborativeFactorizationFilteringlatest newsMatrixNatashiaRecommenderSystemTech NewsTechnology
Comments (0)
Add Comment