Implementation Contents
- Data Import
- Data Pre-Processing
- Implementation #1: Matrix Factorization in Python from Scratch
- Implementation #2: Matrix Factorization with Surprise Package
The complete notebook on Matrix Factorization implementation is available here.
Since we are developing a recommendation system like Netflix, but we may not have access to their big data, we are going to use a great dataset from MovieLens for this practice [1] with permission. Besides, you can read and review their README files for the usage licenses and other details. This dataset comprises millions of movies, users, and users’ past-interacting ranking.
After extracting the zip file, there will be 4 csv given as follows:
Btw, Collaborative Filtering has a problem with user cold-start. The cold-start problem refers to a situation in which a system or algorithm could not make accurate predictions or recommendations for new users, items, or entities that has no prior information. This can happen when there is little or no historical data available for the new users or items, making it difficult for the system to understand their preferences or characteristics.
The cold-start problem is a common challenge in recommendation systems, where the system needs to provide personalized recommendations for users with limited or no interaction history.
In this stage, we are going to select users who have at least interacted with 2000 movies and movies who have been rated by 1000 users (this can be a good way to reduce the size of data and ofc with less null data. Besides, my RAM could never handle massive table)
Actually, you can also use the small subset of 100k ratings which is provided by MovieLens. I just want to optimize my computer resources as much as I can with less null data.
As is customary, we will divide the data into two groups: a training set and a testing set — by utilizing the train_test_split method.
While the information we require is present, it is not presented in a way that is beneficial for humans to comprehend. However, I have created a table that presents the same data in a format that is easier for humans to understand.
Here is the Python snippet for implementing Matrix Factorization with the gradient descent. The matrix_factorization
function returns 2 matrices: nP (user matrix) and nQ (item matrix).
Then, fit the training dataset to the model and here I set n_factor K = 5. Following that, predictions can be computed by multiplying nP and the transpose of nQ using the dot product method, as illustrated in the code snippet below.
As a result, here is the final prediction that the matrix_factorization produce
Prediction on the Test Set
The following snippet leverages the given nP (user matrix) and nQ (movie matrix) to make a prediction on the test set
Evaluating The Prediction Performance
Although there are various evaluation metrics for Recommender Systems, such as Precision@K, Recall@K, MAP@K, and the list goes on. For this exercise, I will employ a basic accuracy metric namely RMSE. I probably will write other evaluation metrics in greater detail in the subsequent article.
As the result, the RMSE on the test set is 0.829, which is pretty decent even before the hyper-tuning is implemented. Definitely, we can tune several parameters like learning rate, n_factor, epochs steps for better outcomes.
Implementation Contents
- Data Import
- Data Pre-Processing
- Implementation #1: Matrix Factorization in Python from Scratch
- Implementation #2: Matrix Factorization with Surprise Package
The complete notebook on Matrix Factorization implementation is available here.
Since we are developing a recommendation system like Netflix, but we may not have access to their big data, we are going to use a great dataset from MovieLens for this practice [1] with permission. Besides, you can read and review their README files for the usage licenses and other details. This dataset comprises millions of movies, users, and users’ past-interacting ranking.
After extracting the zip file, there will be 4 csv given as follows:
Btw, Collaborative Filtering has a problem with user cold-start. The cold-start problem refers to a situation in which a system or algorithm could not make accurate predictions or recommendations for new users, items, or entities that has no prior information. This can happen when there is little or no historical data available for the new users or items, making it difficult for the system to understand their preferences or characteristics.
The cold-start problem is a common challenge in recommendation systems, where the system needs to provide personalized recommendations for users with limited or no interaction history.
In this stage, we are going to select users who have at least interacted with 2000 movies and movies who have been rated by 1000 users (this can be a good way to reduce the size of data and ofc with less null data. Besides, my RAM could never handle massive table)
Actually, you can also use the small subset of 100k ratings which is provided by MovieLens. I just want to optimize my computer resources as much as I can with less null data.
As is customary, we will divide the data into two groups: a training set and a testing set — by utilizing the train_test_split method.
While the information we require is present, it is not presented in a way that is beneficial for humans to comprehend. However, I have created a table that presents the same data in a format that is easier for humans to understand.
Here is the Python snippet for implementing Matrix Factorization with the gradient descent. The matrix_factorization
function returns 2 matrices: nP (user matrix) and nQ (item matrix).
Then, fit the training dataset to the model and here I set n_factor K = 5. Following that, predictions can be computed by multiplying nP and the transpose of nQ using the dot product method, as illustrated in the code snippet below.
As a result, here is the final prediction that the matrix_factorization produce
Prediction on the Test Set
The following snippet leverages the given nP (user matrix) and nQ (movie matrix) to make a prediction on the test set
Evaluating The Prediction Performance
Although there are various evaluation metrics for Recommender Systems, such as Precision@K, Recall@K, MAP@K, and the list goes on. For this exercise, I will employ a basic accuracy metric namely RMSE. I probably will write other evaluation metrics in greater detail in the subsequent article.
As the result, the RMSE on the test set is 0.829, which is pretty decent even before the hyper-tuning is implemented. Definitely, we can tune several parameters like learning rate, n_factor, epochs steps for better outcomes.