Your First Recommendation System: From Data Preparation to ML Debugging and Improvements Assessment | by Alexander Chaptykov | Feb, 2023

By Jessie Hobb On Mar 3, 2023

Resolve your issues, save time, and avoid mistakes

So you’ve begun to develop your first production recommendation system, and although you have experience in programming and ML, you are bombarded with an enormous volume of new information, like model selection, metrics selection, inference problems and quality assurance.

We cover steps for creating the first working version of an ML model, including data processing, model selection, metrics selection, ML debugging, results interpretation and evaluation of improvements.

The code of the article is here, it can serve as a starting point for your own work. The file rec_sys.ipynb contains a step-by-step guide.

To start, we need a minimum available dataset consisting of 3 entities: user, item, rating. Each record of this dataset will tell us about user’s interaction with an item. For this article we have chosen a dataset MovieLens [1](used with permission) with 100k records containing 943 unique users and 1682 movies. For the this dataset user – UserId, item – MovieId , rating – Rating .

MovieLens also contains a metadata for each movie. This has information by genre. We will need this to interpret the predictions.

Here there are the special preprocessing steps for recommendation system. I want to skip obvious steps like drop NaN elements, drop duplicate elements, data cleaning.

Rating generation

If we don’t have a rating, then create a rating column with the value 1.

df['rating'] = 1

Also if the rating is not explicit it can be created by various aggregating functions, for example, based on the number of interactions, duration, etc.

Entity encoding If you have objects in the item and user fields, these fields need to be converted to a numeric format. A great way to accomplish this is to use LabelEncoder.

from sklearn.preprocessing import LabelEncoder
u_transf = LabelEncoder()
item_transf = LabelEncoder()
# encoding
df['user'] = u_transf.fit_transform(df['user'])
df['item'] = item_transf.fit_transform(df['item'])
# decoding
df['item'] = item_transf.inverse_transform(df['item'])
df['user'] = u_transf.inverse_transform(df['user'])

Sparsity index

The Sparsity index must be lowered for quality model training.

What do the high values of this index tell us? It means that we have a lot of users who have not watched many movies, and we also have movies that have a low audience. The more inactive users and unpopular movies we have, the higher this level is.

This situation happens most often when, for example, when number of all users is suddenly increasing. Or we decided to drastically increase our movie library, and we have absolutely no views for new movies.

Reducing sparsity is critical for training. Let’s say you’ve loaded your data and are trying to train a model, and you’re getting extremely low metrics. You don’t need to start searching for special hyperparameters, or looking for other better models. Start by checking the sparsity index.

You can see from the graph that reducing this index by almost 2% has a very positive effect on metrics

The graph shows that 81% of users are inactive (they have watched less than 20 movies). They need to be removed. And this function will help us with this:

def reduce_sparsity(df, min_items_per_user, min_user_per_item, user_col=USER_COL, item_col=ITEM_COL):
good_users = df[user_col].value_counts()[df[user_col].value_counts() > min_items_per_user].index
df = df[df[user_col].isin(good_users)]good_items = df[item_col].value_counts()[df[item_col].value_counts() > min_user_per_item].index
df = df[df[item_col].isin(good_items)].reset_index(drop=1)
return df

So we had to remove some users and movies, but this will allow us to train the model better. It is desirable to reduce the sparsity level carefully and choose it for each dataset based on the situation. In my experience, a sparsity-index of about 98% is already sufficient for training.

There are good articles detailing popular metrics for recommender systems. For example, “Recommender Systems: Machine Learning Metrics and Business Metrics” by Zuzanna Deutschman and “Automatic Evaluation of Recommendation Systems: Coverage, Novelty and Diversity” by Zahra Ahmad. For this article I decided to focus on 4 metrics that can serve as a minimum set to get you started.

Precision@k

P = (relevant elements) / k

This is a simple metric that does not take into account the order of predictions, so it will have higher values than using MAP. This metric is also sensitive to changes in the model, which can be useful for model monitoring and evaluation. It is easier to interpret, so we can include it in our list of metrics.

MAP (Mean Average precision)

This metric, unlike the previous one, is important for the order of predictions, the closer to the top of the list of recommendations we are wrong, the greater the penalty.

Coverage

Coverage = num of uniq items in recommendations / all uniq items

The metric allows you to see percentage of movies used by the recommendation system. This is usually very important for businesses to make sure that the content (in this case, movies) they have on their site is used to its full potential.

Diversity

The purpose of this metric is to calculate how diverse the recommendations are.

In the paper “Automatic Evaluation of Recommendation Systems: Coverage, Novelty and Diversity” by Zahra Ahmad, diversity is the average similarity for top_n.

But in this article diversity will be treated differently — as a median value of the number of unique genres. High diversity values mean that users have an opportunity to discover new genres, diversify their experience, and spend more time on the site. As a rule, it increases the retention rate and has a positive impact on revenue. This way of calculating metrics has a high degree of interpretability for business, unlike the abstract mean similarity ratio.

Interpretation of metrics

There is an excellent repository on recommender systems that contains not only the models themselves, but also an analysis of the metrics. Studying this table gives us an understanding of the possible range of metrics, and intuition in model evaluation. For example, Precision@k values below 0.02, in most cases, should be considered bad.

So we have quality metrics tied and not tied to the rank. There are metrics indirectly responsible for business and money, as well as for the use and availability of content. Now we can move on to the choice of the model.

ALS matrix factorization

This is a great model to start with. Written in Spark the algorithm is relatively simple

During training, the model initialises the User Matrix and Item Matrix and trains them in such a way as to minimize the error of reconstructing the Rating matrix. Each vector of the User Matrix is a representation of some user and each vector of the Item Matrix is a representation of a particular item . Accordingly, the prediction is a scalar multiplication of the corresponding vectors from the User and Item Matrices.

It’s a great model to start with, because it’s extremely easy to implement, and very often it’s better to start with simple models at the research stage, because this model will learn quickly, which means the iteration time is reduced, which will speed up the project considerably. Also, the model will not overload memory and in case there is a lot of data, it will cope, which will save on infrastructure in the future.

Bilateral Variational Autoencoder (BiVAE)

The model is based on the paper “Bilateral variational autoencoder for collaborative filtering” by Quoc Tuan TRUONG, Aghiles SALAH, Hady W. LAUW.

The model is broadly similar to the previous one — in the process of training, the matrix of users Theta and the matrix of Beta units are trained.

But the structure of the model is much more complex than in ALS. We have User encoder and Item encoder consisting of a sequence of linear layers. Their task is to train hidden variables Theta and Beta respectively. Decoding and inference is done by scalar multiplication of these two variables. The error function (Evidence lower bound in this case) is counted twice between the created user vectors and the actual values, then the same is done for the item encoder.

The model has been chosen as the best one in the comparison table. This model is the part of Cornac zoo of models for recommender systems, with Pytorch under the hood. The model has a custom implementation of learning mode. It is slower than its predecessor, and will require more spending on support and infrastructure, but perhaps its high metrics are worth it.

Most popular

Yes, the simplest model, and in some situations the most effective.

df[item_col].value_counts()[:top_n]

Although this approach seems too simple, it will nevertheless allow us to compare metrics and, for example, find out how far ML models have gone from such a simple model. Having such a model can justify or disprove the need for ML implementation.

Random model

This model will just give out random Item. It also creates the necessary contrast when evaluating metrics and predictions of ML models.

So, we chose 4 models for the experiment. One is optimized for speed, another for quality, and 2 others for comparison and a better understanding of the results. We are now ready to begin training.

We will train 4 models at once to make it convenient to compare them. We will use the settings that were in the recomenders repository.

import json
from pathlib import Pathimport pandas as pd
from sklearn.model_selection import train_test_split
from models import RandomModel, MostPopular, ModelALS, BIVAE, evaluate
from setup import ITEM_COL, TOP_K_METRICS, TOP_K_PRED
def main(out_folder='outputs'):
df = pd.read_csv('personalize.csv.zip', compression='zip').iloc[:, :3]
genres = pd.read_csv('movies.csv').rename({"movieId": ITEM_COL}, axis=1).dropna()
train, test = train_test_split(df, test_size=None, train_size=0.75, random_state=42)
metrics = {}
for model_cls in [BIVAE, ModelALS, RandomModel, MostPopular]:
model = model_cls()
model.fit(train)
preds = model.transform(TOP_K_PRED)
preds.to_csv(Path(out_folder) / f"{model_cls.__name__}_preds.csv", index=False)
metrics[model_cls.__name__] = evaluate(train, test, preds, genres, TOP_K_METRICS)
with open('outputs/metrics.json', 'w') as fp:
json.dump(metrics, fp)
if __name__ == "__main__":
main()

Predictions for recommender systems have their own specifics. Predicting all units for all users, we get a matrix of n-users x n-items. Accordingly, we can predict only for those users and units that were at the time of training.

Then:

Remove from the prediction those units that were on the trainee.
Sort by rating in descending order for each user. Otherwise, the metrics will be bad.

An important point, since people often forget to remove seemed items(or train items), this will have a negative impact on the metrics, because the top will be those things that are not in the test dataset. In addition, users will have a negative experience associated with the fact that the model will recommend what they have already seen.

As we can see, the BIVAE model showed the best precision metrics, it was able to adjust very precisely to the tastes of users. And good precision has a downside — Coverage and Diversity are worse than ALS model. This means that a huge amount of content will likely never get a chance to be seen by users. In this particular case BIVAE still looks preferable.

But sometimes Diversity is more important to the business than Precision, and this can happen, for example, if on your site users watch only romantic comedies and other genres are not preferred, but you would like to encourage your audience to watch other genres.

The same can be said about the MostPopular model, this model has better performance than the ALS machine learning model. And it seems that why need all this ML complexity, when we have a ready-made model! But if you look carefully we see that Coverage is very low, and usually with the increase of content, the percentage will fall even more, for example we have only 1682 films, but what happens if tomorrow the business decides to expand the library by 100k films? The Coverage percentage would drop even more for that model. The same rule works in the opposite direction — the less data you have, the more likely it is that a simple MostPopular model will work.

It’s also interesting to consider RandomModel, since in terms of Precision metric it doesn’t look too bad in comparison with ALS, and its Coverage is 100%. Again, don’t jump to conclusions. The small number of films in this dataset contributes to this success.

In the end, a suitable model which has high quality and acceptable Coverage and Diversity is BIVAE. We can build our base recommendation system this model.

Sometimes debugging ml can be very difficult. Where and how to look for problems if the metrics are not very good? In the code? In the data? In the choice of model or its hyperarameters?

There are some tips:

If you have low metrics, for example PrecisionK below 0.1, and you don’t know what the reason is — the data or the model, or maybe the metric calculation, you can take the MovieLens dataset and train your model on it. If its metrics are low on MovieLens too, then the cause is in the model, if the metrics are good, then the likely cause lies in the preprocessing and postprocessing stages.
If Random model and Most popular model metrics are close to ML models, it is worth checking the data — maybe the number of unique items is too low. This can also happen if we have very little data, or maybe there is a bug in the training code.
Values higher than 0.5 for PrecisionK look too high and it is worth checking if there is a bug in the script or if we are lowering the sparsity index too much.
Always compare how many users and items you have left after lowering the sparsity index. Sometimes in the pursuit of quality you can lose almost all users, so you should look for a compromise.

So what do we do next, if we want to get the model to production? We need to figure out what else we need to do and how much work we need to do.

BIVAE optimization and functionality expansion

There are no cold-start mechanisms out of the box. Taking into account the fact that we lowered the sparsity index, meaning that many users simply will not have predictions.
It is necessary to implement a batch predict algorithm. Now the mechanism is implemented to predict one user, i.e. batch_size=1, naturally this greatly slows down the speed of work.
Using metadata about users and objects (movies).

Postprocessing

Advanced data diversification algorithms may be required. Which can sometimes be comparable to the model development time.

Monitoring

development of new metrics
data quality check

Of course, in each case, this list can be different, I have listed only the most likely improvements, and perhaps, after compiling such a list, there will be a desire to use another model that will have such features out of the box.

We tested different models, including the latest solutions, to ensure their effectiveness and learned how to choose the most suitable model based on technical and business metrics.
Made the first predictions that can already be used to show users.
In addition, we have outlined a further action plan related to the development of the model and its release into production (prod).

You can find all the code for the article here.

I would like to tell you more about the BiVAE architecture, about cold start and increasing diversification, about down-stream tasks such as item / user similarity, about how quality can be controlled and about online or offline inference, about how to transfer all this to pipelines. But this is much beyond the scope of the article, and if the readers like the article, then it will be possible to release a sequel where I will go into everything in more detail.

Subscribe to get notified when I publish a new story.

References:

[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

All images unless otherwise noted are by the author.