Techno Blender
Digitally Yours.

Introduction to Embedding-Based Recommender Systems | by Dr. Robert Kübler | Jan, 2023

0 10


Photo by Johannes Plenio on Unsplash

They are everywhere: these sometimes fantastic, sometimes poor, and sometimes even funny recommendations on major websites like Amazon, Netflix, or Spotify, telling you what to buy, watch or listen to next. While recommender systems are convenient for us users — we get inspired to try new things — the companies especially benefit from them.

To understand to which extent, let us take a look at some numbers from the paper Measuring the Business Value of Recommender Systems by Dietmar Jannach and Michael Jugovac [1]. From their paper:

  • Netflix: “75 % of what people watch is from some sort of recommendation” (this one is even from Medium!)
  • Youtube: “60 % of the clicks on the home screen are on the recommendations”
  • Amazon: “about 35 % of their sales originate from cross-sales (i.e., recommendation)”, where their means Amazon

In this paper [1] you can find more interesting statements about increased CTRs, engagement, and sales that you can get from employing recommender systems.

So, it seems like recommenders are the greatest thing since sliced bread, and I also agree that recommenders are one of the best and most interesting things that emerged from the field of machine learning. That’s why in this article, I want to show you

  • how to design an easy collaborative recommender
  • how to implement it in TensorFlow
  • what the advantages and disadvantages are.

Before we start, let us grab some data we can play with.

If you don’t have it yet, get tensorflow_datasets via pip install tensorflow-datasets . You can download any dataset they offer, but we will stick to a true classic: movielens! We take the smallest version of the movielens data consisting of 1,000,000 rows, so training is faster later.

import tensorflow_datasets as tfds

data = tfds.load("movielens/1m-ratings")

data is a dictionary containing TensorFlow DataSets, which are great. But to keep it simpler, let’s cast it into a pandas dataframe, so everyone is on the same page.

Note: Usually, you would keep it as a TensorFlow dataset, especially if the data gets even larger since pandas is extremely hungry on your RAM. Do not try do convert it to a pandas dataframe for the 25,000,000 version of the movielens dataset!

df = tfds.as_dataframe(data["train"])
print(df.head(5))
Image by the author.

⚠️ Warning: Don’t print the entire dataframe since this is a styled dataframe that’s configured to display all 1,000,000 rows by default!

We can see an abundance of data. Each row consists of a

  • user (user_id),
  • a movie (movie_id),
  • the rating that the user gave to the movie (user_rating), expressed as an integer between 1 and 5 (stars), and
  • a lot more features about the user and movie.

In this tutorial, let us only use the bare minimum: user_id, movie_id, and user_rating since very often this is the only data we have. Having more features about users and movies is usually a luxary, so let us directly deal with the harder, but broadly applicable case.

We will also keep the timestamp to conduct a temporal train-test split since this resembles how we train in real life: we train now, but we want the model to work well tomorrow. So we should asses the model quality like this as well.

filtered_data = (
df
.filter(["timestamp", "user_id", "movie_id", "user_rating"])
.sort_values("timestamp")
.astype({"user_id": int, "movie_id": int, "user_rating": int}) # nicer types
.drop(columns=["timestamp"]) # don't need the timestamp anymore
)

train = filtered_data.iloc[:900000] # chronologically first 90% of the dataset
test = filtered_data.iloc[900000:] # chronologically last 10% of the dataset

filtered_data contains

Image by the author.

Cold Start Problem

If we split the data in any way, we may run into something called the cold start problem, meaning that some users or movies are only present in the test set, but not in the training set. In our case, funnily enough, user 1 is such an example.

print(train.query("user_id == 1").shape[0])
print(test.query("user_id == 1").shape[0])

# Output:
# 0
# 53

It is a bit like a category of a categorical feature that only appears in the test set. It makes learning harder, but still, the model has to deal with it somehow. The recommender that we will build soon is quite prone to the cold start problem, but there are other types of recommenders that can deal with new users or movies in a better way. This is something for another article, though.

Let’s build train and test dataframes and move on.

X_train = train.drop(columns=["user_rating"])
y_train = train["user_rating"]

X_test = test.drop(columns=["user_rating"])
y_test = test["user_rating"]

Now that we know what the data looks like, let us define the model signature, meaning what goes in and what comes out. In our case, it is quite simple: The input should be a user_id and a movie_id, and the output should be the user_rating, i.e. how the user rates the movie.

Image by the author.

But what could such a model look like? This is a tough one, especially for data science beginners. The users and movies are categories, even if we encoded them as integers. So, treating them like numbers and merely training a model with them is not purposeful.

Something Horrible!

For the curious readers, I will do it anyway. The following is an example of how not to do it:

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

hgb = HistGradientBoostingRegressor(random_state=0)
hgb.fit(X_train, y_train)
print(hgb.score(X_test, y_test), mean_absolute_error(y_test, hgb.predict(X_test)))

# Output:
# 0.07018701410615702 0.8508620798953698

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

The r² is about 0.07, which is about as good as a regressor that only outputs the mean of the ratings, independently of the user and movie inputs. The mean absolute error is about 0.85, meaning that we miss the true rating by about 0.85 stars on average.

Instead of doing it like this, I will show you how to use embeddings to build a more meaningful and better model.

One-Hot Encoding as a Special Case of Embeddings

One way to encode categorical variables such as our users or movies is with vectors, i.e. a tuple of numbers — called embeddings in this context. This is a useful technique to keep in mind, not only for recommender systems but whenever you deal with categorical data.

Image by the author.

A very simple example of turning categories into numbers is one-hot/dummy encoding. However, the resulting embeddings are high-dimensional for high-cardinality categorical features, leading us right into the curse of dimensionality trap when trying to work with them.

Another drawback is that each pair of two vectors have the same distance from each other. As an example, if you take a feature with three categories that are encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], each category has the same distance to each other category in common metrics such as Euclidean and other Mikowski distances, or cosine similarity. This might be fine for nominal features, but for ordinal features such as hot, mild, or cold weather, it would be nicer if hot is closer to mild than cold.

Clearly, this is bad, so we have to think of something different.

The Real Deal

Embeddings allow us to create shorter vectors with more meaning than one-hot encoded vectors.

They are readily available within deep learning frameworks such as TensorFlow and PyTorch. On a very high level, they work like this:

  1. You specify an embedding dimension, i.e. how long the vector should be. This is a hyperparameter that you could tune, among others.
  2. The embeddings for each category get initialized randomly, just as any other weight in your neural network.
  3. Training pushes the embeddings to be more useful to the model.

This is actually not a new operation or layer since you can simulate it by first one-hot encoding the category and then using a linear (dense) layer without activation function and bias. The embedding layers are just more performant since it’s just a lookup instead of a matrix product in the linear layer.

Image by the author.

So, now that we have all of the ingredients, let’s build a model! First, we will define the high-level architecture of the model, and then we will build it in TensorFlow, although it is similarly easy in PyTorch if you prefer this.

Architecture

Alright, so two categorical variables (user_id and movie_id) enter the model, then we embed them. We end up with two vectors, preferably of the same length. In the end, we want to end up with a single number, the user_rating.

Note: We will model it as a regression problem, but you can also see it as a classification task.

So, how can we make a single number out of two vectors of the same length? There are many ways, but one of the easiest and most efficient ones is by just taking the dot product.

Image by the author.

Nothing too crazy, I would argue. Now we are able to look at how the model should work:

Image by the author.

As a formula, we created this:

Image by the author.

which reads as “rating of movie m from user u equals embedding of user u dot product embedding of movie m.

Implementation in TensorFlow, Version One

The implementation is actually a piece of cake if you know basic TensorFlow. The only thing to pay attention to is that the embedding layers want the categories to be represented as integers from 1 to number_of_categories. Very often you find people populating some dictionary like {“user_8323”: 1, “user_1122”: 2, …} and an inverse dictionary like {1: “user_8323”, 2: “user_1122”, …} to achieve this, but TensorFlow has some nice layers to take care of than as well. We will use the IntegerLookup here. A nice feature of this layer: unknown categories get mapped to 0 by default.

Before we start, we have to grab all the unique users and movies from the training set first.

all_users = train["user_id"].unique()
all_movies = train["movie_id"].unique()

Using the functional API of Keras, you can implement the above ideas like this:

import tensorflow as tf

# user pipeline
user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users)+1, output_dim=32)(user_as_integer)

# movie pipeline
movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies)+1, output_dim=32)(movie_as_integer)

# dot product
dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
flatten = tf.keras.layers.Flatten()(dot)

# model input/output definition
model = tf.keras.Model(inputs=[user_input, movie_input], outputs=flatten)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

Since we gave the user and movie input layers nice names, we can train the model like this:

model.fit(
x={
"user": X_train["user_id"],
"movie": X_train["movie_id"]
},
y=y_train.values,
batch_size=256,
epochs=100,
validation_split=0.1, # for early stopping
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=1, restore_best_weights=True)
],
)

# Output (for me):
# ...
# Epoch 18/100
# 3165/3165 [==============================] - 8s 3ms/step - loss: 0.7357 - mean_absolute_error: 0.6595 - val_loss: 11.4699 - val_mean_absolute_error: 2.9923

We could evaluate this model on the test set now, but we can already see here that it’s probably quite bad because val_mean_absolute_error is about 3. That means that we are on average 3 stars off, which is horrible in a 5-star system. This is even worse than our bad model from before, which is quite an achievement. 😅 But why is that? Let’s explore this in the next section.

Implementation in TensorFlow, Version Two

We have built a regression model that can potentially output any real number so far. It is very hard for the model to learn that it should output numbers in the special range between 1 to 5 but we can make it easier for the model with a simple trick: just squash the output range as we do it for logistic regression. Just instead of a [0, 1] interval, let’s scale and shift it to [1, 5].

user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=32)(user_as_integer)

movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=32)(movie_as_integer)

dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
flatten = tf.keras.layers.Flatten()(dot)

# this is new!
squash = tf.keras.layers.Lambda(lambda x: 4*tf.nn.sigmoid(x) + 1)(flatten)

model = tf.keras.Model(inputs=[user_input, movie_input], outputs=squash)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

As a formula:

Image by the author.

where σ is the sigmoid function. Training is as above and evaluating it on the test set gives us

model.evaluate(
x={"user": X_test["user_id"], "movie": X_test["movie_id"]},
y=y_test
)

# Output:
# [...] loss: 0.9701 - mean_absolute_error: 0.7683

This is much better than the model before and also our bad baseline. In case you want an r² score as well:

from sklearn.metrics import r2_score

r2_score(
y_test,
model.predict(
{"user": X_test["user_id"], "movie": X_test["movie_id"]}
).ravel()
)
# Output:
# 0.1767611765807019

Let’s do a small final adjustment to end up with an even better model.

Implementation in TensorFlow, Final Version

Additionally to the embeddings, we can also associate a bias term to each movie and user. This captures that some users tend to give only rather positive (or negative) ratings, and also that some movies only tend to get positive (or negative) reviews. This way, the biases can do the rough work while the embeddings do the fine-tuning. For example, a user that mostly gives 4 stars for everything will have some fixed bias. The embeddings then only have to focus on explaining why this user sometimes gives 3 or 5 stars.

The formula then becomes

Image by the author.

where bᵤ and bₘ are the biases of user u and movie m respectively. As code:

user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=32)(user_as_integer)
user_bias = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=1)(user_as_integer)

movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=32)(movie_as_integer)
movie_bias = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=1)(movie_as_integer)

dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
add = tf.keras.layers.Add()([dot, user_bias, movie_bias])
flatten = tf.keras.layers.Flatten()(add)
squash = tf.keras.layers.Lambda(lambda x: 4 * tf.nn.sigmoid(x) + 1)(flatten)

model = tf.keras.Model(inputs=[user_input, movie_input], outputs=squash)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

If you like the plot_model output of Keras:

Image by the author.
Image by the author, created with https://netron.app/.

As already indicated, the model performance improves again.

  • MSE ≈ 0.89
  • MAE ≈ 0.746
  • r² ≈ 0.245

Nice! We got the lowest MAE and MSE (and hence the highest r²) with this version.

In this article, we have seen that recommenders have a large impact on businesses, that’s why there are widely used. Building a good recommender is not as straightforward as other models since you often have to deal with high-cardinality categorical features, rendering simple tricks like one-hot encoding useless.

We learned how to circumvent this problem by using embeddings in our neural network architecture. We added some more simple tricks to end up with a not-to-shabby model, even without tuning any hyperparameters. We could improve the model even further by

  • optimizing the embedding dimension (that we just set to 32 so far)
  • applying regularization to the embeddings
  • building a proper time-split validation set, not a random one as we did
  • retraining it on the complete training dataset (including the validation set) after we know the best hyperparameters

One of the biggest advantages is that we can apply the model in most contexts since we only need interaction (rating) data of users and movies. We do not need to know any more things about the users and movies, such as age, gender, genre, … so usually, we can get going immediately.

The price that we pay for this is that we cannot output meaningful embeddings for unknown users or movies — the cold start problem. The model will output something, but the quality will be horrible.

However, if we happen to have user and movie data, we can do smarter things and incorporate these features in a straightforward way as well. This mitigates the cold start problem and might even improve the model on known users and movies. More about this in the next article!


Photo by Johannes Plenio on Unsplash

They are everywhere: these sometimes fantastic, sometimes poor, and sometimes even funny recommendations on major websites like Amazon, Netflix, or Spotify, telling you what to buy, watch or listen to next. While recommender systems are convenient for us users — we get inspired to try new things — the companies especially benefit from them.

To understand to which extent, let us take a look at some numbers from the paper Measuring the Business Value of Recommender Systems by Dietmar Jannach and Michael Jugovac [1]. From their paper:

  • Netflix: “75 % of what people watch is from some sort of recommendation” (this one is even from Medium!)
  • Youtube: “60 % of the clicks on the home screen are on the recommendations”
  • Amazon: “about 35 % of their sales originate from cross-sales (i.e., recommendation)”, where their means Amazon

In this paper [1] you can find more interesting statements about increased CTRs, engagement, and sales that you can get from employing recommender systems.

So, it seems like recommenders are the greatest thing since sliced bread, and I also agree that recommenders are one of the best and most interesting things that emerged from the field of machine learning. That’s why in this article, I want to show you

  • how to design an easy collaborative recommender
  • how to implement it in TensorFlow
  • what the advantages and disadvantages are.

Before we start, let us grab some data we can play with.

If you don’t have it yet, get tensorflow_datasets via pip install tensorflow-datasets . You can download any dataset they offer, but we will stick to a true classic: movielens! We take the smallest version of the movielens data consisting of 1,000,000 rows, so training is faster later.

import tensorflow_datasets as tfds

data = tfds.load("movielens/1m-ratings")

data is a dictionary containing TensorFlow DataSets, which are great. But to keep it simpler, let’s cast it into a pandas dataframe, so everyone is on the same page.

Note: Usually, you would keep it as a TensorFlow dataset, especially if the data gets even larger since pandas is extremely hungry on your RAM. Do not try do convert it to a pandas dataframe for the 25,000,000 version of the movielens dataset!

df = tfds.as_dataframe(data["train"])
print(df.head(5))
Image by the author.

⚠️ Warning: Don’t print the entire dataframe since this is a styled dataframe that’s configured to display all 1,000,000 rows by default!

We can see an abundance of data. Each row consists of a

  • user (user_id),
  • a movie (movie_id),
  • the rating that the user gave to the movie (user_rating), expressed as an integer between 1 and 5 (stars), and
  • a lot more features about the user and movie.

In this tutorial, let us only use the bare minimum: user_id, movie_id, and user_rating since very often this is the only data we have. Having more features about users and movies is usually a luxary, so let us directly deal with the harder, but broadly applicable case.

We will also keep the timestamp to conduct a temporal train-test split since this resembles how we train in real life: we train now, but we want the model to work well tomorrow. So we should asses the model quality like this as well.

filtered_data = (
df
.filter(["timestamp", "user_id", "movie_id", "user_rating"])
.sort_values("timestamp")
.astype({"user_id": int, "movie_id": int, "user_rating": int}) # nicer types
.drop(columns=["timestamp"]) # don't need the timestamp anymore
)

train = filtered_data.iloc[:900000] # chronologically first 90% of the dataset
test = filtered_data.iloc[900000:] # chronologically last 10% of the dataset

filtered_data contains

Image by the author.

Cold Start Problem

If we split the data in any way, we may run into something called the cold start problem, meaning that some users or movies are only present in the test set, but not in the training set. In our case, funnily enough, user 1 is such an example.

print(train.query("user_id == 1").shape[0])
print(test.query("user_id == 1").shape[0])

# Output:
# 0
# 53

It is a bit like a category of a categorical feature that only appears in the test set. It makes learning harder, but still, the model has to deal with it somehow. The recommender that we will build soon is quite prone to the cold start problem, but there are other types of recommenders that can deal with new users or movies in a better way. This is something for another article, though.

Let’s build train and test dataframes and move on.

X_train = train.drop(columns=["user_rating"])
y_train = train["user_rating"]

X_test = test.drop(columns=["user_rating"])
y_test = test["user_rating"]

Now that we know what the data looks like, let us define the model signature, meaning what goes in and what comes out. In our case, it is quite simple: The input should be a user_id and a movie_id, and the output should be the user_rating, i.e. how the user rates the movie.

Image by the author.

But what could such a model look like? This is a tough one, especially for data science beginners. The users and movies are categories, even if we encoded them as integers. So, treating them like numbers and merely training a model with them is not purposeful.

Something Horrible!

For the curious readers, I will do it anyway. The following is an example of how not to do it:

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

hgb = HistGradientBoostingRegressor(random_state=0)
hgb.fit(X_train, y_train)
print(hgb.score(X_test, y_test), mean_absolute_error(y_test, hgb.predict(X_test)))

# Output:
# 0.07018701410615702 0.8508620798953698

# BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD BAD

The r² is about 0.07, which is about as good as a regressor that only outputs the mean of the ratings, independently of the user and movie inputs. The mean absolute error is about 0.85, meaning that we miss the true rating by about 0.85 stars on average.

Instead of doing it like this, I will show you how to use embeddings to build a more meaningful and better model.

One-Hot Encoding as a Special Case of Embeddings

One way to encode categorical variables such as our users or movies is with vectors, i.e. a tuple of numbers — called embeddings in this context. This is a useful technique to keep in mind, not only for recommender systems but whenever you deal with categorical data.

Image by the author.

A very simple example of turning categories into numbers is one-hot/dummy encoding. However, the resulting embeddings are high-dimensional for high-cardinality categorical features, leading us right into the curse of dimensionality trap when trying to work with them.

Another drawback is that each pair of two vectors have the same distance from each other. As an example, if you take a feature with three categories that are encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], each category has the same distance to each other category in common metrics such as Euclidean and other Mikowski distances, or cosine similarity. This might be fine for nominal features, but for ordinal features such as hot, mild, or cold weather, it would be nicer if hot is closer to mild than cold.

Clearly, this is bad, so we have to think of something different.

The Real Deal

Embeddings allow us to create shorter vectors with more meaning than one-hot encoded vectors.

They are readily available within deep learning frameworks such as TensorFlow and PyTorch. On a very high level, they work like this:

  1. You specify an embedding dimension, i.e. how long the vector should be. This is a hyperparameter that you could tune, among others.
  2. The embeddings for each category get initialized randomly, just as any other weight in your neural network.
  3. Training pushes the embeddings to be more useful to the model.

This is actually not a new operation or layer since you can simulate it by first one-hot encoding the category and then using a linear (dense) layer without activation function and bias. The embedding layers are just more performant since it’s just a lookup instead of a matrix product in the linear layer.

Image by the author.

So, now that we have all of the ingredients, let’s build a model! First, we will define the high-level architecture of the model, and then we will build it in TensorFlow, although it is similarly easy in PyTorch if you prefer this.

Architecture

Alright, so two categorical variables (user_id and movie_id) enter the model, then we embed them. We end up with two vectors, preferably of the same length. In the end, we want to end up with a single number, the user_rating.

Note: We will model it as a regression problem, but you can also see it as a classification task.

So, how can we make a single number out of two vectors of the same length? There are many ways, but one of the easiest and most efficient ones is by just taking the dot product.

Image by the author.

Nothing too crazy, I would argue. Now we are able to look at how the model should work:

Image by the author.

As a formula, we created this:

Image by the author.

which reads as “rating of movie m from user u equals embedding of user u dot product embedding of movie m.

Implementation in TensorFlow, Version One

The implementation is actually a piece of cake if you know basic TensorFlow. The only thing to pay attention to is that the embedding layers want the categories to be represented as integers from 1 to number_of_categories. Very often you find people populating some dictionary like {“user_8323”: 1, “user_1122”: 2, …} and an inverse dictionary like {1: “user_8323”, 2: “user_1122”, …} to achieve this, but TensorFlow has some nice layers to take care of than as well. We will use the IntegerLookup here. A nice feature of this layer: unknown categories get mapped to 0 by default.

Before we start, we have to grab all the unique users and movies from the training set first.

all_users = train["user_id"].unique()
all_movies = train["movie_id"].unique()

Using the functional API of Keras, you can implement the above ideas like this:

import tensorflow as tf

# user pipeline
user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users)+1, output_dim=32)(user_as_integer)

# movie pipeline
movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies)+1, output_dim=32)(movie_as_integer)

# dot product
dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
flatten = tf.keras.layers.Flatten()(dot)

# model input/output definition
model = tf.keras.Model(inputs=[user_input, movie_input], outputs=flatten)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

Since we gave the user and movie input layers nice names, we can train the model like this:

model.fit(
x={
"user": X_train["user_id"],
"movie": X_train["movie_id"]
},
y=y_train.values,
batch_size=256,
epochs=100,
validation_split=0.1, # for early stopping
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=1, restore_best_weights=True)
],
)

# Output (for me):
# ...
# Epoch 18/100
# 3165/3165 [==============================] - 8s 3ms/step - loss: 0.7357 - mean_absolute_error: 0.6595 - val_loss: 11.4699 - val_mean_absolute_error: 2.9923

We could evaluate this model on the test set now, but we can already see here that it’s probably quite bad because val_mean_absolute_error is about 3. That means that we are on average 3 stars off, which is horrible in a 5-star system. This is even worse than our bad model from before, which is quite an achievement. 😅 But why is that? Let’s explore this in the next section.

Implementation in TensorFlow, Version Two

We have built a regression model that can potentially output any real number so far. It is very hard for the model to learn that it should output numbers in the special range between 1 to 5 but we can make it easier for the model with a simple trick: just squash the output range as we do it for logistic regression. Just instead of a [0, 1] interval, let’s scale and shift it to [1, 5].

user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=32)(user_as_integer)

movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=32)(movie_as_integer)

dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
flatten = tf.keras.layers.Flatten()(dot)

# this is new!
squash = tf.keras.layers.Lambda(lambda x: 4*tf.nn.sigmoid(x) + 1)(flatten)

model = tf.keras.Model(inputs=[user_input, movie_input], outputs=squash)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

As a formula:

Image by the author.

where σ is the sigmoid function. Training is as above and evaluating it on the test set gives us

model.evaluate(
x={"user": X_test["user_id"], "movie": X_test["movie_id"]},
y=y_test
)

# Output:
# [...] loss: 0.9701 - mean_absolute_error: 0.7683

This is much better than the model before and also our bad baseline. In case you want an r² score as well:

from sklearn.metrics import r2_score

r2_score(
y_test,
model.predict(
{"user": X_test["user_id"], "movie": X_test["movie_id"]}
).ravel()
)
# Output:
# 0.1767611765807019

Let’s do a small final adjustment to end up with an even better model.

Implementation in TensorFlow, Final Version

Additionally to the embeddings, we can also associate a bias term to each movie and user. This captures that some users tend to give only rather positive (or negative) ratings, and also that some movies only tend to get positive (or negative) reviews. This way, the biases can do the rough work while the embeddings do the fine-tuning. For example, a user that mostly gives 4 stars for everything will have some fixed bias. The embeddings then only have to focus on explaining why this user sometimes gives 3 or 5 stars.

The formula then becomes

Image by the author.

where bᵤ and bₘ are the biases of user u and movie m respectively. As code:

user_input = tf.keras.layers.Input(shape=(1,), name="user")
user_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_users)(user_input)
user_embedding = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=32)(user_as_integer)
user_bias = tf.keras.layers.Embedding(input_dim=len(all_users) + 1, output_dim=1)(user_as_integer)

movie_input = tf.keras.layers.Input(shape=(1,), name="movie")
movie_as_integer = tf.keras.layers.IntegerLookup(vocabulary=all_movies)(movie_input)
movie_embedding = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=32)(movie_as_integer)
movie_bias = tf.keras.layers.Embedding(input_dim=len(all_movies) + 1, output_dim=1)(movie_as_integer)

dot = tf.keras.layers.Dot(axes=2)([user_embedding, movie_embedding])
add = tf.keras.layers.Add()([dot, user_bias, movie_bias])
flatten = tf.keras.layers.Flatten()(add)
squash = tf.keras.layers.Lambda(lambda x: 4 * tf.nn.sigmoid(x) + 1)(flatten)

model = tf.keras.Model(inputs=[user_input, movie_input], outputs=squash)

model.compile(loss="mse", metrics=[tf.keras.metrics.MeanAbsoluteError()])

If you like the plot_model output of Keras:

Image by the author.
Image by the author, created with https://netron.app/.

As already indicated, the model performance improves again.

  • MSE ≈ 0.89
  • MAE ≈ 0.746
  • r² ≈ 0.245

Nice! We got the lowest MAE and MSE (and hence the highest r²) with this version.

In this article, we have seen that recommenders have a large impact on businesses, that’s why there are widely used. Building a good recommender is not as straightforward as other models since you often have to deal with high-cardinality categorical features, rendering simple tricks like one-hot encoding useless.

We learned how to circumvent this problem by using embeddings in our neural network architecture. We added some more simple tricks to end up with a not-to-shabby model, even without tuning any hyperparameters. We could improve the model even further by

  • optimizing the embedding dimension (that we just set to 32 so far)
  • applying regularization to the embeddings
  • building a proper time-split validation set, not a random one as we did
  • retraining it on the complete training dataset (including the validation set) after we know the best hyperparameters

One of the biggest advantages is that we can apply the model in most contexts since we only need interaction (rating) data of users and movies. We do not need to know any more things about the users and movies, such as age, gender, genre, … so usually, we can get going immediately.

The price that we pay for this is that we cannot output meaningful embeddings for unknown users or movies — the cold start problem. The model will output something, but the quality will be horrible.

However, if we happen to have user and movie data, we can do smarter things and incorporate these features in a straightforward way as well. This mitigates the cold start problem and might even improve the model on known users and movies. More about this in the next article!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment