Breaking Down YouTube’s Recommendation Algorithm | by Samuel Flender | Apr, 2023

By Jessie Hobb On Apr 17, 2023

Opening the “bag of tricks” that makes a modern recommender system work

(Logo design Eyestetix Studio, background design by Dan Cristian Pădureț)

Recommender systems have become one of the most ubiquitous industrial Machine Learning applications of our times, but little is being published about how they actually work in practice.

A notable exception is Paul Covington’s paper “Deep Neural Networks for YouTube Recommendations”, which is packed with numerous practical insights and learnings about YouTube’s deep-learning powered recommendation algorithm, providing a rare window not just into the inner workings of a modern industrial recommender system but also into the problems that today’s ML engineers are trying to tackle.

If you’re looking to deepen your understanding of modern recommender systems, preparing for ML design interviews, or simply curious about how YouTube gets people hooked, read on. In this post, we’ll break down 8 key insights from the paper that help explain YouTube’s (and any modern recommender system’s) success.

Let’s get started.

1 — Recommendation = candidate generation + ranking

YouTube’s recommender system is broken down into 2 stages: the candidate generation stage, which filters the pool of Billions of videos down to a few hundred, and the ranking stage, which further narrows down and sorts the candidates that end up in front of the user.

Technically, both stages contain a two-tower neural network — a special architecture with two arms for user ids and video ids, respectively — but their training objectives differ:

for the candidate generation model, the learning problem is formulated as an extreme multi-class classification problem: out of all existing videos, predict the ones that the user engaged with.
for the ranking model, the learning problem is formulated as a (weighted) logistic regression problem: given a user/video pair, predict whether the user engaged with that video or not.

The motivation behind this design choice is to break down the problem of finding the optimal content into recall optimization and precision optimization: candidate generation optimizes for recall, i.e. making sure we’re capturing all relevant content, while ranking optimizes for precision, i.e. making sure we show the best content first. Breaking the problem down in this way is key to enable recommendation at the scale of Billions of users and videos.

YouTube’s 2-stage recommendation funnel. From Covington 2016, Deep Neural Networks for YouTube Recommendations

2 — Implicit labels work better than explicit labels

Explicit user feedback, such as Like, Share, or Comment, is extremely scarce: out of all users watching a particular video, only a small fraction will leave explicit feedback. A model trained only on Likes, for example, would therefore leave a lot of signal on the table.

Implicit labels, such as user clicks and watch times, are a bit more noisy — users may click accidentally — but orders of magnitude more abundant. At YouTube’s scale, label quantity beats label quality, and therefore using implicit feedback as training objective works better in their models.

3 — Watch sequence matters

The sequence formed by a particular user’s watch history isn’t random, but contains particular patterns with asymmetric co-watch probabilities. For example, after watching 2 videos from the same creator, a user is likely to watch another video from that same creator. A model that simply learns to predict whether an impressed video is going to be watched, irrespective of the user’s recent watch history, doesn’t perform well: again, it’s leaving information on the table.

Instead, YouTube’s model learns to predict the next watch, given the user’s latest watch (and search) history. Technically, it does this by feeding the user’s latest 50 watched videos and 50 search queries, at the time of the training example, as features into the model.

4 — The ranking model is trained using weighted logistic regression

Positive training examples (impressions with clicks) are weighted by their observed watch time, while negative training examples (impressions without clicks) receive unit weights. The purpose of this weighting scheme is to down-weigh click-bait content, and up-weigh content that leads to more meaningful and longer engagement.

Mathematically, the odds learned by such a weighted logistic regression model are approximately equal to expected watch time. At inference time, we can therefore convert the predicted odds into watch time simply be applying the exponential function. Being able to predict watch times enables this next critical insight:

5 — Ranking by predicted watch time works better than ranking by click-through rate

This is because ranking by click-through rate promotes click-bait content with low watch times: users click but go back right away. Ranking by predicted watch time down-ranks clickbait, leading to more engaging recommendations.

6 — A diverse feature set is key to high model performance

The advantage of deep learning over linear or tree-based models is that it can handle a diverse set of input signals. YouTube’s model looks at:

watch history: which videos has the user watched recently?
search history: which keywords has the user searched for recently?
demographic features, such as user gender, age, geographic location, and device, which provide priors for “cold-start” users, i.e. users with no history.

Indeed, feature diversity is the key to achieve high model performance: the authors show that a model trained on all of these features improves holdout MAP from 6% to 13%, relative to a model trained only on watch history.

YouTube’s ranking neural network model takes in a diverse set of features. From Covington 2016, Deep Neural Networks for YouTube Recommendations

7 — Content stays fresh thanks to “Example Age” feature

ML systems are often biased towards the past, simply because they’re trained on historic data. For YouTube, this is a problem because users usually prefer recently uploaded, “fresh” content over content that has been uploaded long ago.

In order to fix this “past bias”, YouTube uses the age of the training example as a feature in the model, and sets it to 0 at inference time to reflect that the model is making predictions at the very end of the training window. For example, if the training data contains a window of 30 days of data, this “example age” feature would vary from 31 (for the first day in the training data) to 1 (for the last day).

The authors show that introducing this feature makes the model much more biased in favor of fresh content, which is exactly what YouTube wants.

8 — Sparse features are encoded as low-dimensional embeddings

YouTube’s ranking model uses a large number of high-cardinality (“sparse”) categorical features, such as

video id and user id,
tokenized search queries,
the last 50 videos watched by a user, or
the “seed” video that started the current watch session.

These sparse features are one-hot encoded and mapped to 32-dimensional embeddings that are learned during model training, and then stored as embedding tables for inference.

In order to limit the memory footprint caused by embedding tables, id spaces are truncated to include only the most common ids. For example, if a video that has only been watched once in the training period, it’s not be worth having its own place in the embedding table, and will therefore be treated the same as a video that was never watched.

Another notable trick is that sparse features within the same id space share the same underlying embeddings. For example, there exists a single, global embedding of video ids that many distinct features use, such as the video id of the impression, the last video id watched by the user, or the video id that seeded the current session. Sharing embeddings in this way has 3 benefits:

it saves memory, because there are fewer embedding tables that need to be stored,
it speeds up model training, since fewer parameters need to be learned, and
it improves generalization, because it enables the model to have more context about each id.