Biases in Recommender Systems: Top Challenges and Recent Breakthroughs | by Samuel Flender | Feb, 2023

By Jessie Hobb On Feb 23, 2023

Behind the ongoing quest for building unbiased models from biased data

Image generated by the author with Midjourney

Recommender systems have become ubiquitous in our daily lives, from online shopping to social media to entertainment platforms. These systems use complex algorithms to analyze historic user engagement data and make recommendations based on their inferred preferences and behaviors.

While these systems can be incredibly useful in helping users discover new content or products, they are not without their flaws: recommender systems are plagued by various forms of bias that can lead to poor recommendations and therefore poor user experience. One of today’s main research threads around recommender systems is therefore how to de-bias them.

In this article, we’ll dive into 5 of the most prevalent biases in recommender systems, and learn about some of the recent research from Google, YouTube, Netflix, Kuaishou, and others.

Let’s get started.

1 — Clickbait bias

Wherever there’s an entertainment platform, there’s clickbait: sensational or misleading headlines or video thumbnails designed to grab a user’s attention and entice them to click, without providing any real value. “You won’t believe what happened next!”

If we train a ranking model using clicks as positives, naturally that model will be biased in favor of clickbait. This is bad, because such a model would promote even more clickbait to users, and therefore amplify the damage it does.

One solution for de-biasing ranking models from clickbait, proposed by Covington et al (2016) in the context of YouTube video recommendations, is weighted logistic regression, where the weights are the watch time for positive training examples (impressions with clicks), and unity for the negative training example (impressions without clicks).

Mathematically, it can be shown that such a weighted logistic regression model learns odds that are approximately the expected watch time for a video. At serving time, videos are ranked by their predicted odds, resulting in videos with long expected watch times to be placed high on top of the recommendations, and clickbait (with the lowest expected watch times) at the bottom of it.

Unfortunately, Covington et al don’t share all of their experimental results, but they do say that weighted logistic regression performs “much better” than predicting clicks directly.

2 — Duration bias

Weighted logistic regression work well for solving the clickbait problem, but it introduces a new problem: duration bias. Simply put, longer videos always have a tendency to be watched for a longer time, not necessarily because they’re more relevant, but simply because they’re longer.

Think about a video catalog that contains 10-second short-form videos along with 2-hour long-form videos. A watch time of 10 seconds means something completely different in the two cases: it’s a strong positive signal in the former, and a weak positive (perhaps even a negative) signal in the latter. Yet, the Covington approach would not be able to distinguish between these two cases, and would bias the model in favor of long-form videos (which generate longer watch times simply because they’re longer).

A solution to duration bias, proposed by Zhan et al (2022) from KuaiShou, is quantile-based watch-time prediction.

The key idea is to bucket all videos into duration quantiles, and then bucket all watch times within a duration bucket into quantiles as well. For example, with 10 quantiles, such an assignment could look like this:

(training example 1)
video duration = 120min --> video quantile 10
watch duration = 10s    --> watch quantile 1(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10
...

By translating all time intervals into quantiles, the model understands that 10s is “high” in the latter example, but “low” in the former, so the author’s hypothesis. At training time, we’re providing the model with the video quantile, and task it with predicting the watch quantile. At inference time, we’re simply ranking all videos by their predicted watch time, which will now be de-confounded from the video duration itself.

And indeed, this approach appears to work. Using A/B testing, the authors report

0.5% improvements in total watch time compared weighted logistic regression (the idea from Covington et al), and
0.75% improvements in total watch time compared to predicting watch time directly.

The results show that removing duration bias can be a powerful approach on platforms that serve both long-form and short-form videos. Perhaps counter-intuitively, removing bias in favor of long videos in fact improves overall user user watch times.

3 — Position bias

Position bias means that the highest-ranked items are the ones which create the most engagement not because they’re actually the best content for the user, but instead simply because they’re ranked highest, and users start to blindly trust the ranking they’re being shown. The model predictions become a self-fulfilling prophecy, but this is not what we really want. We want to predict what users want, and not make them want what we predict.

Position bias can be mitigated by techniques such as rank randomization, intervention harvesting, or using the ranks themselves as features, which I covered in my other post here.

Particularly problematic is that position bias will always make our models look better on paper than they actually are. Our models may be slowly degrading in quality, but we wouldn’t know what is happening until it’s too late (and users have churned away). It is therefore important, when working with recommender systems, to monitor multiple quality metrics about the system, including metrics that quantify user retention and the diversity of recommendations.

4 — Popularity bias

Popularity bias refers to the tendency of the model to give higher rankings to items that are more popular overall (due to the fact that they’ve been rated by more users), rather than being based on their actual quality or relevance for a particular user. This can lead to a distorted ranking, where less popular or niche items that could be a better fit for the user’s preferences are not given adequate consideration.

Yi et al (2019) from Google propose a simple but effective algorithmic tweak to de-bias a video recommendation model from popularity bias. During model training, they replace the logits in their logistic regression layer as follows:

logit(u,v) <-- logit(u,v) - log(P(v))

where

logit(u,v) is the logit function (i.e., the log-odds) for user u engaging with video v, and
log(P(v)) is the log-frequency of video v.

Of course, the right hand side is equivalent to:

log[ odds(u,v)/P(v) ]

In other words, they simply normalize the predicted odds for a user/video pair by the video probability. Extremely high odds from popular videos count as much as moderately high odds from not-so-popular videos. And that’s the entire magic.

And indeed, the magic appears to work: in online A/B tests, the authors find a 0.37% improvement in overall user engagements with the de-biased ranking model.

5 — Single-interest bias

Suppose you watch mostly drama movies, but sometimes you like to watch a comedy, and from time to time a documentary. You have multiple interests, yet a ranking model trained to maximize your watch time may over-emphasize drama movies because that’s what you’re most likely to engage with. This is single-interest bias, the failure of a model to understand that users inherently have multiple interests and preferences.

In order to remove single-interest bias, a ranking model needs to be calibrated. Calibration simply means that, if you watch drama movies 80% of the time, then the model’s top 100 recommendations should in fact include around 80 drama movies (and not 100).

Netflix’s Harald Steck (2018) demonstrates the benefits of model calibration with a simple post-processing technique called Platt scaling. He presents experimental results that demonstrate the effectiveness of the method in improving the calibration of Netflix recommendations, which he quantifies with KL divergence scores. The resulting movie recommendations are more diverse — in fact, as diverse as the actual user preferences — and result in improved overall watch times.