Hashing in Modern Recommender Systems: A Primer | by Samuel Flender | Mar, 2023

By Jessie Hobb On Mar 28, 2023

Understanding the most underrated trick in applied Machine Learning

Hashing is one of the most common “tricks” used in industrial Machine Learning applications, yet it doesn’t get nearly as much attention as it deserves.

The biggest advantage of hashing, especially in modern recommender systems, is its finite-memory guarantee: without hashing, it would be highly impractical to learn the relevance of billions of videos, news articles, photos, or web pages for billions of users without running out of memory.

But we’re getting ahead of ourselves here. This post is a primer, so let’s go back to where it all started: the famous 2009 “hashing trick” paper.

The paper that started it all

The idea of using hashing as a way to process features, as well as the term “hashing trick”, were first introduced in a 2009 paper by a team of researchers from Yahoo, led by Kilian Weinberger, in the context of Email spam detection. An email, after all, is a sequence of words, and each word can be thought of as a feature. With hashing, the authors explain, we can represent emails as vectors (where each index corresponds to a particular word), and use these vectors in a spam collaborative filtering model.

For example, the phrase “Your prescription is ready” in a 20-dimensional hash space may be equivalent to

[0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1]

and so on, where the position of the ones depends on the particular choice of hash function used.

However, different words may have different relevance to different people. The word “prescription” may signal spam to one user, but not to another. In order to account for this difference in relevance, the authors introduce “personalized” tokens, in which they combine the user id with the words themselves. For the phrase above, they’d not just hash the tokens

"your", "prescription",  "is", "ready",

but also

"user123_your", "user123_prescription", "user123_is", "user123_ready",

and so on. Given their dataset of 40 Million unique words and 400,000 users, this personalized tokenization results in a total of 16 Trillion possible features.

By hashing these personalized features into a hash size of 2²², or roughly 4.2 Million (a reduction of 7 orders of magnitude!), the authors achieve a total spam reduction of 30%, compared to a baseline model with no hashing-powered personalization, one of the first clear demonstrations of the usefulness of hashing in ML problems.

Hashing in modern recommender systems

Fast forward from 2009 to today, and we see that a lot of the ML technology has changed (deep neural nets have largely replaced linear collaborative filters), but hashing is still there.

Modern recommender systems are usually some variant of a two-tower neural network, where one tower learns user embeddings from user ids, and the other tower learns, say, video embedding from video ids (for a video recommender system). At training time the model is given user/video pairs from historic engagement data, such as video impressions with clicks as positives, and those without clicks as negatives, resulting in a shared embedding space for users and videos. We can then store the embeddings for all of our users and videos in a set of embedding tables, and use these tables to make recommendation at serving time, for example using a kNN algorithm.

So far so good, but where does hashing fit in here?

Well, consider a model that stores modest 100-dimensional embeddings of 1B videos and 1B users: that’s already 800GB of memory. This is a huge memory footprint, and the model will be very impractical (if not impossible) and expensive to serve. With hashing, we can first reduce the “vocabulary” of videos and users down to, say, 10M each, resulting in a more manageable memory footprint of 8GB.

Hashing, in other words, allows us to scale modern recommender systems to Billions of users and Billions of items. Without hashing, the scale of recommender systems would be fundamentally limited by the amount of memory we can afford.

Towards collision-free hashing

The downside of hashing is the existence of hash collisions, the fact that multiple different ids will eventually be mapped into the same hash, resulting in multiple users or items sharing the same embedding in the embedding table. Obviously, this “cramping” of information can degrade the resulting recommendations. One of the most important research questions in recommender systems today is therefore how to make them collision-free.

One idea is the use of “Deep Hash Embeddings” (DHE), proposed in 2021 by a team from Google Brain led by Wang-Cheng Kang. Instead of a single hash function, they use a large number of hash functions, and concatenate all hashes for a single id into a dense vector, which is then fed into a deep neural net which learns higher-oder representations of the ids. The key idea is that, if the number of hash functions is large enough, then hash collisions become statistically impossible.

And indeed, their approach shows promise: using DHE with 1024 hash functions, the authors see an improvement of 0.25% in AUC over regular hashing on a set of public benchmark datasets. One caveat with this approach though is that is doesn’t work for multivalent features, only for ids directly. (For example, DHE, as proposed here, couldn’t encode a list of movies you’ve watched on Netflix in the past 30 days, which would be a multivalent feature.)

Another promising collision-free hashing approach is “Cuckoo hashing”, first introduced in 2001 by Rasmus Pagh and Flemming Rodler at Arhus University, Denmark. Cuckoo hashing is used in Monolith, ByteDance’s online recommendation system, which is introduced in their 2022 paper, led by Zhuoran Liu.

In Cuckoo hashing we maintain not one but multiple hash tables: in ByteDance’s paper, they use 2. When we encounter a new id, by default we hash it into the first hash table. If the first hash table is already occupied with another id, we evict that id (hence the name of the algorithm) and re-hash it into a spot in the second hash table. This process is repeated until there are no more evictions, and the set of hash tables stabilizes without collisions. Compared to regular hashing (with collisions) the authors find an improvement of 0.5% AUC on a public benchmark dataset with their collision-free hashing approach.

Coda: the most underrated trick in applied Machine Learning

Recap for memory:

Hashing allows us to scale modern recommender systems to Billions of users and Billions of items with a finite-memory guarantee.
The idea of using hashing in ML application dates back to a 2009 paper from Yahoo, which demonstrated its usefulness in an Email spam detection model.
Hashing however introduces hash collisions, which can degrade recommendations due to multiple users and items sharing the same embedding.
For this reason, recent research in recommender systems focuses on how to make hashing collision-free. Notable examples are Deep Hash Embeddings (Google Brain) and Cuckoo Hashing (ByteDance).

For many years, hashing has been among the most underrated tricks in the applied ML literature, with much of the attention focusing instead on model architecture, data selection, or feature engineering.

This may begin to change. As the recent papers from Google Brain, ByteDance, and others have shown, optimizing recommender systems for hash collisions can enable notable gains in performance. The surprising rise in popularity of TikTok (owned by ByteDance) may, at least in part, be explained by better hashing.

Watch this space: new breakthroughs are certainly on the horizon.