Solving Machine Learning’s Generalization-Memorization Dilemma: 3 Promising Paradigms | by Samuel Flender | Apr, 2023

By Jessie Hobb On Apr 5, 2023

The quest to build Machine Learning systems that learn like us

The “holy grail” of Machine Learning is the ability to build systems that can both memorize known patterns in the training data as well as generalize to unknown patterns in the wild.

It’s the holy grail because this is how we humans learn as well. You can recognize your grandma in an old photo, but you could also recognize a Xoloitzcuintli as a dog even though you’ve never actually seen one before. Without memorization we’d have to constantly re-learn everything from scratch, and without generalization we wouldn’t be able to adapt to an ever-changing world. To survive, we need both.

Traditional statistical learning theory tells us that this is impossible: models can either generalize well or memorize well, but not both. It’s the well-known bias-variance trade-off, one of the first things we learn in standard ML curricula.

How then can we build such universal learning systems? Is the holy grail within reach?

In this post, let’s dive into 3 paradigms from the literature,

Generalize first, memorize later
Generalize and memorize at the same time
Generalize with machines, memorize with humans

Let’s get started.

BERT revolutionized Machine Leaning with its introduction of the pre-training/fine-tuning paradigm: after pre-training in an unsupervised way on a massive amount of text data, the model can be rapidly fine-tuned on a specific downstream task with relatively fewer labels.

Surprisingly, this pre-training/fine-tuning approach turns out to solve the generalization/memorization problem. BERT can generalize as well as memorize, show Michael Tänzer and collaborators from the Imperial College London in a 2022 paper.

In particular, the authors show that during fine-tuning, BERT learns in 3 distinct phases:

Fitting (epoch 1): the model learns simple, generic pattens that explain as much of the training data as possible. During this phase, both training and validation performance increases.
Setting (epochs 2–5): there are no more simple patterns left to learn. Both training and validation performance saturate, forming a plateau in the learning curve.
Memorization (epochs 6+): the model starts to memorize specific examples in the training set, including noise, which improves training performance but degrades validation performance.

How did they figure this out? By starting with a noise-free training set (CoNLL03, a named-entity-recognition benchmark dataset), and then gradually introducing more and more artificial label noise. Comparing the learning curves with different amounts of noise clearly reveals the 3 distinct phases: more noise results in a steeper drop during phase 3.

Fine-tuning BERT exhibits 3 distinct learning phases. Figure taken from Tänzer et al, “Memorisation versus Generalisation in Pre-trained Language Models” (link)

Tänzer et al also show that memorization in BERT requires repetition: BERT memorizes a specific training example only once it has seen that example a certain number of times. This can be deduced from the learning curve for the artificially introduced noise: it’s a step function, which improves with each epoch. In other words, during phase 3 BERT can eventually memorize the entire training set, if we just let it train for a sufficient number of epochs.

BERT, in conclusion, appears to generalize first and memorize later, as evidenced by the observation of its 3 distinct learning phases during fine-tuning. In fact, it can also be shown that this behavior is a direct consequence of pre-training: Tänzer et al show that a randomly initialized BERT model does not share the same 3 learning phases. This leads to the conclusion that the pre-training/fine-tuning paradigm may be a possible solution to the generalization/memorization dilemma.

Let’s leave the world of natural language processing and enter the world of recommender systems.

In modern recommender systems, the ability to memorize and generalize at the same time is critical. YouTube, for example, wants to show you videos that are similar to the ones you’ve watched in the past (memorization), but also new ones that are little bit different and you didn’t even knew you’d like (generalization). Without memorization you’d get frustrated, and without generalization you’d get bored.

The best recommender systems today need to do both. But how?

In a 2016 paper, Heng-Tze Cheng and collaborators from Google propose what they call “Wide and Deep Learning” to address this problem. The key idea is to build a single neural network that has both a deep component (a deep neural net with embedding inputs) for generalization as well as a wide component (a linear model with a large number of sparse inputs) for memorization. The authors demonstrate the effectiveness of this approach on recommendations within the Google Play store, which recommends apps to users.

The inputs to the deep component are dense features as well as embeddings of categorical features such as user language, user gender, impressed app, installed apps, and so on. These embeddings are initialized randomly, and then tuned during model training along with the other parameters in the neural network.

The inputs to the wide component of the network are granular cross-features such as

AND(user_installed_app=netflix, impression_app=hulu”),

the value of which is 1 if the user has Netflix installed and the impressed app is Hulu.

It’s easy to see why the wide component enables a form of memorization: if 99% of users that install Netflix also end up installing Hulu, the wide component will be able to learn this piece of information, while it may get lost in the deep component. Having both the wide and deep components really is the key to peak performance, argue the Cheng et al.

And in fact, the experimental results confirm the authors’ hypothesis. The wide & deep model outperformed both a wide-only model (by 2.9%) and a deep-only model (by 1%) in terms of online acquisition gain in the Google Play store. These experimental results indicate that “Wide & Deep” is another promising paradigm to solve the generalization/memorization dilemma.

Both Tänzer and Cheng proposed approaches to solve the generalization/memorization dilemma with machines alone. However, machines have a hard time memorizing singular examples: Tänzer et al find that BERT requires at least 25 instances of a class to learn to predict it at all and 100 examples to predict it “with some accuracy”.

Taking a step back, we don’t have to let machines do all of the work. Instead of fighting our machines’ failure to memorize, why not embrace it? Why not build a hybrid system that combines ML with human expertise?

That’s precisely the idea behind Chimera, Walmart’s production system for large-scale e-commerce item classification, presented in a 2014 paper by Chong Sun and collaborators from Walmart Labs. The premise behind Chimera is that Machine Learning alone is not the enough to handle item classification at scale due to the existence of a large number of edge cases with little training data.

For example, Walmart may agree to carry a limited number of new products from a new vendor on a trial basis. An ML system may not be able to accurately classify these products because there’s not enough training data. However, human analysts can write rules to cover these cases precisely. These rule-based decisions can then later be used in the model training, so that after some time the model can catch up to the new patterns.

The authors conclude;

We use both machine learning and hand-crafted rules extensively. Rules in our system are not “nice to have”. They are absolutely essential to achieving the desired performance, and they give domain analysts a fast and effective way to provide feedback into the system. As far as we know, we are the first to describe an industrial-strength system where both [Machine] learning and rules co-exist as first-class citizens.

Alas, this co-existence may also be the key to solving the generalization/memorization problem.

Let’s recap. Building systems that can both memorize known patterns and generalize to unknown ones is the holy grail of Machine Learning. As of today, no one has yet cracked this problem completely, but we’ve seen a few promising directions:

BERT has been shown to generalize first and memorize later during fine-tuning, a capability that’s possible due to being pre-trained beforehand.
Wide & Deep neural networks have been designed to both generalize (using the deep component) and memorize (using the wide component) at the same time, outperforming both wide-only and deep-only networks in Google Play store recommendations.
Walmart’s hybrid production system Chimera leverages human experts to write rules for edge cases that their ML models fail to memorize. By adding these rule-based decisions back into the training data, over time the ML models can catch up, but ultimately ML and rules co-exist as first-class citizens.

And this is really just a small glimpse of what’s out there. Any industrial ML team fundamentally needs to tackle some version of the memorization/generalization dilemma. If you’re working in ML, chances are you will eventually encounter it as well.

Ultimately, solving this problem will not just enable us to build vastly superior ML systems. It will enable us build systems that learn more like us.