Large Language Models, MirrorBERT — Transforming Models into Universal Lexical and Sentence…

Large Language Models, MirrorBERT — Transforming Models into Universal Lexical and Sentence Encoders

Discover how mirror augmentation generates data and aces the BERT performance on semantic similarity tasks

Introduction

It is no secret that BERT-like models play a fundamental role in modern NLP applications. Despite their phenomenal performance on downstream tasks, most of these models are not that perfect on specific problems without fine-tuning. Embedding construction from raw pretrained models usually leads to metrics being far from state-of-the-art results. At the same time, fine-tuning is a heavy procedure and usually requires at least thousands of annotated data samples to make the model better understand the domain data. In some cases, this aspect becomes problematic when we cannot simply collect already annotated data or it comes with a high price.

MirrorBERT was designed to overcome the aforementioned issue. Instead of the standard fine-tuning algorithm, MirrorBERT relies on self-supervision by smartly augmenting initial data without any external knowledge. This approach allows MirrorBERT to show comparable performance on semantic similarity problems. Furthermore, by using its innovative contrastive learning technique, MirrorBERT can transform pretrained models like BERT or RoBERTa into universal lexical encoders in less than a minute!

Large Language Models: RoBERTa — A Robustly Optimized BERT Approach

With the help of the official MirrorBERT paper, we will dive into its crucial details to understand how it works under the hood. The obtained knowledge will be universal as the discussed techniques could then be used for other NLP models dealing with similarity tasks as well.

Methodology

To put it simply, MirrorBERT is the same BERT model except for several steps introduced in its learning process. Let us discuss each of them.

MirrorBERT learning process

1. Self-duplication

As the name suggests, MirrorBERT simply duplicates the initial data.

Self-duplication

This duplicated data is then used to further construct two different embedding representations of the same strings.

2. Data augmentation

The authors of the paper propose two intuitive techniques that slightly modify dataset texts. According to them, in a vast majority of cases, these text corruptions do not change their meaning.

2.1. Input augmentation

Given a pair of strings (xᵢ, x̄ᵢ), the algorithm randomly chooses one of them and applies random span masking consisting of a random substitution of a substring of a fixed length k in the text with the [MASK] token.

Input augmentation through random span masking

2.2. Feature augmentation

Random span masking operates on sentence- / phrase-level. To make a model able to work well on word-level tasks as well, another augmentation mechanism is needed operating on shorter text fragments. Feature augmentation solves this problem by using dropout.

The dropout process refers to turning off a given percentage p of neurons in a certain network layer. This can be viewed as the equivalent of zeroing corresponding neurons in the network.

The authors of the paper propose using dropout for data augmentation. When a pair of strings (xᵢ, x̄ᵢ) is passed to the network with dropout layers, their output representations will be slightly different if, on each forward pass, dropout layers always disable different neurons.

The great aspect of using dropout for feature augmentation is that the dropout layers are already included in BERT / RoBERTa architecture meaning no additional implementation is needed!

While random span masking is applied only to each second object in the dataset, the dropout is applied to all of them.

3. Constrastive learning

Contrastive learning is a machine learning technique consisting of learning data representations in a way that similar objects lie close to each other in the embedding space while dissimilar are far away from each other.

One of the ways of constrastive learning implementation is the usage of a constrastive loss function. The one chosen for MirrorBERT is InfoNCELoss. Let us understand how it works.

InfoNCELoss

At first sight, the formula for InfoNCELoss might look intimidating, so let us gradually come to it step by step.

The cosine similarity between two vectors measures how close they align to each other taking values in the range from -1 to 1 with greater values indicating higher similarity.

Cosine similarity between two vectors

2. To better understand the next steps, it is necessary to be aware that InfoNCELoss uses softmax transformation with the temperature parameter T controlling the smoothness of the output softmax distribution. That is why similarities are divided by T.

For more information about softmax temperature, refer to this article explaining it in more detail.

Cosine similarity divided by the temperature

3. As in the standard softmax formula, a prediction (similarity) is then transformed to the exponent form.

Exponent of cosine similarity

4. In the normal softmax formula, the numerator contains an exponent of a class probability whereas the denominator is the exponential sum of all distribution probabilities. In the case with similarities in InfoNCELoss, the formula analogously follows the logic:

The numerator contains exponential similarity of two slightly modified identical strings (xᵢ, x̄ᵢ) which can be thought of as a positive example.
The denominator consists of a sum of exponential similarities between xᵢ and all other dataset strings xⱼwhich can be seen as a set of all negative examples.

Softmax formula for cosine similarity. Nᵢ denotes all dataset strings except for xᵢ and x̄ᵢ.

5. In the ideal scenario, we want the similarity between identical strings (xᵢ, x̄ᵢ) to be high while the similarity of xᵢ with other strings xⱼ to be low. If it is true, then the numerator in the formula above will increase while the denominator will decrease making the whole expression larger.

Loss functions work inversely: in ideal cases, they take smaller values, and, in worse situations, they highly penalise the model. To make the formula above compatible with this loss principle, let us add the negative logarithm before the whole expression.

Negative log of softmax similarities. This expression can be viewed as a loss value for a single string xᵢ.

6. The expression in the previous step already corresponds to a loss value for a single string xᵢ. Since the dataset consists of many strings, we need to take all of them into account. For that, let us sum up this expression for all the strings.

InfoNCELoss

The obtained formula is exactly the InfoNCELoss!

InfoNCELoss tries to group similar objects close to each other while pushing away the dissimilar ones in the embedding space.

Triplet loss used in SBERT is another example of contrastive learning loss.

Large Language Models: SBERT — Sentence-BERT

Training resources

A surprising fact about MirrorBERT is that it does not require a lot of data to be fine-tuned. Furthermore, this data does not have to be external as the whole training process is self-supervised.

The researchers report that for fine-tuning lexical representations, they use only 10k most frequent words in each language. For sentence-level tasks, 10k sentences are used.

Training details

The details on the MirrorBERT training are listed below:

The temperature is set to T = 0.04 in sentence-level tasks and to T = 0.2 in word-level tasks.
In random span masking, k is set to 5.
Dropout is set to p = 0.1.
AdamW optimizer is used with a learning rate of 2e-5.
The batch size is set to 200 (or 400 with duplicates).
Lexical models are trained for 2 epochs and sentence-level models are trained for a single epoch.
Instead of the mean pooling of all output token representations, the [CLS] token representation is created.

Single MirrorBERT training epoch needs only 10–20 seconds.

Evaluations

The authors evaluated metrics on a set of benchmarks by applying mirror fine-tuning. The results were reported on three types of tasks: lexical, sentence-level and cross-lingual. In each of them, MirrorBERT demonstrated comparable performance to other BERT-like fine-tuned models.

The results also showed that the range between 10k and 20k training examples is the most optimal for fine-tuning. The performance of the model gradually decreases with more training examples.

Conclusion

Mirror fine-tuning literally acts like a magic spell: instead of heavy fine-tuning procedures, the mirror framework requires much less time without the use of external data being on par with other fine-tuned models like BERT, SBERT or RoBERTa on semantic similarity tasks.

As a result, MirrorBERT can transform BERT-like pretrained model into universal encoders capturing linguistic knowledge with high efficiency.

Resources

Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders

All images unless noted are by the author

Large Language Models, MirrorBERT — Transforming Models into Universal Lexical and Sentence… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.