Generating State-of-the-Art Text Embeddings with Hardware Accessible by Everyone | by Jan Schmitz | Jul, 2022

By Jessie Hobb On Jul 1, 2022

OpenAI GPT-3 — Do I really need it to generate state-of-the-art text embeddings?

Source: Created by the author

In this article, I’ll be demonstrating that large language models such as GPT-3 do not generate the best dense text embeddings for many NLP (natural language processing) tasks and how it´s possible for everyone to generate state-of-the-art embeddings with widely available hardware.

Dense text embeddings are vector representations of text that encode the meaning of words. In this way words, which are close in meaning, are close in the vector representation. These embeddings are extremely useful for complex tasks such as comparing the similarity of textual content, semantic search, paraphrase mining etc. In the past years, language models increased exponentially in size, making it virtually impossible for many practitioners to experiment with the latest state-of-the-art models due to resource limitations. I might have a solution for you 😉.

Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter — https://arxiv.org/pdf/1910.01108.pdf

The constraints that most of us have is that we do not have access or the financial means to deploy large GPU/TPU clusters to run large language models such as T5, MPNet or GPT-3. To overcome these limitations and yet to obtain state-of-the-art text embeddings, I spend a lot of time on the following two steps, which I’d like to share with you guys:

Compare and rank all relevant language models based on their performance and number of model parameters, to find the most effective pre-trained model.
Apply and compare how different fine-tuning techniques can improve the performance of a pre-trained model.

Of course, I have my own reason to generate high-quality text embeddings. My objective is to use the text embeddings as input for my own models to help decide my own financial asset selection. Since I have received a lot of requests to make my models public, I initiated a project called PinkLion.

I designed PinkLion for portfolio optimisation and asset analytics on the fly by providing access to underlying prediction models. The models have access to daily asset data available for thousands of stocks, funds/ETFs, and cryptocurrencies.

Feel free to give it a try and please share feedback 🙏. (It is still in a rough state) www.pinklion.xyz

As an initial step, we have to obtain a good overview of what language models are out there, how they perform and how big they are. Hence, the objective is to compare different language models based on various benchmark tasks and determine how well they perform in relation to the number of their parameters.

Definition of model performance — GLUE and SuperGLUE

While reading through a lot of research papers it was surprisingly evident that it is very rare to find a fully transparent model evaluation across multiple benchmark tasks such as SentEval [1], GLUE [2], SQuAD[3], RACE [4], or SWAG [5]. After looking further into the topic I came to the conclusion that GLUE (The General Language Understanding Evaluation) [2] seems to be the most widely used benchmark to evaluate language model performance.

GLUE is a collection of tools and datasets to evaluate the performance of a model across a set of nine natural language tasks. These nine tasks focus on the areas of text similarity and paraphrases, classification, and inference.

Equipped with this conclusion, I started to gather the GLUE evaluation results for all relevant language models. Even though GLUE seems widely used, it was difficult to obtain all relevant scores, especially for smaller model versions. All subsequent figures in this article have been obtained in the following order either from the GLUE leaderboard, the model’s research paper, or the Github page of the respective model.

For everyone who is interested in how the GLUE [2] benchmark is comprised in detail, here are the task names, acronyms, descriptions, and metrics which comprise the benchmark. — If you are not interested skip to the next section Model comparison and ranking.

Corpus of Linguistic Acceptability (CoLA): Determine if a sentence is grammatically correct or not. — Metric: Matthew’s correlation
Stanford Sentiment Treebank (SST-2): Determine if the sentence has a positive or negative sentiment. — Metric: Accuracy
Microsoft Research Paraphrase Corpus (MRPC): Determine if two sentences are paraphrases from one another or not. — Metrics: Accuracy/F1
Semantic Textual Similarity Benchmark (STS-B): Determine the similarity of two sentences with a score from 1 to 5. — Metrics: Pearson correlation/Spearman correlation
Question-answering Natural Language Inference (QNLI): Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.) — Metric: Accuracy
Quora Question Pairs (QQP): Determine if two questions are semantically equivalent or not. — Metrics: Accuracy/F1
Multi-Genre Natural Language Inference (MNLI): Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain. — Metric: Accuracy
Recognizing Textual Entailment (RTE): Determine if a sentence entails a given hypothesis or not. — Metric: Accuracy
Winograd Natural Language Inference (WNLI): Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.) — Metric: Accuracy

The final GLUE Score is calculated by taking the average across all tasks. Tasks such as MRPC and STS-B which have two tracking metrics are averaged first and then considered as a single score for the task.

Source: GLUE task descriptions — Table can be found here — Created by author

Model comparison and ranking

After having defined how we measure the performance of a model here are the collected insights of different language model families. The table below shows for each model the number of parameters a model has, how big the model is in terms of memory storage (these numbers have been obtained from Tensorflow Hub or the Github page of the respective model), and the GLUE score which has been described above. Lastly, the final column Model Performance/Size Ratio shows a ratio calculated based on a model’s GLUE score and its number of parameters.

Model Performance/Size Ratio = GLUE Score / Number of model parameters

Described in words, the ratio is higher for a model if the model performs better on the GLUE benchmark with fewer parameters. Thus, the higher the number in the Model Performance/Size Ratio column the better.

Source: Language model comparision — Full evaluation here — Created by author

From an overall perspective, it is evident that with increasing model sizes the performance/size ratio reduces rapidly. Hence, each added parameter to a model has strong diminishing returns. This becomes very obvious when simply looking at the T5 model family which has been published by Google [11].

Source: T5 family comparison — Full evaluation here — Created by author

The smallest T5 model, T5-small has 60 million parameters, a GLUE score of 77.4 and subsequently a performance/size ratio of 1.29. In comparison, the largest T5 model, T5–11B, has a GLUE score of 90.3, 12,9 points more than T5-small.

To achieve a score increase of 12.9 points or 16.6% the model size had to be increased from 60 million to 11 billion parameters which is an increase of 18300% 🤯.

Right now you might be wondering why did GPT-3 not show up in the table above. The reason being GPT-3 is not officially evaluated on the normal GLUE benchmark, but rather on the SuperGLUE [6] benchmark.

SuperGLUE is a new benchmark created after the image GLUE with a new set of more difficult language understanding tasks.

The SuperGlue leaderboard luckily has GPT-3 and multiple models (BERT-large, RoBERTa-large, T5–11B) from the normal GLUE benchmark listed. Comparing BERT-base with GPT-3, GPT-3 is only having a higher SuperGLUE score of 2.8 points (69.0 → 71.8). Even though GPT-3 has 1605x times more parameters. Similarly, RoBERTa-large and T5–11B have fewer parameters than GPT-3 but achieve even higher scores, 84.6 and 89.3 respectively.

Source: SuperGlue comparison — Full evaluation here — Created by author

Equipped with the knowledge that truly massive models such as GPT-3 are outperformed on SuperGLUE by other models (RoBERTa-large and T5–11B), we can make the assumption that GPT-3 would also be outperformed by these models on the normal GLUE benchmark.

Coming back to our normal GLUE comparison we can see that the language model family which achieves the best performance/size ratio are the ALBERT (A Lite BERT) models.

Source: ALBERT family comparision— Full evaluation here — Created by author

The ALBERT-base model achieves a performance/size ratio of 6.986, the highest ratio among all tested language models. The high ratio can be explained due to a very good GLUE score of 83.9 while being the smallest of all the tested models with only 12 million parameters. Hence, ALBERT-base provides the best performance per deployed parameter. Due to the small number of parameters, ALBERT-base and ALBERT-large can be easily trained on a K80 or P100 GPU.

Conclusion of model comparison

The best model family to consider when generating dense text embeddings with limited resources is the ALBERT family. ALBERT-base and ALBERT-large have the best performance/size ratios among all the evaluated language models. Both have 12 and 18 million parameters and a GLUE score of 83.8 and 85.7, respectively. This makes these two models one of the smallest and simultaneously provides comparable performance to the T5 models which are all significantly larger. In addition, GPT-3 does not perform well at all on the SuperGLUE benchmark and should not be a model of choice if the downstream task is not text generation related. The full evaluation can be found here.

In the following sections, I’ll be showing how different fine-tuning techniques can be used to further increase the performance of ALBERT-base.

After having selected ALBERT-base as our language model of choice we now want to fine-tune the generically trained model and make it more fitting for our own application. There are many approaches to how one can fine-tune a model to generate text embeddings. However, different fine-tuning approaches lead to very different qualitative results. Since comparing different fine-tuning techniques is paired with substantial effort, I’d like to share my own results to make it easier for everyone else in the future.

In this section, we will be comparing three different fine-tuned ALBERTs which have been all altered with different fine-tuning techniques. Here is an initial overview of the different approaches and subsequent results.

Source: Comparision ALBERT-base fine-tuning architectures — Full evaluation here— Created by author

Frozen ALBERT: The Frozen ALBERT is comprised of a regular pre-trained ALBERT-base model, which is frozen and hence does not update its parameters when being fine-tuned. In addition, it has a trainable classification head, including a dense and a softmax classification layer. This model will serve as our baseline benchmark since it will represent how ALBERT-base will perform without significant fine-tuning.
Trained ALBERT: The Trained ALBERT has exactly the same architecture as the Frozen ALBERT, however, all components are fully trainable and fine-tuneable. Meaning, that the ALBERT-base component and the classification head (Dense Layer+ Softmax Layer) can fully adjust their weights toward the classification fine-tuning task.
Siamese ALBERT: The Siamese ALBERT is inspired by the general Siamese framework [7] which in our case comprises three identical ALBERT-base networks which are using shared weights. In addition to the three sub-networks, a classification head represented by a softmax layer is added for the classification fine-tuning task.

Dataset

To see how the different fine-tuning approaches will compare in a real-world scenario, I selected the HuffPost News Classification dataset [8] as a text corpus. The dataset contains 200853 news records with headlines and descriptions collected between the years 2012 to 2018 from the HuffPost, distributed across 41 unique news categories. Thus, this dataset is a perfect fit to see how well we can fine-tune a model for a classification task. Here is a sample of the dataset.

Source: HuffPost News classification data sample — Created by author

The objective of our fine-tuning task will be to predict the news category based on a give news headline.

Data preparation

Since we want to only use the news headlines as input to predict the news categories we first want to obtain a better overview of the dataset by calculating and plotting the distribution of character and word counts of each news headline.

For the word count of the news headlines, the minimum word count is 0, the max word count is 44, and on average a news headline has 9 words. However, from the distribution, we can derive that the average is inflated due to many headlines having a large number of words.

The distribution of character counts of the news headlines shows a similar picture, where a significant number of headlines have a large number of characters. For the character counts, 0 is the minimum count, 320 is the max count, and on average a headline has 57 characters.

Source: HuffPost News headline probability distribution of words and characters — Created by author

To increase the quality of our input data we remove all news records which have less than 10 words per headline. Applying this condition reduces the number of records from 200853 to 189339.

After having dropped all records with headlines that have less than 10 words, and having applied an 80/20 training and validation dataset split the category distribution looks as follows.

Source: HuffPost Training and validation split news categories — Created by author

It is now obvious that we have a high-class imbalance, with Politics being the dominant class. This bias will be corrected during the training process by assigning different weights to each class (more on this later).

Comparing fine-tuning techniques

After having trained our 3 models (Frozen ALBERT, Trained ALBERT, Siamese ALBERT) on the training dataset which has 160682 records classifying the news categories, we are now able to look at the model performances in more detail.

All metrics below are all derived from the validation dataset which shows that from an overall perspective, the Siamese ALBERT has the best performance, followed by the Trained ALBERT, and then by the Frozen ALBERT.

Siamese ALBERT performance the best on almost all metrics

Comparing first the Frozen ALBERT which uses the unchanged weights of ALBERT-base network and the Trained ALBERT which has fine-tuned weights due to the HuffPost training process, we can already see a significant performance boost. The Trained ALBERT has an accuracy of 58%, an F1-Score of 60%, and an AUC-micro value of 0.996, whereas the Frozen ALBERT has an accuracy of 32%, an F1-Score of 35%, and an AUC-micro value of 0.905, respectively.

Source: Comparision ALBERT-base fine-tuning metrics — Full evaluation here— Created by author

Comparing in a second step the Siamese ALBERT we can see that the siamese version outperformance the other two models on almost every metric. The Siamese ALBERT has an accuracy score on the validation dataset of 63% which is almost twice as high as the accuracy of the Frozen ALBERT and is also greater than the accuracy score of the Trained ALBERT. Summing up, the Siamese ALBERT has an accuracy of 63%, an F1-Score of 64%, and an AUC-micro value of 0.946, which is slightly lower that the equivalent score of the Trained ALBERT approach.

Since our initial objective was to create the best text embeddings possible here are the embeddings generated by the Siamese ALBERT. As we can see, we do achieve distinct clusters for the different news headline categories. One interesting characteristic of the siamese embeddings is that it not just only creates clusters for distinct classes, but classes of news headlines that have similar content such as The Worldpost, Worldpost, and World News are closely located to each other in the embedding space.

Source: Siamese ALBERT news headline embeddings — Created by author

The entire code for the different evaluation steps and embedding generations is made available on my Github.

The Siamese ALBERT which has been presented above is the best approach when having hardware restrictions to produce high-quality text embeddings. Hence, I’d like to also share how the Siamese ALBERT works, and how one can implement the network and reproduce the results from above.

Let’s first have a look at a high-level representation of the architecture that is hidden behind the single Siamese ALBERT box. The Siamese ALBERT consists of 3 ALBERT-base subnetworks which share their weight with each other.

Source: Construction of Siamese ALBERT network — Code — Created by author

The input to the siamese system is a triplet of text headlines and news categories. This triplet consists of two headlines from the same class (e.g. class World News) and one headline from a different class (e.g. class Politics). The text headlines are passed through the three sub-networks, yielding three outputs which are passed to the Triplet-Semi-Hard-Loss function which tries to bring the inputs of the two similar classes closer together and simultaneously tries to distance the input of the third input which has a different class, within the embedding space. Even though it looks like the triplet siamese architecture has 3 networks, it instead only has one network which provides the same weights to each sub-network. Thus, the sub-networks are sharing their weights with each other.

Triplet Loss

The triple loss has been introduced in the Face Net research paper in 2015 [9] to find a scalable approach to tasks such as face recognition which are characterised by a high number of classes and a low number of samples per class.

In general, a triplet loss—siamese architecture can be used for very different purposes, such as text and image similarity tasks. This can be attributed to the fact that it allows direct learning of a mapping from its inputs to a compact Euclidean space where distances directly correspond to a measure of similarity. In other words, the Triplet Loss minimizes the distance between an anchor and a positive and maximizes the distance between the anchor and a negative. The anchor and the positive possess the same class whereas the negative is sampled from a different class. The Triplet Loss function is represented by an Eulcidian distance function.

Source: Triplet loss illustration — https://www.tensorflow.org/addons/tutorials/losses_triplet

Siamese ALBERT implementation

But before we are getting lost in theory, let’s start with the implementation. As always the first step is to load the dataset, in our case, it is the HuffPost dataset, subsequently, clean and prepares the data so that it can be served to our network. These steps have been briefly described in an earlier section, the code for the data preparation in combination with the full implementation can be found here.

After having cleaned and prepared our data, we load the ALBERT tokenizer and set up a data generator function to have a flexible implementation which serves our data as batches to the siamese model.

Next, we will load the pre-trained ALBERT-base model from Tensorflow Hub. The Siamese ALBERT network takes in this example three inputs (word_ids, mask, type_ids) these are not the anchor, positive, and negative of the triplet loss inputs from before, but rather are the normal inputs that an ALBERT model requires. After the 3 parallel inputs, the loaded ALBERT-base network is incorporated into the Siamese ALBERT as a single layer from which we will extract the pooled_output weights and feed them into a 64-dimensional dense layer. The dense layer is followed by an L2 normalization layer, whereas the final L2 layer is crucial for the training success otherwise the network has unstable training iterations and issues converging.

You might be wondering now: Why are there no 3 separate sub-networks or different input streams that supply the networks with an anchor, positive and negative class example?

Indeed, the implementation which I’m sharing does not require the configuration of a network architecture that consists of 3 sub-networks. While defining the loss function of our network we make use of the Tensorflow Addons Triplet-Semi-Hard-Loss [10] function implementation which takes care of this in the background. The loss function makes sure to find triples within our ingested batches. This process is called online learning and makes sure to only train (in the case of our Semi-Hard-Loss Function) on semi-hard triplets.

Furthermore, we define a callback function to save our training runs. And lastly, we initiate the training runs with model.fit(). These are all the steps which are needed to train a reasonable version of a siamese network. The full code can is accessible here.

When facing hardware limitations ALBERT-base and ALBERT-large are excellent language models to generate high-quality text embeddings. Both models have one of the best performance/size ratios among all right now available pre-trained language networks. GPT-3 is not a state-of-the-art model concerning many language tasks and as such should not be considered for usage except for language generation applications.

Using a fine-tuning technique to further improve the performance of one’s model on various metrics, the siamese architecture yields a boost in performance in comparison to the classical fine-tuning approach which simply adds a fine-tuning head after a pre-trained network.

I hope that my findings might help someone in the future to set up their own projects and products.