Everything You Need to Know to Get Started with NLP | by Marcello Politi | Jul, 2022


Photo by Benjamin Suter on Unsplash

Stop googling around, read this first!

In this last period, I have been concentrating on reviewing and studying the most important things in the NLP field in order to face various interviews with as much peace of mind as possible. Therefore, the first thing I did was to go over all the basics of this subject so that I would be prepared for all the questions I would be asked. However, I had difficulty finding all the basic concepts in one article and spent a lot of time googling, so I decided to write one myself! 😜 (This article is based on some documents I cited at the end).

  • Introduction
  • Text Pre-processing
  • Categorical words representation
  • Weighted words representation
  • Representation Learning : Non Contextual word representations
  • Representation Learning : Contextual word representations
  • Transformer-based Pre-trained Language Models
  • Downstream Learning
  • Countering Catastrophic Forgetting
  • Model Compression
  • NLP libraries

Introduction

Text-based data are increasing with exorbitant growth, and most of these data are unstructured thus it is difficult to derive useful information from text. An example of this huge amount of low-quality data being released daily are tweets, social media posts, or online forums. People write their comments using language that is not always correct, often using dialect or emojis to get their emotions understood.
The main goal of the different NLP approaches is to achieve human-like text comprehension. NLP helps us to examine a large amount of unstructured text and to extract its main features.

Generally in an NLP task, there are standard steps we have to follow. The principal problem is that Machine Learning algorithms don’t know how to compute words, so we have to find an appropriate numeric representation of our texts. To generate this representation we have to clean the data of any noise and then perform feature extraction, that is, transforming raw data into numeric data that machines can understand.

Text Representation Pipeline (Image by Author)

Let’s examine what are the most common steps of text pre-processing. Note that below I will describe a list of preprocessing steps, but you don’t necessarily have to perform all of them to solve your task. On the contrary, often using transformers you will try to leave the text unchanged to maintain the context of the sentences. So it’s up to you as an NLP practitioner to figure out what you need.

  • Tokenization: tokenization is the process that turns a text into atomic units. For example, atomic units can be words, subwords, or even the individual characters
  • Removal of Noise, URLs, Hashtag and User-mentions. Often when we look at a dataset it is very dirty, for example, because it was scraped from the internet. In many cases, symbols or useless words in the text, like HTML tags, do not add information but only create noise.
  • Hashtag Segmentation : often in the text we find words preceded by an hashtag (# it is used a lot in social media). These words can be very important to understand the topic of the text, so we should be able to identify them and remove the symbol #.
  • Replacing Emoticons and Emojis: Emoticons and emojis can be displayed incorrectly, I hope I am not the only one who has had encoding problems. Sometimes it’s appropriate to remove emoticons and emojis, but in a sentiment analysis task instead fo instance, they can be very useful. So it’s very much up to your discretion.
  • Replacing elongated characters: heeeeeelloooo → hello
  • Correction of Spelling mistakes: a great article about this.
  • Expanding Contractions: I’ll → I will. (Here’s how to do this)
  • Removing Punctuations
  • Removing Numbers
  • Lower-casing all words
  • Removing Stop-words : these are words that appear many times and therefore do not add much information to the test, for example, words like “the, a, ok”.
  • Stemming: is the technique to replace and remove the suffixes and affixes to get the root, base or stem word. Ex: eats → eat.
  • Lemmatization : the purpose of lemmatization is the same as stemming, which is to cut down the words to their base or root words. However, in lemmatization, inflexion of words is not just chopped off, but it uses lexical knowledge to transform words into their base forms. Ex: wolves → wolf.
  • Part of Speech (POS) Tagging : is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
  • Handling Negations : Negations can totally change the meaning of the sentence. Finding them explicitly could therefore greatly help your algorithm.

Categorical word representation:

  • One hot encoding: This is the first method and also the easiest to apply to represent text in a vector space. Each word is associated with a vector that has a length equal to the cardinality of the dictionary. All entries in each vector are 0, except for one position where we find a 1. So each vector is different, and the vectors all have the same distance among them, so there are no words that are more similar than others.
One hot encoding (Image by Author)
  • Bag-of-Words (BoW): this is simply an extension of one-hot encoding. It adds up the one-hot representations of words in the sentence.
BOW (Image by Author)

Weighted Word representation

The previous representations do not take into account the frequency of words in the text, which can be an important data element for understanding their importance, however.

Term Frequency (TF): instead of using only 0 and 1, it calculates the frequency of a word. TF of a word is calculated by going to count how many times a word appears in the text divided by the total number of words (so as not to penalize short documents at the expense of longer ones).

Term Frequency-Inverse Document Frequency (TF-IDF) : if you think about it, words like “the” appear many times in a document but are not very informative. So a word must have a high weight if it appears many times in a document, and if it appears in a few different documents in your dataset

TF-IDF (source : https://kinder-chen.medium.com/introduction-to-natural-language-processing-tf-idf-1507e907c19 )

Where d refers to a document, N is the total number of documents, df is the number of documents with term t.

Representation Learning : Non Contextual word representations

The categorical representations just seen are very intuitive and easy to use but present several problems.
First, they fail to capture the syntactic and semantic meaning of words and also suffer from the so-called curse of dimensionality. A vector representing a word will have a length equal to the size of the word vocabulary of a language. If we were working on texts that have multiple languages at once, dimensionality would grow by a huge amount!

For this reason, we now see models that manage to learn a word representation using vectors with fixed and limited dimensionality. The most significant benefit of these vectors or word embeddings is that it provides more efficient and expressive representation by keeping the word similarity of context and by low dimensional vectors. So these vectors tend to have the attributes of word’s neighbours, and they capture the similarity between words. This kind of representation is called continuous words representation.

Word2Vec : The great innovation that has been introduced is that word embeddings are the weights of a neural network! This model uses two hidden layers which are used in a shallow neural network to create a vector of each word. The word vectors captured by Continuous Bag of words (CBOW) and Skip-gram models of word2vec are supposed to have semantic and syntactic information of words

  • Continuous Bag of words (CBOW)
  • Skipgram

Global Vectors (GloVe): The Global Vectors for Word Representation or GloVe algorithm is quite similar to Word2Vec. However, the method is a little different. GloVe only considers contextual information on a 1–1 basis. This means that GloVe only creates a word-by-word relative matrix, which includes the probability of displaying word k around word b.
The main purpose of this technique is to find the representation of the two vectors so as to generate the log likelihood of their point products equal to the co-occurrence. They have great results for relating words in context to each other.

FastText : FastText breaks a word in n-grams instead of full word for feeding into a neural network, which can acquire the relationship between characters and pick up the semantics of words.

Representation Learning : Contextual word representations

Why do we care about context? To explain it briefly in my opinion it’s easier to show an example. A word can take on different meanings depending on the context in which it is placed. In the sentence: “I left my phone on the left side of the table.” , the word left assumes two different meanings, and therefore will have to have two different representations. Creating the embedding of a word depending on the context could be a very strong boost for your ML model. Be careful, though, of the pre-processing steps that could go to alter the context.

Generic Context word representation (Context2Vec) : his method uses an LSTM-type neural network to efficiently learn a generic context embedding function from large corpora. The main goal of this model is to learn a generic, task-independent embedding function for
sentential contexts of varying lengths around the
words.

Contextualized word representations Vectors (CoVe): uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. CoVe word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with GloVe embeddings, and then feeding these in as features for the task-specific models.

Embedding from language Models (ELMo) : The final word vectors are learned from a bidirectional language model. ELMo uses the linear concatenation of representations learnt from the bidirectional language model instead of only just the final layer representations like other contextual word representations. In different sentences, ELMo provides different word representations for the same word.

For an easy and good tutorial:

Universal Language Model Fine-Tuning (ULMFiT) : It relies heavily on the concept of transfer learning in particular by allowing Language Model training on one corpus (a set of documents) and the ability to then refine the model on different corpora, but building on what was learned at the previous step. It is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer AWD-LSTM architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.

Transformer-based Pre-trained Language Models

Before you study transoformer-based models, there are two resources you absolutely must read:

Generative Pre-Training – GPT (OpenAI Transformer) : is the first Transformer based pre-trained LM that can effectively manipulate the semantics of words in terms of context. It is based on the decoder part of the tranformer to model the language as it is an auto-regressive model where the model predicts the next word according to its previous context. By learning on a massive set of free text corporas, GPT extends the unsupervised LM to a much larger scale. One drawback of GPT is it’s uni-directional, i.e., the model is only trained to predict the future left-to-right context.

GPT (source : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

GPT 2 (OpenAI Transformer) : The OpenAI team released a scaled-up variant of GPT in 2019. It incorporates some slight improvements compared to the previous concerning the position of layer normalisation and residual relations. Overall, there are four distinct GPT2 variants with the smallest being identical to GPT, the medium one being similar in size to BERTLARGE and the xlarge one being released with 1.5B parameters as the actual GPT2 standard.

GPT-3 (OpenAI Transformer) : It is the third-generation language prediction model in the GPT-n series (it is amazing try it!).

BERT : this model proposes a masked language modelling (MLM) objective, where some of the tokens of an input sequence are randomly masked, and the objective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training. In addition, BERT uses a next-sentence-prediction (NSP) objective. Given two input sentences, NSP predicts whether the second sentence is the actual next sentence of the first sentence.

BERT (source : https://arxiv.org/pdf/1810.04805.pdf)

If you want to learn how to use BERT on real tasks check my articles:

DistilBERT : a distilled (“compressed”) version of BERT, reduces the size of a BERT by 40% while retaining 97% of its language understanding proficiency and being 60% quicker. I used distilbert in my articles [1] [2].

RoBERTa : makes a few changes to the released BERT model and achieves substantial improvements. The changes include: (1) Training the model longer with larger batches and more data (2) Removing the NSP objective (3) Training on longer sequences (4) Dynamically changing the masked positions during pretraining.

ALBERT : it presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

  • Splitting the embedding matrix into two smaller matrices.
  • Using repeating layers split among groups.

XLNet : XLNet, a generalized autoregressive pretraining method that (1) allows for learning bidirectional contexts by maximizing the expected probability over all permutations of the factorization order and (2) overcomes the limitations of BERT due to its autoregressive formulation.

ELECTRA : Compared to BERT, ELECTRA a more effective pretraining method. Instead of corrupting some positions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator network. ELECTRA trains a discriminator to predict whether each token in the corrupted input was replaced by the generator or not. The pre-trained discriminator can then be used in downstream tasks.

MASS : BERT cannot be easily used for natural language generation. MASS uses masked sequences to pretrain sequence-to-sequence models. More specifically, MASS adopts an encoder-decoder framework and extends the MLM objective. The encoder takes as input a sequence where consecutive tokens are masked and the decoder predicts these masked consecutive tokens autoregressively.

T5 : is an extremely large new neural network model that is trained on a mixture of unlabeled text and labelled data from popular natural language processing tasks, then fine-tuned individually for each of the tasks that the authors aim to solve [3].

BART : BART is a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.[4]

Once the embeddings (vectors) of a text are appressed, they can be used to solve various NLP tasks called downstream tasks. Contextual embeddings have demonstrated impressive performances compared to non contextual embeddings. But the question now is “How can we use contextual embedding for downstream tasks ?”.

Feature-based

With this method you freeze your model, so when you need to solve your task, the model will not be towed back to your custom dataset. It will only use the pretrained model to generate features (the embeddings) that you will use as input to a classifier for example. To see how to do this look at my article.

Fine-tuning

Unlike the previous method, the pre-trained model will be trained for a few more epochs on your downstream dataset in order to fit the particular case.

Adapters

Adapters are small modules between the layers of pre-trained models to
obtain models capable of being trained in a multitasking style. The parameters of the pre-trained model are frozen while the adaptors are trained.

Countering Catastrophic Forgetting

Whenever we go to train pre-trained models to fit a particular downstream task, we do so to improve the performance of that model in our particular case. But changing the pre-trained parameters can lead the model to completely forget the things it had learned. For example, if I use a language model that understands the Italian language well, and I want to fine tune it to the Sicialian dialect, the model may forget Italian altogether. The study on catastrophic forgetting is still a lot, but let’s look at methods to be able to alleviate this effect:

  • Freezing layers : It is possible to freeze all layers, or we can freeze all but the last k layers. Or another method is to thaw and train only one layer at a time (chain town method).
  • Adaptive learning rates : in NLP as in computer vision, the bottom layers are thought to capture the most important features. So a lower learning rate could be used for the first few layers.
  • Regularization : regularization (penalization of weights) limits a model’s ability to learn.

To date, the deep learning models have become huge, containing millions and millions of parameters. Besides requiring gigantic computational resources these models are also harmful to the environment. It has been estimated that training a model can emit CO2e equal to the average life of 5 cars in America. Fortunately, ways are being explored to reduce the size of these networks, let’s look at some of them:

  • Pruning : I have already addressed this issue while writing my thesis, and you can read the article I wrote on the implementation of pruning in Julia. Pruning attempts to remove the less important weights from the network, thus going to decrease the size of the network and yet keeping the performances constant.
  • Knowledge distillation : this is the process of transferring knowledge from a large model to a smaller one. An example of a model and its distilled version are Bert and DistilBert
  • Weight quantization : is the process of reducing the precision of the weights, biases, and activations such that they consume less memory
  • CoreNLP
  • NLTK
  • Gensim
  • spaCY
  • PyTorch
  • Tensorflow

I hope that with this short guide you will not waste as I did too much time googling around what are the foundational things to know in NLP, but can actually focus on learning this fantastic subject!

Marcello Politi

Linkedin, Twitter, CV




Photo by Benjamin Suter on Unsplash

Stop googling around, read this first!

In this last period, I have been concentrating on reviewing and studying the most important things in the NLP field in order to face various interviews with as much peace of mind as possible. Therefore, the first thing I did was to go over all the basics of this subject so that I would be prepared for all the questions I would be asked. However, I had difficulty finding all the basic concepts in one article and spent a lot of time googling, so I decided to write one myself! 😜 (This article is based on some documents I cited at the end).

  • Introduction
  • Text Pre-processing
  • Categorical words representation
  • Weighted words representation
  • Representation Learning : Non Contextual word representations
  • Representation Learning : Contextual word representations
  • Transformer-based Pre-trained Language Models
  • Downstream Learning
  • Countering Catastrophic Forgetting
  • Model Compression
  • NLP libraries

Introduction

Text-based data are increasing with exorbitant growth, and most of these data are unstructured thus it is difficult to derive useful information from text. An example of this huge amount of low-quality data being released daily are tweets, social media posts, or online forums. People write their comments using language that is not always correct, often using dialect or emojis to get their emotions understood.
The main goal of the different NLP approaches is to achieve human-like text comprehension. NLP helps us to examine a large amount of unstructured text and to extract its main features.

Generally in an NLP task, there are standard steps we have to follow. The principal problem is that Machine Learning algorithms don’t know how to compute words, so we have to find an appropriate numeric representation of our texts. To generate this representation we have to clean the data of any noise and then perform feature extraction, that is, transforming raw data into numeric data that machines can understand.

Text Representation Pipeline (Image by Author)

Let’s examine what are the most common steps of text pre-processing. Note that below I will describe a list of preprocessing steps, but you don’t necessarily have to perform all of them to solve your task. On the contrary, often using transformers you will try to leave the text unchanged to maintain the context of the sentences. So it’s up to you as an NLP practitioner to figure out what you need.

  • Tokenization: tokenization is the process that turns a text into atomic units. For example, atomic units can be words, subwords, or even the individual characters
  • Removal of Noise, URLs, Hashtag and User-mentions. Often when we look at a dataset it is very dirty, for example, because it was scraped from the internet. In many cases, symbols or useless words in the text, like HTML tags, do not add information but only create noise.
  • Hashtag Segmentation : often in the text we find words preceded by an hashtag (# it is used a lot in social media). These words can be very important to understand the topic of the text, so we should be able to identify them and remove the symbol #.
  • Replacing Emoticons and Emojis: Emoticons and emojis can be displayed incorrectly, I hope I am not the only one who has had encoding problems. Sometimes it’s appropriate to remove emoticons and emojis, but in a sentiment analysis task instead fo instance, they can be very useful. So it’s very much up to your discretion.
  • Replacing elongated characters: heeeeeelloooo → hello
  • Correction of Spelling mistakes: a great article about this.
  • Expanding Contractions: I’ll → I will. (Here’s how to do this)
  • Removing Punctuations
  • Removing Numbers
  • Lower-casing all words
  • Removing Stop-words : these are words that appear many times and therefore do not add much information to the test, for example, words like “the, a, ok”.
  • Stemming: is the technique to replace and remove the suffixes and affixes to get the root, base or stem word. Ex: eats → eat.
  • Lemmatization : the purpose of lemmatization is the same as stemming, which is to cut down the words to their base or root words. However, in lemmatization, inflexion of words is not just chopped off, but it uses lexical knowledge to transform words into their base forms. Ex: wolves → wolf.
  • Part of Speech (POS) Tagging : is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
  • Handling Negations : Negations can totally change the meaning of the sentence. Finding them explicitly could therefore greatly help your algorithm.

Categorical word representation:

  • One hot encoding: This is the first method and also the easiest to apply to represent text in a vector space. Each word is associated with a vector that has a length equal to the cardinality of the dictionary. All entries in each vector are 0, except for one position where we find a 1. So each vector is different, and the vectors all have the same distance among them, so there are no words that are more similar than others.
One hot encoding (Image by Author)
  • Bag-of-Words (BoW): this is simply an extension of one-hot encoding. It adds up the one-hot representations of words in the sentence.
BOW (Image by Author)

Weighted Word representation

The previous representations do not take into account the frequency of words in the text, which can be an important data element for understanding their importance, however.

Term Frequency (TF): instead of using only 0 and 1, it calculates the frequency of a word. TF of a word is calculated by going to count how many times a word appears in the text divided by the total number of words (so as not to penalize short documents at the expense of longer ones).

Term Frequency-Inverse Document Frequency (TF-IDF) : if you think about it, words like “the” appear many times in a document but are not very informative. So a word must have a high weight if it appears many times in a document, and if it appears in a few different documents in your dataset

TF-IDF (source : https://kinder-chen.medium.com/introduction-to-natural-language-processing-tf-idf-1507e907c19 )

Where d refers to a document, N is the total number of documents, df is the number of documents with term t.

Representation Learning : Non Contextual word representations

The categorical representations just seen are very intuitive and easy to use but present several problems.
First, they fail to capture the syntactic and semantic meaning of words and also suffer from the so-called curse of dimensionality. A vector representing a word will have a length equal to the size of the word vocabulary of a language. If we were working on texts that have multiple languages at once, dimensionality would grow by a huge amount!

For this reason, we now see models that manage to learn a word representation using vectors with fixed and limited dimensionality. The most significant benefit of these vectors or word embeddings is that it provides more efficient and expressive representation by keeping the word similarity of context and by low dimensional vectors. So these vectors tend to have the attributes of word’s neighbours, and they capture the similarity between words. This kind of representation is called continuous words representation.

Word2Vec : The great innovation that has been introduced is that word embeddings are the weights of a neural network! This model uses two hidden layers which are used in a shallow neural network to create a vector of each word. The word vectors captured by Continuous Bag of words (CBOW) and Skip-gram models of word2vec are supposed to have semantic and syntactic information of words

  • Continuous Bag of words (CBOW)
  • Skipgram

Global Vectors (GloVe): The Global Vectors for Word Representation or GloVe algorithm is quite similar to Word2Vec. However, the method is a little different. GloVe only considers contextual information on a 1–1 basis. This means that GloVe only creates a word-by-word relative matrix, which includes the probability of displaying word k around word b.
The main purpose of this technique is to find the representation of the two vectors so as to generate the log likelihood of their point products equal to the co-occurrence. They have great results for relating words in context to each other.

FastText : FastText breaks a word in n-grams instead of full word for feeding into a neural network, which can acquire the relationship between characters and pick up the semantics of words.

Representation Learning : Contextual word representations

Why do we care about context? To explain it briefly in my opinion it’s easier to show an example. A word can take on different meanings depending on the context in which it is placed. In the sentence: “I left my phone on the left side of the table.” , the word left assumes two different meanings, and therefore will have to have two different representations. Creating the embedding of a word depending on the context could be a very strong boost for your ML model. Be careful, though, of the pre-processing steps that could go to alter the context.

Generic Context word representation (Context2Vec) : his method uses an LSTM-type neural network to efficiently learn a generic context embedding function from large corpora. The main goal of this model is to learn a generic, task-independent embedding function for
sentential contexts of varying lengths around the
words.

Contextualized word representations Vectors (CoVe): uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. CoVe word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with GloVe embeddings, and then feeding these in as features for the task-specific models.

Embedding from language Models (ELMo) : The final word vectors are learned from a bidirectional language model. ELMo uses the linear concatenation of representations learnt from the bidirectional language model instead of only just the final layer representations like other contextual word representations. In different sentences, ELMo provides different word representations for the same word.

For an easy and good tutorial:

Universal Language Model Fine-Tuning (ULMFiT) : It relies heavily on the concept of transfer learning in particular by allowing Language Model training on one corpus (a set of documents) and the ability to then refine the model on different corpora, but building on what was learned at the previous step. It is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer AWD-LSTM architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.

Transformer-based Pre-trained Language Models

Before you study transoformer-based models, there are two resources you absolutely must read:

Generative Pre-Training – GPT (OpenAI Transformer) : is the first Transformer based pre-trained LM that can effectively manipulate the semantics of words in terms of context. It is based on the decoder part of the tranformer to model the language as it is an auto-regressive model where the model predicts the next word according to its previous context. By learning on a massive set of free text corporas, GPT extends the unsupervised LM to a much larger scale. One drawback of GPT is it’s uni-directional, i.e., the model is only trained to predict the future left-to-right context.

GPT (source : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

GPT 2 (OpenAI Transformer) : The OpenAI team released a scaled-up variant of GPT in 2019. It incorporates some slight improvements compared to the previous concerning the position of layer normalisation and residual relations. Overall, there are four distinct GPT2 variants with the smallest being identical to GPT, the medium one being similar in size to BERTLARGE and the xlarge one being released with 1.5B parameters as the actual GPT2 standard.

GPT-3 (OpenAI Transformer) : It is the third-generation language prediction model in the GPT-n series (it is amazing try it!).

BERT : this model proposes a masked language modelling (MLM) objective, where some of the tokens of an input sequence are randomly masked, and the objective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training. In addition, BERT uses a next-sentence-prediction (NSP) objective. Given two input sentences, NSP predicts whether the second sentence is the actual next sentence of the first sentence.

BERT (source : https://arxiv.org/pdf/1810.04805.pdf)

If you want to learn how to use BERT on real tasks check my articles:

DistilBERT : a distilled (“compressed”) version of BERT, reduces the size of a BERT by 40% while retaining 97% of its language understanding proficiency and being 60% quicker. I used distilbert in my articles [1] [2].

RoBERTa : makes a few changes to the released BERT model and achieves substantial improvements. The changes include: (1) Training the model longer with larger batches and more data (2) Removing the NSP objective (3) Training on longer sequences (4) Dynamically changing the masked positions during pretraining.

ALBERT : it presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

  • Splitting the embedding matrix into two smaller matrices.
  • Using repeating layers split among groups.

XLNet : XLNet, a generalized autoregressive pretraining method that (1) allows for learning bidirectional contexts by maximizing the expected probability over all permutations of the factorization order and (2) overcomes the limitations of BERT due to its autoregressive formulation.

ELECTRA : Compared to BERT, ELECTRA a more effective pretraining method. Instead of corrupting some positions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator network. ELECTRA trains a discriminator to predict whether each token in the corrupted input was replaced by the generator or not. The pre-trained discriminator can then be used in downstream tasks.

MASS : BERT cannot be easily used for natural language generation. MASS uses masked sequences to pretrain sequence-to-sequence models. More specifically, MASS adopts an encoder-decoder framework and extends the MLM objective. The encoder takes as input a sequence where consecutive tokens are masked and the decoder predicts these masked consecutive tokens autoregressively.

T5 : is an extremely large new neural network model that is trained on a mixture of unlabeled text and labelled data from popular natural language processing tasks, then fine-tuned individually for each of the tasks that the authors aim to solve [3].

BART : BART is a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.[4]

Once the embeddings (vectors) of a text are appressed, they can be used to solve various NLP tasks called downstream tasks. Contextual embeddings have demonstrated impressive performances compared to non contextual embeddings. But the question now is “How can we use contextual embedding for downstream tasks ?”.

Feature-based

With this method you freeze your model, so when you need to solve your task, the model will not be towed back to your custom dataset. It will only use the pretrained model to generate features (the embeddings) that you will use as input to a classifier for example. To see how to do this look at my article.

Fine-tuning

Unlike the previous method, the pre-trained model will be trained for a few more epochs on your downstream dataset in order to fit the particular case.

Adapters

Adapters are small modules between the layers of pre-trained models to
obtain models capable of being trained in a multitasking style. The parameters of the pre-trained model are frozen while the adaptors are trained.

Countering Catastrophic Forgetting

Whenever we go to train pre-trained models to fit a particular downstream task, we do so to improve the performance of that model in our particular case. But changing the pre-trained parameters can lead the model to completely forget the things it had learned. For example, if I use a language model that understands the Italian language well, and I want to fine tune it to the Sicialian dialect, the model may forget Italian altogether. The study on catastrophic forgetting is still a lot, but let’s look at methods to be able to alleviate this effect:

  • Freezing layers : It is possible to freeze all layers, or we can freeze all but the last k layers. Or another method is to thaw and train only one layer at a time (chain town method).
  • Adaptive learning rates : in NLP as in computer vision, the bottom layers are thought to capture the most important features. So a lower learning rate could be used for the first few layers.
  • Regularization : regularization (penalization of weights) limits a model’s ability to learn.

To date, the deep learning models have become huge, containing millions and millions of parameters. Besides requiring gigantic computational resources these models are also harmful to the environment. It has been estimated that training a model can emit CO2e equal to the average life of 5 cars in America. Fortunately, ways are being explored to reduce the size of these networks, let’s look at some of them:

  • Pruning : I have already addressed this issue while writing my thesis, and you can read the article I wrote on the implementation of pruning in Julia. Pruning attempts to remove the less important weights from the network, thus going to decrease the size of the network and yet keeping the performances constant.
  • Knowledge distillation : this is the process of transferring knowledge from a large model to a smaller one. An example of a model and its distilled version are Bert and DistilBert
  • Weight quantization : is the process of reducing the precision of the weights, biases, and activations such that they consume less memory
  • CoreNLP
  • NLTK
  • Gensim
  • spaCY
  • PyTorch
  • Tensorflow

I hope that with this short guide you will not waste as I did too much time googling around what are the foundational things to know in NLP, but can actually focus on learning this fantastic subject!

Marcello Politi

Linkedin, Twitter, CV

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsJullatest newsMarcelloNLPPolitiStartedTechnoblender
Comments (0)
Add Comment