Fine-tune a Large Language Model with Python | by Marcello Politi | Apr, 2023

By Jessie Hobb On Apr 18, 2023

Learn how to fine-tune a BERT from scratch on a custom dataset

In this article, we will deal with the fine-tuning of BERT for sentiment classification using PyTorch. BERT is a large language model that offers a good balance between popularity and model size, which can be fine-tuned using a simple GPU. We can download a pre-trained BERT from Hugging Face (HF), so there is no need to train it from scratch. In particular, we will use the distilled (smaller) version of BERT, called Distil-BERT.

Distil-BERT is widely used in production since it has 40% fewer parameters than BERT uncased. It runs 60% faster and retains 95% performance in the GLUE language comprehension benchmark.

We start by installing all the necessary libraries. The first line is to capture the output of the installation and keep your notebook clean.

I will use Deepnote to run the code in this article but you also use Google Colab if you prefer.

You can also check the version of the libraries you are using with the following line of code.

Now you need to specify some general setups, including the number of epochs and device hardware. We set a fixed random seed which helps for the reproducibility of the experiment.

Loading the IMDb movie review dataset

Let’s see how to prepare and tokenize the IMDb movie review dataset and fine-tune Distilled BERT. Fetch the compressed data and unzip it.

As usual, we need to split the data into training, validation, and test sets.

Let’s tokenize the texts into individual word tokens using the tokenizer implementation inherited from the pre-trained model class.

Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning.

With Hugging Face you will always find a tokenizer associated with each model. If you are not doing research or experiments on tokenizers it’s always preferable to use the standard tokenizers.
Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences. On the other, sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.

Let’s pack everything into a Python class that we are going to name IMDbDataset. We are also going to use this custom dataset to create the corresponding dataloaders.

The encodings variable stores a lot of information about the tokenized text. We can extract only the most relevant information via dictionary comprehension. The dictionary contains:

input_ids: are the indices corresponding to each token in the sentence.
labels: classes labels
attention_mask: indicates whether a token should be attended to or not (for example padding tokens will not be attended).

Let’s construct datasets and corresponding dataloaders.

Loading and fine-tuning BERT

Finally, we are done with the data preprocessing, and we can start fine-tuning our model. Let’s define a model and an optimization algorithm, Adam in this case.

DistilbertForSequenceCLassification specifies the downstream task we want to fine-tune the model on, which is sequence classification in this case. Note that “uncased” means that the model does not distinguish between upper and lower case letters.

Before training the model, we need to define some metrics to compare the model improvements. In this simple case, we can use conventional accuracy for classification. Notice that this function is quite long because we are loading the dataset batch by batch to work around ram and GPU limitations. Usually, these resources are never enough when fine-tuning huge datasets.

In the compute_accuracy function, we load a given batch and then take the predicted labels from the outputs. While doing this, we keep track of the total number of examples via the variable num_examples. In the same way, we keep track of the number of correct predictions via the correct_pred variable. After we have iterated over the complete dataloader, we can compute the accuracy by the last division.

You can also notice how to use the model in the compute_accuracy function. We feed the model with input_ids along with the attention_mask information that denotes whether a token is an actual text token or padding. The model returns a SequenceClassificatierOutput object from which we get the logits and convert them into a class using the argmax function.

Training (fine-tuning) loop

If you know how to code a training loop in PyTorch you won’t have any issues understanding this fine-tuning loop. As in any neural network, we give inputs to the network, calculate the output, compute the loss, and do parameter updates based on this loss.
Every few epochs we print the training progress to get feedback.

In this article, we have seen how to perform fine-tuning of a Large Language Model such as BERT by using PyTorch exclusively. Actually, there is a much faster and even smarter way to do this using the Transformers library from Hugging Face. This library allows us to create a Trainer object for fine-tuning where we can specify parameters such as the number of epochs and more in just a few lines of code. Follow me if you are curious to see how to do it in the next article! 😉

Marcello Politi

Linkedin, Twitter, Website