A Step-by-step Guide to Solving 4 Real-life Problems With Transformers and Hugging Face | by Zoumana Keita | Jan, 2023

By Jessie Hobb On Jan 31, 2023

Understand Transformers and harness their power to solve real-life problems in Python

In the field of Natural Language Processing (NLP), researchers have made significant contributions over the past decades, resulting in innovative advancements in various domains. Some examples of NLP in practice are provided below:

Siri, a personal assistant developed by Apple, can assist users with tasks like setting alarms, sending texts, and answering questions.
In the medical field, NLP is being utilized to speed up drug discovery.
Additionally, NLP is also being used to bridge language barriers through translation.

The purpose of this article is to discuss Transformers, an extremely powerful model in Natural Language Processing. It will begin by highlighting the advantages of Transformers over recurrent neural networks, furthering your comprehension of the model. Then, it will provide practical examples of using Huggingface transformers in real-world scenarios

Before delving into the fundamental idea of transformers, it’s important to gain a basic understanding of recurrent models, including their limitations.

Recurrent networks use an encoder-decoder structure and are typically used for tasks involving input and output sequences in a specific order. Some of the most common applications of recurrent networks include machine translation and modeling time series data.

Challenges with recurrent networks

As an example, let’s take a French sentence and translate it into English. The encoder receives the original French sentence as input, and the decoder produces the translated output.

Simple illustration of the recurrent network (By Author)

The encoder processes the input French sentence word by word, and the decoder generates word embeddings in the same sequence, making them time-consuming to train.
The hidden state of the current word is dependent on the hidden states of the previous words, making it impossible to perform parallel computation, regardless of the computational power available.
Sequence-to-sequence neural networks are prone to experiencing exploding gradients when the network is too large, resulting in poor performance
Another type of recurrent network, Long Short-Term Memory (LSTM) networks, were developed to address the issue of vanishing gradients, but they are even slower than traditional sequence models.

Wouldn’t it be beneficial to have a model that combines the advantages of recurrent networks and enables parallel computation?

Here is where transformers come in handy.

In 2017, Google Brain introduced Transformers, a new, powerful neural network architecture, in their renowned research paper “Attention is all you need.” It is based on the attention mechanism rather than the sequential computation found in recurrent networks.

Like recurrent networks, transformers also consist of two main components: an encoder and a decoder, each incorporating a self-attention mechanism. The following section provides an overall understanding of the primary elements of each component of transformers

General Architecture of Transformers (adapted by Author)

Input sentence preprocessing stage

This section involves two primary steps: (1) creating the embeddings of the input sentence, and (2) calculating the positional vector of each word in the input sentence. These computations are carried out in the same way for both the input sentence (prior to the encoder block) and the output sentence (before the decoder block).

Embedding of the input data

Before creating the embeddings of the input data, we begin by tokenizing it, then creating the embedding for each individual word without considering their relationship within the sentence.

Positional encoding

The tokenization process eliminates any sense of connections that existed in the input sentence. The positional encoding aims to restore the original cyclical nature by creating a context vector for each word.

Encoder bloc

As a result of the previous step, we obtain a combination of two vectors for each word: (1) the embedding and (2) its context vector. These vectors are combined to create a single vector for each word, which is then sent to the encoder.

Multi-head attention

As previously stated, all sense of relationship is lost. The purpose of the attention layer is to identify the contextual connections between different words in the input sentence. This step ultimately results in the creation of an attention vector for each word.

Position-wise feed-forward net (FFN)

In this step, a feed-forward neural network is applied to each attention vector to change them into a format that can be used by the following multi-head attention layer in the decoder.

Decoder block

The decoder block is made up of three main layers: a masked multi-head attention layer, a multi-head attention layer, and a position-wise feed-forward network. The last two layers are similar to those in the encoder.

The decoder is used during training and takes two inputs: the attention vectors of the input sentence being translated and the corresponding target sentences in English.

So, what is the masked multi-head attention layer responsible for?

During the generation of the next English word, the network is allowed to use all the words from the French word. However, when dealing with a given word in the target sequence (English translation), the network only has to access the previous words, because making the next ones available will lead the network to “cheat” and not make any effort to learn properly. Here is where the masked multi-head attention layer has all its benefits. It masks those next words by transforming them into zeros so that they can’t be used by the attention network.

The result of the masked multi-head attention layer passes through the rest of the layers in order to predict the next word by generating a probability score.

Training deep neural networks such as transformers from scratch is not an easy task, and might present the following challenges: (1) finding the required amount of data for the target problem can be time-consuming, and (2) getting the necessary computation resources like GPUs to train such deep networks can be very costly.

Image building a model from scratch to translate Mandingo language into Wolof, which are both low resources languages. Gathering data related to those languages is costly. Instead of going through all these challenges, one can re-use pre-trained deep-neural networks as the starting point for training the new model.

Such models have been trained on a huge corpus of data, made available by someone else (moral person, organization, etc.), and evaluated to work very well on language translation tasks such as French to English.

But what do you mean by re-use deep-neural networks?

The re-use of the model involves choosing the pre-trained model that is similar to your use case, refining the input-output pair data of your target task, and retraining the higher layers of the pre-trained model by using your data.

The introduction of transformers has led to the development of state-of-the-art transfer learning models such as:

BERT short for Bidirectional Encoder Representations from Transformers was developed by Google researchers in 2018. It helps to solve the most common language tasks such as named entity recognition, sentiment analysis, question-answering, text summarization, etc.
GPT3 (Generative Pre-Training-3), proposed by OpenAI researchers. It is a multi-layer transformer, mainly used to generate any type of text. GPT models are capable of producing human-like text responses to a given question.

Hugging Face is an AI community and Machine Learning platform created in 2016 by Julien Chaumond, Clément Delangue, and Thomas Wolf. It aims to democratize NLP by providing Data Scientists, AI practitioners, and Engineers immediate access to over 20000 pre-trained models based on the state-of-the-art transformer architecture. These models can be applied to:

Text in over 100 languages, for performing tasks such as classification, information extraction, question answering, generation, generation, and translation.
Speech, for tasks such as object audio classification and speech recognition.
Vision for object detection, image classification, and segmentation.

Hugging Face transformers also provides almost 2000 data sets and layered APIs allowing programmers to easily interact with those models using the three most popular deep learning libraries: Pytorch, Tensorflow, and Jax.

Other components of the Hugging Face transformers are the Pipelines. These are objects that abstract the complexity of the code from the library. They make it easy to use all these models for inference.

Now that you have a better understanding of transformers, and the Hugging Face platform, we will walk you through the following real-world scenarios: sequence classification, language translation, text generation, question answering, named entity recognition, and text summarization.

Prerequisites

The first step is to install the transformers library as follows:

pip install transformers

We will be using the Internet News and Consumer Engagement data set from Kaggle. This data set is made freely availably by CC0: Public Domain and was created to predict the popularity of an article before its publication.

For simplicity’s sake, the tutorial will be using only three examples from the data, and the analysis is based on the description column.

import pandas as pd# Create wrapper to properly format the text
from textwrap import TextWrapper
# Wrap text to 80 characters.
wrapper = TextWrapper(width=80)
# Load the data
news_data = pd.read_csv("consumer_engagement_data.csv")
# Choose candidate descriptions
description_76 = news_data.iloc[76]["description"]
description_118 = news_data.iloc[118]["description"]
description_178 = news_data.iloc[178]["description"]
english_texts = [description_76, description_118, description_178]
for english_text in english_texts:
print(wrapper.fill(english_text))
print("\n")

Original news in English (Image by Author)

Language Translation

MariamMT is an efficient Machine Translation framework. It uses the MarianNMT engine under the hood, which is purely developed in C++ by Microsoft and many academic institutions such as the University of Edinburgh, and Adam Mickiewicz University in Poznań. The same engine is currently behind the Microsoft Translator service.

The NLP group from the University of Helsinki open-sourced multiple translation models on Hugging Face Transformers and they are all in the following format Helsinki-NLP/opus-mt-{src}-{tgt}where {src} and {tgt}correspond respectively to the source and target languages.

So, in our case, the source language is English (en) and the target language is French (fr)

MarianMT is one of those models previously trained using Marian on parallel data collected at Opus.

MarianMT requires sentencepiece in addition to Transformers:

pip install sentencepiece
from transformers import MarianTokenizer, MarianMTModel

Select the pre-trained model, get the tokenizer and load the pre-trained model

# Get the name of the model
trans_model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
trans_model_tkn = MarianTokenizer.from_pretrained(trans_model_name)

# Instanciate the model
trans_model = MarianMTModel.from_pretrained(trans_model_name)

Add the special token >>{tgt}<< in front of each source (English) text with the help of the following function.

Implement the batch translation logic with the help of the following function, a batch being a list of texts to be translated.

Zero-shot classification

Most of the time, training a Machine Learning model requires all the candidate labels/targets to be known beforehand, meaning that if your training labels are science, politics, or education, you will not be able to predict the healthcare label unless you retrain your model, taking into consideration that label and the corresponding input data.

This powerful approach makes it possible to predict the target of a text in about 15 languages without having seen any of the candidate labels. We can use this model by simply loading it from the hub.

The goal here is to try to classify the category of each of the previous descriptions, whether it is tech, politics, security, or finance.

Import the pipeline module

from transformers import pipeline

Define candidate labels. These correspond to what we want to predict: tech, politics, business, or finance

candidate_labels = ["tech", "politics", "business", "finance"]

Define the classifier with the multi-lingual option

my_classifier = pipeline("zero-shot-classification",       
model='joeddav/xlm-roberta-large-xnli')

Implement the prediction logic using this helper function.

Run the predictions on the first and the last descriptions

#For the first description
prediction_desc_76 = run_predictions(english_texts[0])print(wrapper.fill(prediction_desc_76["Text"]))

print(prediction_desc_76["Result"])

Text predicted to be mainly about finance (Image by Author)

This previous result shows that the text is overall about finance at 81%.

For the last description, we get the following result:

#For the last description
prediction_desc_178 = run_predictions(english_texts[-1])print(wrapper.fill(prediction_desc_178["Text"]))

print(prediction_desc_178["Result"])

Text predicted to be mainly about tech (Image by Author)

This previous result shows that the text is overall about tech at 95%.

Sentiment classification

Most models performing sentiment classification require proper training. The hugging Face pipeline module makes it easy to run sentiment analysis predictions by using a specific model available on the hub by specifying its name.

from transformers import pipeline

Chose the task to perform and load the corresponding model. Here, we want to perform sentiment classification, using distill BERT base model.

distil_bert_model = pipeline(task = "sentiment-analysis", 
model="distilbert-base-uncased-finetuned-sst-2-english")

The model is ready! Let’s analyze the underlying sentiments behind the last two sentences.

# Run the predictions
distil_bert_model(english_texts[1:])

The model predicted the first text to have a negative sentiment with 96% of confidence, and the second one is predicted with positive sentiment at 52% of confidence.

Question Answering

Imagine dealing with a report much longer than the one about Apple. And, all you are interested in is the date of the event being mentioned. Instead of reading the whole report to find the key information, we can use a question-answering model from Hugging Face that will provide the answer we are interested in.

This can be done by providing the model with proper context (Apple’s report)and the question we are interested in finding the answer to.

Import the question-answering class and tokenizer from transformers

from transformers import AutoModelForQuestionAnswering, AutoTokenizer

Instantiate the model using its name and its tokenizer.

model_name = "deepset/roberta-base-squad2"
QA_model = pipeline('question-answering', model=model_name, 
tokenizer=model_name)

Request the model by asking the question and specifying the context.

QA_input = {
'question': 'when is Apple hosting an event?',
'context': english_texts[-1]
}

Get the result of the model

model_response = QA_model(QA_input)
pd.DataFrame([model_response])

Question Answering Result (Image by Author)

The model answered that Apple’s event is on September 10th with high confidence of 97%. It even specifies where the answer is in the text by providing the starting and ending locations.

In this article, we’ve covered the evolution of natural language technology from recurrent networks to transformers and how Hugging Face has democratized the use of NLP through its platform.

If you are still hesitant about using transformers, we believe it is time to give them a try and add value to your business cases.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

The article When Should You Consider Using Datatable Instead of Pandas to Process Large Data? could be a good next step.