An Intro to Hugging Face With Implementation of 6 NLP Tasks | by Farzad Mahmoodinobar | Apr, 2023

By Jessie Hobb On Apr 18, 2023

An introductory tutorial to use Hugging Face for NLP tasks

Hugging Face is an open-source AI community for and by machine learning practitioners with a focus on Natural Language Processing (NLP), computer vision and audio/speech processing tasks. Whether you already work in one of these areas or aspire to enter this realm in the future, you will benefit from learning how to use Hugging Face tools and models.

In this post we are going to go over six of the most frequently used NLP tasks by leveraging pre-trained models available on Hugging Face, as follows:

Text Generation (a.k.a. Language Modeling)
Question Answering
Sentiment Analysis
Text Classification
Text Summarization
Machine Translation

Before jumping into the tasks, let’s take a minute to talk about the distinction between “Training” and “Inference”, which are two important concepts in machine learning, in order to clarify what we will be working on today.

Let’s get started!

Training is the process of feeding a machine learning model with large amounts of data. During this process the model “learns” from the provided data (by optimizing an objective function) and hence this process is called “Training”. Once we have a trained model, we can use it to make predictions in new data that model has not seen before. This process is called “Inference”. In short, training is the learning process for the model, while inference is the model making predictions (i.e. when we actually use the model).

Now that we understand the distinction between training and inference, we can more concretely define what we will be working on today. In this post, we will be using various pre-trained models for inference. In other words, we would not be going through the expensive process of training any new models here. On the other hand, we are going to leverage the myriad of existing pre-trained models in the Hugging Face Hub and use those for inference (i.e. to make predictions).

I decided to start with this task, given the recent hiked interest about Generative AI such as ChatGPT. This task is usually called language modeling and the task that the models perform is to predict missing parts of text (this can be a word, token or larger strings of text). What has attracted a lot of interest recently is that the models can generate text without necessarily having seen such prompts before.

Let’s see how it works in practice!

1.1. Text Generation — Implementation

In order to implement text generation, we will import pipeline from transformers library, use one of the GPT models and take the steps below. I have also added comments in the code so that you can more easily follow the steps:

Import libraries
Specify the name of the pre-trained model to be used for this specific task
Specify the sentence, which will be completed by the model
Create an instance of pipeline as generator
Perform the text generation and store the results as output
Return the results

Code block below follows these steps.

# Import libraries
from transformers import pipeline# Specify the model
model = "gpt2"
# Specify the task
task = "text-generation"
# Instantiate pipeline
generator = pipeline(model = model, task = task, max_new_tokens = 30)
# Specify input text
input_text = "If you are interested in learing more about data science, I can teach you how to"
# Perform text generation and store the results
output = generator(input_text)
# Return the results
output

Results:

Text Generation Results

We can see in the results that the model took our provided input text and generated additional text, given the data it has been trained on and the sentence that we provided. Note that I limited the length of the output using the max_new_tokens to 30 tokens to prevent a lengthy response. The generated text sounds reasonable and relevant to context.

But what about a case where we would like to ask a question from the model? Can the model answer a question, instead of just completing an incomplete sentence? Let’s explora that next.

Question answering, as the name suggests, is a task where the model answers a question provided by the user. There are generally two types of question answering tasks:

Extractive (i.e. context-dependent): Where the user describes a situation to the model in the question/prompt and ask the model to generate a response, given that provided information. In this scenario, the model picks the relevant parts of the information from the prompt and returns the results
Abstractive (i.e. context-independent): Where the user asks a question from the model, without providing any context

Let’s look at how question answering can be implemented.

2.1. Question Answering — Implementation

Implementation process is similar to the language modeling task. We will use two different models to be able to compare the results.

Let’s start with the distilbert-base-cased-distilled-squad (link).

# Specify model
model = 'distilbert-base-cased-distilled-squad'# Instantiate pipeline
answerer = pipeline(model = model, task="question-answering")
# Specify question and context
question = "What does NLP stand for?"
context = "Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages"
# Generate predictions
preds = answerer(
question = question,
context = context,
)
# Return results
print(
f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

Results:

Question Answering Results (distilbert-base-cased-distilled-squad Model)

We can see the summary and the model was able to determine which part of the context was relevant to answer the question.

let’s implement the same problem using a different model, named deepset/roberta-base-squad2 (link).

# Specify model
model = "deepset/roberta-base-squad2"# Specify task
task = "question-answering"
# Instantiate pipeline
answerer = pipeline(task = task, model = model, tokenizer = model)
# Specify input
qa_input = {
'question': 'What does NLP stand for?',
'context': 'Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages'
}
# Generate predictions
output = answerer(qa_input)
# Return results
output

Results:

As we see in the above example, the second model was also able to identify that NLP stands for natural language processing, given the context that we provided.

Let’s continue our journey in the NLP tasks by looking at sentiment analysis next.

Sentiment analysis is the process of categorizing the sentiment of a text into positive, negative or neutral. There is a wide range of applications for sentiment analysis in different industries, such as monitoring customers’ sentiment from product reviews or even in politics, such as gauging public interest in a given topic during an election year. Focus of this post is to use Hugging Face for various tasks so we will not dive deeper into each topic but if you are interested in learning more about sentiment analysis in depth, you can refer to this post:

3.1. Sentiment Analysis — Implementation

In order to implement sentiment analysis, we will again rely onpipeline from transformers library and take the steps below. I have also added comments in the code so that you can more easily follow the steps.

Import libraries
Specify the name of the pre-trained model to be used for this specific task (i.e. sentiment analysis)
Specify the task (i.e. sentiment analysis)
Specify the sentence, which will be sentiment analyzed
Create an instance of pipeline as analyzer
Perform the sentiment analysis and save the results as output
Return the results

# Specify pre-trained model to use
model = 'distilbert-base-uncased-finetuned-sst-2-english'# Specify task
task = 'sentiment-analysis'
# Text to be analyzed
input_text = 'Performing NLP tasks using HuggingFace pipeline is super easy!'
# Instantiate pipeline
analyzer = pipeline(task, model = model)
# Store the output of the analysis
output = analyzer(input_text)
# Return output
output

Results:

Sentiment Analysis Results

The results indicate that the sentiment of the sentence is a positive one with a score of ~85%. The sentence sounds pretty positive to me so I like the results so far. Feel free to replicate the process for other sentences and test it out!

Let’s move on to a different type of text classification.

Sentiment analysis, which we just covered, can be considered a special case of text classification, where the categories (or classes) are only positive, negative or neutral. Text classification is more generic in that it can classify (or categorize) the incoming text (e.g. sentence, paragraph or document) into pre-defined classes. Let’s see what this means in practice.

4.1. Text Classification — Implementation

We will use the same pipeline package and take steps very similar to what we did for sentiment analysis, as follows:

# Specify model
model = 'facebook/bart-large-mnli'# Specify Task
task = 'zero-shot-classification'
# Specify input text
input_text = 'This is a tutorial about using pre-trained models through HuggingFace'
# Identify the classes/categories/labels
labels = ['business', 'sports', 'education', 'politics', 'music']
# Instantiate pipeline
classifier = pipeline(task, model = model)
# Store the output of the analysis
output = classifier(input_text, candidate_labels = labels)
# Return output
output

Results:

Results are quite interesting! The scores correspond to each label, sorted from the largest to the smallest for ease of reading. For example, the results indicate that our sentence is labeled as “education” with a score of ~40%, followed by “business” by ~22%, while labels for “music”, “sports” and “politics” have very low scores, which makes sense to me overall.

Let’s move on to our next task, which is summarization.

Text summarization is the task of automatically summarizing textual input, while still conveying the main points/gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. For example, these human representatives can read the summary of the customer emails instead of the entire emails, resulting in improved operational efficiency by saving human time and cost.

Let’s look at how we can implement text summarization.

5.1. Text Summarization — Implementation

Similar to other tasks, we will use the pipeline for a summarization task. For this specific task, we will first use a text-to-text pre-rained model from Google named T5 to summarize the description that we just read about “Text Summarization” in the section above. We will then repeat the same exercise using a different model from Google to see how the results vary. Let’s see how we can implement this.

# Specify model and tokenizer
model = "t5-base"
tokenizer = "t5-base"# Specify task
task = "summarization"
# Specify input text
input_text = "Text summarization is the task of automatically summarizing textual input, while still conveying the main points and gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. "
# Instantiate pipeline
summarizer = pipeline(task = task, model = model, tokenizer = tokenizer, framework = "tf")
# Summarize and store results
output = summarizer(input_text)
# Return output
output

Results:

Text Summarization Results (T5 Model)

As you see in the results, the T5 model took the input text, which was rather long, and returned a brief summary of what it considered the main points of the input text. I like the summary since it explains what text summarization is and what benefits it can provide — that’s a good summary!

Let’s try another model from Google named Pegasus to see if and how the results change, when we use a different model.

# Specify model
model = 'google/pegasus-cnn_dailymail'# Specify task
task = 'summarization'
# Specify inpute text
input_text = "Text summarization is the task of automatically summarizing textual input, while still conveying the main points and gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. "
# Instantiate pipeline
summarizer = pipeline(task = task, model = model)
# Summarize and store results
output = summarizer(input_text, max_length = 75, min_length = 25)
# Return output
output

Results:

Text Summarization Results (Pegasus Model)

As expected, resulting outputs of the two models defer, since they are each trained using specific data and training objectives but both somehow accomplished the task. I personally prefer the outcome of the T5 model, since it more succinctly states the point of the input text.

Last, but not the least task that we will be looking at is machine translation.

Machine translation is the task of generating the translation of an input text in a target language. This is similar to what Google Translate or other similar translation engines provide. One of the benefits of using Hugging Face for machine translation is that we get to choose what models to use for our translation, which can potentially provide a more accurate translation for the specific language that we are looking for.

Let’s look at the implementation of machine translation in Hugging Face.

6.1. Machine Translation — Implementation

In order to generate translations, we will use two of the most common pre-trained models to translate the same sentence from English to French. Implementation of each slightly varies but the overall process is the same as other tasks that we have implemented so far.

6.1.1. T5

T5 is an encoder-decoder pre-trained model developed by Google, which works well on multiple tasks, including machine translation. In order to prompt T5 to perform a tasks such as translation from language X to language Y, we will add a string (called a “prefix”) to the sentence to the input of each task follows: "translate X to Y: sentence_to_be_translated".

This is actually easier in practice to understand so let’s just translate a sentence from English to French using T5 and see how it works.

# Specify prefix
original_language = 'English'
target_language = 'French'
prefix = f"translate {original_language} to {target_language}: "# Specify input text
input_text = f"{prefix}This is a post on Medium about various NLP tasks using Hugging Face."
# Specify model
model = "t5-base"
# Specify task
task = "translation"
# Instantiate pipeline
translator = pipeline(task = task, model = model)
# Perform translation and store the output
output = translator(input_text)
# Return output
output

Results:

Machine Translation Results (T5 Model)

I looked up this translation on Google Translate and this looks like a good translation! I was concerned about this verification methodology, since T5 is also developed by Google. I do not know what exact model Google Translate uses for translation but we will see how much the results vary when we run the same translation task using mBART in the next section.

6.1.2. mBART

mBART is a multilingual encoder-decoder pre-trained model developed by Meta, which is primarily intended for machine translation tasks. mBART, unlike T5, does not require the prefix in the prompt but we need to identify the original and target languges to the model.

Let’s implement the same task in mBART.

# Import packages
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast# Specify input text
input_text = "This is a post on Medium about various NLP tasks using Hugging Face."
# Specify model
model_name = "facebook/mbart-large-50-many-to-many-mmt"
# Instantiate model and tokenizer
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
# Specify source language
tokenizer.src_lang = "en_XX"
# Encode input text
encoded_en = tokenizer(input_text, return_tensors="pt")
# Perform translation to the target language
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
# Decode the translation and store the output
output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# Return output
output

Machine Translation Results (mBART Model)

Results seem very similar to what T5 generated, with the exception of “poste” having been replaced by “post”. Regardless of the difference between the two outcomes, the main point of the exercise was to demonstrate how these pre-trained models can generate machine translation, which we have accomplished using both models.

In this post we introduced Hugging Face, an open-source AI community used by and for many machine learning practitioners in NLP, computer vision and audio/speech processing tasks. We then walked through the implementation of such pre-trained models within the Hugging Face platform to accomplish downstream NLP tasks, such as text generation, question answering, sentiment analysis, text classification, text summarization and machine translation.

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!

(All images, unless otherwise noted, are by the author.)

An introductory tutorial to use Hugging Face for NLP tasks

In this post we are going to go over six of the most frequently used NLP tasks by leveraging pre-trained models available on Hugging Face, as follows:

Text Generation (a.k.a. Language Modeling)
Question Answering
Sentiment Analysis
Text Classification
Text Summarization
Machine Translation

Let’s get started!

Let’s see how it works in practice!

1.1. Text Generation — Implementation

Import libraries
Specify the name of the pre-trained model to be used for this specific task
Specify the sentence, which will be completed by the model
Create an instance of pipeline as generator
Perform the text generation and store the results as output
Return the results

Code block below follows these steps.

# Import libraries
from transformers import pipeline# Specify the model
model = "gpt2"
# Specify the task
task = "text-generation"
# Instantiate pipeline
generator = pipeline(model = model, task = task, max_new_tokens = 30)
# Specify input text
input_text = "If you are interested in learing more about data science, I can teach you how to"
# Perform text generation and store the results
output = generator(input_text)
# Return the results
output

Results:

Text Generation Results

But what about a case where we would like to ask a question from the model? Can the model answer a question, instead of just completing an incomplete sentence? Let’s explora that next.

Question answering, as the name suggests, is a task where the model answers a question provided by the user. There are generally two types of question answering tasks:

Extractive (i.e. context-dependent): Where the user describes a situation to the model in the question/prompt and ask the model to generate a response, given that provided information. In this scenario, the model picks the relevant parts of the information from the prompt and returns the results
Abstractive (i.e. context-independent): Where the user asks a question from the model, without providing any context

Let’s look at how question answering can be implemented.

2.1. Question Answering — Implementation

Implementation process is similar to the language modeling task. We will use two different models to be able to compare the results.

Let’s start with the distilbert-base-cased-distilled-squad (link).

# Specify model
model = 'distilbert-base-cased-distilled-squad'# Instantiate pipeline
answerer = pipeline(model = model, task="question-answering")
# Specify question and context
question = "What does NLP stand for?"
context = "Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages"
# Generate predictions
preds = answerer(
question = question,
context = context,
)
# Return results
print(
f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

Results:

Question Answering Results (distilbert-base-cased-distilled-squad Model)

We can see the summary and the model was able to determine which part of the context was relevant to answer the question.

let’s implement the same problem using a different model, named deepset/roberta-base-squad2 (link).

# Specify model
model = "deepset/roberta-base-squad2"# Specify task
task = "question-answering"
# Instantiate pipeline
answerer = pipeline(task = task, model = model, tokenizer = model)
# Specify input
qa_input = {
'question': 'What does NLP stand for?',
'context': 'Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages'
}
# Generate predictions
output = answerer(qa_input)
# Return results
output

Results:

As we see in the above example, the second model was also able to identify that NLP stands for natural language processing, given the context that we provided.

Let’s continue our journey in the NLP tasks by looking at sentiment analysis next.

3.1. Sentiment Analysis — Implementation

Import libraries
Specify the name of the pre-trained model to be used for this specific task (i.e. sentiment analysis)
Specify the task (i.e. sentiment analysis)
Specify the sentence, which will be sentiment analyzed
Create an instance of pipeline as analyzer
Perform the sentiment analysis and save the results as output
Return the results

# Specify pre-trained model to use
model = 'distilbert-base-uncased-finetuned-sst-2-english'# Specify task
task = 'sentiment-analysis'
# Text to be analyzed
input_text = 'Performing NLP tasks using HuggingFace pipeline is super easy!'
# Instantiate pipeline
analyzer = pipeline(task, model = model)
# Store the output of the analysis
output = analyzer(input_text)
# Return output
output

Results:

Sentiment Analysis Results

Let’s move on to a different type of text classification.

4.1. Text Classification — Implementation

We will use the same pipeline package and take steps very similar to what we did for sentiment analysis, as follows:

# Specify model
model = 'facebook/bart-large-mnli'# Specify Task
task = 'zero-shot-classification'
# Specify input text
input_text = 'This is a tutorial about using pre-trained models through HuggingFace'
# Identify the classes/categories/labels
labels = ['business', 'sports', 'education', 'politics', 'music']
# Instantiate pipeline
classifier = pipeline(task, model = model)
# Store the output of the analysis
output = classifier(input_text, candidate_labels = labels)
# Return output
output

Results:

Let’s move on to our next task, which is summarization.

Let’s look at how we can implement text summarization.

5.1. Text Summarization — Implementation

# Specify model and tokenizer
model = "t5-base"
tokenizer = "t5-base"# Specify task
task = "summarization"
# Specify input text
input_text = "Text summarization is the task of automatically summarizing textual input, while still conveying the main points and gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. "
# Instantiate pipeline
summarizer = pipeline(task = task, model = model, tokenizer = tokenizer, framework = "tf")
# Summarize and store results
output = summarizer(input_text)
# Return output
output

Results:

Text Summarization Results (T5 Model)

Let’s try another model from Google named Pegasus to see if and how the results change, when we use a different model.

# Specify model
model = 'google/pegasus-cnn_dailymail'# Specify task
task = 'summarization'
# Specify inpute text
input_text = "Text summarization is the task of automatically summarizing textual input, while still conveying the main points and gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. "
# Instantiate pipeline
summarizer = pipeline(task = task, model = model)
# Summarize and store results
output = summarizer(input_text, max_length = 75, min_length = 25)
# Return output
output

Results:

Text Summarization Results (Pegasus Model)

Last, but not the least task that we will be looking at is machine translation.

Let’s look at the implementation of machine translation in Hugging Face.

6.1. Machine Translation — Implementation

6.1.1. T5

This is actually easier in practice to understand so let’s just translate a sentence from English to French using T5 and see how it works.

# Specify prefix
original_language = 'English'
target_language = 'French'
prefix = f"translate {original_language} to {target_language}: "# Specify input text
input_text = f"{prefix}This is a post on Medium about various NLP tasks using Hugging Face."
# Specify model
model = "t5-base"
# Specify task
task = "translation"
# Instantiate pipeline
translator = pipeline(task = task, model = model)
# Perform translation and store the output
output = translator(input_text)
# Return output
output

Results:

Machine Translation Results (T5 Model)

6.1.2. mBART

Let’s implement the same task in mBART.

# Import packages
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast# Specify input text
input_text = "This is a post on Medium about various NLP tasks using Hugging Face."
# Specify model
model_name = "facebook/mbart-large-50-many-to-many-mmt"
# Instantiate model and tokenizer
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
# Specify source language
tokenizer.src_lang = "en_XX"
# Encode input text
encoded_en = tokenizer(input_text, return_tensors="pt")
# Perform translation to the target language
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
# Decode the translation and store the output
output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# Return output
output

Machine Translation Results (mBART Model)

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!

(All images, unless otherwise noted, are by the author.)

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.