How Few-Shot Learning is Automating Document Labeling | by Walid Amamou | Apr, 2023

By Jessie Hobb On Apr 11, 2023

Leveraging GPT Model

Manual document labeling is a time-consuming and tedious process that often requires significant resources and can be prone to errors. However, recent advancements in machine learning, particularly the technique known as few-shot learning, are making it easier to automate the labeling process. Large Language Models (LLMs) in particular are excellent few shot learners thanks for their emergent capability in context learning.

In this article, we’ll take a closer look at how few-shot learning is transforming document labeling, specifically for Named Entity Recognition which is the most important task in document processing. We will show how the UBIAI’s platform is making it easier than ever to automate this critical task using few shot labeling techniques.

Few-shot learning is a machine learning technique that enables models to learn a given task with only a few labeled examples. Without modifying its weights, the model can be tuned to perform a specific task by including concatenated training examples of these tasks in its input and asking the model to predict the output of a target text. Here is an example of few shot learning for the task of Named Entity Recognition (NER) using 3 examples:

###Prompt
Extract entities from the following sentences without changing original words.###
Sentence: " and storage components. 5+ years of experience deliver
ing scalable and resilient services at large enterprise scale, including experience in data platforms including large-scale analytics on relational, structured and unstructured data. 3+ years of experien
ce as a SWE/Dev/Technical lead in an agile environment including 1+ years of experience operating in a DevOps model. 2+ years of experience designing secure, scalable and cost-efficient PaaS services on
the Microsoft Azure (or similar) platform. Expert understanding of"
DIPLOMA: none
DIPLOMA_MAJOR: none
EXPERIENCE: 3+ years, 5+ years, 5+ years, 5+ years, 3+ years, 1+ years, 2+ years
SKILLS: designing, delivering scalable and resilient services, data platforms, large-scale analytics on relational, structured and unstructured data, SWE/Dev/Technical, DevOps, designing, PaaS services, Microsoft Azure
###
Sentence: "8+ years demonstrated experience in designing and developing enterprise-level scale services/solutions. 3+ years of leadership and people management experience. 5+ years of Agile Experie
nce Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience Other 5+ years of full-stack software development exp
erience to include C# (or similar) experience with the ability to contribute to technical architecture across web, mobile, middle tier, data pipeline"
DIPLOMA: Bachelors\nDIPLOMA_MAJOR: Computer Science
EXPERIENCE: 8+ years, 3+ years, 5+ years, 5+ years, 5+ years, 3+ years
SKILLS: designing, developing enterprise-level scale services/solutions, leadership and people management experience, Agile Experience, full-stack software development, C#, designing
###
Sentence: "5+ years of experience in software development. 3+ years of experience in designing and developing enterprise-level scale services/solutions. 3+ years of experience in leading and managing
teams. 5+ years of experience in Agile Experience. Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience."

The prompt typically begins by instructing the model to perform a specific task, such as “Extract entities from the following sentences without altering the original words.” Notice, we’ve added the instruction “without changing the original words” to prevent the LLM from hallucinating random texts, which it is notoriously known for. This has proven critical in obtaining consistent responses from the model.

The few-shot learning phenomenon has been extensively studied in this article, which I highly recommend. Essentially, the paper demonstrates that, under mild assumptions, the pretraining distribution of the model is a mixture of latent tasks that can be efficiently learned through in-context learning. In this case, in-context learning is more about identifying the task than about learning it by adjusting the model weights.

Few-shot learning has an excellent practical application in the data labeling space, often referred as few-shot labeling. In this case, we provide the model few labeled examples and ask it to predict the labels of the subsequent documents. However, integrating this capability in a functional data labeling platform is easier said than done, here are few challenges:

LLMs are inherently text generators and tend to generate variable output. Prompt engineering is critical to make them create predictable output that can be later used to auto-label the data.
Token limitation: LLMs such as OpenAI’s GPT-3 is limited to 4000 tokens per request which limits the length of documents that can be sent at once. Chunking and splitting the data before sending the request becomes essential.
Span offset calculation: After receiving the output from the model, we need to search its occurrence in the document and label it correctly.

We’ve recently added few shot labeling capability by integrating OpenAI’s GPT-3 Davinci with UBIAI annotation tool. The tool currently support few-shot NER task for unstructured and semi-structured documents such as PDFs and scanned images.

To get started:

Simply label 1–5 examples
Enable few-shot GPT model
Run prediction on a new unlabeled document

Here is an example of few shot NER on job description with 5 examples provided:

Image by Author: Few Shot NER on unstructured text

The GPT model accurately predicts most entities with just five in-context examples. Because LLMs are trained on vast amounts of data, this few-shot learning approach can be applied to various domains, such as legal, healthcare, HR, insurance documents, etc., making it an extremely powerful tool.

However, the most surprising aspect of few-shot learning is its adaptability to semi-structured documents with limited context. In the example below, I provided GPT with only one labeled OCR’d invoice example and asked it to label the next. The model surprisingly predicted many entities accurately. With even more examples, the model does an exceptional job of generalizing to semi-structured documents as well.

Few-shot learning is revolutionizing the document labeling process. By integrating few-shot labeling capabilities into functional data labeling platforms, such as UBIAI’s annotation tool, it is now possible to automate critical tasks like Named Entity Recognition (NER) in unstructured and semi-structured documents. This does not imply that LLMs will replace human labelers anytime soon. Instead, they augment their capabilities by making them more efficient. With the power of few-shot learning, LLMs can label vast amounts of data and apply to multiple domains, such as legal, healthcare, HR, and insurance documents, to train smaller and more accurate specialized models that can be efficiently deployed.

We’re currently adding support for few-shot relation extraction and document classification, stay tuned!

Leveraging GPT Model

###Prompt
Extract entities from the following sentences without changing original words.###
Sentence: " and storage components. 5+ years of experience deliver
ing scalable and resilient services at large enterprise scale, including experience in data platforms including large-scale analytics on relational, structured and unstructured data. 3+ years of experien
ce as a SWE/Dev/Technical lead in an agile environment including 1+ years of experience operating in a DevOps model. 2+ years of experience designing secure, scalable and cost-efficient PaaS services on
the Microsoft Azure (or similar) platform. Expert understanding of"
DIPLOMA: none
DIPLOMA_MAJOR: none
EXPERIENCE: 3+ years, 5+ years, 5+ years, 5+ years, 3+ years, 1+ years, 2+ years
SKILLS: designing, delivering scalable and resilient services, data platforms, large-scale analytics on relational, structured and unstructured data, SWE/Dev/Technical, DevOps, designing, PaaS services, Microsoft Azure
###
Sentence: "8+ years demonstrated experience in designing and developing enterprise-level scale services/solutions. 3+ years of leadership and people management experience. 5+ years of Agile Experie
nce Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience Other 5+ years of full-stack software development exp
erience to include C# (or similar) experience with the ability to contribute to technical architecture across web, mobile, middle tier, data pipeline"
DIPLOMA: Bachelors\nDIPLOMA_MAJOR: Computer Science
EXPERIENCE: 8+ years, 3+ years, 5+ years, 5+ years, 5+ years, 3+ years
SKILLS: designing, developing enterprise-level scale services/solutions, leadership and people management experience, Agile Experience, full-stack software development, C#, designing
###
Sentence: "5+ years of experience in software development. 3+ years of experience in designing and developing enterprise-level scale services/solutions. 3+ years of experience in leading and managing
teams. 5+ years of experience in Agile Experience. Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience."

LLMs are inherently text generators and tend to generate variable output. Prompt engineering is critical to make them create predictable output that can be later used to auto-label the data.
Token limitation: LLMs such as OpenAI’s GPT-3 is limited to 4000 tokens per request which limits the length of documents that can be sent at once. Chunking and splitting the data before sending the request becomes essential.
Span offset calculation: After receiving the output from the model, we need to search its occurrence in the document and label it correctly.

To get started:

Simply label 1–5 examples
Enable few-shot GPT model
Run prediction on a new unlabeled document

Here is an example of few shot NER on job description with 5 examples provided:

We’re currently adding support for few-shot relation extraction and document classification, stay tuned!

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.