Large Language Models as Zero-shot Labelers | by Devin Soni | Mar, 2023

By Jessie Hobb On Mar 20, 2023

Using LLMs to obtain labels for supervised models

Labeling data is a critical step in building supervised machine learning models, as the quantity and quality of labels is often the main factor that determines model performance.

However, labeling data can be very time-consuming and expensive, especially for complex tasks that involve domain knowledge or reading large amounts of data.

In recent years, large language models (LLMs) have emerged as a powerful solution for obtaining labels on text data. Through zero-shot learning, we can obtain labels on unlabeled data using only the output of the LLM, rather than having to ask a human to obtain the labels. This can significantly lower the cost of obtaining labels, and makes the process far more salable.

In this article, we will further explore the concept of zero-shot learning and how LLMs can be used for this purpose.

Zero-shot learning (ZSL) is a a problem setup in machine learning in which the model is asked to solve a prediction task that it was not trained on. This often involves recognizing or classifying data into concepts it had not explicitly seen during training.

In traditional supervised learning, this is not possible, as the model can only output predictions for tasks it was trained on (i.e. had labels for). However, in the ZSL paradigm, models can generalize to an arbitrary unseen task, and perform at a reasonable level. Note that in most cases, a supervised model trained on a given task will still outperform a model using ZSL, so ZSL is more often used before supervised labels are readily available.

One of the most promising applications of ZSL is in data labeling, where it can significantly reduce the cost of obtaining labels. If a model is able to automatically classify data into categories without having been trained on that task, it can be used to generate labels for a downstream supervised model. These labels can be used to bootstrap a supervised model, in a paradigm similar to active learning or human-in-the-loop machine learning.

LLMs like GPT-3 are powerful tools for ZSL because their robust pre-training process allows them to have a holistic understanding of natural language that is not based on a certain supervised task’s labels.

Embedding search

The contextual embeddings of a LLM are able to capture the semantic concepts in a given piece of text, which make them very useful for ZSL.

Libraries such as sentence-transformers offer LLMs that have been trained in such a way that semantically similar pieces of text will have embeddings that have a small distance from each other.

If we have the embeddings for a few labeled pieces of data, we can use a nearest-neighbor search to find pieces of unlabeled data with similar embeddings.

If two pieces of text are very close to each other in the embedding space, then they likely have the same label.

In-context learning

In-context learning is an emergent ability of LLMs that allows them to learn to solve new tasks simply by seeing input-output pairs. No parameter updates of the model are needed for it to be able to learn arbitrary new tasks.

We can use this ability to obtain labels by simply providing a few input-output pairs for our downstream task, and allow the model to provide labels for unlabeled data points.

In the context of ZSL, this means that we can provide a few handcrafted examples of text with their associated supervised labels, and have the model learn the labeling function on-the-fly.

In this trivial case, we train ChatGPT to classify whether or not a sentence is about frogs via in-context learning.

Generative models

Recent advances in alignment methods such as RLHF (Reinforcement Learning from Human Feedback) in generative LLMs have made it possible to simply ask the model to label data for you.

Models such as ChatGPT are able to provide labels for input data by simply replying (in language) with the desired label. Their vast knowledge of the world obtained through pre-training on such large amounts of data have endowed these models with the ability to solve novel tasks using only their semantic understanding of the question being asked.

This process can be automated using open-sourced models such as FLAN-T5 by asking the model to respond with only items in your label set (e.g. “Respond with ‘Yes’ or ‘No’”), and checking which option has the highest output probability after asking the model for labels.

ChatGPT is able to not only provide a label, but also explain its logic for obtaining said label.

Labeling data is a critical step in supervised machine learning, but it can be costly to obtain large amounts of labeled data.

With zero-shot learning and LLMs, we can significantly reduce the cost of label acquisition.

LLMs pre-trained on huge amounts of data encode a semantic understanding of the world’s information that allow them to have high performance on arbitrary, unseen tasks. These models can automatically label data for us with high accuracy, allowing us to bootstrap supervised models at a low cost.