How to Use Large Language Models (LLM) in Your Own Domains | by Eileen Pangu | Mar, 2023

By Jessie Hobb On Mar 21, 2023

A few canonical and research-proven techniques to adapt large language models to domain specific tasks and the intuition of why they are effective.

Ever since being popularized by ChatGPT in late 2022, large language models (LLM) have attracted intense interest from the research and industry communities. While general chatbot is an obvious application of large language models, enterprises are thinking about how to integrate large language models into their business workflows to leverage this latest AI advance. A fundamental premise of business specific integration though, is to be able to adapt the large language models to the bespoke business domain, since LLMs are usually trained on open internet information, which contains too much noise and is not always closely relevant to the specific business context.

While there are many good blog posts out there already detailing large language models themselves, there seems to be a lack of solid introduction on how to leverage LLMs. In this blog post, we examine a few canonical ways of adapting LLMs to domain specific tasks from recent research literature. The goal is to spark some inspiration to actually democratize LLMs and make them accessible to the wider world.

The scenario this blog post postulates is that you somehow get hold of a general purpose large language model that is already pretrained. You can access all the parameters in the model. The model may come from open-source, commercial options, partnerships with other organizations (Google’s PaLM and OpenAI’s GPT-3), or train-from-scratch by your organization. Now you have a variety of tasks (Q&A, summarization, reasoning, etc) of a specific business context that you want to base on the large language model.

Traditional Fine-tuning

The traditional method of adapting a general machine learning model to a specific task is to use the labeled data from the specific domain to uptrain the general model end-to-end. During the uptraining, parts of or all the learnable parameters in the model are fine-tuned via backpropagation. This type of fine-tuning is often undesirable with large language models. The LLMs nowadays are way too big. Some have hundreds of billions of parameters. The end-to-end fine-tuning not only consumes a huge amount of computational resources, but also requires decent size of domain specific labeled data, which is expensive to acquire. As the field of AI advances, models are likely only getting larger, making it increasingly cumbersome to always fine-tuning the entire model end-to-end for every single bespoke task.

One form of end-to-end fine-tuning that is often desired, though, is instruction fine-tuning [1]. Large language models are often training on general text. A simple analogy is that the LLM is like a person who has read tons of books (Make a note of this analogy. We’ll keep referring back to it subsequently to build out our intuition). But he does not know what to do with all that knowledge. The purpose of instruction fine-tuning is to get the model into the habit of performing some common tasks. This is done by prefixing the input with templated instructions such as “answer the following question”, “summarize the following document”, “compute the results of”, “translate this sentence”, etc. The output is then the expected outcome of those instructions. Using this kind of input/output pairs to fine-tuning the model end-to-end will make the model more amenable to “taking action” on future input. Note that instruction fine-tuning does not need to be domain specific unless your domain requires an unusual “action”. And it’s likely that the pretrained large language model you have is already instruction-fine-tuned (such as Google’s Flan-PaLM).

Prompting

Before going into the methods of adapting LLMs to domain specific tasks, we need to introduce the idea of prompting, which the rest of the blog post is based on.

Prompting is how we interact with LLMs. LLMs are effectively sequence to sequence text generators. You can think of them as recurrent neural networks if that helps build the intuition though note that nowadays the start of the art LLMs are built on Transformer, more specifically, the decoder part of Transformer, which is not RNN. Prompt is the input sequence to the model.

Going back to our “knowledgeable person” analogy above, prompting is the act of asking the person questions. Obviously, to get useful answers, your questions need to be good. There are some resources online about how to ask clear and specific questions to elicit a good answer from LLMs. Those are useful tips, but they are not the type of fine-tuning prompts we’ll cover in this blog post.

Come to think of it, why does “prompting” work? Because the model is trained to condition its output on the input sequence. In the case of LLMs trained on the open internet, all the human “knowledge” is packed inside the model and reincarnated as numbers. Prompting is to set up the mathematical conditions so that an appropriate output can be constructed. The best mathematical conditions may not be in the traditional sense of “being clear and specific”, albeit that’s nonetheless a good general rule. And most importantly, as you may have guessed it, those mathematical conditions are domain specific and they are what you should focus on tuning to adapt the LLM to your domain. The model parameters themselves, however, stay unchanged. Using our “knowledgeable person” analogy again, the person is knowledgeable already. So no need to change him. In fact, since he has acquired all the human knowledge, he already possesses the latent knowledge of your domain as your domain is ultimately built on human knowledge.

The Art and Science of Prompting

So how should we prompt the model in a way that fine-tunes it to the specific business domain? The following are a few canonical ways.

Few-shot Exemplars

The simplest and yet very effective way is to provide a few examples as prompts. The academic term is few-shot learning by exemplars [2]. To illustrate with a simple example, let’s say the task you want to perform is arithmetic calculation.

Input:
Jane has 2 apples. She bought 3 more. How many apples does she have in total?Expected Output:
The answer is 5.

Now if you just feed the above input to LLM, you probably won’t get the correct result. Because the “knowledgeable person”, though possesses the ability to do arithmetics, does not know that he is asked to do arithmetics. So what you need to do is to encode a few examples of what you want from LLM all in the input.

Input:
Joe has 3 oranges and he got 1 more. How many oranges does he have?
The answer is 4.
Jack has 8 pears and he lost 1. How many pears does Jack have now?
The answer is 7.
Jane has 2 apples. She bought 3 more. How many apples does she have in total?Expected Output:
The answer is 5.

The final output is just answering the last question in the input. However, LLMs can condition on the prior text in the input to get a hint of “what to do”. Obviously, the final interface of your task is just accepting the real question from users. The examples are prepended to the user question behind the scene. You’ll then need to experiment a bit to find a few reasonably good examples to use as prefix of the model input.

Chain-of-Thoughts

Building on the few-shot exemplars above, we want to tell the LLMs not only “what to do” but also “how to do it”. This can be achieved via chain-of-thoughts prompting. The intuition is that if the “knowledgeable person” sees a few examples of how to do the task, he will mimic the “reasoning” as well. So the above arithmetics scenario becomes:

Input:
Joe has 3 oranges and he got 1 more. How many oranges does he have?
Starting with 3 oranges, then add 1, the result is 4.
The answer is 4.
Jack has 8 pears and he lost 1. How many pears does Jack have now?
Starting with 8 pears, then minus 1, the result is 7.
The answer is 7.
Jane has 2 apples. She bought 3 more. How many apples does she have in total?Expected Output:
Starting with 2 apples, then add 3, the result is 5.
The answer is 5.

Research [2] has shown that chain-of-thoughts prompting significantly boost the performance of LLMs. And you get to pick whether you want to surface the reasoning part — “Starting with 2 apples, then add 3, the result is 5” — to end users.

To further improve the results of chain-of-thoughts, recognize that there are usually multiple reasoning paths to the same result. Humans can solve a problem in multiple ways, and if multiple solutions lead to the same result, we have better confidence on the result. This intuition can again be incorporated. LLMs are probabilistic models. The output is sampled each time. So if you run it multiple times, the output may be different. What we are interested in is the outputs where the reasonings are different while the final answers are the same. This models the human thought process of deliberately solving the problems in multiple ways and gaining confidence from the “self-consistency” [3] of the output.

Input:
[same]Output1:
Starting with 2 apples, then add 3, the result is 5.
The answer is 5. [correct]
Output2:
2 apples and 3 apples make 6 apples.
The answer is 6. [incorrect]
Output3 [repeat of final result 5 with a new reasoning - good]:
2 apples plus 3 apples equal 5 apples.
The answer is 5. [correct]
Output4 [repeat of final result 6 with identical reasoning - ignore]:
2 apples and 3 apples make 6 apples.
The answer is 6. [incorrect]

Simply put, the final result with the largest number of distinct reasonings wins. Output1 and output3 above nail the final correct answer 5.

Learnable Prompts

The above methods only use a few examples from your domain specific labeled dataset. But if you have more data, you naturally want to make good use of them. The other question you should ask is how can you be sure that the examples you pick are mathematically the best. This is where learnable prompts come in.

The key insight is that the prefix in your input does not have to come from a fixed vocabulary. At the end of the day, each token in the input is transformed into an embedding before feeding to the model. An embedding is just a vector of numbers. And the best numbers can be learned.

What you’ll do is to have a set of prefix tokens before the real input. Those prefix tokens can be initialized by sampling words from your domain specific vocabulary. The embeddings of those prefix tokens will then be updated via backpropagation. The model parameters themselves are still frozen. But the gradients are propagated from the expected-vs-actual output delta, via the model parameters, all the way to the input layer where those prefix token embeddedings are updated. After running the training over your domain specific labeled dataset, those learned token embeddings become your fixed input prefix at inference time. In a sense, this is a kind of soft prompt where the input prefix is no longer constrained to be drawn empirically from a fixed vocabulary. Rather, they are optimized mathematically for your domain. This is called prompt tuning [4] (Figure-1).

Figure-1 Prompt Tuning: trainable prefix is concatenated with domain specific input to form the model input. The gradients are propagated from output layer via model parameters to the input layer to update the trainable prefix. The rest of the parameters in this whole architecture stay unchanged.

Epilogue

This blog post provides an intuitive explanation of the common and effective fine-tuning mechanisms that you can employ to adapt large language models (LLMs) to your domain specific tasks in a data-efficient and compute-efficient way. This is a rapidly changing field. New research results are coming out at a dazzling pace. But hopefully this blog post provides some solid ground for you to get started with making use of LLMs.

References

[1] Scaling instruction-finetuned language models https://arxiv.org/abs/2210.11416

[2] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

[3] Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171

[4] The Power of Scale for Parameter-Efficient Prompt Tuning https://arxiv.org/abs/2104.08691