When Should You Fine-Tune LLMs?. There has been a flurry of exciting… | by Skanda Vivek | May, 2023

By Jessie Hobb On May 15, 2023

The problem of giving all the necessary information to the model to answer questions is now offloaded from the model architecture to a database, containing document chunks.

The documents of relevance can then be found by computing similarities between the question and the document chunks. This is done typically by converting the chunks and question into word embedding vectors, and computing cosine similarities between chunks and question, and finally choosing only those chunks above a certain cosine similarity as relevant context.

Finally, the question and context can be combined into a prompt as below, and fed into an LLM API like ChatGPT:

prompt=f"Answer the question. Context: {context}\n Question: {question}"

You might ask — why not feed the entire documents and question into the prompt instead of separating out into chunks? The reason we are separating documents into chunks and not combining all documents into a single prompt is because LLMs have certain limits on maximum input and output length. For ChatGPT, this is 4096 tokens, or roughly 6–7 pages of text.

Offloading documents to a database and querying using closed LLM APIs might work well in cases where the answer is obviously present in these documents. But there are a number of cases in which these could yield sub-optimal results, such as if you want ChatGPT to give an expert perspective on a niche topic — for example, speculating on the financial outlook of the U.S. economy.

Ok, so you tried out ChatGPT or BARD — and you didn’t like it. The answers were a bit off. For example, you asked ChatGPT:

“How long is the U.S. economy likely to have inflation?”

And it returned:

As an AI language model, I cannot predict the future with certainty, especially with regards to complex economic systems. However, I can provide you with some information and perspectives that may help you understand the current situation.

Inflation is a complex phenomenon that can be caused by a variety of factors such as an increase in demand, supply constraints,….

Unfortunately that is not good enough for you. Of course you can feed ChatGPT some recent news about Fed Chairman Jerome Powell’s statements. But that does not give you the same rich domain experience that you would get if you spoke to — well, Jerome Powell, who else! Or another expert.

Think about what it takes to be an expert in a certain field. While some amount of this is reading books on the topic, a lot is also interacting with subject matter experts in the field, and learning from experience. While ChatGPT has been trained on an incredible number of finance books, it probably hasn’t been trained by top financial experts or experts in other specific fields. So how would you make an LLM that is an “expert” in the finance sector? This is where fine-tuning comes in.

Before I discuss fine-tuning LLMs, let’s talk about fine-tuning smaller language models like BERT, which was commonplace before LLMs. For models like BERT and RoBERTa, fine-tuning amounts to passing some context, and labels. Tasks are well-defined like extracting answers from contexts, or classifying emails as spam vs not spam. I’ve written a couple of blog posts on these that might be useful if you are interested in fine-tuning language models:

However, the reason large language models (LLMs) are all the rage is because they can perform multiple tasks seamlessly by changing the way you frame prompts, and you have the experience similar to talking with a person at the other end. What we want now is to fine-tune that LLM to be an expert in a certain subject and engage in conversation like a “person.” This is quite different from fine-tuning a model like BERT on specific tasks.

One of the earliest open-source breakthroughs was by a group of Stanford researchers that fine-tuned a 7B LLaMa model (released earlier in the year by Meta) which they called Alpaca for less than 600$ on 52K instructions. Soon after, the Vicuna team released a 13 Billion parameter model which achieves 90% of ChatGPT quality.

Very recently, the MPT-7B transformer was released that could ingest 65k tokens, 16X the input size of ChatGPT! The training was done from scratch over 9.5 days for 200k$. As an example for a domain specific LLM, Bloomberg released a GPT-like model BloombergGPT, built for finance and also trained from scratch.

Recent advancements in training and fine-tuning open-source models are just the beginning for small and medium sized companies enriching their offerings through customized LLMs. So how do you decide when it makes sense to fine-tune or train entire domain specific LLMs?

First off, it is important to clearly establish the limitations of closed-source LLM APIs in your domain and make the case for empowering customers to chat with an expert in that domain at a fraction of the cost. Fine-tuning a model is not very expensive for a hundred thousand instructions or so — but getting the right instructions requires careful thought. This is where you also need to be a bit bold — I can’t yet think of many areas where a fine-tuned model is shown to perform significantly better than ChatGPT on domain specific tasks, but I believe this is right around the corner, and any company that does this well will be rewarded.

Which brings me to the case for completely training an LLM from scratch. Yes this could easily cost upwards of hundreds of thousands of dollars, but if you make a solid case, investors would be glad to pitch in. In a recent interview with IBM, Hugging Face CEO Clem Delangue commented that soon, customized LLMs could be as common place as proprietary codebases — and a significant component of what it takes to be competitive in an industry.

LLMs applied to specific domains can be extremely valuable in the industry. There are 3 levels of increasing cost and customizability:

Closed source APIs + Document Embedding Database: This first solution is probably the easiest to get started off with, and considering the high quality of ChatGPT API — might even give you a good enough (if not the best) performance. And it’s cheap!
Fine-tune LLMs: Recent progress from fine-tuning LLaMA-like models has shown this costs ~500$ to get a baseline performance similar to ChatGPT in certain domains. Could be worthwhile if you had a database with ~50–100k instructions or conversations to fine-tune a baseline model.
Train from scratch: As LLaMA and the more recent MPT-7B models have shown, this costs ~100–200k and takes a week or two.

Now that you have the knowledge — go forth and build your custom domain specific LLM applications!