Estimating the Cost of Training LLMs

By Jessie Hobb On Mar 31, 2023

Discussing LLMs like ChatGPT, the underlying costs, and inference optimization approaches

In the recent past, machine learning was considered a complex, niche technology that only a select few could comprehend. However, as ML applications become more powerful, the public’s interest has surged, leading to a vast amount of content surrounding Artificial Intelligence. The culmination of this happened in November 2022, when we saw ChatGPT, and continued in March 2023 with the release of GPT-4, when even the most skeptical person was surprised at what modern neural networks can do.

Asking ChatGPT about its capabilities. Image by Author created using ChatGPT

While some of this content is undoubtedly valuable, a significant portion of it perpetuates fear and misinformation, such as the spread of the idea that robots will replace all human jobs or discovering secret ways to make huge sums of money on neural networks. As a result, it has become increasingly important to dispel misconceptions about machine learning and large language models and provide informative content to help people understand these technologies better.

This article aims to discuss the crucial aspect of modern machine learning that is often overlooked or misunderstood — the cost of training large language models. At the same time, we will briefly take a look at what LLM is and some possible techniques to optimize its inference. By providing comprehensive examples, I hope to convince you that these technologies do not come out of the air. By getting an idea about the scale of the data and the underlying calculations you will better understand these powerful tools.

Mostly, I will rely on the recent LLaMA paper by Meta AI because of its clarity in the sense of the amount of data and compute the team used to train these models. The post will be divided into the following sections:

First, we’ll briefly look at what modern LLMs are;
Then, we discuss how much it costs to train such models;
In the end, we briefly consider some popular techniques to optimize language models for inference.

Stay tuned as we delve deeper into the world of large language models and you will see that everything is very simple and very complicated at the same time.

Before we explore the costs associated with training Large Language Models (LLMs), let’s first briefly define what a language model is.

Parameter counts of several language models released in 2018–2019. Modern LLMs usually have tens to hundreds of billions of parameters. Figure 1 from DistilBERT paper

In simple terms, a language model is a type of machine learning algorithm designed to understand or generate human language. Recently, exactly generative models have become more and more popular — the GPT model family developed by OpenAI: ChatGPT, GPT-4, etc (stands for Generative Pre-trained Transformer, honoring the Transformer architecture on which it is based).

Less popular, but still important examples include GPT-3 (175B), BLOOM (176B), Gopher (280B), Chinchilla (70B), and LLaMA (65B), where B refers to billions of parameters, although many of these models also have smaller versions.

Nothing is known about the number of ChatGPT and especially GPT-4 parameters, but it looks like these are about the same numbers.

Some of the popular LLMs architectures. Image by Author

These models are “trained” using vast amounts of text data, enabling them to learn the complex patterns and structures of natural language. However, the task they solve during training is very simple: they just predict the next word (or token) in a sequence.

You may have heard such a model called autoregressive, which means it uses its past outputs as input for future predictions and generate output step by step. This can be seen, among other things, in the example of ChatGPT:

GhatGPT generates a response. Gif by Author created using ChatGPT

You can notice that the model generates the answer gradually and in chunks that are sometimes less than one word. These chunks are called tokens and they are very useful in NLP, although not so important for us now.

At each time step, the model concatenates the previous output to the current input and keeps generating. It does so until it reaches the special End of Sequence (EOS) token. Omitting the prompt task and taking words as tokens for simplicity, the process can be illustrated as follows.

Illustrating text generation for autoregressive models. Image by Author

This simple mechanism together with a huge amount of data (more than any person could read in several lifetimes) allows the model to generate coherent and contextually appropriate text, mimicking human-like writing.

Note, that here we are talking about generative models only. Why if there are other model families?

The reason is quite simple — the text generation task is one of the most difficult to solve and at the same time one of the most impressive. ChatGPT gained 1 million users in just 5 days — faster than any other application before, and continues in the same spirit.

So-called encoders (BERT model family) can be much less exciting, but they can also solve various problems with human-level performance and help you with tasks like text classification or Named Entity Recognition (NER).

I will not provide particular examples of what LLMs can do — the Internet is already full of them. The best way to get an idea is to try ChatGPT yourself, but you can also find plenty of exciting resources like the Awesome ChatGPT prompts repo. Despite their impressive capabilities, current large language models have some limitations. The most popular and significant of them include:

Bias and staticity: Since LLMs are trained on data from various sources, they inadvertently learn and reproduce biases present in those sources. They are also static in the sense that they cannot adapt to new data or update their knowledge in real time without re-training.
Comprehension and disinformation: Although LLMs can generate human-like text, they may not always fully understand the context of the input. Also, the autoregressive way of generating output text does not prohibit the model from generating lies or nonsense.
Resource-intensive: Training LLMs requires substantial computing resources, which translates to high costs and energy consumption. This factor can limit the accessibility of LLMs for smaller organizations or individual researchers.

These and other drawbacks are active topics for the research community. It is worthwhile to note that the field is growing so fast that it is impossible to predict what limitations will be overcome in just a few months — but without a doubt, new ones will arise.

One possible example is the fact that earlier models simply grew in the number of parameters, but now it is considered that it is better to train smaller models for a longer time and give them more data. This reduces the model size and the cost of the model’s further use during inference.

In that way, the LLaMA release freed the hands of enthusiasts and these models were already run locally on computers, Raspberry Pi, and even phones!

Having a big picture of what LLM is, let’s move on to the main section of this article — estimating the cost of training large language models.

To estimate the cost of training large language models, it is essential to consider three key factors that any machine learning algorithm consists of:

Data,
Compute resources, and
Architecture (or the algorithm itself).

Let’s delve deeper into each of these aspects to better understand their impact on training costs.

Data

LLMs require massive amounts of data to learn the patterns and structures of natural language. Estimating the cost of data can be challenging since companies often use data accumulated over time through their business operations together with open-sourced datasets.

Additionally, data needs to be cleaned, labeled, organized, and stored efficiently, considering the scale of LLMs. Data management and processing costs can add up quickly, especially when factoring in the infrastructure, tools, and data engineers required for these tasks.

To make a particular example, it is known that LLaMA used a training dataset containing 1.4 trillion tokens with a total size of 4.6 terabytes!

Training dataset of LLaMA models. Table 1 from LLaMA paper

Smaller models (7B and 13B) were trained on 1T tokens, while larger ones (33B and 65B) used the full dataset of 1.4T tokens.

Training loss over tokens for LLaMA models. Figure 1 from LLaMA paper

I think now you understand that no one is overstating when calling these datasets huge and why it wasn’t technically possible ten years ago. But things are even more interesting with computing resources.

Compute

The actual training process accounts for a significant portion of the LLM budget. Training large language models is resource-intensive and is done on powerful Graphics Processing Units (GPUs), due to significant parallel processing capabilities. NVIDIA releases new GPUs every year, the cost of which hits hundreds of thousands of dollars.

The cost of cloud computing services for training these models can be huge and reach several million dollars, especially considering iterating through various configurations.

Returning to the LLaMA paper, the authors report that they train the biggest 65B model for 21 days on two thousand GPUs with 80 GB of RAM each.

Amount of computing resources for training the LLaMA model. Image from LLaMA paper

NVIDIA A100 GPU authors used is a popular choice for modern neural network training. Google Could Platform offers such GPUs for $3.93 per hour.

Price of NVIDIA A100 GPU. Screenshot of a public GCP pricing page

So let’s do some quick calculations:

2048 GPUs x $3.93 GPU per hour x 24 hours x 21 days =

4.05 million dollars

Four million dollars is a budget that not every researcher can afford, huh? And it is a single run! To give you another example, this article estimates the cost of training GPT-3, and the authors got 355 GPU-years and 4.6 million dollars.

You may have heard that “neural networks train very quickly on GPU”, but no one says relative to what.

They’re really training fast taking into account the enormous amount of calculations, and without these GPUs, they would have been training for decades. So yeah, 21 days is pretty fast for LLMs.

Architecture (and Infrastructure)

The development of state-of-the-art LLMs also depends on the work of skilled researchers and engineers to develop the architecture and configure the training process properly. The architecture is the foundation of the model, dictating how it learns and generates text.

Expertise in various computer science areas is required for designing, implementing, and controlling these architectures. Engineers and researchers responsible for publishing and delivering cutting-edge results can command salaries reaching hundreds of thousands of dollars. It is worth noting that the skill set required for LLM development may differ significantly from the skill set of a “classic” machine learning engineer.

Machine learning system infrastructure. Figure 1 from Hidden Technical Debt in Machine Learning Systems paper

I think now you do not doubt that training LLMs is a very hard and resource-intensive engineering problem.

Now let’s briefly discuss some methods for making the process of LLM inference more efficient and cost-effective.

Do we actually need optimization?

Inference refers to the process of using a trained language model to generate predictions or responses, usually as an API or web service. Given the resource-intensive nature of LLMs, it is essential to optimize them for efficient inference.

For example, GPT-3 model has 175 billion parameters, which is 700 GB of float32 numbers. Approximately the same amount of memory will be taken up by activations, and remember that we are talking about RAM.

To serve predictions without any optimization technique, we will need 16 A100 GPUs with 80 GB of memory each!

Several popular techniques can help reduce memory requirements and model latency, including model parallelism, quantization, and others.

Model Parallelism

Parallelism is a technique that distributes the computation of a single model across multiple GPUs and can be used both during training and inference.

Splitting the model’s layers or parameters across multiple devices can dramatically improve the overall inference speed and is very often used in practice.

Quantization

Quantization involves reducing the precision of the model’s numerical values (such as weights). By converting floating-point numbers to lower-precision integers, quantization can result in significant memory savings and faster computation without a substantial loss in model performance.

The simple idea that arises quite quickly is to use float16 numbers instead of float32 and reduce the amount of memory by half. It turns out that it is possible to convert model weights even to int8 almost without accuracy loss due to the fact that they are located close to each other on the number line.

Other Techniques

Finding ways to optimize LLMs is an active area of research, and other techniques include:

Knowledge distillation — training a smaller student model to mimic the behavior of a larger teacher;
Parameter pruning — removing redundant or less important parameters from the model to reduce its size and computational requirements;
And using frameworks like ORT (ONNX Runtime) to optimize calculation graphs with techniques like operator fusion and constant folding.

Overall, optimizing large language models for inference is a critical aspect of their deployment. By applying various optimization techniques, developers can ensure that their LLMs are not only powerful and accurate but also cost-effective and scalable.

After all the above, one might wonder why OpenAI decided to open access to ChatGPT, given the high costs associated with training and inference. While we cannot be certain of the company’s exact motivations, we can analyze the benefits and potential strategic reasons behind this decision.

First and foremost, OpenAI has gained significant popularity by making state-of-the-art LLMs more accessible to the broader public. By demonstrating the practical applications of large language models, the company has captured the attention of investors, customers, and the tech community at large.

Secondly, OpenAI’s mission revolves around the creation and advancement of AI. By opening access to ChatGPT, the company is arguably moving closer to fulfilling its mission and preparing society for unavoidable changes. Providing access to powerful AI tools encourages innovation, driving the field of AI research forward. This progress can lead to the development of more efficient models, more extensive applications, and novel solutions to various challenges. It’s worth noting here that the architecture of ChatGPT and GPT-4 is closed, but that’s another discussion.

While the costs associated with training and maintaining large language models are undoubtedly significant, the benefits and strategic advantages that come with opening access to these tools can outweigh the expenses for some organizations. In the case of OpenAI, opening access to ChatGPT has not only increased their popularity and proved to be a leader in the AI field, but also allowed them to collect more data to train more powerful models. This strategy has allowed them to advance their mission and contribute (in some sense) to the broader development of AI and LLM technologies.

Asking ChatGPT why OpenAI is giving free access to ChatGPT. Image by Author created using ChatGPT

As we have seen, the cost of training large language models is influenced by various factors, including not only expensive computing resources but also big data management and the expertise required to develop cutting-edge architectures.

Modern LLMs have billions of parameters, are trained on trillions of tokens, and cost millions of dollars.

I hope you now better understand the scale of training and inferencing large language models, as well as their limitations and pitfalls.

The field of NLP has been experiencing its ImageNet moment for several years, and now it’s the turn of generative models. The widespread application and adoption of generative language models have the potential to revolutionize various industries and aspects of our lives. While it is difficult to predict exactly how these changes will unfold, we can be certain that LLMs will have some impact on the world.

Personally, I like the recent tendency of training “smarter”, not just “larger” models. By exploring more elegant ways to develop and deploy LLMs, we can push the boundaries of AI and NLP, opening the door to innovative solutions and a brighter future for the field.

If after reading the article you become interested in LLMs and want to learn more about them, here are some resources that can help you with that: