GPT-4 is impressive, really impressive. It’s remained on top of every LLM leaderboard for over a year now and it performs extremely well in thousands of tasks. It’s also overkill. You probably don’t need it for your use case and you may be spending way too much money as a result.
What if we told you there was a way to get that same GPT-4 performance but for a fraction of the cost and much lower latency? How, you might ask? Model distillation.
This blog post will cover the what, why, and how of this model distillation. Let’s dive in!
We’ll dig into the why of model distillation first. But, to make sure we don’t leave you with a cliffhanger on what model distillation even is, here’s a quick (to be expanded on!) definition:
Model distillation is the process of teaching a small LLM the knowledge hidden in a large LLM.
But why would we want to do that? Well, that leads us into the very next section.
Generally, the larger the LLM, the better it does on tasks. But, larger LLMs are far more expensive to use and are far slower to run. Especially for targeted use cases, the performance isn’t worth the cost and latency.
Take GPT-4. It has ~10x the parameters that GPT-3.5 does (and GPT-3.5 has 175 billion of them). GPT-4 also outperforms GPT-3.5 extensively on pretty much every task it's given.
GPT4 outperforms GPT-3.5 on many benchmarks (OpenAI)
But, GPT-4 is also 20x the price of GPT-3.5. For 1 million tokens (approximately 1000 user chats), it would cost $20 on GPT-4 vs. $1 on GPT-3.5. Over millions of requests, that difference adds up. Even open source models, with their lower costs, suffer from the same issue. Llama 2 with 70B parameters is far better than Llama 2 with 13B parameters. It’s also 17x as expensive to generate tokens.
Larger LLMs are also slower. More parameters need to be multiplied together to get the final output. For users, this can be a frustrating experience. An average reader reads at 238 words per minute. An LLM needs to generate output at speeds greater than that. Otherwise, a reader will read a word, then have to wait a bit before reading the next word, completely interrupting their flow.
But large LLMs have much higher latencies (2x higher!) than their small counterparts. They can produce enough words per second to meet the minimum 238 words per minute, but that requires specialized, expensive hardware.
The energy needed to power LLMs also matters. One estimate says that running GPT-4, with its trillion-plus parameters, consumes the same amount of electricity as 175,000 residents do in Denmark.
All this to say, large LLMs are very performant, but they are not very efficient. They cost a lot of money to run, they take longer to respond to queries, and they consume tons of energy.
If we could somehow get their same high performance in a smaller model, it would be a win-win-win. Luckily, that’s exactly what model distillation does.
Model distillation is the process by which we can replicate large LLM performance in a much smaller LLM. We do this by using the output of the large LLM as training data for the small LLM. In industry parlance, the larger LLM here is called the “teacher” and the smaller LLM is called the “student.”
To distill the teacher model, we give it a set of questions (or tasks) and note down its responses. Then, we fine tune the student model—we give the student that set of questions and corresponding responses, measure how the student’s responses diverged from the teacher’s, and then train the student to align its output with the teacher’s responses. With enough examples, the student model responds the same way the teacher model did.
This process works shockingly well. A research team was able to distill a teacher BERT model (another type of AI model built with the same technology powering LLMs) into a student BERT model that was 7.5x smaller and 9.4x faster. Across a wide range of tasks, the student model hit 96.8% of the performance of the teacher. In another case, a 770M parameter student model outperformed a 540B parameter teacher model.
The concept that powers model distillation is that large LLMs learn a lot of knowledge about the world through the tons of training data they’re fed. Because of their vast knowledge, they have good general reasoning capabilities and can pick out the important information in any task. A smaller LLM doesn’t have the capabilities to learn that from the raw data.
As an analogy, imagine two people: one is an expert data analyst and the other is a novice. The human expert is able to look at a spreadsheet of data and instantly identify the 2–3 numbers that matter. A novice wouldn’t know to pick up on those crucial numbers. But, that same expert can teach the novice how to quickly identify what matters.
The same happens with the teacher and student LLM in model distillation.
Here’s a basic example. Assume we have a large teacher LLM that does well on math word problems such as “How many cookies did you sell if you sold 320 chocolate cookies and 270 vanilla cookies?” We want to teach a smaller student LLM how to do well on these kinds of math word problems as well. To distill knowledge from the teacher to the student, we would give a bunch of math word problems to the teacher and note down its answers. In the cookie example, the teacher might realize that the flavors of the cookies (or even the fact they’re cookies) are irrelevant, and all it needs to do is “sum 320 and 270.” Then, we fine tune the student model on the responses (and logic) from the teacher model.
With this, we can train the student to change its responses to mirror the teacher’s.
Of course, this example is very simplified. In actuality, there are many different variants of model distillation. There are options for what kinds of models to pick for the teacher and student, what “knowledge” to extract from the teacher, and even the specific distillation algorithm (there are many!). But for most use cases, simply taking the responses of the teacher LLM (and its logits, which are a good proxy for the internal representation of the model) and training the student LLM on that is enough to transfer the teacher’s knowledge.
Keep in mind, model distillation can transmit errors: if the teacher model is wrong about something, that too will be passed on to the student model. Fine-tuning the student model on factual data can root out those errors.
The usual process for distilling LLMs is quite complex. We would need to choose the architecture of the teacher and student model, evaluate the teacher model rigorously to make sure its responses are high-quality, choose which parts of the teacher to distill, and set up a data pipeline. This is all before we even get one response from the teacher to train the student on. Not the easiest process to set up, despite its many benefits.
We’re changing that at Verta. We strongly believe distilling models can be good for companies, end users, and even the environment. We’ll be rolling model distillation out to our GenAI Workbench very soon. With that, you’ll be able to get GPT-4 performance for the cost and latency of GPT-2.
Learn more
For an in-depth look at our Workbench capabilities, you can read our full launch blog post and check out the platform here.