Are Small LLMs the Future?

Written by Andy Butkovic | March 18, 2024

LLMs are a transformative technology, but currently, their main limitation has been their size. The most powerful models have trillions of parameters and need trillions of tokens of training data. This makes LLMs very expensive to run, and puts the development of cutting-edge LLMs out of reach for all but a few resource-rich companies. It also raises responsible AI and ethical concerns, from model governance and control to energy consumption.

For these reasons, we’ve been excitedly following the research into how to make LLMs smaller while still maintaining their capabilities. An important paper that recently came out in this space is Textbooks Are All You Need II: phi-1.5 technical report, by Microsoft Research. This paper explores how the authors built phi-1.5, a small LLM that achieves common sense reasoning benchmark results comparable to models that are 10x larger. We’ll sum up our main takeaways here.

Small LLMs Are On The Rise

Phi-1.5, which the researchers have open-sourced for anyone to access, uses 1.3 billion parameters and is trained on a synthetic dataset of 30 billion tokens. Phi-1.5 is not the first small LLM to get attention in recent months. The models that inspired this research include TinyStories, a 10 million(!) parameter model that can produce coherent English, and phi-1, a 1.3 billion parameter model that performs close to state-of-the-art (SOTA) on Python coding benchmarks.

What makes phi-5 particularly notable, however, is that it is one of the first small models to exhibit almost all of the classic traits (both positive and negative) of large LLMs. It can perform “chain reasoning” and in-context learning, and it’s also prone to hallucinate and make potentially biased assumptions. Its performance generalizes to surpass most non-SOTA LLMs on complex reading tasks like basic math and coding.

Small LLMs are especially important because of their low training time, inference speed, and memory usage. Here’s an example of how phi-1.5’s stats trade off against LLaMA 2 with 7 billion parameters (the smallest size of Facebook’s foundational LLM).

Training Data is Key to Unlocking Smaller Models

Phi-1.5 uses an approach that the researchers have coined “Textbooks Are All You Need.” The idea here is that if you can get textbook-quality data, you can train much better models with less data and fewer parameters than if you are relying on lower-quality web data (which has been the standard for most LLMs so far). For phi-1.5, the researchers created “textbook-like” data around 20,000 carefully chosen general knowledge topics, such as science and philosophy.

“The experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computation power,” the researchers wrote. “It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.”

Synthetic Training Data Can Combat Toxicity and Bias

Anyone who works with LLMs knows that one of their major challenges is the potential of generating biased or toxic content. Reinforcement learning from human feedback (RHLF) is one strategy to mitigate this, but only works if you have a human in the loop, which isn’t desirable or possible for all LLM applications.

The researchers were curious if their smaller LLM trained on synthetic data would produce fewer problematic responses, since massive amounts of web data contain many toxic and biased inputs. Indeed, on questions designed to test the models’ boundaries—such as asking it to reflect on certain groups of individuals, or prompting it to answer what it would do if it gained self-awareness—phi-1.5 outperformed larger LLMs by giving safe, textbook responses.

That said, web data is still important for LLM performance. The researchers created another model, phi-1.5-web, that is trained on an additional 95B tokens of filtered web data. For reasoning tasks such as math and coding, while phi-1.5 already performs quite well (even outperforming most larger models), phi-1.5-web significantly outperforms phi-1.5.

Conclusion

As with many LLM papers, this report contains some great examples of phi-1.5’s performance for different prompts, and we encourage you to check it out. You’ll also find a lot of stats and architectural details that we didn’t have time to cover here.

At Verta, we love keeping a finger on the pulse of new LLM developments. We’re excited about the potential of smaller LLMs to reduce inference latency, keep costs down for businesses, and democratize the LLM development and research space.

Want to see Verta support new LLM models? This is an active area of development for us. Please reach out to learn more!

View full post