Is GPT-4 better than Claude 2? Is a hammer better than a screwdriver? It depends on the task.
From answering questions to writing code, large language models (LLMs) can perform seemingly any task, making them hard to benchmark. It’s also not obvious what “better” even means. Most AI research has emphasized accuracy, but what about fairness, efficiency, and other qualities?
Stanford’s Center for Research on Foundation Models has put forth the HELM: Holistic Evaluation of Language Models, i.e., HELM benchmark in short. This framework measures LLMs against dozens of scenarios along several dimensions beyond just accuracy. In this post, we’ll give an overview of HELM, how to use it as a starting point, and how to go a step further in choosing an LLM for your product.
HELM is a benchmark for LLMs that improves upon prior benchmarks in three main ways:
Figure 5 depicting a “run” from the HELM research paper (Liang et al., 2023)
Rather than arbitrarily selecting scenarios and metrics, HELM takes a very principled approach, as we’ll explore next.
HELM frames scenarios as a combination of a task, a domain (which they define as the who, what, and when of a scenario), and a language.
Figure 8 showing scenario taxonomy from the HELM research paper (Liang et al., 2023)
The space of all possible scenarios is vast, and it isn’t feasible to test every single one. To limit scope, HELM focuses primarily on the English language and otherwise tries to cover a diverse set of tasks and domains. HELM also prioritizes scenarios that correspond to common real-world applications of LLMs.
HELM’s scenario taxonomy provides a structure for rigorously thinking about benchmarks. The research paper acknowledges that HELM’s coverage of scenarios is incomplete and will need to expand over time. As people devise new applications for LLMs, these novel use cases can be included in the taxonomy and added as benchmarks.
Accuracy is, of course, important for LLMs. It’s hard to use a model you can’t trust, and hallucinations are a well-known shortcoming of LLMs.
HELM goes a step further than just measuring accuracy. The framework encompasses several other metrics:
With this ensemble of metrics, it’s possible to evaluate tradeoffs. For instance, a 2% improvement in accuracy may not merit a 25% decrease in efficiency.
Making such tradeoffs is not a one-size-fits-all decision, which is why HELM makes no claims about which LLM is best. Instead, HELM lets you decide for yourself which LLM fits your use case.
HELM’s results are publicly available and will be continually updated. As of this writing, the Stanford team has benchmarked 119 models and is an advocate for benchmarking new models as they come to the market.
You can use the HELM leaderboard to compare models according to specific scenarios and metrics:
Screenshot of the HELM leaderboard
To get the most out of HELM, it probably makes the most sense to start by identifying which scenario is most relevant to your LLM use case. (You can also choose a group of scenarios if more than one is applicable.) Separately, go through the exercise of prioritizing which metrics matter the most to you.
Then, you can look at the leaderboard and decide which model is the best fit. Don’t forget to consider cost, terms of use, technical support, and other factors as well.
HELM is a great resource for the AI community as a whole. It pushes the field to standardize comparisons between models, and it emphasizes the importance of fair and equitable AI.
That said, general-purpose benchmarks can only go so far. While HELM has tested thousands of configurations, it may not be representative of every element of your LLM product:
HELM isn’t meant to be the one-stop shop to choose a large language model for your production application. It’s a research framework and corpus that also happens to be a good starting point for selecting an LLM. If you’re serious about evaluating all the options for building an AI application, you need something bespoke.
That means addressing both of the questions above. First, it’s important to test models against the exact task you intend on using them for. Second, evaluation doesn’t stop with just base models. Choosing effective prompts is just as, if not more important.
We built the Verta GenAI Workbench to help you iterate on your LLM application and compare results between models and prompts. The Workbench comes with built-in benchmarking, custom-made for your use case. The leaderboards feature lets you iterate toward the ideal combination of prompt and model to best serve your users.