HELM: A Better Benchmark for Large Language Models

Written by Meeta Dash | February 05, 2024

Is GPT-4 better than Claude 2? Is a hammer better than a screwdriver? It depends on the task.

From answering questions to writing code, large language models (LLMs) can perform seemingly any task, making them hard to benchmark. It’s also not obvious what “better” even means. Most AI research has emphasized accuracy, but what about fairness, efficiency, and other qualities?

Stanford’s Center for Research on Foundation Models has put forth the HELM: Holistic Evaluation of Language Models, i.e., HELM benchmark in short. This framework measures LLMs against dozens of scenarios along several dimensions beyond just accuracy. In this post, we’ll give an overview of HELM, how to use it as a starting point, and how to go a step further in choosing an LLM for your product.

What is HELM?

HELM is a benchmark for LLMs that improves upon prior benchmarks in three main ways:

Broad coverage of scenarios: HELM tests LLMs on a diverse range of tasks to emulate the varied applications they’re likely to face.
Evaluation of multiple metrics: Unlike previous benchmarks that study just one metric, HELM considers seven metrics for every scenario, making it possible to understand tradeoffs between metrics.
Standardization: LLMs aren’t always tested under the same conditions let alone the same datasets. HELM uses consistent prompting techniques to give the most meaningful comparisons.

To achieve these goals, the team at Stanford needed to collect a large amount of empirical data. They spent $38,000 on commercial models (such as from OpenAI) and around 19,500 GPU hours of computing on open models to test just under 5,000 “runs” or combinations of scenarios, metrics, and LLMs.

Figure 5 depicting a “run” from the HELM research paper (Liang et al., 2023)

Rather than arbitrarily selecting scenarios and metrics, HELM takes a very principled approach, as we’ll explore next.

A taxonomy of scenarios

HELM frames scenarios as a combination of a task, a domain (which they define as the who, what, and when of a scenario), and a language.

Figure 8 showing scenario taxonomy from the HELM research paper (Liang et al., 2023)

The space of all possible scenarios is vast, and it isn’t feasible to test every single one. To limit scope, HELM focuses primarily on the English language and otherwise tries to cover a diverse set of tasks and domains. HELM also prioritizes scenarios that correspond to common real-world applications of LLMs.

HELM’s scenario taxonomy provides a structure for rigorously thinking about benchmarks. The research paper acknowledges that HELM’s coverage of scenarios is incomplete and will need to expand over time. As people devise new applications for LLMs, these novel use cases can be included in the taxonomy and added as benchmarks.

Multiple metrics for LLM evaluation

Accuracy is, of course, important for LLMs. It’s hard to use a model you can’t trust, and hallucinations are a well-known shortcoming of LLMs.

HELM goes a step further than just measuring accuracy. The framework encompasses several other metrics:

Calibration and uncertainty: A well-calibrated LLM accurately quantifies its uncertainty about its own predictions.
Robustness: A measure of how the model responds to changes in its input. For example, does a typo change the model’s answers?
Bias and fairness: The degree to which an LLM’s outputs are asymmetric for certain groups. HELM focuses on gender and racial biases.
Toxicity: Whether an LLM produces hate speech, violent speech, or other abusive language.
Efficiency: This includes a few components, including energy, carbon, and runtime.

With this ensemble of metrics, it’s possible to evaluate tradeoffs. For instance, a 2% improvement in accuracy may not merit a 25% decrease in efficiency.

Making such tradeoffs is not a one-size-fits-all decision, which is why HELM makes no claims about which LLM is best. Instead, HELM lets you decide for yourself which LLM fits your use case.

How can you use HELM to choose a foundation model?

HELM’s results are publicly available and will be continually updated. As of this writing, the Stanford team has benchmarked 119 models and is an advocate for benchmarking new models as they come to the market.

You can use the HELM leaderboard to compare models according to specific scenarios and metrics:

Screenshot of the HELM leaderboard

To get the most out of HELM, it probably makes the most sense to start by identifying which scenario is most relevant to your LLM use case. (You can also choose a group of scenarios if more than one is applicable.) Separately, go through the exercise of prioritizing which metrics matter the most to you.

Then, you can look at the leaderboard and decide which model is the best fit. Don’t forget to consider cost, terms of use, technical support, and other factors as well.

Make even better evaluations by building your own benchmarks

HELM is a great resource for the AI community as a whole. It pushes the field to standardize comparisons between models, and it emphasizes the importance of fair and equitable AI.

That said, general-purpose benchmarks can only go so far. While HELM has tested thousands of configurations, it may not be representative of every element of your LLM product:

Your scenario is likely unique. HELM may have data about a similar task or domain, but can you confidently extrapolate those results to your use case?
HELM uses 5-shot prompting along with a very particular prompt format to test LLMs. Will you be using those same conditions in production?

HELM isn’t meant to be the one-stop shop to choose a large language model for your production application. It’s a research framework and corpus that also happens to be a good starting point for selecting an LLM. If you’re serious about evaluating all the options for building an AI application, you need something bespoke.

That means addressing both of the questions above. First, it’s important to test models against the exact task you intend on using them for. Second, evaluation doesn’t stop with just base models. Choosing effective prompts is just as, if not more important.

We built the Verta GenAI Workbench to help you iterate on your LLM application and compare results between models and prompts. The Workbench comes with built-in benchmarking, custom-made for your use case. The leaderboards feature lets you iterate toward the ideal combination of prompt and model to best serve your users.

Want to give Verta a try? You can sign up for free today.

View full post