The Supreme Court and Sherlock Holmes both once said versions of the same thing: “I know it when I see it.” While data is important in making judgments, it’s not everything. Sometimes human insight is the most effective at describing the intangibles.
As the ecosystem around evaluation for large language models (LLMs) matures, it’s important that we continue to keep humans in the loop. In this post, I’ll share how it’s possible to incorporate a human touch while still being rigorous. We’ll start with why this is important in the first place and then cover actionable strategies for building an LLM evaluation system.
When evaluating any ML model, and especially an LLM, most people fall back to rigid tests and singular metrics. For example, you can say that a model should have accuracy at least 0.7 in the test set to be considered “good” or that an LLM should never admit it’s an AI system.
Very quickly, these kinds of guardrails encounter limitations. They’re usually implemented as rules-based systems that someone has to manually codify based on domain knowledge, making them very difficult to scale.
Not only are these rules taxing to create and maintain, but they’re often ignored. It’s not uncommon for teams to opt for a model that performs worse on paper but feels better in practice.
Relying on intuition and holistic evaluation isn’t necessarily a bad thing. Actually, it highlights how traditional evaluation doesn’t capture the full scope of what matters in applied AI. Humans can recognize patterns that would otherwise be hard to articulate and even harder to encode into rules for repeated testing.
I’m not suggesting we abandon principled LLM evaluation in favor of going with gut instinct. That would be highly erratic. Rather, a full-fledged LLM evaluation system ought to include direct human evaluation as an integral component.
In practical terms, I’ve found there are three key components to a successful human-centered LLM evaluation system.
Comparison-based evaluation
Humans are notoriously bad at trying to come up with objective scores. One person’s 4-star rating might be 3 stars for someone else, and some raters leave the extreme ends of the scale unused just in case something better (or worse) comes along.
People are generally much more effective at comparing two things and saying which one is better. To minimize statistical noise and produce more meaningful data, LLM evaluation systems need to have comparison-based interfaces.
Be mindful that pairwise comparisons can get very expensive. For 10 items, there are 45 comparisons to make. For 20 items, that number grows to 190. Usually, it’s not necessary to make every single comparison to determine the best choice. I’d recommend drawing from the field of active learning for techniques to make only the most insightful comparisons.
Multiple evaluation metrics
LLM accuracy isn’t the only metric that matters. Qualities like bias, friendliness, and response context are also critical in determining whether a model is ready for production.
The nice thing about human evaluation is that it can inherently measure many of these qualities. That said, you could also be explicit in asking human raters for feedback. Instead of only asking for generic comparisons of which model is the “better,” you could also ask raters to consider specific characteristics you’re trying to optimize.
Gathering feedback about multiple criteria makes it easier to iterate on an LLM. Developers will have a better sense of where a particular model is weak relative to others, so that prompt engineering or fine tuning can focus on addressing that specific pain point.
Incremental evaluation that leverages existing feedback
Building with LLMs is an iterative process, so your evaluation system should match that. It doesn’t make sense to discard feedback from earlier iterations and start fresh with each version. If anything, direct comparisons against past models is the best way to validate that the changes you’ve made are a net positive.
Leveraging past feedback can also help you be strategic about what evaluations to run next. If you have sufficient evidence that your LLM is accurate for a given domain, subsequent evaluations can concentrate on other dimensions that you’re looking to improve.
Having done a PhD in optimization and then spent several years in industry, I’ve seen firsthand how academic AI and applied AI target very different objectives. LLM benchmarks are great research tools, but in a production setting they’re usually too contrived.
Companies that rely on LLMs to serve end users have unique challenges spanning domains that benchmarks can only approximate. There’s no substitute for human feedback. That said, many companies have yet to systematize how they collect feedback and iterate based on it, so there’s certainly room for improvement.
At Verta, we’re building an LLM evaluation platform to help companies launch and land GenAI products. We’re implementing best practices such as multiple evaluation metrics and iterative feedback loops so that builders don’t have to reinvent and rediscover these systems on their own.