Back
Why machine learning benchmarks have worked, the crisis they now face, and how to preserve the old engine of progress.
Machine learning turns on one simple trick: Split your data into training and test sets. Anything goes on the training set. Rank models on the test set. Let model builders compete. Call it a benchmark.
Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of machine learning benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's law cautions against applying competitive pressure to statistical measurement. Over time, they say, researchers overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities, deceiving us especially when comparing humans and machines. Top off the list of issues with a slew of reasons why things don't transfer from benchmarks to the real world.
These scorching critiques go hand in hand with serious ethical objections. Benchmarks reinforce and perpetuate biases in our representation of people, social relationships, culture, and society. Worse, the creation of massive human-annotated datasets extracts labor from a marginalized workforce disenfranchised from the economic gains their efforts enable.
All of this is true.
Many have said it well. Many have argued it convincingly. I'm particularly drawn to the claim that benchmarks serve industry objectives, giving big tech labs a structural advantage. The case against benchmarks is clear, in my view.
What's far less clear is the scientific case for benchmarks.
There is the undeniable fact that benchmarks have been successful as a driver of progress in the field. ImageNet was inseparable from the deep learning revolution ofthe 2010s, with companies competing fiercely over the best dog breed classifiers. The difference between a Blenheim Spaniel and a Welsh Springer became a matter of serious rivalry. A decade later, language model benchmarks reached geopolitical significance in the global competition over artificial intelligence. Tech CEOs now recite the company's number on MMLU—a set of 14,042 college-level multiple choice questions—in presentations to shareholders. I'm writing not longer after news broke that David beat Goliath on reasoning benchmarks, a sensation that shook global stock markets.
Benchmarks come and go, but their centrality hasn't changed. Competitive leaderboard climbing has been the main way machine learning advances.
If we accept that progress in artificial intelligence is real, we must also accept that benchmarks have, in some sense, worked. But the fact that benchmarks worked is more of a hindsight observation than a scientific lesson. Benchmarks emerged in the early days of pattern recognition. They followed no scientific principles. To the extent that benchmarks had any theoretical support, that theory was readily invalidated by how people used benchmarks in practice. Statistics prescribed locking test sets in a vault, but machine learning practitioners did the opposite. They put them on the internet for everyone to use freely. Popular benchmarks draw millions of downloads and evaluations as model builders incrementally compete over better numbers.
Benchmarks are the mistake that made machine learning. They shouldn't have worked and, yet, they did.
Benchmarks emerged from little more than common sense and intuition. They appeared in the late 1950s, had some life during the 1960s, hibernated throughout the 1970s, and sprung to popularity in the late 1980s when pattern recognition became machine learning. Today, benchmarks are so ubiquitous, we take them for granted. And we expect them to do their job. After all, they have in the past.
The deep learning revolution of the 2010s was a triumph for the benchmarking enterprise. The ImageNet benchmark was at the center of all the cutting-edge advances in image classification with deep convolutional neural networks. Despite massive competitive pressure, it reliably supported model improvements for nearly a decade. Throughout its long life, a sprawling software ecosystem grew around ImageNet, making it ever simpler and faster to develop models on the benchmark.
Progress on ImageNet according to Papers With Code.
Even its tiny cousin, CIFAR-10, did surprisingly well for itself. Model builders often put CIFAR-10 into the development loop for architecture search. Once they found a promising candidate architecture, they would then scale it up to ImageNet. Folklorehas it that some of the best ImageNet architectures saw the light of day on CIFAR-10 first. Even though CIFAR-10 contains only tiny pixelated test images from ten classes, such as frogs, trucks, and ships, the dataset was any model builder's swiss army knife for many years. The platform PapersWithCode counts more than 15,000 papers published with CIFAR-10 evaluations. This does not count the numerous evaluations that went into every single one of these papers. It also doesn't count the enormous amount of engineering work that used the dataset in one way or the other.
That so much work should hinge on so little data might seem reckless.
With the benefit of hindsight, though, researchers verified that the ranking of popular models on CIFAR-10 largely agrees with the ranking of the same models on ImageNet. Model rankings on ImageNet in turn transfer well to many other datasets. The better a model is on ImageNet, the better it is elsewhere, too. Researchers even created a dataset called ImageNot, full of noisy web crawl data, designed to stray as far as possible from ImageNet, while matching only its size. Training all key ImageNet era architectures on ImageNot from scratch, model rankings turn out to be the same.
Stability of ImageNet model rankings (left) and relative accuracy improvements over AlexNet (right).
Model rankings replicate to a surprising degree, which is in stark contrast with absolute performance numbers. Model accuracies and other metrics don't replicate from one dataset to the other, even when the datasets are similar. This means that model rankings—rather than model evaluations—are the primary scientific export of machine learning benchmarks.
The ImageNet era didn’t end with computer vision; it ended with natural language processing, as the new transformer architecture triumphed over the sluggish recurrent neural networks that had long been the workhorse for sequence problems of all kinds. Transformers were much easier to scale up and quickly took over. The simple training objective of next token prediction in a text sequence freed training data from human annotation. Companies quickly scraped up anything they could find on the internet, from chat messages and Reddit rants to humanity's finest writing. New scaling laws suggested that training loss decreases predictably as you jointly increase dataset size, number of model parameters, and training compute. For a while it seemed as though the only thing left to do was to sit back and watch new abilities emerge in explosively growing models. But where did this new reality leave benchmarks?
The new era departs from the old in some significant ways.
First, models train on the internet, or at least massive minimally curated web crawls. At the point of evaluation, we therefore don't know and can't control what training data the model saw. This turns out to have profound implications for benchmarking. The extent to which a model has encountered data similar to the test task during training skews model comparisons and threatens the validity of model rankings. A worse model may have simply crammed better for the test. Would you prefer a worse student who came better prepared to the exam, or the better student who was less prepared? If it's the latter, you'll need to adjust for the difference in test preparation. Thankfully, this can be done by fine-tuning each model on the same task-specific data before evaluation without the need to train from scratch.
Model rankings on the popular GSM8k (Grade School Math 8k) benchmark before and after giving each model the same preparation for the benchmark. Preparation for the test drastically changes model rankings.
Second, models no longer solve a single task, but can be prompted to tackle pretty much any task. In response, multi-task benchmarks have emerged as the de-facto standard to provide a holistic evaluation of recent models by aggregating performance across numerous tasks into a single ranking. Aggregating rankings, however, is a thorny problem in social choice theory that has no perfect solution. Working from an analogy between multi-task benchmarks and voting systems, ideas from social choice theory reveal inherent trade-offs that muli-task benchmarks face. Specifically, greater task diversity necessarily comes at the cost of greater sensitivity to irrelevant changes. For example, adding weak models to popular multi-task benchmarks can change the order of top contenders. The familiar stability of model rankings, characteristic of the ImageNet era, therefore does not extend to multi-task benchmarks in the LLM era.
Ranking instability in the LLM era. Left: Adding weak models to the HELM multi-task benchmark. Right: Using LLaMa3-70b as a judge.
The final problem benchmarking faces is an existential one. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. There's hope that models might be able to evaluate each other. But the idea of using models as judges runs into some serious hurdles. LLM judges are biased, unsurprisingly, in their own favor. Intriguing recent debiasing methods from statistics promise to debias model predictions from few human ground truth labels. Unfortunately, at the evaluation frontier—where new models are at least as good as the judge—even the optimal debiasing method is no better than collecting twice as many ground truth labels for evaluation.
Will our old engine of progress grind to a halt?
In moments of crisis, we tend to accelerate. What if we instead step back and ask why we expected benchmarks to work in the first place—and for what purpose? This question leads us into uncharted territory. For the longest time, we took benchmarks for granted and didn’t bother to work out the underlying methodology. We got away with it mostly by sheer luck, but we might not be as lucky this time.
Over the last decade, however, a growing body of work has begun to map out the foundations of a science of machine learning benchmarks. What emerges is a rich set of observations—both theoretical and empirical—raising intriguing open problems that deserve the community’s attention. If benchmarks are to serve us well in the future, we must put them on solid scientific ground. My upcoming book, The Emerging Science of Benchmarks, supports this development.
Article cross listed at SIAM News: https://www.siam.org/publications/siam-news/articles/the-emerging-science-of-machine-learning-benchmarks/
More information