Back
Benchmarking is a process of continual improvement through competitive testing, central to engineering communities. Although benchmarking has long fueled progress in machine learning, there’s a growing crisis about recent generative models. In this talk, I'll discuss the causes of this crisis and how to achieve valid model comparisons—and, by extension, valid model rankings. Currently, different benchmarks yield contradictory comparisons, even when targeting the same task. Multi-task benchmarks exacerbate ranking disagreements, as do attempts to scale up evaluation. Toward diagnosing the problem, I’ll argue that ranking validity breaks down when models receive different degrees of task-relevant preparation—a problem called training on the test task. Training on the test task confounds comparisons under direct (i.e., black-box) evaluation. To address this, I’ll motivate net-of-effort evaluation: comparing models only after giving each the same task-relevant preparation. Although simple, net-of-effort evaluation has far-reaching consequences. First, it produces significant ranking agreement across benchmarks and tasks. Second, it reduces the benchmark–model score matrix to essentially rank one. Third, it helps resolve puzzles about emergent abilities and scaling laws for downstream tasks.
Moritz Hardt is a director at the Max Planck Institute for Intelligent Systems. Prior to joining the institute, he was Associate Professor for Electrical Engineering and Computer Sciences at the University of California, Berkeley. His research contributes to the scientific foundations of machine learning and algorithmic decision making with a focus on social questions. Hardt is the author of the upcoming book The Emerging Science of Machine Learning Benchmarks (Princeton University Press), and a co-author of the textbooks Fairness and Machine Learning: Limitations and Opportunities (MIT Press) and Patterns, Predictions, and Actions: Foundations of Machine Learning (Princeton University Press).
More information