Social Foundations of Computation Talk
25 September 2025 at 10:00 - 11:00 | Lecture Hall N0.002, MPI-IS Tübingen

How benchmarking broke in the LLM era and what to salvage

IMPRS-IS Keynote Lecture by Moritz Hardt

Thumb ticker xxl 2025 audience 1

Benchmarking is a process of continual improvement through competitive testing, central to engineering communities. Although benchmarking has long fueled progress in machine learning, there’s a growing crisis about recent generative models. In this talk, I'll discuss the causes of this crisis and how to achieve valid model comparisons—and, by extension, valid model rankings. Currently, different benchmarks yield contradictory comparisons, even when targeting the same task. Multi-task benchmarks exacerbate ranking disagreements, as do attempts to scale up evaluation. Toward diagnosing the problem, I’ll argue that ranking validity breaks down when models receive different degrees of task-relevant preparation—a problem called training on the test task. Training on the test task confounds comparisons under direct (i.e., black-box) evaluation. To address this, I’ll motivate net-of-effort evaluation: comparing models only after giving each the same task-relevant preparation. Although simple, net-of-effort evaluation has far-reaching consequences. First, it produces significant ranking agreement across benchmarks and tasks. Second, it reduces the benchmark–model score matrix to essentially rank one. Third, it helps resolve puzzles about emergent abilities and scaling laws for downstream tasks.