The Emerging Science of Machine Learning Benchmarks

Institute Homepage

Institute Homepage Sign In

Back

Soziale Grundlagen der Informatik Book 2025

Website

Soziale Grundlagen der Informatik

Moritz Hardt

Director

Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark. Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.

Author(s):	Hardt, Moritz
Links:	Website
Year:	2025

Bibtex Type:	Book (book)

State:	Published
URL:	https://mlbenchmarks.org/

BibTex

@book{hardt2025emerging,
  title = {The Emerging Science of Machine Learning Benchmarks},
  abstract = {Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark.
  
  
  Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.},
  year = {2025},
  slug = {hardt2025emerging},
  author = {Hardt, Moritz},
  url = {https://mlbenchmarks.org/}
}