Social Foundations of Computation Book 2025

The Emerging Science of Machine Learning Benchmarks

Website
Thumb ticker sm 20241104 hardt moritz 12 cleaned kleiner
Social Foundations of Computation
  • Director

Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark. Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.

Author(s): Hardt, Moritz
Links:
Year: 2025
Bibtex Type: Book (book)
State: Published
URL: https://mlbenchmarks.org/

BibTex

@book{hardt2025emerging,
  title = {The Emerging Science of Machine Learning Benchmarks},
  abstract = {Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark.
  
  
  Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.},
  year = {2025},
  slug = {hardt2025emerging},
  author = {Hardt, Moritz},
  url = {https://mlbenchmarks.org/}
}