Social Foundations of Computation Miscellaneous 2025

Train-before-Test Harmonizes Language Model Rankings

arXiv
Thumb ticker sm profile9
Social Foundations of Computation
Thumb ticker sm 5bc8e8c7 1c24 4a28 b899 720dca90de2f 1 201 a
Empirical Inference, Social Foundations of Computation
  • Doctoral Researcher
Thumb ticker sm 20241104 hardt moritz 12 cleaned kleiner
Social Foundations of Computation
  • Director

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

Author(s): Zhang, Guanhua and Dominguez-Olmedo, Ricardo and Hardt, Moritz
Links:
Year: 2025
Month: July
Bibtex Type: Miscellaneous (misc)
State: Submitted

BibTex

@misc{zhang2025train,
  title = {Train-before-Test Harmonizes Language Model Rankings},
  abstract = {Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.},
  month = jul,
  year = {2025},
  slug = {zhang2025train},
  author = {Zhang, Guanhua and Dominguez-Olmedo, Ricardo and Hardt, Moritz},
  month_numeric = {7}
}