Social Foundations of Computation Members Publications

Training on the Test Task Confounds Evaluation and Emergence

Training on the test task confounds
Top panel: Model accuracy on the MMLU benchmark as a function of pretraining compute, newer models (orange), older models (blue). Newer models appear to be better at utilizing compute. Also, high accuracy on MMLU appears to be emergent. Bottom panel: After adjusting for training on the test task, new and old models have the same scaling law. Moreover, accuracy picks up at much smaller model scale.

Members

Thumb ticker sm 5bc8e8c7 1c24 4a28 b899 720dca90de2f 1 201 a
Empirical Inference, Social Foundations of Computation
  • Doctoral Researcher
Thumb ticker sm 2022 04 17 6936 crop
Social Foundations of Computation
  • Doctoral Researcher
Thumb ticker sm 20241104 hardt moritz 12 cleaned kleiner
Social Foundations of Computation
  • Director

Publications

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025, Accepted (Accepted) ArXiv BibTeX