Training on the Test Task Confounds Evaluation and Emergence | Social Foundations of Computation – Max Planck Institute for Intelligent Systems

Institute Homepage

Institute Homepage Sign In

Research Overview

Social Prediction

Performative Prediction: Past and Future

Difficult Lessons on Social Prediction from Wisconsin Public Schools

Allocation Requires Prediction Only if Inequality Is Low

Digital Platforms, Power and Work

Performative Power

An Engine Not a Camera: Measuring Performative Power of Online Search

Causal Inference from Competing Treatments

Contesting Algorithmic Systems

Algorithmic Collective Action in Machine Learning

Decline Now: A Combinatorial Model for Algorithmic Collective Action

Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists

Algorithmic Fairness

Fairness and Machine Learning: Limitations and Opportunities

Unprocessing Seven Years of Algorithmic Fairness

Science of Machine Learning Benchmarks

ImageNot: A Contrast with ImageNet Preserves Model Rankings

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

A Theory of Dynamic Benchmarks

Evaluating Large Language Models

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Evaluating Language Models as Risk Scores

Fine-tuning Large Language Models

Training on the Test Task Confounds Evaluation and Emergence

Lawma: The Power of Specialization for Legal Tasks

Social Foundations of Computation Members Publications

Training on the Test Task Confounds Evaluation and Emergence

Training on the test task confounds — Top panel: Model accuracy on the MMLU benchmark as a function of pretraining compute, newer models (orange), older models (blue). Newer models appear to be better at utilizing compute. Also, high accuracy on MMLU appears to be emergent. Bottom panel: After adjusting for training on the test task, new and old models have the same scaling law. Moreover, accuracy picks up at much smaller model scale.

Benchmarking works best if all models have the same training data. Model builders today optimize training data mixes with the test task in mind, leading to confounded evaluations. We call the problem training on the test task and propose a simple adjustment that mitigates the problem..

Members

Thumb ticker sm 5bc8e8c7 1c24 4a28 b899 720dca90de2f 1 201 a

Empirical Inference, Social Foundations of Computation

Ricardo Dominguez-Olmedo

Doctoral Researcher

Thumb ticker sm 2022 04 17 6936 crop

Social Foundations of Computation

Florian Dorner

Doctoral Researcher

Thumb ticker sm 20241104 hardt moritz 12 cleaned kleiner

Social Foundations of Computation

Moritz Hardt

Director

Publications

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted) ArXiv BibTeX