Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Conference Paper ROC-n-reroll: How Verifier Imperfection affects Test-Time Scaling Dorner, F. E., Chen, Y. C., Cruz, A. F., Yang, F. Y. The Fourteenth International Conference on Learning Representations (ICLR), April 2026 (Accepted)
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
arXiv BibTeX

Social Foundations of Computation Miscellaneous Text as the Richest Preference Signal Cruz, A. F., Kleinberg, J., Abebe, R. The Fourteenth International Conference on Learning Representations (ICLR), AIMS Workshop , April 2026 (Accepted)
Preference elicitation algorithms have long relied on structured representations of user preferences: rankings of items, ratings, or simple binary interactions (e.g., views). Over the years, we've slowly become aware of the limitations and biases these representations entail. Users form preferences over items' features rather than items themselves. In this paper, we explore \emph{natural language} as a first-class preference representation, beyond a mere cold-start aid. We study three parallel representations of user preferences: (i) a user-item interaction matrix, (ii) free-form text profiles describing users' preferences, and (iii) interpretable tabular features derived by an LLM from these text profiles. Our findings unfold in three parts. First, text-based predictors substantially outperform collaborative filtering in the cold-start regime and remain competitive as interaction histories grow. Second, most of the predictive signal in text can be retained in a compact, interpretable tabular representation. Third, the three representations are complementary: Simple ensembles that combine them consistently achieve the strongest performance.
BibTeX

Social Foundations of Computation Miscellaneous Scaling Open-Ended Reasoning To Predict the Future Chandak, N., Shashwat, G., Prabhu, A., Hardt, M., Geiping, J. January 2026 (Submitted)
High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
arXiv BibTeX

Social Foundations of Computation Conference Paper Train-before-Test Harmonizes Language Model Rankings Zhang, G., Dominguez-Olmedo, R., Hardt, M. The Fourteenth International Conference on Learning Representations (ICLR), oral, Top1.18%, January 2026 (Accepted)
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
arXiv BibTeX