Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Miscellaneous Policy Design in Long-run Welfare Dynamics Wu, J., Abebe, R., Hardt, M., Stoica, A. Proceedings of the Fifth ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), November 2025 (Published) URL BibTeX

Social Foundations of Computation Conference Paper Strategic Hypothesis Testing Hossain, S., Chen, Y., Chen, Y. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), Spotlight Poster, top 3%, September 2025 (Accepted)
We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent's incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent's participation and reporting behavior respond to the principal's statistical decision rule. Despite the complexity of the interaction, we show that the principal's errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.
arXiv BibTeX

Social Foundations of Computation Algorithms and Society Article Performative Prediction: Past and Future Hardt, M., Mendler-Dünner, C. Statistical Science, Institute of Mathematical Statistics, August 2025 (Published)
Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finance, and the social sciences, the notion has been absent from the development of machine learning. In machine learning applications, performativity often surfaces as distribution shift. A predictive model deployed on a digital platform, for example, influences consumption and thereby changes the data-generating distribution. We survey the recently founded area of performative prediction that provides a definition and conceptual framework to study performativity in machine learning. A consequence of performative prediction is a natural equilibrium notion that gives rise to new optimization challenges. Another consequence is a distinction between learning and steering, two mechanisms at play in performative prediction. The notion of steering is in turn intimately related to questions of power in digital markets. We review the notion of performative power that gives an answer to the question how much a platform can steer participants through its predictions. We end on a discussion of future directions, such as the role that performativity plays in contesting algorithmic systems.
arXiv URL BibTeX

Social Foundations of Computation Miscellaneous Answer Matching Outperforms Multiple Choice for Language Model Evaluation Chandak, N., Goel, S., Prabhu, A., Hardt, M., Geiping, J. July 2025 (Submitted)
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
arXiv BibTeX

Social Foundations of Computation Conference Paper Difficult Lessons on Social Prediction from Wisconsin Public Schools Perdomo, J. C., Britton, T., Hardt, M., Abebe, R. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, June 2025 (Published)
Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the efficacy of EWS, and the role of statistical risk scores in education. In this work, we draw on nearly a decade's worth of data from a system used throughout Wisconsin to provide the first large-scale evaluation of the long-term impact of EWS on graduation outcomes. We present empirical evidence that the prediction system accurately sorts students by their dropout risk. We also find that it may have caused a single-digit percentage increase in graduation rates, though our empirical analyses cannot reliably rule out that there has been no positive treatment effect. Going beyond a retrospective evaluation of DEWS, we draw attention to a central question at the heart of the use of EWS: Are individual risk scores necessary for effectively targeting interventions? We propose a simple mechanism that only uses information about students' environments -- such as their schools, and districts -- and argue that this mechanism can target interventions just as efficiently as the individual risk score-based mechanism. Our argument holds even if individual predictions are highly accurate and effective interventions exist. In addition to motivating this simple targeting mechanism, our work provides a novel empirical backbone for the robust qualitative understanding among education researchers that dropout is structurally determined. Combined, our insights call into question the marginal value of individual predictions in settings where outcomes are driven by high levels of inequality.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper How Benchmark Prediction from Fewer Data Misses the Mark Zhang, G., Dorner, F. E., Hardt, M. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), June 2025 (Accepted)
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
arXiv BibTeX

Social Foundations of Computation Conference Paper To Give or Not to Give? The Impacts of Strategically Withheld Recourse Chen, Y., Estornell, A., Vorobeychik, Y., Liu, Y. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTAS), PMLR, The Twenty-Eight International Conference on Artificial Intelligence and Statistics (AISTATS), May 2025 (Published)
Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper Limits to Predicting Online Speech Using Large Language Models Remeli, M., Hardt, M., Williamson, R. C. April 2025 (Submitted)
We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.
arXiv BibTeX

Social Foundations of Computation Conference Paper Lawma: The Power of Specialization for Legal Tasks Dominguez-Olmedo, R., Nanda, V., Abebe, R., Bechtold, S., Engel, C., Frankenreiter, J., Gummadi, K., Hardt, M., Livermore, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.
ArXiv Code BibTeX

Social Foundations of Computation Conference Paper Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data Dorner, F. E., Nastl, V. Y., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.
arXiv URL BibTeX

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
ArXiv BibTeX

Social Foundations of Computation Book The Emerging Science of Machine Learning Benchmarks Hardt, M. 2025 (Published)
Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark. Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.
Website URL BibTeX