Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Conference Paper ROC-n-reroll: How Verifier Imperfection affects Test-Time Scaling Dorner, F. E., Chen, Y. C., Cruz, A. F., Yang, F. Y. The Fourteenth International Conference on Learning Representations (ICLR), April 2026 (Accepted)
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
arXiv BibTeX

Social Foundations of Computation Miscellaneous Text as the Richest Preference Signal Cruz, A. F., Kleinberg, J., Abebe, R. The Fourteenth International Conference on Learning Representations (ICLR), AIMS Workshop , April 2026 (Accepted)
Preference elicitation algorithms have long relied on structured representations of user preferences: rankings of items, ratings, or simple binary interactions (e.g., views). Over the years, we've slowly become aware of the limitations and biases these representations entail. Users form preferences over items' features rather than items themselves. In this paper, we explore \emph{natural language} as a first-class preference representation, beyond a mere cold-start aid. We study three parallel representations of user preferences: (i) a user-item interaction matrix, (ii) free-form text profiles describing users' preferences, and (iii) interpretable tabular features derived by an LLM from these text profiles. Our findings unfold in three parts. First, text-based predictors substantially outperform collaborative filtering in the cold-start regime and remain competitive as interaction histories grow. Second, most of the predictive signal in text can be retained in a compact, interpretable tabular representation. Third, the three representations are complementary: Simple ensembles that combine them consistently achieve the strongest performance.
BibTeX

Social Foundations of Computation Miscellaneous Scaling Open-Ended Reasoning To Predict the Future Chandak, N., Shashwat, G., Prabhu, A., Hardt, M., Geiping, J. January 2026 (Submitted)
High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
arXiv BibTeX

Social Foundations of Computation Conference Paper Train-before-Test Harmonizes Language Model Rankings Zhang, G., Dominguez-Olmedo, R., Hardt, M. The Fourteenth International Conference on Learning Representations (ICLR), oral, Top1.18%, January 2026 (Accepted)
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
arXiv BibTeX

Social Foundations of Computation Miscellaneous Policy Design in Long-run Welfare Dynamics Wu, J., Abebe, R., Hardt, M., Stoica, A. Proceedings of the Fifth ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), November 2025 (Published) URL BibTeX

Social Foundations of Computation Conference Paper Strategic Hypothesis Testing Hossain, S., Chen, Y., Chen, Y. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), Spotlight Poster, top 3%, September 2025 (Accepted)
We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent's incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent's participation and reporting behavior respond to the principal's statistical decision rule. Despite the complexity of the interaction, we show that the principal's errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.
arXiv BibTeX

Social Foundations of Computation Algorithms and Society Article Performative Prediction: Past and Future Hardt, M., Mendler-Dünner, C. Statistical Science, Institute of Mathematical Statistics, August 2025 (Published)
Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finance, and the social sciences, the notion has been absent from the development of machine learning. In machine learning applications, performativity often surfaces as distribution shift. A predictive model deployed on a digital platform, for example, influences consumption and thereby changes the data-generating distribution. We survey the recently founded area of performative prediction that provides a definition and conceptual framework to study performativity in machine learning. A consequence of performative prediction is a natural equilibrium notion that gives rise to new optimization challenges. Another consequence is a distinction between learning and steering, two mechanisms at play in performative prediction. The notion of steering is in turn intimately related to questions of power in digital markets. We review the notion of performative power that gives an answer to the question how much a platform can steer participants through its predictions. We end on a discussion of future directions, such as the role that performativity plays in contesting algorithmic systems.
arXiv URL BibTeX

Social Foundations of Computation Miscellaneous Answer Matching Outperforms Multiple Choice for Language Model Evaluation Chandak, N., Goel, S., Prabhu, A., Hardt, M., Geiping, J. July 2025 (Submitted)
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
arXiv BibTeX

Social Foundations of Computation Conference Paper Difficult Lessons on Social Prediction from Wisconsin Public Schools Perdomo, J. C., Britton, T., Hardt, M., Abebe, R. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, June 2025 (Published)
Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the efficacy of EWS, and the role of statistical risk scores in education. In this work, we draw on nearly a decade's worth of data from a system used throughout Wisconsin to provide the first large-scale evaluation of the long-term impact of EWS on graduation outcomes. We present empirical evidence that the prediction system accurately sorts students by their dropout risk. We also find that it may have caused a single-digit percentage increase in graduation rates, though our empirical analyses cannot reliably rule out that there has been no positive treatment effect. Going beyond a retrospective evaluation of DEWS, we draw attention to a central question at the heart of the use of EWS: Are individual risk scores necessary for effectively targeting interventions? We propose a simple mechanism that only uses information about students' environments -- such as their schools, and districts -- and argue that this mechanism can target interventions just as efficiently as the individual risk score-based mechanism. Our argument holds even if individual predictions are highly accurate and effective interventions exist. In addition to motivating this simple targeting mechanism, our work provides a novel empirical backbone for the robust qualitative understanding among education researchers that dropout is structurally determined. Combined, our insights call into question the marginal value of individual predictions in settings where outcomes are driven by high levels of inequality.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper How Benchmark Prediction from Fewer Data Misses the Mark Zhang, G., Dorner, F. E., Hardt, M. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), June 2025 (Accepted)
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
arXiv BibTeX

Social Foundations of Computation Conference Paper To Give or Not to Give? The Impacts of Strategically Withheld Recourse Chen, Y., Estornell, A., Vorobeychik, Y., Liu, Y. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTAS), PMLR, The Twenty-Eight International Conference on Artificial Intelligence and Statistics (AISTATS), May 2025 (Published)
Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper Limits to Predicting Online Speech Using Large Language Models Remeli, M., Hardt, M., Williamson, R. C. April 2025 (Submitted)
We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.
arXiv BibTeX

Social Foundations of Computation Conference Paper Lawma: The Power of Specialization for Legal Tasks Dominguez-Olmedo, R., Nanda, V., Abebe, R., Bechtold, S., Engel, C., Frankenreiter, J., Gummadi, K., Hardt, M., Livermore, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.
ArXiv Code BibTeX

Social Foundations of Computation Conference Paper Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data Dorner, F. E., Nastl, V. Y., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.
arXiv URL BibTeX

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
ArXiv BibTeX

Social Foundations of Computation Book The Emerging Science of Machine Learning Benchmarks Hardt, M. 2025 (Published)
Machine learning turns on one simple trick: Split the data into training and test sets. Anything goes on the training set. Rank models on the test set and let model builders compete. Call it a benchmark. Machine learning researchers cherish a good tradition of lamenting the apparent shortcomings of benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming; in fact, Goodhart's Law cautions against applying competitive pressure to statistical measurement. Over time, researchers may overfit to benchmarks, building models that exploit data artifacts. As a result, test set performance draws a skewed picture of model capabilities that deceives us—especially when comparing humans and machines. To top off the list of issues, there are a slew of reasons why things don't transfer well from benchmarks to the real world.
Website URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists Baumann, J., Mendler-Dünner, C. Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published)
We investigate algorithmic collective action in transformer-based recommender systems. Our use case is a collective of fans aiming to promote the visibility of an artist by strategically placing one of their songs in the existing playlists they control. The success of the collective is measured by the increase in test-time recommendations of the targeted song. We introduce two easily implementable strategies towards this goal and test their efficacy on a publicly available recommender system model released by a major music streaming platform. Our findings reveal that even small collectives (controlling less than 0.01 of the training data) can achieve up 25x amplification of recommendations by strategically choosing the position at which to insert the song. We then focus on investigating the externalities of the strategy. We find that the performance loss for the platform is negligible, and the recommendations of other songs are largely preserved, minimally impairing the user experience of participants. Moreover, the costs are evenly distributed among other artists. Taken together, our findings demonstrate how collective action strategies can be effective while not necessarily being adversarial, raising new questions around incentives, social dynamics, and equilibria in recommender systems.
arXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper An Engine Not a Camera: Measuring Performative Power of Online Search Mendler-Dünner, C., Carovano, G., Hardt, M. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published)
The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the power of online search providers, building on the recent definition of performative power. Instantiated in our setting, performative power quantifies the ability of a search engine to steer web traffic by rearranging results. To operationalize this definition we developed a browser extension that performs unassuming randomized experiments in the background. These randomized experiments emulate updates to the search algorithm and identify the causal effect of different content arrangements on clicks. We formally relate these causal effects to performative power. Analyzing tens of thousands of clicks, we discuss what our robust quantitative findings say about the power of online search engines. More broadly, we envision our work to serve as a blueprint for how performative power and online experiments can be integrated with future investigations into the economic power of digital platforms.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Do Causal Predictors Generalize Better to New Domains? Nastl, V. Y., Hardt, M. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Spotlight Poster, December 2024 (Published)
We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. If the goal is to generalize to new domains, practitioners might as well train the best possible model on all available features.
ArXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Evaluating Language Models as Risk Scores Cruz, A. F., Hardt, M., Mendler-Dünner, C. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published)
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.
ArXiv Code URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Questioning the Survey Responses of Large Language Models Dominguez-Olmedo, R., Hardt, M., Mendler-Dünner, C. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Oral, December 2024 (Published)
As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful representations of any human population. Using a de-facto standard multiple-choice prompting technique and evaluating 39 different language models using systematic experiments, we establish two dominant patterns: First, models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases. Second, models' responses do not contain the entropy variations and statistical signals typically found in human populations. As a result, a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census. At the same time, models' relative alignment with different demographic subgroups can be predicted from the subgroups' entropy, irrespective of the model's training data or training strategy. Taken together, our findings suggest caution in treating models' survey responses as equivalent to those of human populations.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper The Fairness-Quality Trade-off in Clustering Hakim, R., Stoica, A., Papadimitriou, C. H., Yannakakis, M. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published) arXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Decline Now: A Combinatorial Model for Algorithmic Collective Action Sigg, D., Hardt, M., Mendler-Dünner, C. CHI Conference on Human Factors in Computing Systems, October 2024 (Accepted)
Drivers on food delivery platforms often run a loss on low-paying orders. In response, workers on DoorDash started a campaign, DeclineNow, to purposefully decline orders below a certain pay threshold. For each declined order, the platform returns the request to other available drivers with slightly increased pay. While contributing to overall pay increase the implementation of the strategy comes with the risk of missing out on orders for each individual driver. In this work, we propose a first combinatorial model to study the strategic interaction between workers and the platform. Within our model, we formalize key quantities such as the average worker benefit of the strategy, the benefit of freeriding, as well as the benefit of participation. We extend our theoretical results with simulations. Our key insights show that the average worker gain of the strategy is always positive, while the benefit of participation is positive only for small degrees of labor oversupply. Beyond this point, the utility of participants decreases faster with an increasing degree of oversupply, compared to the utility of non-participants. Our work highlights the significance of labor supply levels for the effectiveness of collective action on gig platforms. We suggest organizing in shifts as a means to reduce oversupply and empower collectives
arXiv BibTeX

Social Foundations of Computation Conference Paper Fairness in Social Influence Maximization via Optimal Transport Chowdhary, S., De Pasquale, G., Lanzetti, N., Stoica, A., Dorfler, F. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), September 2024 (Published)
We study fairness in social influence maximization, whereby one seeks to select seeds that spread a given information throughout a network, ensuring balanced outreach among different communities (e.g. demographic groups). In the literature, fairness is often quantified in terms of the expected outreach within individual communities. In this paper, we demonstrate that such fairness metrics can be misleading since they ignore the stochastic nature of information diffusion processes. When information diffusion occurs in a probabilistic manner, multiple outreach scenarios can occur. As such, outcomes such as "in 50\% of the cases, no one of group 1 receives the information and everyone in group 2 receives it and in other 50\%, the opposite happens", which always results in largely unfair outcomes, are classified as fair by a variety of fairness metrics in the literature. We tackle this problem by designing a new fairness metric, mutual fairness, that captures variability in outreach through optimal transport theory. We propose a new seed selection algorithm that optimizes both outreach and mutual fairness, and we show its efficacy on several real datasets. We find that our algorithm increases fairness with only a minor decrease (and at times, even an increase) in efficiency.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Allocation Requires Prediction Only if Inequality Is Low Shirali, A., Abebe, R., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024, *equal contribution (Published)
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics’ learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Causal Inference from Competing Treatments Stoica, A., Nastl, V. Y., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Dorner, F. E., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Zhang, G., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes.
ArXiv Code URL BibTeX

Social Foundations of Computation Conference Paper Fairness Rising from the Ranks: HITS and PageRank on Homophilic Networks Stoica, A., Litvak, N., Chaintreau, A. In Proceedings of the Association for Computing Machinery (ACM) Web Conference 2024, ACM, The 2024 ACM Web Conference, May 2024 (Published)
In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high-ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: on the one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top-ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Test-Time Training on Nearest Neighbors for Large Language Models Hardt, M., Sun, Y. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.
ArXiv Code URL BibTeX

Social Foundations of Computation Conference Paper Unprocessing Seven Years of Algorithmic Fairness Cruz, A. F., Hardt, M. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
ArXiv Code URL BibTeX

Social Foundations of Computation Conference Paper ImageNot: A Contrast with ImageNet Preserves Model Rankings Salaudeen, O., Hardt, M. April 2024 (Submitted)
We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
ArXiv BibTeX

Social Foundations of Computation Article Integration of Generative AI in the Digital Markets Act: Contestability and Fairness from a Cross-Disciplinary Perspective Yasar, A. G., Chong, A., Dong, E., Gilbert, T., Hladikova, S., Mougan, C., Shen, X., Singh, S., Stoica, A., Thais, S. Workshop on Generative AI + Law (GenLaw) , LSE Legal Studies Working Paper, The Fortieth International Conference on Machine Learning (ICML) 2023 , March 2024 (Published)
The EU’s Digital Markets Act (DMA) aims to address the lack of contestability and unfair practices in digital markets. But the current framework of the DMA does not adequately cover the rapid advance of generative AI. As the EU adopts AI-specific rules and considers possible amendments to the DMA, this paper suggests that generative AI should be added to the DMA’s list of core platform services. This amendment is the first necessary step to address the emergence of entrenched and durable positions in the generative AI industry.
URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Causal Inference out of Control: Estimating Performativity without Treatment Randomization Cheng, G., Hardt, M., Mendler-Dünner, C. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), January 2024 (Published)
Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on user consumption. In pursuit of estimating this effect from observational data, we identify a set of assumptions that permit causal identifiability without assuming randomized platform actions. Our results are applicable to platforms that rely on machine-learning-powered predictions and leverage knowledge from historical data. The key novelty of our approach is to explicitly model the dynamics of consumption over time, exploiting the repeated interaction of digital platforms with their participants to prove our identifiability results. By viewing the platform as a controller acting on a dynamical system, we can show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying the causal effect of interest. We complement our claims with an analysis of ready-to-use finite sample estimators and empirical investigations. More broadly, our results deriving identifiability conditions tailored to digital platform settings illustrate a fruitful interplay of control theory and causal inference
ArXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Collaborative Learning via Prediction Consensus Fan, D., Mendler-Dünner, C., Jaggi, M. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Curran Associates, Inc., The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS), December 2023 (Published)
We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost the performance of individual models in the target domain from which the auxiliary data is sampled. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods. At the same time, it can probably mitigate the negative impact of bad models on the collective.
ArXiv URL BibTeX

Social Foundations of Computation Poster Do Personality Tests Generalize to Large Language Models Dorner, F. E., Sühr, T., Samadi, S., Kelava, A. Socially Responsible Language Modelling Research (SoLaR) Workshop, The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS), December 2023, *equal contribution (Published)
With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests’ validity generalizes to LLMs. In this work, we provide evidence that LLMs’ responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. “I am introverted” vs “I am extraverted”) are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to “steer” LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests’ validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs’ “personality”.
URL BibTeX

Social Foundations of Computation Book Fairness and Machine Learning: Limitations and Opportunities Barocas, S., Hardt, M., Narayanan, A. MIT Press, December 2023 (Published)
An introduction to the intellectual foundations and practical utility of the recent work on fairness and machine learning. Fairness and Machine Learning introduces advanced undergraduate and graduate students to the intellectual foundations of this recently emergent field, drawing on a diverse range of disciplinary perspectives to identify the opportunities and hazards of automated decision-making. It surveys the risks in many applications of machine learning and provides a review of an emerging set of proposed solutions, showing how even well-intentioned applications may give rise to objectionable results. It covers the statistical and causal measures used to evaluate the fairness of machine learning models as well as the procedural and substantive aspects of decision-making that are core to debates about fairness, including a review of legal and philosophical perspectives on discrimination. This incisive textbook prepares students of machine learning to do quantitative work on fairness while reflecting critically on its foundations and its practical utility.• Introduces the technical and normative foundations of fairness in automated decision-making• Covers the formal and computational methods for characterizing and addressing problems• Provides a critical assessment of their intellectual foundations and practical utility• Features rich pedagogy and extensive instructor resources
URL BibTeX

Social Foundations of Computation Conference Paper Is Your Model Predicting the Past? Hardt, M., Kim, M. P. In Proceedings of the Third ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), ACM, October 2023 (Published)
When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system’s predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.
URL BibTeX

Social Foundations of Computation Conference Paper Incentivizing Honesty among Competitors in Collaborative Learning and Optimization Dorner, F. E., Konstantinov, N., Pashaliev, G., Vechev, M. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS), September 2023 (Published)
Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper AI and the EU Digital Markets Act: Addressing the Risks of Bigness in Generative AI Yasar, A. G., Chong, A., Dong, E., Gilbert, T. K., Hladikova, S., Maio, R., Mougan, C., Shen, X., Singh, S., Stoica, A., Thais, S., Zilka, M. Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR, The Forty International Conference on Machine Learning (ICML), July 2023 (Accepted)
As AI technology advances rapidly, concerns over the risks of bigness in digital markets are also growing. The EU's Digital Markets Act (DMA) aims to address these risks. Still, the current framework may not adequately cover generative AI systems that could become gateways for AI-based services. This paper argues for integrating certain AI software as core platform services and classifying certain developers as gatekeepers under the DMA. We also propose an assessment of gatekeeper obligations to ensure they cover generative AI services. As the EU considers generative AI-specific rules and possible DMA amendments, this paper provides insights towards diversity and openness in generative AI services.
arXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Algorithmic Collective Action in Machine Learning Hardt, M., Mazumdar, E., Mendler-Dünner, C., Zrnic, T. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR, The Forty International Conference on Machine Learning (ICML), July 2023 (Published)
We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm’s learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: nonparametric optimal learning, parametric risk minimization, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective’s size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform’s learning algorithm.
URL BibTeX

Social Foundations of Computation Conference Paper A Theory of Dynamic Benchmarks Shirali, A., Abebe, R., Hardt, M. In The Eleventh International Conference on Learning Representations (ICLR 2023) , May 2023 (Published)
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper Human-Guided Fair Classification for Natural Language Processing Dorner, F. E., Peychev, M., Konstantinov, N., Goel, N., Ash, E., Vechev, M. In The Eleventh International Conference on Learning Representations (ICLR 2023), February 2023 (Published)
Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper What Makes ImageNet Look Unlike LAION Shirali, A., Hardt, M. The Twelfth International Conference on Learning Representations (ICLR 2024), February 2023 (Submitted)
ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.
arXiv URL BibTeX

Human Aspects of Machine Learning Social Foundations of Computation Unpublished Challenging the validity of personality tests for large language models Tom, S., Florian, D., Samadi, S., Kelava, A. arXiv, 2023 (Submitted)
With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs’ responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reversecoded items (“I am introverted” vs.“I am extraverted”) are often both answered affirmatively. Furthermore, variation across prompts designed to “steer” LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests’ validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs’“personality”.
BibTeX

Human Aspects of Machine Learning Social Foundations of Computation Conference Paper Do personality tests generalize to Large Language Models? Sühr, T., Dorner, F., Samadi, S., Kelava, A. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems., Ernest N. Morial Convention Center, New Orleans, Louisiana., Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS, 2023 (Published) URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Anticipating Performativity by Predicting from Predictions Mendler-Dünner, C., Ding, F., Wang, Y. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates, Inc., The Thirty-Six Annual Conference on Neural Information Processing Systems (NeurIPS), November 2022 (Published)
Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they are designed to predict. Understanding the causal effect of predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes, which can make the causal effects of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identifiability. Despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal effect of predictions can be identified from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identifiability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can find transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.
ArXiv URL BibTeX

Social Foundations of Computation Book Patterns, Predictions, and Actions: Foundations of Machine Learning Hardt, M., Recht, B. Princeton University Press, August 2022 (Published)
An authoritative, up-to-date graduate textbook on machine learning that highlights its historical context and societal impacts Patterns, Predictions, and Actions introduces graduate students to the essentials of machine learning while offering invaluable perspective on its history and social implications. Beginning with the foundations of decision making, Moritz Hardt and Benjamin Recht explain how representation, optimization, and generalization are the constituents of supervised learning. They go on to provide self-contained discussions of causality, the practice of causal inference, sequential decision making, and reinforcement learning, equipping readers with the concepts and tools they need to assess the consequences that may arise from acting on statistical decisions. Provides a modern introduction to machine learning, showing how data patterns support predictions and consequential actions Pays special attention to societal impacts and fairness in decision making Traces the development of machine learning from its origins to today Features a novel chapter on machine learning benchmarks and datasets Invites readers from all backgrounds, requiring some experience with probability, calculus, and linear algebra An essential textbook for students and a guide for researchers
URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Regret Minimization with Performative Feedback Jagadeesan, M., Zrnic, T., Mendler-Dünner, C. In Proceedings of the Thirty-Ninth International Conference on Machine Learning (ICML 2022), PMLR, The Thirty-Ninth International Conference on Machine Learning (ICML), July 2022 (Published)
In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Adversarial Scrutiny of Evidentiary Statistical Software Abebe, R., Hardt, M., Jin, A., Miller, J., Schmidt, L., Wexler, R. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), ACM, June 2022 (Published)
The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software—such as probabilistic genotyping, environmental audio detection and toolmark analysis tools—that the defense counsel cannot fully cross-examine or scrutinize. This undermines the commitments of the adversarial criminal legal system, which relies on the defense’s ability to probe and test the prosecution’s case to safeguard individual rights. Responding to this need to adversarially scrutinize output from such software, we propose robust adversarial testing as a framework to examine the validity of evidentiary statistical software. We define and operationalize this notion of robust adversarial testing for defense use by drawing on a large body of recent work in robust machine learning and algorithmic fairness. We demonstrate how this framework both standardizes the process for scrutinizing such tools and empowers defense lawyers to examine their validity for instances most relevant to the case at hand. We further discuss existing structural and institutional challenges within the U.S. criminal legal system which may create barriers for implementing this framework and close with a discussion on policy changes that could help address these concerns.
URL BibTeX