Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Conference Paper Difficult Lessons on Social Prediction from Wisconsin Public Schools Perdomo, J. C., Britton, T., Hardt, M., Abebe, R. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, June 2025 (Published)
Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the efficacy of EWS, and the role of statistical risk scores in education. In this work, we draw on nearly a decade's worth of data from a system used throughout Wisconsin to provide the first large-scale evaluation of the long-term impact of EWS on graduation outcomes. We present empirical evidence that the prediction system accurately sorts students by their dropout risk. We also find that it may have caused a single-digit percentage increase in graduation rates, though our empirical analyses cannot reliably rule out that there has been no positive treatment effect. Going beyond a retrospective evaluation of DEWS, we draw attention to a central question at the heart of the use of EWS: Are individual risk scores necessary for effectively targeting interventions? We propose a simple mechanism that only uses information about students' environments -- such as their schools, and districts -- and argue that this mechanism can target interventions just as efficiently as the individual risk score-based mechanism. Our argument holds even if individual predictions are highly accurate and effective interventions exist. In addition to motivating this simple targeting mechanism, our work provides a novel empirical backbone for the robust qualitative understanding among education researchers that dropout is structurally determined. Combined, our insights call into question the marginal value of individual predictions in settings where outcomes are driven by high levels of inequality.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper To Give or Not to Give? The Impacts of Strategically Withheld Recourse Chen, Y., Estornell, A., Vorobeychik, Y., Liu, Y. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTAS), PMLR, The Twenty-Eight International Conference on Artificial Intelligence and Statistics (AISTATS), May 2025 (Published)
Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper Allocation Requires Prediction Only if Inequality Is Low Shirali, A., Abebe, R., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024, *equal contribution (Published)
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics’ learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Causal Inference from Competing Treatments Stoica, A., Nastl, V. Y., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget Dorner, F. E., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Zhang, G., Hardt, M. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), July 2024 (Published)
We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes.
ArXiv Code URL BibTeX

Social Foundations of Computation Conference Paper Fairness Rising from the Ranks: HITS and PageRank on Homophilic Networks Stoica, A., Litvak, N., Chaintreau, A. In Proceedings of the Association for Computing Machinery (ACM) Web Conference 2024, ACM, The 2024 ACM Web Conference, May 2024 (Published)
In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high-ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: on the one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top-ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Test-Time Training on Nearest Neighbors for Large Language Models Hardt, M., Sun, Y. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.
ArXiv Code URL BibTeX

Social Foundations of Computation Conference Paper Unprocessing Seven Years of Algorithmic Fairness Cruz, A. F., Hardt, M. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
ArXiv Code URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Causal Inference out of Control: Estimating Performativity without Treatment Randomization Cheng, G., Hardt, M., Mendler-Dünner, C. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), January 2024 (Published)
Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on user consumption. In pursuit of estimating this effect from observational data, we identify a set of assumptions that permit causal identifiability without assuming randomized platform actions. Our results are applicable to platforms that rely on machine-learning-powered predictions and leverage knowledge from historical data. The key novelty of our approach is to explicitly model the dynamics of consumption over time, exploiting the repeated interaction of digital platforms with their participants to prove our identifiability results. By viewing the platform as a controller acting on a dynamical system, we can show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying the causal effect of interest. We complement our claims with an analysis of ready-to-use finite sample estimators and empirical investigations. More broadly, our results deriving identifiability conditions tailored to digital platform settings illustrate a fruitful interplay of control theory and causal inference
ArXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Collaborative Learning via Prediction Consensus Fan, D., Mendler-Dünner, C., Jaggi, M. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Curran Associates, Inc., The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS), December 2023 (Published)
We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost the performance of individual models in the target domain from which the auxiliary data is sampled. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods. At the same time, it can probably mitigate the negative impact of bad models on the collective.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Is Your Model Predicting the Past? Hardt, M., Kim, M. P. In Proceedings of the Third ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), ACM, October 2023 (Published)
When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system’s predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.
URL BibTeX

Social Foundations of Computation Conference Paper Incentivizing Honesty among Competitors in Collaborative Learning and Optimization Dorner, F. E., Konstantinov, N., Pashaliev, G., Vechev, M. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS), September 2023 (Published)
Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.
arXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Algorithmic Collective Action in Machine Learning Hardt, M., Mazumdar, E., Mendler-Dünner, C., Zrnic, T. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR, The Forty International Conference on Machine Learning (ICML), July 2023 (Published)
We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm’s learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: nonparametric optimal learning, parametric risk minimization, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective’s size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform’s learning algorithm.
URL BibTeX

Social Foundations of Computation Conference Paper A Theory of Dynamic Benchmarks Shirali, A., Abebe, R., Hardt, M. In The Eleventh International Conference on Learning Representations (ICLR 2023) , May 2023 (Published)
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
arXiv URL BibTeX

Social Foundations of Computation Conference Paper Human-Guided Fair Classification for Natural Language Processing Dorner, F. E., Peychev, M., Konstantinov, N., Goel, N., Ash, E., Vechev, M. In The Eleventh International Conference on Learning Representations (ICLR 2023), February 2023 (Published)
Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.
arXiv URL BibTeX

Human Aspects of Machine Learning Social Foundations of Computation Conference Paper Do personality tests generalize to Large Language Models? Sühr, T., Dorner, F., Samadi, S., Kelava, A. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems., Ernest N. Morial Convention Center, New Orleans, Louisiana., Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS, 2023 (Published) URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Anticipating Performativity by Predicting from Predictions Mendler-Dünner, C., Ding, F., Wang, Y. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates, Inc., The Thirty-Six Annual Conference on Neural Information Processing Systems (NeurIPS), November 2022 (Published)
Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they are designed to predict. Understanding the causal effect of predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes, which can make the causal effects of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identifiability. Despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal effect of predictions can be identified from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identifiability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can find transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.
ArXiv URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Regret Minimization with Performative Feedback Jagadeesan, M., Zrnic, T., Mendler-Dünner, C. In Proceedings of the Thirty-Ninth International Conference on Machine Learning (ICML 2022), PMLR, The Thirty-Ninth International Conference on Machine Learning (ICML), July 2022 (Published)
In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.
ArXiv URL BibTeX

Social Foundations of Computation Conference Paper Adversarial Scrutiny of Evidentiary Statistical Software Abebe, R., Hardt, M., Jin, A., Miller, J., Schmidt, L., Wexler, R. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), ACM, June 2022 (Published)
The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software—such as probabilistic genotyping, environmental audio detection and toolmark analysis tools—that the defense counsel cannot fully cross-examine or scrutinize. This undermines the commitments of the adversarial criminal legal system, which relies on the defense’s ability to probe and test the prosecution’s case to safeguard individual rights. Responding to this need to adversarially scrutinize output from such software, we propose robust adversarial testing as a framework to examine the validity of evidentiary statistical software. We define and operationalize this notion of robust adversarial testing for defense use by drawing on a large body of recent work in robust machine learning and algorithmic fairness. We demonstrate how this framework both standardizes the process for scrutinizing such tools and empowers defense lawyers to examine their validity for instances most relevant to the case at hand. We further discuss existing structural and institutional challenges within the U.S. criminal legal system which may create barriers for implementing this framework and close with a discussion on policy changes that could help address these concerns.
URL BibTeX

Social Foundations of Computation Conference Paper Causal Inference Struggles with Agency on Online Platforms Milli, S., Belli, L., Hardt, M. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), ACM, June 2022 (Published)
Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a “Catch-22” that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.
URL BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Performative Power Hardt, M., Jagadeesan, M., Mendler-Dünner, C. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates Inc., The Thirty-Six Annual Conference on Neural Information Processing Systems (NeurIPS), March 2022 (Published)
We introduce the notion of performative power, which measures the ability of a firm operating an algorithmic system, such as a digital content recommendation platform, to cause change in a population of participants. We relate performative power to the economic study of competition in digital economies. Traditional economic concepts struggle with identifying anti-competitive patterns in digital platforms not least due to the complexity of market definition. In contrast, performative power is a causal notion that is identifiable with minimal knowledge of the market, its internals, participants, products, or prices.We study the role of performative power in prediction and show that low performative power implies that a firm can do no better than to optimize their objective on current data. In contrast, firms of high performative power stand to benefit from steering the population towards more profitable behavior. We confirm in a simple theoretical model that monopolies maximize performative power. A firm's ability to personalize increases performative power, while competition and outside options decrease performative power. On the empirical side, we propose an observational causal design to identify performative power from discontinuities in how digital platforms display content. This allows to repurpose causal effects from various studies about digital platforms as lower bounds on performative power. Finally, we speculate about the role that performative power might play in competition policy and antitrust enforcement in digital marketplaces.
URL BibTeX