Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Empirical Inference Conference Paper Neural Posterior Estimation of Terrain Parameters from Radar Sounder Data Dal Corso, J., Kofler, A., Cortellazzi, M., Bruzzone, L., Schölkopf, B. IEEE International Geoscience and Remote Sensing Symposium (IGARSS), August 2026 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Echoes of the Prior: A Computational Phenomenology of Forgetting Gao, G., Schölkopf, B., Geiger, A. Proceedings of the ACM on Computer Graphics and Interactive Techniques: PACM-CGIT, SIGGRAPH, July 2026 (Accepted) Project BibTeX

Empirische Inferenz Conference Paper On the Emergence and Test-Time Use of Structural Information in Large Language Models Chen, M. C., Miller, M., Schölkopf, B., Guo, S. 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), July 2026 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control Paulus*, A., Geist*, A. R., Schumacher*, P., Rappenecker, S., Musil, V., Martius, G. The Fourteenth International Conference on Learning Representations (ICLR), April 2026, *equal contribution (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Estimating Joint Interventional Distributions from Marginal Interventional Data Garrido Mejia, S., Kirschbaum, E., Kekić, A., Schölkopf, B., Mastakouri, A. A. Proceedings of the Fifth Conference on Causal Learning and Reasoning, 323:1-23, PMLR, 5th Conference on Causal Learning and Reasoning, April 2026 (To be published) arXiv BibTeX

Empirical Inference Conference Paper Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies Kekić, A., Schneider, J., Büchler, D., Schölkopf*, B., Besserve*, M. The Fourteenth International Conference on Learning Representations (ICLR), April 2026, *joint supervision (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Position: Science is Collaborative—LLM for Science Should Be Too Zhang, T. J., Jiang, W., Guzman Piedrahita, D., Yang, Y., Lu, S., Schölkopf, B., Jin, Z. ICLR 2026 – 2nd Workshop on Foundation Models for Science: Real-World Impact and Science-First Design , ICLR - Workshop FM4Science, April 2026 (Published) URL BibTeX

Empirical Inference Conference Paper Proper Velocity Neural Networks Chen*, Z., Su*, Z., Schölkopf, B., Sebe, N. The Fourteenth International Conference on Learning Representations (ICLR), April 2026, *equal contribution (Published) URL BibTeX

Social Foundations of Computation Conference Paper ROC-n-reroll: How Verifier Imperfection affects Test-Time Scaling Dorner, F. E., Chen, Y. C., Cruz, A. F., Yang, F. Y. The Fourteenth International Conference on Learning Representations (ICLR), April 2026 (Accepted)
Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
arXiv BibTeX

Empirical Inference Conference Paper Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles Ni*, Z., Li*, Y., Qiu*, Z., Schölkopf, B., Guo, H., Liu, W., Liu, S. The Fourteenth International Conference on Learning Representations (ICLR), April 2026, *equal contribution (Published) arXiv URL BibTeX

Empirical Inference Deep Models and Optimization Conference Paper Scaling Behavior of Discrete Diffusion Language Models von Rütte, D., Fluri, J., Pooladzandi, O., Schölkopf, B., Hofmann, T., Orvieto, A. The Fourteenth International Conference on Learning Representations (ICLR), April 2026 (Published) arXiv URL BibTeX

Empirical Inference Robust Machine Learning Conference Paper Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning Reizinger*, P., Mucsányi*, B., Guo*, S., Eysenbach, B., Schölkopf, B., Brendel, W. The Fourteenth International Conference on Learning Representations (ICLR), April 2026, *equal contribution (Published) arXiv URL BibTeX

Empirical Inference Conference Paper SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests Pandey, P. S., Le, H. S., Bhardwaj, D., Mihalcea, R., Zhijing, J. The Fourteenth International Conference on Learning Representations (ICLR), April 2026 (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models Guzman Piedrahita*, D., Strauss*, I., Schölkopf, B., Mihalcea, R., Jin, Z. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 593-652, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026, *equal contribution (Published) DOI URL BibTeX

Empirical Inference Conference Paper How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities Kassem, A. M., Schölkopf, B., Jin, Z. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 7496-7507, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Empirical Inference Conference Paper Taming Object Hallucinations with Verified Atomic Confidence Estimation Liu, J., Xuan, W., Jin, Z., Diab, M. T. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 5430-5444, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Empirical Inference Conference Paper Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification He, P., Huang, Y., Sachan, M., Jin, Z. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 1231-1250, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Empirical Inference Conference Paper When Do Language Models Endorse Limitations on Human Rights Principles? Samway, K., Takagi, M. N., Mihalcea, R., Schölkopf, B., Chalkidis, I., Hershcovich, D., Jin, Z. Findings of the Association for Computational Linguistics: EACL, 6597-6623, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Empirical Inference Conference Paper CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures Pandey, P. S., Yang, Y., Liu, J., Jin, Z. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 1251-1266, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Empirical Inference Conference Paper NLP for Social Good: A Survey and Outlook of Challenges, Opportunities and Responsible Deployment Karamolegkou, A., Borah, A., Cho, E., Choudhury, S. R., Galletti, M., Gupta, P., Ignat, O., Kargupta, P., Kotonya, N., Lamba, H., Lee, S., Mangla, A., Mondal, I., Moudakir, F. Z., Nazar, D., Nemkova, P., Pisarevskaya, D., Rizwan, N., Sabri, N., Samway, K., et al. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 5110-5170, (Editors: Demberg, Vera and Inui, Kentaro and Marquez, Lluís), Association for Computational Linguistics, 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), March 2026 (Published) DOI URL BibTeX

Social Foundations of Computation Conference Paper Train-before-Test Harmonizes Language Model Rankings Zhang, G., Dominguez-Olmedo, R., Hardt, M. The Fourteenth International Conference on Learning Representations (ICLR), oral, Top1.18%, January 2026 (Accepted)
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
arXiv BibTeX

Empirical Inference Conference Paper A data and task-constrained mechanistic model of the mouse outer retina shows robustness to contrast variations Kadhim, K. L., Beck, J., Huang, Z., Macke, J. H., Rieke, F., Euler, T., Deistler, M., Berens, P. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) bioRxiv BibTeX

Empirical Inference Conference Paper Are Language Models Efficient Reasoners? A Perspective from Logic Programming Opedal, A., Zengaffinen, Y., Shirakami, H., Pasti, C., Sachan, M., Saparov, A., Cotterell, R., Schölkopf, B. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper CauSciBench: Assessing LLM Causal Reasoning for Scientific Research Acharya, S., Zhang, T. J., Kim, A., Haghighat, A., Sun, X., Shrestha, R. B., Mordig, M., Danisman, F., Jose, C., Qi, Y., Cobben, P., Schölkopf, B., Sachan, M., Jin, Z. NeurIPS 2025: 5th Workshop on Mathematical Reasoning and AI (Math-AI) and CauScien Workshop, December 2025 (Published) URL BibTeX

Empirical Inference Conference Paper Counterfactual reasoning: an analysis of in-context emergence Miller, M., Schölkopf, B., Guo, S. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Cultural Alien Sampler: Open-ended art generation balancing originality and coherence Hernandez, A., Yakura, H., Brinkmann, L., Sola, M. C., Alhaija, H. A., Serna, I., Rahaman, N., Schölkopf, B., Rahwan, I. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, Creative AI Track, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Do-PFN: In-Context Learning for Causal Effect Estimation Robertson*, J., Reuter*, A., Guo, S., Hollmann, N., Hutter, F., Schölkopf, B. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025, *equal contribution (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models Vetter, J., Gloeckler, M., Gedon, D., Macke, J. H. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper FNOPE: Simulation-based inference on function spaces with Fourier Neural Operators Moss, G., Muhle, L. S., Drews, R., Macke, J. H., Schröder, C. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Identifying multi-compartment Hodgkin-Huxley models with high-density extracellular voltage recordings Tanoh, I. C., Deistler, M., Macke, J. H., Linderman, S. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Reparameterized LLM Training via Orthogonal Equivalence Transformation Qiu, Z., Buchholz, S., Xiao, T., Dax, M., Schölkopf, B., Liu, W. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper Root Cause Analysis of Outliers with Missing Structural Knowledge Orchard, W. R., Okati, N., Garrido Mejia, S., Blöbaum, P., Janzing, D. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Empirical Inference Conference Paper SPARTAN: A Sparse Transformer World Model Attending to What Matters Lei, A., Schölkopf, B., Posner, I. Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 39th Annual Conference on Neural Information Processing Systems, December 2025 (Accepted) arXiv BibTeX

Organizational Leadership and Diversity Conference Paper Inclusive Leadership in the Age of AI: A Dataset and Comparative Study of LLMs vs. Real-Life Leaders in Workplace Action Planning Singh, V., Schulte im Walde, S., Keplinger, K. Findings of the Association for Computational Linguistics: EMNLP 2025, 19732-19753, Association for Computational Linguistics, Suzhou, China, Empirical Methods in Natural Language Processing, November 2025 (Published)
Generative Large Language Models have emerged as useful tools, reshaping professional workflows. However, their efficacy in inherently complex and human-centric tasks such as leadership and strategic planning remains under-explored. In this interdisciplinary study, we present a novel dataset and compare LLMs and human leaders in the context of work-place action planning, specifically focusing on translating the abstract idea of inclusion into actionable SMART goals. We developed the Leader Success Bot, a script-based chat-bot co-designed with domain experts, to guide more than 250 real-life leaders in generating inclusive workplace action plans. We systematically prompted seven state-of-the-art chat-based LLMs to perform the same task using the socio-demographic data of real-life leaders and instructions co-developed with domain experts. Our publicly released dataset enables direct comparison between human and LLM-generated workplace action plans, offering in-sights into their respective strengths, biases, and limitations. Our findings highlight critical gaps and opportunities for LLMs in leadership applications, fostering interdisciplinary collaboration and NLP applications.
DOI URL BibTeX

Empirical Inference Conference Paper Are Language Models Consequentialist or Deontological Moral Reasoners? Samway, K., Kleiman-Weiner, M., Guzman Piedrahita, D., Mihalcea, R., Schölkopf, B., Jin, Z. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 30699-30726, (Editors: Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet), Association for Computational Linguistics, EMNLP, November 2025 (Published)
As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments."
DOI URL BibTeX

Empirical Inference Conference Paper Improving Large Language Model Safety with Contrastive Representation Learning Simko, S., Sachan, M., Schölkopf, B., Jin, Z. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 28166-28194, (Editors: Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet), Association for Computational Linguistics, November 2025 (Published) arXiv DOI URL BibTeX

Empirical Inference Conference Paper Corrupted by reasoning: Reasoning language models become free-riders in public goods games Guzman Piedrahita, D., Yang, Y., Sachan, M., Ramponi, G., Schölkopf, B., Jin, Z. Second Conference on Language Modeling (COLM 2025), October 2025 (Published) arXiv URL BibTeX

Social Foundations of Computation Conference Paper Strategic Hypothesis Testing Hossain, S., Chen, Y., Chen, Y. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), Spotlight Poster, top 3%, September 2025 (Accepted)
We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent's incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent's participation and reporting behavior respond to the principal's statistical decision rule. Despite the complexity of the interaction, we show that the principal's errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.
arXiv BibTeX

Empirical Inference Deep Models and Optimization Conference Paper Generalized Interpolating Discrete Diffusion von Rütte, D., Fluri, J., Ding, Y., Orvieto, A., Schölkopf, B., Hofmann, T. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:61810-61843, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025 (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Generative Intervention Models for Causal Perturbation Modeling Schneider, N., Lorch, L., Kilbertus, N., Schölkopf, B., Krause, A. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:53388-53412, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025 (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Learning Joint Interventional Effects from Single-Variable Interventions in Additive Models Kekić, A., Garrido Mejia, S., Schölkopf, B. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:29651-29669, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025 (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Position: Probabilistic Modelling is Sufficient for Causal Inference Mlodozeniec, B. K., Krueger, D., Turner, R. E. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:81810-81840, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025 (Published) URL BibTeX

Empirical Inference Conference Paper Progressive Tempering Sampler with Diffusion Rissanen*, S., OuYang*, R., He*, J., Chen, W., Heinonen, M., Solin, A., Hernández-Lobato, J. M. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:51724-51746, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025, *equal contribution (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Scalable Gaussian Processes with Latent Kronecker Structure Lin, J. A., Ament, A., Balandat, M., Eriksson, D., Hernández-Lobato, J. M., Bakshy, E. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:37730-37744, Proceedings of Machine Learning Research, (Editors: Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry), PMLR, International Conference on Machine Learning, July 2025 (Published) arXiv URL BibTeX

Social Foundations of Computation Conference Paper How Benchmark Prediction from Fewer Data Misses the Mark Zhang, G., Dorner, F. E., Hardt, M. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), June 2025 (Accepted)
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
arXiv BibTeX

Empirical Inference Conference Paper Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation Sanyal, A., Hu, Y., Yu, Y., Ma, Y., Wang, Y., Schölkopf, B. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:2170-2178, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025 (Published) URL BibTeX

Empirical Inference Conference Paper Training Neural Samplers with Reverse Diffusive KL Divergence He*, J., Chen*, W., Zhang*, M., Barber, D., Hernández-Lobato, J. M. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:5167-5175, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper Your Finetuned Large Language Model is Already a Powerful Out-of-distribution Detector Zhang, A., Xiao, T. Z., Liu, W., Bamler, R., Wischik, D. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:2701-2709, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025 (Published) URL BibTeX

Empirical Inference Autonomous Learning Conference Paper Advancing Out-of-Distribution Detection via Local Neuroplasticity Canevaro, A., Schmidt, J., Marvi, M. S., Yu, H., Martius, G., Jordan, J. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX