Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Empirical Inference Conference Paper What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis Ormaniec, W., Dangel, F., Singh, S. P. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Empirical Inference Conference Paper Why AI Is WEIRD and Should Not Be This Way: Towards AI For Everyone, With Everyone, By Everyone Mihalcea*, R., Ignat*, O., Bai, L., Borah, A., Chiruzzo, L., Jin, Z., Kwizera, C., Nwatu, J., Poria, S., Solorio, T. The Thirty-Nineth AAAI Conference on Artificial Intelligence, AAAI 2025 (Senior Member Presentation Track), (27)28657-28670, (Editors: Toby Walsh, Julie Shah, Zico Kolter ), AAAI Press, April 2025, *equal contribution (Published)
This paper presents a vision for creating AI systems that are inclusive at every stage of development, from data collection to model design and evaluation. We address key limitations in the current AI pipeline and its WEIRD* representation, such as lack of data diversity, biases in model performance, and narrow evaluation metrics. We also focus on the need for diverse representation among the developers of these systems, as well as incentives that are not skewed toward certain groups. We highlight opportunities to develop AI systems that are for everyone (with diverse stakeholders in mind), with everyone (inclusive of diverse data and annotators), and by everyone (designed and developed by a globally diverse workforce). *WEIRD = an acronym coined by Joseph Henrich to highlight the coverage limitations of many psychological studies, referring to populations that are Western, Educated, Industrialized, Rich, and Democratic; while we do not fully adopt this term for AI, as its current scope does not perfectly align with the WEIRD dimensions, we believe that today’s AI has a similarly "weird" coverage, particularly in terms of who is involved in its development and who benefits from it.
arXiv DOI URL BibTeX

Haptic Intelligence Perceiving Systems Article Wrist-to-Wrist Bioimpedance Can Reliably Detect Discrete Self-Touch Forte, M., Vardar, Y., Javot, B., Kuchenbecker, K. J. IEEE Transactions on Instrumentation and Measurement, 74(4006511):1-11, April 2025 (Published)
Self-touch is crucial in human communication, psychology, and disease transmission, yet existing methods for detecting self-touch are often invasive or limited in scope. This study systematically investigates the feasibility of using non-invasive electrical bioimpedance for detecting discrete self-touch poses across individuals. While previous research has focused on classifying defined self-touch poses, our work explores how various poses cause bioimpedance changes, providing insights into the underlying physiological mechanisms. We thus created a dataset of 27 genuine self-touch poses, including skin-to-skin contact between the hands and face and skin-to-clothing contact between the hands and chest, alongside six adversarial mid-air gestures. We then measured the wrist-to-wrist bioimpedance of 30 adults (15 female, 15 male) across these poses, with each measurement preceded by a no-touch pose serving as a baseline. Statistical analysis of the measurements showed that skin-to-skin contacts cause significant changes in bioimpedance magnitude between 237.8 kHz and 4.1 MHz, while adversarial gestures do not; skin-to-clothing contacts cause less-significant changes due to the influence and variability of the clothing material. Furthermore, our analysis highlights the sensitivity of bioimpedance to the body parts involved, skin contact area, and individual's characteristics. Our contributions are two-fold: (1) we demonstrate that bioimpedance offers a practical, non-invasive solution for detecting self-touch poses involving skin-to-skin contact, (2) researchers can leverage insights from our study to determine whether a pose can be detected without extensive testing.
DOI BibTeX

Empirical Inference Conference Paper MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs Opedal*, A., Shirakami*, H., Schölkopf, B., Saparov, A., Sachan, M. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *equal contribution (Published) arXiv BibTeX

Perceiving Systems Ph.D. Thesis Democratizing 3D Human Digitization Xiu, Y. March 2025 (Published)
Richard Feynman once said, “What I cannot create, I do not understand.” Similarly, making virtual humans more realistic helps us better grasp human nature. Simulating lifelike avatars has scientific value (such as in biomechanics) and practical applications (like the Metaverse). However, creating them affordably at scale with high quality remains challenging. Reconstructing complex poses, varied clothing, and unseen areas from casual photos under real-world conditions is still difficult. We address this through a series of works—ICON, ECON, TeCH, PuzzleAvatar—bridging pixel-based reconstruction with text-guided generation to reframe reconstruction as conditional generation. This allows us to turn everyday photos, like personal albums featuring random poses, diverse clothing, tricky angles, and arbitrary cropping, into 3D avatars. The process converts unstructured data into structured output without unnecessary complexity. With these techniques, we can efficiently scale up the creation of digital humans using readily available imagery.
Thesis BibTeX

Robotic Materials Article A robotic and virtual testing platform highlighting the promise of soft wearable actuators for wrist tremor suppression Shagan Shomron, A., Chase-Markopoulou, C., Walter, J. R., Sellhorn-Timm, J., Shao, Y., Nadler, T., Benson, A., Wochner, I., Rumley, E. H., Wurster, I., Klocke, P., Weiss, D., Schmitt, S., Keplinger, C., Haeufle, D. F. Device, 3:100719, March 2025 (Published)
Nearly 80 million people in the world deal with medical conditions that cause involuntary periodic movements known as tremors. Wearable soft robotic devices offer a potential solution for actively suppressing these tremors. However, existing prototypes face limitations in actuation performance and complex testing procedures. We present a comprehensive approach for the rapid evaluation of emerging wearable tremor-suppression technologies. This method combines reproducing patient-recorded tremor episodes and measuring tremor suppression in a robotic platform, termed a "mechanical patient", with validation of the achieved suppression performance of soft actuators via biomechanical modeling, thereby avoiding time-consuming clinical testing in the early stages of development. Using this approach, we highlight that an antagonistic pair of slim and lightweight electrohydraulic actuators can effectively …
Press release Video (overview) Video (technical description) Article in pdf DOI URL BibTeX

Haptic Intelligence Article A Sleeve Alters the Pressure-Stretch Curve of a Hyperelastic Balloon to Enable Pre-Programmed Sequencing Gertler, I., Kuchenbecker, K. J. Advanced Materials Technologies, 10(6):2400993, March 2025 (Published)
Coupled hyperelastic balloons that anchor alternately against a lumen wall provide an appealing locomotion method for soft robots, especially for pipe inspection and medical interventions. However, it is still challenging to use a single fluid channel to obtain a practical balloon actuation sequence, where the rear anchor is both the first to inflate and the first to deflate. The common solution delays the front balloon's reaction using fluid dynamics, producing a slow and/or bulky system. This study presents a new method that utilizes an inextensible sleeve along with geometry and mechanical properties to set the pressure-stretch curve of two silicone-rubber balloons so they could serve as the rear and front anchors when driven from a single fluid supply. Experimental measurements and numerical simulations compare the characteristic curves of thin and thick spherical balloons with identical diameters to that of a thin balloon inside a rigid encasing sleeve that delays its initial expansion. Pairing this encased thin balloon with a non-encased thick balloon yields the desired asymmetric actuation sequence. A physical demonstration of the behavior needed for self-propelling robots is achieved by placing such balloons within rigid tubes, connecting them to a shared supply, and sequentially adding and removing fluid.
DOI BibTeX

Empirical Inference Article Early warning of complex climate risk with integrated artificial intelligence Reichstein, M., Benson, V., Blunk, J., Camps-Valls, G., Creutzig, F., Fearnley, C. J., Han, B., Kornhuber, K., Rahaman, N., Schölkopf, B., Tárraga, J. M., Vinuesa, R., Dall, K., Denzler, J., Frank, D., Martini, G., Nganga, N., Maddix, D. C., Weldemariam, K. Nature Communications, 16(1), March 2025 (Published) DOI BibTeX

Haptic Intelligence Miscellaneous Error-State Extended Kalman Filter Sensor Fusion for Tracking Collaborating Humans Hudhud Mughrabi, M., Allemang–Trivalle, A., Kuchenbecker, K. J. Extended abstract (3 pages) presented at the German Robotics Conference (GRC), Nuremberg, Germany, March 2025 (Published)
How teams collaborate to perform complex tasks , from team sports to surgical procedures, has previously been investigated via multimodal sensing and analysis. Ultra-wideband (UWB) positioning systems are highly mobile and can be used to track collaborating team members even in cramped environments. However, the sampling rate of UWB systems is inversely proportional to the number of people tracked, and their accuracy is hindered by electromagnetic occlusion. To improve position and orientation estimation during team collaborative studies, we propose to fuse UWB positioning with a wearable inertial measurement unit (IMU) by applying an error-state extended Kalman filter (ES-EKF). This filter offers faster and more consistent estimation and remains functional even in the absence of UWB input. Single-human and multi-human sessions were recorded and filtered for evaluation against ground truth from optical motion capture. By integrating IMU readings, the ES-EKF increases the sampling rate from 0.5-20 Hz to 100 Hz. Even by correcting only planar position in the room, the ES-EKF yields improved results over UWB in four out of six DOF: lateral and longitudinal position and yaw and pitch orientation.
BibTeX

Perceiving Systems Conference Paper Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photo-Realistic Appearance from Multi-View Video Rong, B., Grigorev, A., Wang, W., Black, M. J., Thomaszewski, B., Tsalicoglou, C., Hilliges, O. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We introduce Gaussian Garments, a novel approach for reconstructing realistic-looking, simulation-ready garment assets from multi-view videos. Our method represents garments with a combination of a 3D mesh and a Gaussian texture that encodes both the color and high-frequency surface details. This representation enables accurate registration of garment geometries to multi-view videos and helps disentangle albedo textures from lighting effects. Furthermore, we demonstrate how a pre-trained Graph Neural Network (GNN) can be fine-tuned to replicate the real behavior of each garment. The reconstructed Gaussian Garments can be automatically combined into multi-garment outfits and animated with the fine-tuned GNN.
arXiv project video URL BibTeX

Haptic Intelligence Miscellaneous Haptify: A Measurement System for Benchmarking Grounded Force-Feedback Devices Fazlollahi, F., Kuchenbecker, K. J. Extended abstract (3 pages) presented at the German Robotics Conference (GRC), Nuremberg, Germany , March 2025 (Published)
Grounded force-feedback (GFF) devices are a well-established and diverse category of haptic technology based on robotic arms. However, the number of designs and their specifications make it challenging to compare devices effectively. We address this challenge by presenting Haptify, a benchmarking system capable of evaluating GFF haptic devices in a thorough, fair, and non-invasive way. The user holds the instrumented device end-effector and moves it through a series of passive and active experiments. Haptify captures the interaction between the hand, device, and ground using a seven-camera optical motion-capture system, a custom 60-cm-square force plate, and a customized sensing end-effector. We propose six key metrics for evaluating GFF device performance: workspace shape, global free-space forces, global free-space vibrations, local dynamic forces and torques, frictionless surface rendering, and stiffness rendering. We then benchmark two commercial haptic devices using Haptify. The more expensive Touch X has a smaller workspace than the 3D Systems Touch, but it outputs smaller free-space forces and vibrations, smaller and more predictable dynamic forces and torques, and higher-quality renderings of a frictionless surface and high stiffness.
BibTeX

Empirical Inference Ph.D. Thesis Learning to Generalize Across Distribution Shifts Träuble, F. J. University of Tübingen, Germany, March 2025, (IMPRS-PhD-Fellowship-Program and ELLIS-PhD-Fellowship-Program) (Published) BibTeX

Empirical Inference Article Real-time inference for binary neutron star mergers using machine learning Dax, M., Green, S. R., Gair, J., Gupte, N., Pürrer, M., Raymond, V., Wildberger, J., Macke, J. H., Buonanno, A., Schölkopf, B. Nature, 639(8053):49-53, March 2025 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper CameraHMR: Aligning People with Perspective Patel, P., Black, M. J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.
arXiv project BibTeX

Perceiving Systems Conference Paper CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation Morales, T., Taheri, O., Lacey, G. In Winter Conference on Applications of Computer Vision (WACV), February 2025 (Published)
Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion’s improvements over the SOTA in both applications: it increases the contact F1 score by 5% for refinement and decreases the sim. displacement by 46% for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks.
GitHub Paper URL BibTeX

Biomimetic Materials and Machines Article Highly agile flat swimming robot Hartmann, F., Baskaran, M., Raynaud, G., Benbedda, M., Mulleners, K., Shea, H. February 2025 (Published) BibTeX

Rationality Enhancement Article Evaluating the Effectiveness of the InsightApp: A Longitudinal Randomized Controlled Trial on Anxiety, Valued Action, and Psychological Resilience Amo, V., Lieder, F. JMIR Mental Health, 12:e57201, February 2025 (Published)
Background: Anxiety disorders are among the most prevalent mental disorders, and stress plays a significant role in their development. Ecological momentary interventions (EMIs) hold great potential to help people manage stress and anxiety by training emotion regulation and coping skills in real-life settings. InsightApp is a gamified EMI and research tool that incorporates elements from evidence-based therapeutic approaches. It is designed to strengthen people’s metacognitive skills for coping with challenging real-life situations and embracing anxiety and other emotions. Objective: This randomized controlled trial aims to examine the effectiveness of InsightApp in (1) improving individuals’ metacognitive strategies for coping with stress and anxiety and (2) promoting value-congruent action. It also evaluates how long these effects are retained. This experiment advances our understanding of the role of metacognition in emotional and behavioral reactivity to stress. Methods: We conducted a randomized controlled trial with 228 participants (completion rate: n=197, 86.4%; mean age 38, SD 11.50 years; age range 20-80 years; female: n=101, 52.6%; and White: n=175, 91.1%), who were randomly assigned to either the treatment or the active placebo control group. During the 1-week intervention phase, the treatment group engaged with InsightApp, while participants in the control group interacted with a placebo version of the app that delivered executive function training. We assessed the differences between the 2 groups in posttest and follow-up assessments of mental health and well-being while controlling for preexisting differences. Moreover, we used a multilevel model to analyze the longitudinal data, focusing on the within-participant causal effects of the intervention on emotional and behavioral reactivity to daily stressors. Specifically, we measured daily anxiety, struggle with anxiety, and value-congruent action. Results: The intervention delivered by InsightApp yielded mixed results. On one hand, we found no significant posttest scores on mental health and well-being measures directly after the intervention or 7 days later (all P>.22). In contrast, when confronted with real-life stress, the treatment group experienced a 15% lower increase in anxiety (1-tailed t test, t197=–2.4; P=.009) and a 12% lower increase in the struggle with anxiety (t197=–1.87; P=.031) than the control group. Furthermore, individuals in the treatment group demonstrated a 7% higher tendency to align their actions with their values compared to the control group (t197=3.23; P=.002). After the intervention period, InsightApp’s positive effects on the struggle with anxiety in reaction to stress were sustained, and increased to an 18% lower reactivity to stress (t197=–2.84; P=.002). Conclusions: As our study yielded mixed results, further studies are needed to obtain an accurate and reliable understanding of the effectiveness of InsightApp. Overall, our findings tentatively suggest that guiding people to apply adaptive metacognitive strategies for coping with real-life stress daily with a gamified EMI is a promising approach that deserves further evaluation.
DOI URL BibTeX

Empirical Inference Article Artificial intelligence for modelling infectious disease epidemics Kraemer, M. U. G., Tsui, J. L., Chang, S. Y., Lytras, S., Khurana, M. P., Vanderslott, S., Bajaj, S., Scheidwasser, N., Curran-Sebastian, J. L., Semenova, E., Zhang, M., Unwin, H. J. T., Watson, O. J., Mills, C., Dasgupta, A., Ferretti, L., Scarpino, S. V., Koua, E., Morgan, O., Tegally, H., et al. Nature, 638(8051):623-635, February 2025 (Published) DOI URL BibTeX

Empirical Inference Ph.D. Thesis Predictions, Policies, Rewards: Models of Decision-Making from Observational Data Pace, A. ETH Zurich, Switzerland, February 2025, ETH AI Center-Fellowship-Program (Published) BibTeX

Biomimetic Materials and Machines Article Ecosystem-Centered Robot Design: Toward Ecoresorbable Sustainability Robots (ESRs) Yilmaz, T., Fang, Y., Contreras, C., Schulz, A. K., Hartmann, F. Advanced Science, e09194:1-31, January 2025 (Published)
The deployment of robots and sensors across diverse ecosystems supports ecological monitoring, nature conservation, and exploration. However, retrieving these machines is often impractical or economically infeasible, posing risks to ecosystems through pollution, physical damage, and waste generation. To alleviate these risks, the development of transient systems from biodegradable materials represents a promising solution, enabling them to decompose harmlessly after use. Robots made from soft or functional polymers exhibit a unique potential in solving this challenge by drawing from a wide range of biomaterials, while simultaneously benefiting from intrinsic adaptability. Despite significant progress in the development of sustainable soft robotics, the influence of specific ecosystems on biodegradation is frequently overlooked. The environmental context is essential, as biodegradation depends largely on environmental factors unique to each ecosystem. In this review, a comprehensive overview of various ecosystems relevant to robot deployment is provided, offering critical context for assessing sustainability and deriving principles for ecosystem-centered robot design. Co-developing materials and sustainability robots with an understanding of their operational ecosystems paves the way for environmentally friendly machines, which are named ecoresorbable sustainability robots (ESRs), that coexist harmoniously with nature.
DOI URL BibTeX

Dynamic Locomotion Article How knee muscles and ground reaction forces shape knee buckling and ankle push-off in neuromuscular simulations of human walking Buchmann, A., Kiss, B., Badri-Spröwitz, A., Renjewski, D. Scientific Reports, 15:2249, January 2025 (Published)
Ankle push-off is important for efficient, human-like walking, and many prosthetic devices mimic push-off using motors or elastic elements. The knee is extended throughout the stance phase and begins to buckle just before push-off, with timing being crucial. However, the exact mechanisms behind this buckling are still unclear. We use a predictive neuromuscular simulation to investigate whether active muscles are required for knee buckling and to what extent ground reaction forces (GRFs) drive it. In a systematic parameter search, we tested how long the knee muscles vastus (VAS), gastrocnemius (GAS), and hamstrings could be deactivated while maintaining a stable gait with impulsive push-off. VAS deactivation up to 35\% of the gait cycle resulted in a dynamic gait with increased ankle peak power. GAS deactivation up to 20\% of the gait cycle was detrimental to gait efficiency and showed reduced ankle peak power. At the start of knee buckling, the GRF vector is positioned near the knee joint’s neutral axis, assisting in knee flexion. However, this mechanism is likely not enough to drive knee flexion independently. Our findings contribute to the biomechanical understanding of ankle push-off, with applications in prosthetic and bipedal robotic design, and fundamental research on human gait mechanics.
DOI URL BibTeX

Deep Models and Optimization Conference Paper Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture Movahedi, S., Orvieto, A., Moosavi-Dezfooli, S. In The Thirteenth International Conference on Learning Representations, ICLR 2025, The Thirteenth International Conference on Learning Representations, January 2025 (Accepted) BibTeX

Dynamic Locomotion Book Special issue on embodied intelligence-understanding animal locomotion and its robotic implementations Manoonponga, P., Badri-Spröwitz, A., Owaki, D. Advanced Robotics, 39:1-2, Taylor & Francis and RSJ, Milton, January 2025 (Published)
Embodied Intelligence (EI)’ refers to the innate ability of animals to utilize their body structures and interact with their environment (morphological computation) in conjunction with their brain and nervous systems (neural computation). This synergy enables them to achieve flexible, versatile, and robust locomotion, and allows them to learn and perform complex tasks throughout their lives. In modern robotics, where artificial intelligence (AI) is the driver for transformative advancements, the harmonious and continuous dynamic interaction between neural computation (including control, memory, and plasticity), the physical (flexible) body, and the environment – collectively referred to as ‘embodiment’ – remains a fundamental principle. Given that animals exhibit adaptive movement strategies across diverse real-world scenarios, understanding these strategies can pave the way for innovative robotic systems that reflect ‘nature intelligence’.
DOI URL BibTeX

Materials Article Simultaneous Selective and Quantitative Sensing of Diclofenac and Metoprolol via Electrical Conductance of Two Polyelectrolyte Hydrogels Tsianaka, A., Fichtel, K., Tovar, G. E. M., Southan, A. Advanced Sensor Research, 4(3):2400141, January 2025 (Published)
Hydrogels containing functional groups are highly interesting for sensor applications as they can change their physical properties by interaction with their environment. In this study, it is demonstrated that by monitoring the conductance of two different functional hydrogels, the concentrations of two different drugs in aqueous solution can be selectively and quantitatively measured simultaneously based on non-specific interactions. Detailed characterization of the competitive drug adsorption on the hydrogels allows the description of both hydrogel conductances as a function of the drug concentrations based on physical models. The result is a system of non-linear equations that can be solved for the drug concentrations. The different affinities and conductance responses of the hydrogels for the two drugs is a prerequisite, which is usually achieved with different materials. This approach is demonstrated with hydrogels based on poly(ethylene glycol), functionalized with the ionic monomers [2-(acryloyloxy)ethyl] trimethylammonium chloride (AETA) and 3-sulfopropyl acrylate potassium salt (SPA), and the drugs diclofenac and metoprolol. The hydrogel conductance is found to be linear with drug concentration in the hydrogels, which in turn is described by a non-linear Langmuir-type competitive adsorption isotherm. The proposed approach thus shows potential for future studies on more complex mixtures by including a larger variety of functional hydrogels.
pdf DOI URL BibTeX

Robust Machine Learning Conference Paper Cross-Entropy Is All You Need To Invert the Data Generating Process Reizinger, P., Bizeul, A., Juhos, A., Vogt, J. E., Balestriero, R., Brendel, W., Klindt, D. In January 2025 (Published) OpenReview BibTeX

Robust Machine Learning Conference Paper In Search of Forgotten Domain Generalization Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., Brendel, W. In January 2025 (Published) OpenReview BibTeX

Robust Machine Learning Conference Paper Interaction Asymmetry: A General Principle for Learning Composable Abstractions Brady, J., von Kügelgen, J., Lachapelle, S., Buchholz, S., Kipf, T., Brendel, W. In January 2025 (Published) OpenReview BibTeX

Social Foundations of Computation Conference Paper Lawma: The Power of Specialization for Legal Tasks Dominguez-Olmedo, R., Nanda, V., Abebe, R., Bechtold, S., Engel, C., Frankenreiter, J., Gummadi, K., Hardt, M., Livermore, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.
ArXiv Code BibTeX

Social Foundations of Computation Conference Paper Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data Dorner, F. E., Nastl, V. Y., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.
arXiv URL BibTeX

Social Foundations of Computation Miscellaneous Training on the Test Task Confounds Evaluation and Emergence Dominguez-Olmedo, R., Dorner, F. E., Hardt, M. The Thirteenth International Conference on Learning Representations (ICLR 2025), January 2025 (Accepted)
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
ArXiv BibTeX

Perceiving Systems Conference Paper OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics Gozlan, Y., Falisse, A., Uhlrich, S., Gatti, A., Black, M., Chaudhari, A. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , January 2025 (Published)
Pose estimation has promised to impact healthcare by enabling more practical methods to quantify nuances of human movement and biomechanics. However, despite the inherent connection between pose estimation and biomechanics, these disciplines have largely remained disparate. For example, most current pose estimation benchmarks use metrics such as Mean Per Joint Position Error, Percentage of Correct Keypoints, or mean Average Precision to assess performance, without quantifying kinematic and physiological correctness - key aspects for biomechanics. To alleviate this challenge, we develop OpenCapBench to offer an easy-to-use unified benchmark to assess common tasks in human pose estimation, evaluated under physiological constraints. OpenCapBench computes consistent kinematic metrics through joints angles provided by an open-source musculoskeletal modeling software (OpenSim). Through OpenCapBench, we demonstrate that current pose estimation models use keypoints that are too sparse for accurate biomechanics analysis. To mitigate this challenge, we introduce SynthPose, a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. Incorporating such finetuning on synthetic data of prior models leads to twofold reduced joint angle errors. Moreover, OpenCapBench allows users to benchmark their own developed models on our clinically relevant cohort. Overall, OpenCapBench bridges the computer vision and biomechanics communities, aiming to drive simultaneous advances in both areas.
arXiv code/data URL BibTeX

Deep Models and Optimization Conference Paper Using Shapley interactions to understand how models use structure Divyansh Singhvi, D. M. A. E. R. J. I. P. N. S. In Proceedings ACL, 1-20, Vienna Center, Association for Computational Linguistics (ACL 2025), 2025 (Accepted) DOI URL BibTeX

Learning and Dynamical Systems Conference Paper Adversarial Training for Defense Against Label Poisoning Attacks Bal, M. I., Cevher, V., Muehlebach, M. In International Conference on Learning Representations, 2025 (Accepted) BibTeX

Dynamic Locomotion Conference Paper Bird-inspired tendon coupling improves paddling efficiency by shortening phase transition times Lin, J., Zhao, G., Badri-Spröwitz, A. Proceedings of ICRA 2025, 6, arxiv, NY, ICRA, 2025 (Accepted)
Drag-based swimming with rowing appendages, fins, and webbed feet is a widely adapted locomotion form in aquatic animals. To develop effective underwater and swimming vehicles, a wide range of bioinspired drag-based paddles have been proposed, often faced with a trade-off between propulsive efficiency and versatility. Webbed feet provide an effective propulsive force in the power phase, are light weight and robust, and can even be partially folded away in the recovery phase. However, during the transition between recovery and power phase, much time is lost folding and unfolding, leading to drag and reducing efficiency. In this work, we took inspiration from the coupling tendons of aquatic birds and utilized tendon coupling mechanisms to shorten the transition time between recovery and power phase. Results from our hardware experiments show that the proposed mechanisms improve propulsive efficiency by 2.0 and 2.4 times compared to a design without extensor tendons or based on passive paddle, respectively. We further report that distal leg joint clutching, which has been shown to improve efficiency in terrestrial walking, did not play an major role in swimming locomotion. In sum, we describe a new principle for an efficient, drag-based leg and paddle design, with potential relevance for the swimming mechanics in aquatic birds.
DOI URL BibTeX

Neuromechanics of Movement Organizational Leadership and Diversity Article Building bridges: allyship as a catalyst for gender diversity and inclusion in experimental biology communities M. Janneke Schwaner, , Keplinger, K. 2025 (Published)
Diversity drives innovation and creativity, directly contributing to scientific excellence. However, achieving equity in academia, including in experimental biology fields such as biomechanics and comparative physiology, remains a significant challenge, with women and other historically marginalized groups underrepresented, especially in more senior roles. When considering gender, the disparity is often linked to difficulties in balancing family responsibilities with demanding careers, along with lower ‘academic visibility’, as evidenced by fewer professional awards for women scientists. Many successful women who balance career and family keep their family lives private, making these aspects invisible to early career scholars, and thus depriving them of role models. To help close the gender gap, in this Perspective, we propose 10 actionable strategies for scholars at all career stages to promote gender diversity and inclusion through active allyship. Although we focus on gender diversity, these strategies can be broadly applied to harness the benefits of other diversity dimensions (e.g. age or ethnicity). We argue that embracing allyship benefits individual scientists, their research groups, the quality of their research, the broader research community and society at large by enhancing collective scientific output and inspiring the next generation of scientists.
URL BibTeX

Human Aspects of Machine Learning Article Causal fair metric: Bridging causality, individual fairness, and adversarial robustness Ehyaei, A. R., Farnadi, G., Samadi, S. Transactions on Machine Learning Research, 2025 (Accepted) BibTeX

Organizational Leadership and Diversity Article Chatting Towards Inclusivity: A Digital Approach to Inclusion Action Plans and Leader Development Singh, V., Rivin, J. M., van Wagoner, H. P., Keplinger, K., Barbuto, J. 2025 (Published)
Inclusion is a cornerstone of success for organizations and society, yet inclusion is not guaranteed. Building on inclusive leadership research and relational models theory, we argue that inclusion cannot manifest without systematic effort and planning by leaders. Unfortunately, few resources exist to help leaders plan and enact specific inclusion behaviors. To address this, we introduce the “Leader Success Bot,” an innovative conversational chatbot designed to help leaders develop daily inclusion action plans. Through our immersive longitudinal design and mixed methods data, we advance the taxonomy of inclusive leader behaviors and test the impact of inclusion planning on leaders and followers. We demonstrate how equality matching is an overlooked relational model that is a pivotal relational dynamic for inclusion. Across two studies, our quantitative and qualitative findings show that equitable exchanges by leaders can foster a deeper sense of belonging and community. As leaders interact with the chatbot, both leaders and followers are more likely to accomplish their goals. Additionally, followers' inclusion climate and psychological safety benefited, leading to a decrease in turnover intentions. Our findings underscore the potential of chatbots to support inclusive leadership training and development by providing leaders with a structured, scalable platform for continuous reflection and growth. This research advances theoretical understanding of relational inclusion dynamics and offers practical insights and a scalable tool for HR managers seeking to build more inclusive, psychologically safe cultures.
DOI BibTeX

Learning and Dynamical Systems Conference Paper Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering Kladny, K., Schölkopf, B., Muehlebach, M. In International Conference On Learning Representations, International Conference on Learning Representations, 2025 (Accepted) URL BibTeX

Perceiving Systems Thesis Dynamic 3D Synthesis: From Video-Based Animatable Head Avatars to Text-Guided 4D Content Creation Zheng, Y. 2025 (Published)
The synthesis of 4D content—dynamic 3D content that evolves over time—has become increasingly important across a wide range of applications, including virtual communication, gaming, AR/VR, and digital content creation. Despite recent advances, generating realistic 4D content from accessible inputs remains a significant challenge. Existing approaches often rely on dense multi-camera capture systems, which are costly and impractical for everyday use, or yield results with limited geometric and visual fidelity. This thesis investigates two sub tasks in 4D content creation: (1) the reconstruction of high-fidelity, animatable head avatars from accessible inputs such as monocular RGB videos, and (2) the generation of dynamic 4D scenes from text prompts and optionally sparse visual input, such as reference images. These two directions are unified by a common goal—enabling controllable and high-quality 4D content creation from minimal visual supervision. The first part of this thesis presents IMavatar, a morphable implicit surface representation for reconstructing personalized head avatars from monocular videos. Implicit surfaces provide topological flexibility and can recover detailed 3D geometry directly from RGB images, making them well-suited for head avatar reconstruction. However, modeling expression- and pose-dependent deformations in an interpretable and generalizable way remains a major challenge when working with implicit representations. Inspired by 3D morphable models, IMavatar models deformation by learning expression blendshapes and skinning weight fields in a canonical space, enabling structured and generalizable control over novel expressions and poses. To enable end-to-end optimization from monocular videos, we propose a novel analytical gradient formulation that supports joint training of the geometry and deformation directly from RGB supervision. By combining the geometric fidelity of neural implicit fields with the controllability of morphable models, IMavatar achieves high-quality 4D reconstructions and strong generalization to unseen expressions and head poses. The second part of this thesis presents PointAvatar, a deformable point-based representation for animatable 3D head avatars. While implicit representations are effective at learning detailed geometry from image observations, they are inherently difficult to animate and computationally expensive to render. To address these limitations, this work explores point clouds as the underlying geometric representation for head avatars, offering the efficiency of explicit representations while avoiding the fixed-topology constraints of meshes. PointAvatar uses a canonical point cloud combined with learned blendshape and skinning weight fields, and further disentangles intrinsic albedo from view-dependent shading to support relighting under novel illumination. To improve training stability and reconstruction quality, we adopt a coarse-to-fine strategy that gradually increases point cloud resolution during learning. This enables the model to effectively capture accurate geometry and high-quality texture from monocular RGB videos, including challenging cases such as eyeglasses and complex hairstyles. Compared to IMavatar, PointAvatar achieves an 8× speed-up during training and a 100× speed-up during inference rendering, while maintaining high visual and geometric quality. In the final part, this thesis explores Dream-in-4D, a diffusion-guided framework for generating creative 4D content from natural language. The focus is on synthesizing imaginative 4D scenes from minimal visual input—either a single image or no visual input at all. To this end, the method leverages prior knowledge from pre-trained image and video diffusion models to optimize a 4D representation. Dream-in-4D follows a two-stage pipeline. In the first stage, a static 3D model is optimized as a neural radiance field using guidance from both image and 3D-aware diffusion models, resulting in high-quality, view-consistent assets. In the second stage, a time-dependent, multi-resolution deformation field is introduced to represent motion and is optimized using video diffusion guidance, equipping the static 3D asset with detailed and plausible motion driven by text prompts. The resulting system supports text-to-4D, image-to-4D, and personalized 4D generation within a unified framework, enabling intuitive and flexible dynamic scene synthesis from highly accessible inputs. Together, these methods address two essential aspects of 4D content creation: the reconstruction of animatable head avatars from monocular videos, and the generation of dynamic, imaginative 4D scenes from text and image prompts. We hope these contributions advance the field toward more accessible, controllable, and high-quality 4D content creation—enabling a broad range of applications across research, industry, and creative practice.
DOI URL BibTeX

Robotic Composites and Compositions Article Emergent patterns of interaction with dynamic objects Aktaş, B., Myers, P., Salem, E., Klatzky, R., Howe, R. PLOS ONE, 20:e0331844, 2025 (Published)
Perception by touch is fundamentally linked to the motor system. A hallmark of this linkage takes the form of stereotyped haptic “exploratory procedures” [1], movement patterns that emerge when people set a perceptual goal such as judging the roughness of a textured surface. This paper expands the study of touch-directed movements by asking what patterns emerge when people encounter and interact with novel objects without explicitly specified goals. Participants were invited to freely interact with an art installation containing novel objects with distinct design features, intended to vary familiarity, structural affordance, and aesthetic response. Objects’ affordances were additionally varied over time by utilizing jamming, a physical mechanism that induces changes in stiffness and plasticity. From video recordings, four categories of spontaneous “interactive procedures” differentiated by underlying goals were reliably identified: passive observational, active perceptual, constructive, and hedonic. Perceptual actions were most frequent, indicating an overriding goal of acquiring information about physical properties. The prevalence of other interactive procedures varied across objects, demonstrating the influence of perceptual affordances and prior knowledge. Changes in state further moderated interactions, such that interactions were longer in the stiff/jammed state, and the occurrence of a state change during an interactive procedure lengthened its duration. These findings extend our understanding of haptic exploration beyond explicitly goal-directed contexts, revealing how spontaneous responses in complex and dynamic environments are linked to perceptual outcomes and prior knowledge.
DOI URL BibTeX

Empirical Inference Technical Report International AI Safety Report Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., Newman, J., et al. (DSIT 2025/001), 2025 (Published) URL BibTeX

Perceiving Systems Conference Paper Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions Prinzler, M., Zakharov, E., Sklyarova, V., Kabadayi, B., Thies, J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, 2025 (Published)
We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference’s identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.
project page arxiv BibTeX

Physical Intelligence Article Magnetoelectric film for wireless low-frequency neuromodulationMagnetoelectric film for wireless low-frequency neuromodulation Aydin, A., Jahanshahi, A., Esmaeili-Dokht, P., Han, M., Gardi, G., Temel, Y., Sitti, M. Brain Stimulation: Basic, Translational, and Clinical Research in Neuromodulation, 18:284, 2025 (Published)
Wireless neuromodulation techniques are widely investigated to address the challenges associated with conventional neurostimulation devices. Previous research has relied on ultrasound, light and magnetic fields as the modalities for remotely powering neuronal implants. Use of magnetic fields has been promising for wireless neuronal interfaces since they have excellent tissue penetration. Magnetically powered devices typically work with >100 kHz electromagnetic fields; therefore, they are heavily dependent on the on-board electronics to regulate output signal. Moreover, use of such high frequency is a limiting factor for safe use, especially in deeper areas due to tissue absorption. Magnetoelectric (ME) approach is a promising method that stems from the magneto-electrical coupling. It is a high throughput approach for power delivery through magnetic fields in low frequency regimes compared to far-field or inductive coupling. In this study, we aim to understand how ME approach can be used to modulate neuronal behavior in non-resonant frequency regimes. We fabricated ME planar films through laminating magnetostrictive and piezoelectric components. We initially defined the output electrical potential as the main design parameter and subsequently optimize the device geometry and applied magnetic field profile to achieve the best possible performance. We were able to observe current density of ∼ 4-6 μA/cm2 in phosphate-buffered saline environment under 10 Hz input magnetic field. Lastly, we investigated neuromodulation potential of the ME films in-vitro through calcium imaging studies. Our preliminary results show that primary hippocampal neurons have significantly increased calcium influx during stimulation compared to pre-stimulation phase. Stimulation efficiency was further investigated with changing stimulation duration and input magnetic field waveforms. Overall, these results show that ME films are promising candidates of neuronal interfaces for wireless electrical modulation. Future work will be conducted to understand exact mechanisms of neuromodulation and design such interfaces in an implantable miniature form for in-vivo studies.
DOI URL BibTeX

Empirical Inference Book Chapter Natural Language Processing Jin, Z., Mihalcea, R., Schölkopf, B. In Elgar Encyclopedia of Political Communication, (Editors: Nai, A. and Grömping, M. and Wirz, D.), Edward Elgar Publishing, 2025 (Published) PDF URL BibTeX