Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Conference Paper Difficult Lessons on Social Prediction from Wisconsin Public Schools Perdomo, J. C., Britton, T., Hardt, M., Abebe, R. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, June 2025 (Published)
Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the efficacy of EWS, and the role of statistical risk scores in education. In this work, we draw on nearly a decade's worth of data from a system used throughout Wisconsin to provide the first large-scale evaluation of the long-term impact of EWS on graduation outcomes. We present empirical evidence that the prediction system accurately sorts students by their dropout risk. We also find that it may have caused a single-digit percentage increase in graduation rates, though our empirical analyses cannot reliably rule out that there has been no positive treatment effect. Going beyond a retrospective evaluation of DEWS, we draw attention to a central question at the heart of the use of EWS: Are individual risk scores necessary for effectively targeting interventions? We propose a simple mechanism that only uses information about students' environments -- such as their schools, and districts -- and argue that this mechanism can target interventions just as efficiently as the individual risk score-based mechanism. Our argument holds even if individual predictions are highly accurate and effective interventions exist. In addition to motivating this simple targeting mechanism, our work provides a novel empirical backbone for the robust qualitative understanding among education researchers that dropout is structurally determined. Combined, our insights call into question the marginal value of individual predictions in settings where outcomes are driven by high levels of inequality.
arXiv URL BibTeX

Empirical Inference Article Flow annealed importance sampling bootstrap meets differentiable particle physics Kofler, A., Stimper, V., Mikhasenko, M., Kagan, M., Heinrich, L. Machine Learning: Science and Technology, 6(2), IOP Publishing, June 2025 (Published)
High-energy physics requires the generation of large numbers of simulated data samples from complex but analytically tractable distributions called matrix elements. Surrogate models, such as normalizing flows, are gaining popularity for this task due to their computational efficiency. We adopt an approach based on Flow Annealed importance sampling Bootstrap (FAB) that evaluates the differentiable target density during training and helps avoid the costly generation of training data in advance. We show that FAB reaches higher sampling efficiency with fewer target evaluations in high dimensions in comparison to other methods.
DOI URL BibTeX

Social Foundations of Computation Conference Paper How Benchmark Prediction from Fewer Data Misses the Mark Zhang, G., Dorner, F. E., Hardt, M. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), June 2025 (Accepted)
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
arXiv BibTeX

Empirical Inference Conference Paper Temporally Consistent Object-Centric Learning by Contrasting Slots Manasyan, A., Seitzer, M., Radovic, F., Martius, G., Zadaianchuk, A. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5401-5411, June 2025 (Published) DOI BibTeX

Haptic Intelligence Ph.D. Thesis Towards Robust and Flexible Robot State and Motion Estimation through Optimization and Learning Nubert, J. ETH Zurich, Zurich, Switzerland, June 2025, Department of Mechanical and Process Engineering (Published) BibTeX

Empirical Inference Conference Paper VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models Ye, M., Liu, W., He, P. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8679-8688, June 2025 (Published) DOI BibTeX

Haptic Intelligence Robotic Materials Article Wearable Electrohydraulic Actuation for Salient Full-Fingertip Haptic Feedback Shao, Y., Shagan Shomron, A., Javot, B., Keplinger, C., Kuchenbecker, K. J. Advanced Materials Technologies, 10(12):2401525, June 2025, Yitian Shao and Alona Shagan Shomron contributed equally to this publication. This article was selected for the front cover. https://doi.org/10.1002/admt.202570062 (Published)
Although essential for an immersive experience in extended reality (XR), providing salient and versatile touch feedback remains a technical challenge. Existing solutions restrict hand movements with bulky rigid structures, require a tethered energy source to power actuators worn on the hand, or output vibrations that lack expressiveness. This study introduces a design strategy for compact, lightweight, untethered haptic feedback centering on a 30-µm-thick inflatable chamber that naturally conforms to the fingertip; to minimize fluidic losses and enable high bandwidth, a soft electrohydraulic pump mounted on the hand actuates the chamber via a mechanically transparent fluidic channel. A 15.2-mm-diameter prototypical actuation chamber achieves 8 N peak force, 3 N steady-state force, stroke up to 5 mm, and bandwidth from 0 to 500 Hz. In contrast to these salient fingertip cues, the entire hydraulic system has a weight less than 8 g and a thickness less than 2 mm. Additionally, this study presents a validation approach that uses a commercial fingertip sensor to confirm that the haptic feedback created by the device imitates the touch signals generated during typical hand interactions. Together, this design strategy and validation method can enable a broad spectrum of haptic activities in diverse XR applications, including medical training, online shopping, and social interactions.
DOI BibTeX

Empirical Inference Perceiving Systems Conference Paper ChatHuman: Chatting about 3D Humans with Tools Lin, J., Feng, Y., Liu, W., Black, M. J. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8150-8161, June 2025 (Published)
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
project pdf Paper DOI BibTeX

Perceiving Systems Conference Paper InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Dwivedi, S. K., Antić, D., Tripathi, S., Taheri, O., Schmid, C., Black, M. J., Tzionas, D. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22605-22615, June 2025 (Published)
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image.
Project Paper Code Video BibTeX

Conference Paper PICO: Reconstructing 3D People In Contact with Objects Cseke, A., Tripathi, S., Dwivedi, S. K., Lakshmipathy, A., Chatterjee, A., Black, M. J., Tzionas, D. In June 2025 (Published) arXiv project BibTeX

Perceiving Systems Conference Paper PromptHMR: Promptable Human Mesh Recovery Wang, Y., Sun, Y., Patel, P., Daniilidis, K., Black, M. J., Kocabas, M. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.
arXiv project video BibTeX

Physical Intelligence Article 3D Locomotion of Surface-Rolling Microrobots: A Trade-off between Hydrodynamic Wall and Gravitational Effects Park, M., Bozuyuk, U., Yildiz, E., Min, H., Yoon, J., Sitti, M. Advanced Intelligent Systems, 7:2500381, May 2025 (Published)
Synthetic microrobots have gained significant attention due to their potential in various applications in biomedicine and lab-on-a-chip technologies. As a fundamental requirement, microrobots must navigate in 3D, effectively counteracting gravity to execute their tasks. However, locomotion at small scales presents numerous counterintuitive behaviors, primarily governed by the interactions between the microrobot's body and its surrounding boundaries. In this study, the locomotion of surface-rolling microrobots is investigated in 3D, particularly focusing on their ability to climb walls. Through a combination of experiments and computational fluid dynamics analyzes, it is demonstrated that the influence of gravity plays a secondary role in enabling surface-rolling microrobots to climb walls. Instead, locomotion capability in 3D settings is primarily determined by interactions with surrounding boundaries. The fundamental principles of surface-rolling locomotion in 3D spaces is elucidated and a design strategy aimed at optimizing fluid flow for efficient propulsion in future applications is proposed.
DOI URL BibTeX

Physical Intelligence Article Anisotropic Surface Microrollers for Endovascular Navigation: A Computational Analysis with a Case Study in Hepatic Perfusion Arslan, B., Bozuyuk, U., Görgülü, K., Yildiz, E., Ozturk, H., Liotta, L., Heinemann, V., Algül, H., Sitti, M. Advanced Theory and Simulations, 8:2400387, May 2025 (Published)
Magnetic surface microrollers have demonstrated promise as active drug delivery agents for targeted and minimally invasive disease treatment. Specifically, it can be employed in the circulatory system to locally release therapeutic agents at disease sites, minimizing systemic exposure and reducing side effects, particularly in the treatment of diseases like cancer. Previous research indicates that the design and shape of microrollers play a crucial role in safe navigation within blood vessels, with anisotropic microrollers exhibiting superiority due to favorable hydrodynamic interactions with nearby boundaries. In this study, the navigation potential of anisotropic microrollers is investigated in veins, venules, and capillaries through computational fluid dynamics analyses. These results indicate that robust locomotion is only achievable in larger vessels, such as veins. Subsequently, their performance is explored in a clinically relevant scenario – the hepatic circulation toward treating primary liver cancer or metastatic nodes of distant tumors (e.g., pancreatic cancer). Computational fluid dynamics analyses using the data from five different patients demonstrate that robust navigation can be achieved with high actuation frequencies. Overall, the findings presented in this study lay a preliminary foundation for the potential future application of surface microrollers in vivo.
DOI URL BibTeX

Empirical Inference Conference Paper Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation Sanyal, A., Hu, Y., Yu, Y., Ma, Y., Wang, Y., Schölkopf, B. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:2170-2178, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025 (Published) URL BibTeX

Haptic Intelligence Article Comparing Puncture-Detection Approaches for Manual Needle Insertions Through the Parietal Pleura L’Orsa, R., Zareinia, K., Sutherland, G. R., Westwick, D., Kuchenbecker, K. J. IEEE Transactions on Medical Robotics and Bionics, 7(2):455-468, May 2025 (Published)
Tube thoracostomy (chest tube insertion) is a surgical procedure that treats pneumothorax, a potentially life-threatening condition where air accumulates between the chest wall and the lungs. The literature reports high complication rates for this procedure, including accidental fatality due to poor manual depth control during tool insertion. We hypothesize that an instrumented needle-holder could help operators recognize pleural puncture and improve depth control, and we present a puncture-detection experiment that contributes toward this goal. An operator manually inserted a bevel-tip needle into ex vivo porcine ribs and through the parietal pleura via a sensorized percutaneous device that records position, force, and videos. We use this rich dataset of 63 insertions to thoroughly test four previously published data-driven puncture-detection (DDPD) algorithms against two new real-time algorithms: a custom recursive digital filter with coefficients optimized for our application, and a difference equation that compares standard deviations between adjacent sliding windows. Our algorithms achieve a precision (true positives over total identified punctures) of 23% and 22%, respectively, while the precision of existing DDPD algorithms ranges from 0% to 21%. Despite these performance improvements, our results show the limitations of DDPD algorithms and motivate new methods for detecting pleural membrane punctures in thoracostomy.
DOI BibTeX

Haptic Intelligence Article Enhancing Needle Puncture Detection Using High-Pass Filtering and Diffuse Reflectance L’Orsa, R., Bisht, A., Yu, L., Murari, K., Sutherland, G. R., Westwick, D. T., Kuchenbecker, K. J. Frontiers in Robotics and AI, 12(1429327):1-16, May 2025 (Published)
Chest trauma or disease progression can lead to tension pneumothorax, a condition where mounting pressurization of the pleural cavity (the space between the chest wall and the lungs) leads rapidly to cardiac arrest. In pre-hospital settings, tension pneumothorax is treated by venting the pleural cavity via a needle introduced through the chest wall. Very high failure rates (up to 94.1%) have been reported for pre-hospital needle decompression, however, and the procedure can result in the accidental puncture of critical thoracic tissues because it is performed blind. Instrumented needles could help operators more reliably identify when the tool has entered the target space. This paper investigates technical approaches to provide such support; we created an experimental system that acquires needle force and position signals, as well as the diffuse backscattered reflectance from white light carried to and collected from the needle's tip via two in-bore optical fibers. Data collection occurred while two experimenters inserted a bevel-tipped percutaneous needle into an ex vivo porcine rib section simulating human chest anatomy. Four data-driven puncture-detection (DDPD) algorithms from the literature, which are appropriate for use with the variable tool velocities produced by manual insertions, were applied to the resulting data set offline. Grid search was performed across key signal-processing parameters, high-pass filters (HPFs) were applied to examine their impact on puncture detection, and a first exploration of multimodal (ensemble) methods was performed. Combining high-pass filters with DDPD methods resulted in a 2.7-fold improvement (from 8.2% to 21.9%) in the maximum overall precision (MOP) produced by force signals. Applying this HPF + DDPD scheme to reflectance data streams yielded a peak MOP of 36.4%, and combining reflectance with force generated the best MOP overall (42.1%); these results represent 4.4-fold and 5.1-fold improvements, respectively, over the best MOP produced by the traditional application of DDPD algorithms to force signals alone. These results strongly support the utility of high-pass filters combined with both reflectance-only and multimodal reflectance-plus-force data-driven puncture-detection schemes for needle decompression applications.
DOI BibTeX

Haptic Intelligence Optics and Sensing Laboratory Miscellaneous Open-Source Multi-Viewpoint Surgical Telerobotics Caccianiga, G., Sharon, Y., Javot, B., Polikovsky, S., Ergün, G., Capobianco, I., Mihaljevic, A. L., Deguet, A., Kuchenbecker, K. J. Extended abstract (2 pages) presented at the ICRA Workshop on Robot-Assisted Medical Imaging (ICRA-RAMI), Atlanta, USA, May 2025 (Published) URL BibTeX

Empirical Inference Ph.D. Thesis Scalable Gaussian Processes: Advances in Iterative Methods and Pathwise Conditioning Lin, J. University of Cambridge, UK, May 2025, (Cambridge-T{\"u}bingen-Fellowship-Program) (Published) BibTeX

Empirical Inference Ph.D. Thesis The Geometry of Learning Via Loss Landscape Curvature Singh, S. P. ETH Zurich, Switzerland, May 2025, CLS Fellowship Program (Published) BibTeX

Social Foundations of Computation Conference Paper To Give or Not to Give? The Impacts of Strategically Withheld Recourse Chen, Y., Estornell, A., Vorobeychik, Y., Liu, Y. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTAS), PMLR, The Twenty-Eight International Conference on Artificial Intelligence and Statistics (AISTATS), May 2025 (Published)
Individuals often aim to reverse undesired outcomes in interactions with automated systems, like loan denials, by either implementing system-recommended actions (recourse), or manipulating their features. While providing recourse benefits users and enhances system utility, it also provides information about the decision process that can be used for more effective strategic manipulation, especially when the individuals collectively share such information with each other. We show that this tension leads rational utility-maximizing systems to frequently withhold recourse, resulting in decreased population utility, particularly impacting sensitive groups. To mitigate these effects, we explore the role of recourse subsidies, finding them effective in increasing the provision of recourse actions by rational systems, as well as lowering the potential social cost and mitigating unfairness caused by recourse withholding.
arXiv URL BibTeX

Empirical Inference Conference Paper Training Neural Samplers with Reverse Diffusive KL Divergence He*, J., Chen*, W., Zhang*, M., Barber, D., Hernández-Lobato, J. M. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:5167-5175, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025, *equal contribution (Published) URL BibTeX

Haptic Intelligence Embodied Vision Robotics Conference Paper Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing Mack, L., Grüninger, F., Richardson, B. A., Lendway, R., Kuchenbecker, K. J., Stueckler, J. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 12401-12407, Atlanta, USA, May 2025 (Published)
Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the two kinds of measurements using a robust cost function to reduce the influence of outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DOF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 12.9 Hz on average.
DOI BibTeX

Empirical Inference Conference Paper Your Finetuned Large Language Model is Already a Powerful Out-of-distribution Detector Zhang, A., Xiao, T. Z., Liu, W., Bamler, R., Wischik, D. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 258:2701-2709, Proceedings of Machine Learning Research, (Editors: Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz), PMLR, May 2025 (Published) URL BibTeX

Autonomous Learning Miscellaneous Emergence of natural and robust bipedal walking by learning from biologically plausible objectives Schumacher, P., Geijtenbeek, T., Caggiano, V., Kumar, V., Schmitt, S., Martius, G., Haeufle, D. F. iScience, 28(4):112203, April 2025 (Published)
Humans show unparalleled ability when maneuvering diverse terrains. While reinforcement learning (RL) has shown great promise for musculoskeletal simulation in the development of robust controllers, complex behaviors are only achievable under extensive use of motion data. We demonstrate that the combination of a recent RL algorithm with a biologically plausible reward is capable of learning controllers for 4 different musculoskeletal models and achieves locomotion with up to 90 muscles without demonstrations. Our controllers generalize to diverse and unseen terrains, while only a single adaptive objective function is needed for training. We validate our findings on four models in two different simulators. The RL agents perform robustly with complex 3D models, where reflex-controllers are difficult to apply, and produce close-to-natural motion. This is a first step for the motor control, biomechanics, and rehabilitation communities to generate complex human movements with RL, without using motion data or simple unrepresentative models.
DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Estimating Human and Camera Motion From RGB Data Kocabas, M. April 2025 (Published)
This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods. First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications. Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.
Thesis PDF BibTeX

Perceiving Systems Ph.D. Thesis Understanding Human-Scene Interaction through Perception and Generation Yi, H. April 2025 (Published)
Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.
thesis BibTeX

Physical Intelligence Article Navigating microalgal biohybrids through confinements with magnetic guidance Akolpoglu, M. B., Baltaci, S. F., Bozuyuk, U., Karaz, S., Sitti, M. Matter, 8:102052, April 2025 (Published)
In the natural world, microorganisms constantly navigate through confined spaces—such as those found in tissues, biological gels, and soil—yet their behavior in such environments remains poorly understood. Here, we explore this phenomenon by examining the navigation of magnetic microalgal biohybrids in constrained microenvironments. By leveraging the inherent propulsion of green microalgae and external steering capabilities acquired through the magnetization of microalgal cells, our biohybrids exhibit efficient navigation in viscous and confined microenvironments. Through high-yield fabrication and magnetic manipulation, we show precise control over their movement. Our findings reveal distinct navigation patterns influenced by magnetic guidance, namely backtracking and crossing, shedding light on the unexplored dynamics of confined locomotion assisted by magnetism. Our work highlights the significance of understanding microalgal biohybrid swimming behavior, offering crucial insights for future biotechnological and biomedical applications requiring precise navigation in confined environments.
DOI URL BibTeX

Haptic Intelligence Miscellaneous A Method for Single-Input Sequencing of Hyperelastic Balloons Gertler, I., Kuchenbecker, K. J. Extended abstract (3 pages) presented at the IEEE-RAS International Conference on Soft Robotics (RoboSoft), Lausanne, Switzerland, April 2025 (Published)
This study demonstrates that encasing a hyperelastic balloon in an inextensible sleeve greatly increases its burst pressure while not influencing its minimum pressure. This simple mechanical behavior can be used to produce an asymmetric inflation-deflation sequence for coupled balloons with different thicknesses so they could serve as a soft robot's rear and front anchors when driven from a single fluid supply.
BibTeX

Empirical Inference Autonomous Learning Conference Paper Advancing Out-of-Distribution Detection via Local Neuroplasticity Canevaro, A., Schmidt, J., Marvi, M. S., Yu, H., Martius, G., Jordan, J. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Haptic Intelligence Robotics Miscellaneous Bio-Inspired Gradient (BIG) Whiskers: Stiffness-Shifting Structures Provide Dynamic Functional Benefits for Contact Sensing Schulz, A. K., Andrussow, I., Farsijani, F., Faulkner, R., Kuchenbecker, K. J. Extended abstract (3 pages) presented at the IEEE-RAS International Conference on Soft Robotics (RoboSoft), Lausanne, Switzerland, April 2025 (Published)
Mammal whiskers have inspired many sensors that can help robots find obstacles, identify textures, or sense flow. Though they vary in geometry, past bio-inspired whisker sensors were primarily constructed from homogenous materials. Interestingly, animal whiskers tend to shift from a stiff root to a much softer point; this material stiffness gradient is hypothesized to provide functional benefits such as reduction of wear and amplification of contact sensations. We take inspiration from nature to fabricate bio-inspired gradient (BIG) whiskers via 3D printing, and we assess their performance compared to stiff, medium, and soft homogenous artificial whiskers with the same geometry. Tests with controlled quasi-static and dynamic perturbations allow us to measure the whisker point deflection and the reaction torque at the stationary whisker root, respectively. The dynamic results reveal that BIG whiskers uniquely encode contact location along their length through torque magnitude and frequency, features that are not seen in the homogenous whiskers. These exciting preliminary findings motivate further exploration of robotic whiskers and other sensing structures with bio-inspired stiffness gradients.
BibTeX

Haptic Intelligence Robotics Article Building Instructions You Can Feel: Edge-Changing Haptic Devices for Digitally Guided Construction Tashiro, N., Faulkner, R., Melnyk, S., Rosales Rodriguez, T., Javot, B., Tahouni, Y., Cheng, T., Wood, D., Menges, A., Kuchenbecker, K. J. ACM Transactions on Computer-Human Interaction, 32(1):1-40, April 2025 (Published)
Recent efforts to connect builders to digital designs during construction have primarily focused on visual augmented reality, which requires accurate registration and specific lighting, and which could prevent a user from noticing safety hazards. Haptic interfaces, on the other hand, can convey physical design parameters through tangible local cues that don't distract from the surroundings. We propose two edge-changing haptic devices that use small inertial measurement units (IMUs) and linear actuators to guide users to perform construction tasks in real time: Drangle gives feedback for angling a drill relative to gravity, and Brangle assists with orienting bricks in the plane. We conducted a study with 18 participants to evaluate user performance and gather qualitative feedback. All users understood the edge-changing cues from both devices with minimal training. Drilling holes with Drangle was somewhat less accurate but much faster and easier than with a mechanical guide; 89% of participants preferred Drangle over the mechanical guide. Users generally understood Brangle's feedback but found its hand-size-specific grip, palmar contact, and attractive tactile cues less intuitive than Drangle's generalized form factor, fingertip contact, and repulsive cues. After summarizing design considerations, we propose application scenarios and speculate how such devices could improve construction workflows.
DOI BibTeX

Empirical Inference Perceiving Systems Conference Paper Can Large Language Models Understand Symbolic Graphics Programs? Qiu, Z., Liu, W., Feng, H., Liu, Z., Xiao, T. Z., Collins, K. M., Tenenbaum, J. B., Weller, A., Black, M. J., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM’s ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to “imagine” and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability – Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM’s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
arXiv Paper BibTeX

Empirical Inference Conference Paper Compositional simulation-based inference for time series Gloeckler*, M., Toyota*, S., Fukumizu, K., Macke, J. H. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Empirical Inference Robust Machine Learning Conference Paper Cross-Entropy Is All You Need to Invert the Data Generating Process Reizinger*, P., Bizeul*, A., Juhos*, A., Vogt, J. E., Balestriero, R., Brendel, W., Klindt, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *Joint first authorship (Published) arXiv BibTeX

Empirical Inference Conference Paper Differentially private steering for Large language model alignment Goel, A., Hu, Y., Gurevych, I., Sanyal, A. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Empirical Inference Perceiving Systems Conference Paper Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets Liu, Z., Xiao, T. Z., Liu, W., Bengio, Y., Zhang, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as ∇-GFlowNet), the first GFlowNet method that leverages the rich signal in reward gradients, together with an objective called ∇-DB plus its variant residual ∇-DB designed for prior-preserving diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.
arXiv BibTeX

Empirical Inference Conference Paper Improving Probabilistic Diffusion Models With Optimal Covariance Matching Ou*, Z., Zhang*, M., Zhang, A., Xiao, T. Z., Li, Y., Barber, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Influence Functions for Scalable Data Attribution in Diffusion Models Mlodozeniec, B. K., Eschenhagen, R., Bae, J., Immer, A., Krueger, D., Turner, R. E. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Empirical Inference Robust Machine Learning Conference Paper Interaction Asymmetry: A General Principle for Learning Composable Abstractions Brady, J., von Kügelgen, J., Lachapelle, S., Buchholz, S., Kipf*, T., Brendel*, W. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *joint senior author (Published) arXiv BibTeX

Empirical Inference Conference Paper Language Model Alignment in Multilingual Trolley Problems Jin, Z., Kleiman-Weiner, M., Piatti, G., Levine, S., Liu, J., Gonzalez, F., Ortu, F., Strausz, A., Sachan, M., Mihalcea, R., Choi, Y., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Social Foundations of Computation Conference Paper Limits to Predicting Online Speech Using Large Language Models Remeli, M., Hardt, M., Williamson, R. C. April 2025 (Submitted)
We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.
arXiv BibTeX

Haptic Intelligence Conference Paper My Robot, My Motion: Expressive Real-Time Teleoperation Mohan, M., Kuchenbecker, K. J. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), 1797-1799, Hands-on demonstration presented at the ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, April 2025 (Published)
Humanoid social robots need to be able to move expressively. Traditional manipulation-focused teleoperation systems primarily control the end-effector's position and orientation, neglecting the extra degrees of freedom in human and robotic arms, which can lead to unnatural movements. This demonstration presents our Optimization-based Customizable Retargeting Algorithm (OCRA), designed for real-time motion mapping between dissimilar kinematic chains. OCRA functions well with widely varying robot-arm joint configurations. The presenter will use a commercial motion-capture suit to teleoperate the upper body of a NAO humanoid robot, demonstrating OCRA's ability to create intuitive, human-like movements in real time.
DOI URL BibTeX

Empirical Inference Autonomous Learning Conference Paper On the Transfer of Object-Centric Representation Learning Didolkar, A. R., Zadaianchuk, A., Goyal, A., Mozer, M. C., Bengio, Y., Martius*, G., Seitzer*, M. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper Preference Elicitation for Offline Reinforcement Learning Pace, A., Schölkopf, B., Rätsch, G., Ramponi, G. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) arXiv BibTeX

Haptic Intelligence Article Simulation Training with Haptic Feedback of Instrument Vibrations Reduces Resident Workload During Live Robot-Assisted Sleeve Gastrectomy Gomez, E. D., Mat Husin, H., Dumon, K. R., Williams, N. N., Kuchenbecker, K. J. Surgical Endoscopy, 39(3):1523-1535, April 2025 (Published)
Background: New surgeons experience heavy workload during robot-assisted surgery partially because they must use vision to compensate for the lack of haptic feedback. We hypothesize that providing realistic haptic feedback during dry-lab simulation training may accelerate learning and reduce workload during subsequent surgery on patients. Methods: We conducted a single-blinded study with twelve general surgery residents (third and seventh post-graduate year, PGY) randomized into haptic and control groups. Participants performed five simulated bariatric surgeries on a custom inanimate simulator followed by live robot-assisted sleeve gastrectomies (RASGs) using da Vinci robots. The haptic group received naturalistic haptic feedback of instrument vibrations during their first four simulated procedures. Participants completed pre-/post-procedure STAI and post-procedure NASA-TLX questionnaires in both simulation and the operating room (OR). Results: Higher PGY level (simulation: p<0.001, OR p=0.004), shorter operative time (simulation: p<0.001, OR: p=0.003), and lower pre-procedure STAI (simulation: p=0.003, OR: p<0.001) were significantly associated with lower self-reported overall workload in both operative settings; PGY-7s reported about 10% lower workload than PGY-3s. The haptic group had significantly lower overall covariate-adjusted NASA-TLX during the fourth (p=0.03) and fifth (p=0.04) simulated procedures and across all OR procedures (p=0.047), though not for only the first three OR procedures. Haptic feedback reduced physical demand (simulation: p<0.001, OR: p=0.001) and increased perceived performance (simulation: p=0.031, OR: p<0.001) in both settings. Conclusion: Haptic feedback of instrument vibrations provided during robotic surgical simulation reduces trainee workload during both simulation and live OR cases. The implications of workload reduction and its potential effects on patient safety warrant further investigation.
DOI BibTeX

Empirical Inference Conference Paper Standardizing Structural Causal Models Ormaniec*, W., Sussex*, S., Lorch*, L., Schölkopf, B., Krause, A. The Thirteenth International Conference on Learning Representations (ICLR), April 2025, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper The Directionality of Optimization Trajectories in Neural Networks Singh, S. P., He, B., Hofmann, T., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published) URL BibTeX