Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Social Foundations of Computation Algorithms and Society Conference Paper Questioning the Survey Responses of Large Language Models Dominguez-Olmedo, R., Hardt, M., Mendler-Dünner, C. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Oral, December 2024 (Published)
As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful representations of any human population. Using a de-facto standard multiple-choice prompting technique and evaluating 39 different language models using systematic experiments, we establish two dominant patterns: First, models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases. Second, models' responses do not contain the entropy variations and statistical signals typically found in human populations. As a result, a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census. At the same time, models' relative alignment with different demographic subgroups can be predicted from the subgroups' entropy, irrespective of the model's training data or training strategy. Taken together, our findings suggest caution in treating models' survey responses as equivalent to those of human populations.
ArXiv URL BibTeX

Empirical Inference Conference Paper Shaving Weights with Occam’s Razor: Bayesian Sparsification for Neural Networks using the Marginal Likelihood Dhahri, R., Immer, A., Charpentier, B., Günnemann, S., Fortuin, V. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37:24959-24989, (Editors: A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang), Curran Associates, Inc., 38th Annual Conference on Neural Information Processing Systems, December 2024 (Published) URL BibTeX

Empirical Inference Conference Paper Sourcerer: Sample-based Maximum Entropy Source Distribution Estimation Vetter, J., Moss, G., Schröder, C., Gao, R., Macke, J. H. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37:88772-88806, (Editors: A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang), Curran Associates, Inc., 38th Annual Conference on Neural Information Processing Systems, December 2024 (Published) URL BibTeX

Social Foundations of Computation Conference Paper The Fairness-Quality Trade-off in Clustering Hakim, R., Stoica, A., Papadimitriou, C. H., Yannakakis, M. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Theoretical Characterisation of the Gauss Newton Conditioning in Neural Networks Zhao*, J., Singh*, S. P., Lucchi, A. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37:114965-115000, (Editors: A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang), Curran Associates, Inc., 38th Annual Conference on Neural Information Processing Systems, December 2024, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper What Makes and Breaks Safety Fine-tuning? A Mechanistic Study Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P., Sanyal, A., Dokania, P. K. Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 37:93406-93478, (Editors: A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang), Curran Associates, Inc., 38th Annual Conference on Neural Information Processing Systems, December 2024 (Published) URL BibTeX

Perceiving Systems Article PuzzleAvatar: Assembling 3D Avatars from Personal Albums Xiu, Y., Liu, Z., Tzionas, D., Black, M. J. ACM Transactions on Graphics, 43(6):1-15, ACM, December 2024 (Published)
Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose.
DOI URL BibTeX

Perceiving Systems Conference Paper SPARK: Self-supervised Personalized Real-time Monocular Face Capture Baert, K., Bharadwaj, S., Castan, F., Maujean, B., Christie, M., Abrevaya, V., Boukhayma, A. In SIGGRAPH Asia 2024 Conference Proceedings, SIGGRAPH Asia, December 2024 (Published)
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.
DOI URL BibTeX

Perceiving Systems Article StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal Ye, C., Qiu, L., Gu, X., Zuo, Q., Wu, Y., Dong, Z., Bo, L., Xiu, Y., Han, X. ACM Transactions on Graphics, 43(6):1-18, ACM, December 2024 (Published)
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available.
DOI BibTeX

Perceiving Systems Ph.D. Thesis Beyond the Surface: Statistical Approaches to Internal Anatomy Prediction Keller, M. University of Tübingen, November 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. But to observe a subject’s anatomy, expensive medical devices (MRI or CT) are required and creating a digital model is often time-consuming and involves manual effort. Instead, we can leverage the fact that the shape of the body surface is correlated with the internal anatomy; indeed, the external body shape is related to the bone lengths, the angle of skeletal articulation, and the thickness of various soft tissues. In this thesis, we leverage the correlation between body shape and anatomy and aim to infer the internal anatomy solely from the external appearance. Learning this correlation requires paired observations of people’s body shape, and their internal anatomy, which raises three challenges. First, building such datasets requires specific capture modalities. Second, these data must be annotated, i.e. the body shape and anatomical structures must be identified and segmented, which is often a tedious manual task requiring expertise. Third, to learn a model able to capture the correlation between body shape and internal anatomy, the data of people with various shapes and poses has to be put into correspondence. In this thesis, we cover three works that focus on learning this correlation. We show that we can infer the skeleton geometry, the bone location inside the body, and the soft tissue location solely from the external body shape. First, in the OSSO project, we leverage 2D medical scans to construct a paired dataset of 3D body shapes and corresponding 3D skeleton shapes. This dataset allows us to learn the correlation between body and skeleton shapes, enabling the inference of a custom skeleton based on an individual’s body. However, since this learning process is based on static views of subjects in specific poses, we cannot evaluate the accuracy of skeleton inference in different poses. To predict the bone orientation within the body in various poses, we need dynamic data. To track bones inside the body in motion, we can leverage methods from the biomechanics field. So in the second work, instead of medical imaging, we use a biomechanical skeletal model along with simulation to build a paired dataset of bodies in motion and their corresponding skeletons. In this work, we build such a dataset and learn SKEL, a body shape and skeleton model that includes the locations of anatomical bones from any body shape and in any pose. After dealing with the skeletal structure, we broaden our focus to include different layers of soft tissues. In the third work, HIT, we leverage segmented medical data to learn to predict the distribution of adipose tissues (fat) and lean tissues (muscle, organs, etc.) inside the body.
pdf URL BibTeX

Deep Models and Optimization Conference Paper Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise Monzio Compagnoni, E., Liu, T., Islamov, R., Proske, F. N., Orvieto, A., Lucchi, A. In The Thirteenth International Conference on Learning Representations, ICLR 2025, The Thirteenth International Conference on Learning Representations, November 2024 (Accepted) BibTeX

Safety- and Efficiency- aligned Learning Conference Paper Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers Singh, S., Singhania, P., Ranjan, A., Kirchenbauer, J., Geiping, J., Wen, Y., Jain, N., Hans, A., Shu, M., Tomar, A., Goldstein, T., Bhatele, A. International Conference for High Performance Computing, Networking, Storage and Analysis SC (SC24), 36-49, Supercomputing, IEEE Digital Library, Atlanta, GA, International Conference for High Performance Computing, November 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Aerial Markerless Motion Capture Saini, N. November 2024 (Published)
Human motion capture (mocap) is important for several applications such as healthcare, sports, animation etc. Existing markerless mocap methods employ multiple static and calibrated RGB cameras to infer the subject’s pose. These methods are not suitable for outdoor and unstructured scenarios. They need an extra calibration step before the mocap session and cannot dynamically adapt the viewpoint for the best mocap performance. A mocap setup consisting of multiple unmanned aerial vehicles with onboard cameras is ideal for such situations. However, estimating the subject’s motion together with the camera motions is an under-constrained problem. In this thesis, we explore multiple approaches where we split this problem into multiple stages. We obtain the prior knowledge or rough estimates of the subject’s or the cameras’ motion in the initial stages and exploit them in the final stages. In our work AirCap-Pose-Estimator, we use extra sensors (an IMU and a GPS receiver) on the multiple moving cameras to obtain the approximate camera poses. We use these estimates to jointly optimize the camera poses, the 3D body pose and the subject’s shape to robustly fit the 2D keypoints of the subject. We show that the camera pose estimates using just the sensors are not accurate enough, and our joint optimization formulation improves the accuracy of the camera poses while estimating the subject’s poses. Placing extra sensors on the cameras is not always feasible. That is why, in our work AirPose, we introduce a distributed neural network that runs on board, estimating the subject’s motion and calibrating the cameras relative to the subject. We utilize realistic human scans with ground truth to train our network. We further fine-tune it using a small amount of real-world data. Finally, we propose a bundle-adjustment method (AirPose+), which utilizes the initial estimates from our network to recover high-quality motions of the subject and the cameras. Finally, we consider a generic setup consisting of multiple static and moving cameras. We propose a method that estimates the poses of the cameras and the human relative to the ground plane using only 2D human keypoints. We learn a human motion prior using a large amount of human mocap data and use it in a novel multi-stage optimization approach to fit the SMPL human body model and the camera poses to the 2D keypoints. We show that in addition to the aerial cameras, our method works for smartphone cameras and standard RGB ground cameras. This thesis advances the field of markerless mocap which is currently limited to multiple static calibrated RGB cameras. Our methods allow the user to use moving RGB cameras and skip the extrinsic calibration. In the future, we will explore the usage of a single moving camera without even needing camera intrinsics.
thesis BibTeX

Organizational Leadership and Diversity Article From challenges to opportunities: navigating the human response to automated agents in the workplace Ðula, I., Berberena, T., Keplinger, K., Wirzberger, M. Humanities and Social Sciences Communications, 11:1454, November 2024 (Published)
Workers are increasingly embracing Artificial Intelligence (AI) to optimise various aspects of their operations in the workplace. While AI offers new opportunities, it also presents unintended challenges that they must carefully navigate. This paper aims to develop a deeper understanding of workers’ experiences with interactions with automated agents (AA) in the workplace and provide actionable recommendations for organisational leaders to achieve positive outcomes. We propose and test a simulation model that quantifies and predicts workers’ experiences with AA, shedding light on the interplay of diverse variables, such as workload, effort and trust. Our findings suggest that lower-efficiency AA might outperform higher-efficiency ones due to the constraining influence of trust on adoption rates. Additionally, we find that lower initial trust in AA could lead to increased usage in certain scenarios and that stronger emotional and social responses to the use of AA may foster greater trust but result in decreased AA utilisation. This interdisciplinary research blends a systems dynamics approach with management theories and psychological concepts, aiming to bridge existing gaps and foster the sustainable and effective implementation of AA in the workplace. Ultimately, our research endeavour contributes to advancing the field of human-AI interaction in the workplace.
navigating the human response to automated agents in the workplace navigating the human response to automated agents in the workplace DOI URL BibTeX

Haptic Intelligence Ph.D. Thesis Data-Driven Needle Puncture Detection for the Delivery of Urgent Medical Care in Space L’Orsa, R. University of Calgary, Calgary, Canada, November 2024, Department of Electrical and Computer Engineering (Published)
Needle thoracostomy (NT) is a surgical procedure that treats one of the most preventable causes of trauma-related death: dangerous accumulations of air between the chest wall and the lungs. However, needle-tip overshoot of the target space can result in the inadvertent puncture of critical structures like the heart. This type of complication is fatal without urgent surgical care, which is not available in resource-poor environments like space. Since NT is done blind, operators rely on tool sensations to identify when the needle has reached its target. Needle instrumentation could enable puncture notifications to help operators limit tool-tip overshoot, but such a solution requires reliable puncture detection from manual (i.e., variable-velocity) needle insertion data streams. Data-driven puncture-detection (DDPD) algorithms are appropriate for this application, but their performance has historically been unacceptably low for use in safety-critical applications. This work contributes towards the development of an intelligent device for manual NT assistance by proposing two novel DDPD algorithms. Three data sets are collected that provide needle forces and displacements acquired during insertions into ex vivo porcine tissue analogs for the human chest, and factors affecting DDPD algorithm performance are analyzed in these data. Puncture event features are examined for each sensor, and the suitability of both accelerometer measurements and diffuse reflectance measurements are evaluated within the context of NT. Finally, DDPD ensembles are proposed that yield a 5.1-fold improvement in precision as compared to the traditional force-only DDPD approach. These results lay a foundation for improving the urgent delivery of percutaneous procedures in space and other resource-poor settings.
BibTeX

Haptic Intelligence Autonomous Learning Empirical Inference Miscellaneous Demonstration: Minsight - A Soft Vision-Based Tactile Sensor for Robotic Fingertips Andrussow, I., Sun, H., Martius, G., Kuchenbecker, K. J. Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (Published)
Beyond vision and hearing, tactile sensing enhances a robot's ability to dexterously manipulate unfamiliar objects and safely interact with humans. Giving touch sensitivity to robots requires compact, robust, affordable, and efficient hardware designs, especially for high-resolution tactile sensing. We present a soft vision-based tactile sensor engineered to meet these requirements. Comparable in size to a human fingertip, Minsight uses machine learning to output high-resolution directional contact force distributions at 60 Hz. Minsight's tactile force maps enable precise sensing of fingertip contacts, which we use in this hands-on demonstration to allow a 3-DoF robot arm to physically track contact with a user's finger. While observing the colorful image captured by Minsight's internal camera, attendees can experience how its ability to detect delicate touches in all directions facilitates real-time robot interaction.
BibTeX

Haptic Intelligence Miscellaneous Demonstration: OCRA - A Kinematic Retargeting Algorithm for Expressive Whole-Arm Teleoperation Mohan, M., Kuchenbecker, K. J. Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (Published)
Traditional teleoperation systems focus on controlling the pose of the end-effector (task space), often neglecting the additional degrees of freedom present in human and many robotic arms. This demonstration presents the Optimization-based Customizable Retargeting Algorithm (OCRA), which was designed to map motions from one serial kinematic chain to another in real time. OCRA is versatile, accommodating any robot joint counts and segment lengths, and it can retarget motions from human arms to kinematically different serial robot arms with revolute joints both expressively and efficiently. One of OCRA's key features is its customizability, allowing the user to adjust the emphasis between hand orientation error and the configuration error of the arm's central line, which we call the arm skeleton. To evaluate the perceptual quality of the motions generated by OCRA, we conducted a video-watching study with 70 participants; the results indicated that the algorithm produces robot motions that closely resemble human movements, with a median rating of 78/100, particularly when the arm skeleton error weight and hand orientation error are balanced. In this demonstration, the presenter will wear an Xsens MVN Link and teleoperate the arms of a NAO child-size humanoid robot to highlight OCRA's ability to create intuitive and human-like whole-arm motions.
BibTeX

Empirical Inference Conference Paper Diffusion-based learning of contact plans for agile locomotion Dh’Edin, V., Ravi, A. K. C., Jordana, A., Zhu, H., Meduri, A., Righetti, L., Schölkopf, B., Khadiv, M. IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), 637-644, IEEE, November 2024 (Published) DOI URL BibTeX

Empirical Inference Conference Paper Do LLMs Think Fast and Slow? A Causal Study on Sentiment Analysis Lyu*, Z., Jin*, Z., Gonzalez, F., Mihalcea, R., Schölkopf, B., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 9353-9372, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024, *equal contribution (Published) DOI URL BibTeX

Empirical Inference Conference Paper Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis Jenny*, D. F., Billeter*, Y., Schölkopf, B., Jin, Z. Proceedings of the Third Workshop on NLP for Positive Impact, 152-178, (Editors: Dementieva, Daryna and Ignat, Oana and Jin, Zhijing and Mihalcea, Rada and Piatti, Giorgio and Tetreault, Joel and Wilson, Steven and Zhao, Jieyu), Association for Computational Linguistics, November 2024, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper Implicit Personalization in Language Models: A Systematic Study Jin, Z., Heil, N., Liu, J., Dhuliawala, S., Qi, Y., Schölkopf, B., Mihalcea, R., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 12309-12325, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Leveraging Unpaired Data for the Creation of Controllable Digital Humans Sanyal, S. Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, November 2024 (Published)
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.
download BibTeX

Empirical Inference Ph.D. Thesis On Principled Modeling of Inductive Bias in Machine Learning Liu, W. University of Cambridge, UK, Cambridge, November 2024, (Cambridge-Tübingen-Fellowship-Program, ELLIS PhD student program) (Published) BibTeX

Empirical Inference Conference Paper The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning Cui, S., Jin, Z., Schölkopf, B., Faltings, B. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16722-16763, (Editors: Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung), Association for Computational Linguistics, November 2024 (Published) URL BibTeX

Empirical Inference Conference Paper RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands Zhao*, Y., Chen*, L., Schneider, J., Gao, Q., Kannala, J., Schölkopf, B., Pajarinen, J., Büchler, D. Proceedings of the 8th Annual Conference on Robot Learning (CoRL), 270:5184-5203, Proceedings of Machine Learning Research, (Editors: Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram), PMLR, Conference on Robot Learning, November 2024, *equal contribution (Published) URL BibTeX

Deep Models and Optimization Article NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs Köprücü, N., Okpekpe, D., Orvieto, A. October 2024 (In preparation) BibTeX

Organizational Leadership and Diversity Conference Paper Is it Part of Me? Exploring Experiences of Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Angerbauer, K., Van Wagoner, P., Halach, T., Vogelsang, J., Hube, N., Smith, A., Keplinger, K., Sedlmair, M. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 1-15, Association for Computing Machinery, New York, NY, USA, ASSETS '24, October 2024 (Published)
Social Virtual Reality (VR) platforms have surged in popularity in recent years, including among people with disabilities (PWD). Previous research has documented accessibility challenges, harassment, and negative experiences for PWD using disability signifiers in VR, primarily focusing on those with visible disabilities who encounter negative experiences. Yet, little is known about the experiences of people with invisible disabilities in social VR environments, and whether positive experiences are also common. To address these gaps, we designed inclusive avatars (avatars with disability signifiers) and investigated the lived experiences of 26 individuals with both visible and invisible disabilities immersing themselves in social interactions in VRChat for a week. We utilized a mixed methods experience sampling design and multilevel regression to explore the relationships between social interactions of PWD in VR and various psychological outcomes. Our results indicate that PWD, both visible and invisible, experienced positive and negative social interactions in VR. These interactions, in turn, significantly influenced users’ overall experience with inclusive avatars, affecting aspects such as emotional responses, engagement levels, satisfaction with the avatar’s design, and perceptions of inclusion in VR. Qualitative interviews of 18 participants allowed for a more nuanced exploration of the experiences of PWD by giving voice to users who are rarely studied in depth. Findings provided unique insights into both the positive and negative experiences of PWD, as well as identified key design factors influencing user experience in social VR.
Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Inclusive Avatar Use for Social VR DOI URL BibTeX

Safety- and Efficiency- aligned Learning Technical Report A Realistic Threat Model for Large Language Model Jailbreaks Boreiko, V., Panfilov, A., Hein, M., Geiping, J. October 2024 (Submitted)
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
URL BibTeX

Autonomous Learning Conference Paper Active Fine-Tuning of Generalist Policies Bagatella, M., Hübotter, J., Martius, G., Krause, A. October 2024 (Submitted) BibTeX

Autonomous Learning Robotics Conference Paper Learning Diverse Skills for Local Navigation under Multi-constraint Optimality Cheng, J., Vlastelica, M., Kolev, P., Li, C., Martius, G. In Learning Diverse Skills for Local Navigation under Multi-constraint Optimality, 5083-5089, ICRA, October 2024 (Published)
Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.
Website DOI URL BibTeX

Learning and Dynamical Systems Conference Paper Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy Fischer, P., Willms, H., Muehlebach, M., Thorwarth, D., Schneider, M., Baumgartner, C. Medical Image Computing and Computer Assisted Intervention - MICCAI 2024 , 696-706, Springer, Cham, 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024), October 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper On predicting 3D bone locations inside the human body Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, 336-346, Springer, Cham, 27th International Conference on Medical Image Computing and Computer assisted Intervention (MICCAI 2024) , October 2024 (Published)
Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.
Project page DOI URL BibTeX

Deep Models and Optimization Conference Paper Loss Landscape Characterization of Neural Networks without Over-Parametrization Islamov, R., Ajroldi, N., Orvieto, A., Lucchi, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Recurrent neural networks: vanishing and exploding gradients are not the end of the story Zucchet, N., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Theoretical Foundations of Deep Selective State-Space Models Muca Cirone, N., Orvieto, A., Walker, B., Salvi, C., Lyons, T. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks Sieber, J., Amo Alonso, C., Didier, A., Zeilinger, M., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Empirical Inference Article A Probabilistic Model behind Self-Supervised Learning Bizeul, A., Schölkopf, B., Allen, C. Transactions on Machine Learning Research, October 2024 (Published) PDF URL BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Active Haptic Feedback for a Virtual Wrist-Anchored User Interface Bartels, J. U., Sanchez-Tamayo, N., Sedlmair, M., Kuchenbecker, K. J. Adjunct Proceedings of the Annual ACM Symposium on User Interface Software and Technology (UIST), (53)1-3, Hands-on demonstration presented at the Annual ACM Symposium on User Interface Software and Technology (UIST), Pittsburgh, USA, October 2024 (Published)
The presented system combines a virtual wrist-anchored user interface (UI) with a new low-profle, wrist-worn device that provides salient and expressive haptic feedback such as contact, pressure and broad-bandwidth vibration. This active feedback is used to add tactile cues to interactions with virtual mid-air UI elements that track the user's wrist; we demonstrate a simple menu-interaction task to showcase the utility of haptics for interactions with virtual buttons and sliders. Moving forward, we intend to use this platform to develop haptic guidelines for body-anchored interfaces and test multiple haptic devices across the body to create engaging interactions.
DOI BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Decline Now: A Combinatorial Model for Algorithmic Collective Action Sigg, D., Hardt, M., Mendler-Dünner, C. CHI Conference on Human Factors in Computing Systems, October 2024 (Accepted)
Drivers on food delivery platforms often run a loss on low-paying orders. In response, workers on DoorDash started a campaign, DeclineNow, to purposefully decline orders below a certain pay threshold. For each declined order, the platform returns the request to other available drivers with slightly increased pay. While contributing to overall pay increase the implementation of the strategy comes with the risk of missing out on orders for each individual driver. In this work, we propose a first combinatorial model to study the strategic interaction between workers and the platform. Within our model, we formalize key quantities such as the average worker benefit of the strategy, the benefit of freeriding, as well as the benefit of participation. We extend our theoretical results with simulations. Our key insights show that the average worker gain of the strategy is always positive, while the benefit of participation is positive only for small degrees of labor oversupply. Beyond this point, the utility of participants decreases faster with an increasing degree of oversupply, compared to the utility of non-participants. Our work highlights the significance of labor supply levels for the effectiveness of collective action on gig platforms. We suggest organizing in shifts as a means to reduce oversupply and empower collectives
arXiv BibTeX

Empirical Inference Article How developments in natural language processing help us in understanding human behaviour Mihalcea, R., Biester, L., Boyd, R. L., Jin, Z., Perez-Rosas, V., Wilson, S., Pennebaker, J. W. Nature Human Behaviour, 8(10):1877-1889, Nature Publishing Group UK London, October 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Learning Digital Humans from Vision and Language Feng, Y. ETH Zürich, October 2024 (Published)
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.
pdf DOI URL BibTeX

Empirical Inference Conference Paper Redesigning Information Markets in the Era of Language Models Weiss, M., Rahaman, N., Wüthrich, M., Bengio, Y., Li, L. E., Schölkopf, B., Pal, C. First Conference on Language Modeling (COLM), arXiv:2403.14443, October 2024 (Published)
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
arXiv URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Stable Video Portraits Ostrek, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.
URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Synthesizing Environment-Specific People in Photographs Ostrek, M., O’Sullivan, C., Black, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.
URL BibTeX

Perceiving Systems Conference Paper HUMOS: Human Motion Model Conditioned on Body Shape Tripathi, S., Taheri, O., Lassner, C., Black, M. J., Holden, D., Stoll, C. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (Published)
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods.
project arXiv BibTeX

Robotic Materials Article Hexagonal electrohydraulic modules for rapidly reconfigurable high-speed robots Yoder, Z., Rumley, E., Schmidt, I., Rothemund, P., Keplinger, C. Science Robotics, 9, September 2024 (Published)
Robots made from reconfigurable modular units feature versatility, cost efficiency, and improved sustainability compared with fixed designs. Reconfigurable modules driven by soft actuators provide adaptable actuation, safe interaction, and wide design freedom, but existing soft modules would benefit from high-speed and high-strain actuation, as well as driving methods well-suited to untethered operation. Here, we introduce a class of electrically actuated robotic modules that provide high-speed (a peak contractile strain rate of 4618\% per second, 15.8-hertz bandwidth, and a peak specific power of 122 watts per kilogram), high-strain (49\% contraction) actuation and that use magnets for reversible mechanical and electrical connections between neighboring modules, thereby serving as building blocks for rapidly reconfigurable and highly agile robotic systems. The actuation performance of each hexagonal electrohydraulic (HEXEL) module is enabled by a synergistic combination of soft and rigid components; a hexagonal exoskeleton of rigid plates amplifies the motion produced by soft electrohydraulic actuators and provides a mechanical structure and connection platform for reconfigurable robots composed of many modules. We characterize the actuation performance of individual HEXEL modules, present a model that captures their quasi-static force-stroke behavior, and demonstrate both a high-jumping and a fast pipe-crawling robot. Using embedded magnetic connections, we arranged multiple modules into reconfigurable robots with diverse functionality, including a high-stroke muscle, a multimodal active array, a table-top active platform, and a fast-rolling robot. We further leveraged the magnetic connections for hosting untethered, snap-on driving electronics, together highlighting the promise of HEXEL modules for creating rapidly reconfigurable high-speed robots.
Video PDF DOI URL BibTeX

Perceiving Systems Conference Paper A Unified Approach for Text- and Image-guided 4D Scene Generation Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., Mello, S. D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 7300-7309, Piscataway, NJ, CVPR, September 2024 (Published)
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
paper project code DOI URL BibTeX

Perceiving Systems Conference Paper Generative Proxemics: A Prior for 3D Social Interaction from Images Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 9687-9697, Piscataway, NJ, CVPR, September 2024 (Published)
Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this, we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from a single image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods. Our code, data, and model are available at our project website.
arXiv project code data DOI URL BibTeX