Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Ph.D. Thesis Beyond the Surface: Statistical Approaches to Internal Anatomy Prediction Keller, M. University of Tübingen, November 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. But to observe a subject’s anatomy, expensive medical devices (MRI or CT) are required and creating a digital model is often time-consuming and involves manual effort. Instead, we can leverage the fact that the shape of the body surface is correlated with the internal anatomy; indeed, the external body shape is related to the bone lengths, the angle of skeletal articulation, and the thickness of various soft tissues. In this thesis, we leverage the correlation between body shape and anatomy and aim to infer the internal anatomy solely from the external appearance. Learning this correlation requires paired observations of people’s body shape, and their internal anatomy, which raises three challenges. First, building such datasets requires specific capture modalities. Second, these data must be annotated, i.e. the body shape and anatomical structures must be identified and segmented, which is often a tedious manual task requiring expertise. Third, to learn a model able to capture the correlation between body shape and internal anatomy, the data of people with various shapes and poses has to be put into correspondence. In this thesis, we cover three works that focus on learning this correlation. We show that we can infer the skeleton geometry, the bone location inside the body, and the soft tissue location solely from the external body shape. First, in the OSSO project, we leverage 2D medical scans to construct a paired dataset of 3D body shapes and corresponding 3D skeleton shapes. This dataset allows us to learn the correlation between body and skeleton shapes, enabling the inference of a custom skeleton based on an individual’s body. However, since this learning process is based on static views of subjects in specific poses, we cannot evaluate the accuracy of skeleton inference in different poses. To predict the bone orientation within the body in various poses, we need dynamic data. To track bones inside the body in motion, we can leverage methods from the biomechanics field. So in the second work, instead of medical imaging, we use a biomechanical skeletal model along with simulation to build a paired dataset of bodies in motion and their corresponding skeletons. In this work, we build such a dataset and learn SKEL, a body shape and skeleton model that includes the locations of anatomical bones from any body shape and in any pose. After dealing with the skeletal structure, we broaden our focus to include different layers of soft tissues. In the third work, HIT, we leverage segmented medical data to learn to predict the distribution of adipose tissues (fat) and lean tissues (muscle, organs, etc.) inside the body.
pdf URL BibTeX

Deep Models and Optimization Conference Paper Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise Monzio Compagnoni, E., Liu, T., Islamov, R., Proske, F. N., Orvieto, A., Lucchi, A. In The Thirteenth International Conference on Learning Representations, ICLR 2025, The Thirteenth International Conference on Learning Representations, November 2024 (Accepted) BibTeX

Safety- and Efficiency- aligned Learning Conference Paper Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers Singh, S., Singhania, P., Ranjan, A., Kirchenbauer, J., Geiping, J., Wen, Y., Jain, N., Hans, A., Shu, M., Tomar, A., Goldstein, T., Bhatele, A. International Conference for High Performance Computing, Networking, Storage and Analysis SC (SC24), 36-49, Supercomputing, IEEE Digital Library, Atlanta, GA, International Conference for High Performance Computing, November 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Aerial Markerless Motion Capture Saini, N. November 2024 (Published)
Human motion capture (mocap) is important for several applications such as healthcare, sports, animation etc. Existing markerless mocap methods employ multiple static and calibrated RGB cameras to infer the subject’s pose. These methods are not suitable for outdoor and unstructured scenarios. They need an extra calibration step before the mocap session and cannot dynamically adapt the viewpoint for the best mocap performance. A mocap setup consisting of multiple unmanned aerial vehicles with onboard cameras is ideal for such situations. However, estimating the subject’s motion together with the camera motions is an under-constrained problem. In this thesis, we explore multiple approaches where we split this problem into multiple stages. We obtain the prior knowledge or rough estimates of the subject’s or the cameras’ motion in the initial stages and exploit them in the final stages. In our work AirCap-Pose-Estimator, we use extra sensors (an IMU and a GPS receiver) on the multiple moving cameras to obtain the approximate camera poses. We use these estimates to jointly optimize the camera poses, the 3D body pose and the subject’s shape to robustly fit the 2D keypoints of the subject. We show that the camera pose estimates using just the sensors are not accurate enough, and our joint optimization formulation improves the accuracy of the camera poses while estimating the subject’s poses. Placing extra sensors on the cameras is not always feasible. That is why, in our work AirPose, we introduce a distributed neural network that runs on board, estimating the subject’s motion and calibrating the cameras relative to the subject. We utilize realistic human scans with ground truth to train our network. We further fine-tune it using a small amount of real-world data. Finally, we propose a bundle-adjustment method (AirPose+), which utilizes the initial estimates from our network to recover high-quality motions of the subject and the cameras. Finally, we consider a generic setup consisting of multiple static and moving cameras. We propose a method that estimates the poses of the cameras and the human relative to the ground plane using only 2D human keypoints. We learn a human motion prior using a large amount of human mocap data and use it in a novel multi-stage optimization approach to fit the SMPL human body model and the camera poses to the 2D keypoints. We show that in addition to the aerial cameras, our method works for smartphone cameras and standard RGB ground cameras. This thesis advances the field of markerless mocap which is currently limited to multiple static calibrated RGB cameras. Our methods allow the user to use moving RGB cameras and skip the extrinsic calibration. In the future, we will explore the usage of a single moving camera without even needing camera intrinsics.
thesis BibTeX

Organizational Leadership and Diversity Article From challenges to opportunities: navigating the human response to automated agents in the workplace Ðula, I., Berberena, T., Keplinger, K., Wirzberger, M. Humanities and Social Sciences Communications, 11:1454, November 2024 (Published)
Workers are increasingly embracing Artificial Intelligence (AI) to optimise various aspects of their operations in the workplace. While AI offers new opportunities, it also presents unintended challenges that they must carefully navigate. This paper aims to develop a deeper understanding of workers’ experiences with interactions with automated agents (AA) in the workplace and provide actionable recommendations for organisational leaders to achieve positive outcomes. We propose and test a simulation model that quantifies and predicts workers’ experiences with AA, shedding light on the interplay of diverse variables, such as workload, effort and trust. Our findings suggest that lower-efficiency AA might outperform higher-efficiency ones due to the constraining influence of trust on adoption rates. Additionally, we find that lower initial trust in AA could lead to increased usage in certain scenarios and that stronger emotional and social responses to the use of AA may foster greater trust but result in decreased AA utilisation. This interdisciplinary research blends a systems dynamics approach with management theories and psychological concepts, aiming to bridge existing gaps and foster the sustainable and effective implementation of AA in the workplace. Ultimately, our research endeavour contributes to advancing the field of human-AI interaction in the workplace.
navigating the human response to automated agents in the workplace navigating the human response to automated agents in the workplace DOI URL BibTeX

Haptic Intelligence Ph.D. Thesis Data-Driven Needle Puncture Detection for the Delivery of Urgent Medical Care in Space L’Orsa, R. University of Calgary, Calgary, Canada, November 2024, Department of Electrical and Computer Engineering (Published)
Needle thoracostomy (NT) is a surgical procedure that treats one of the most preventable causes of trauma-related death: dangerous accumulations of air between the chest wall and the lungs. However, needle-tip overshoot of the target space can result in the inadvertent puncture of critical structures like the heart. This type of complication is fatal without urgent surgical care, which is not available in resource-poor environments like space. Since NT is done blind, operators rely on tool sensations to identify when the needle has reached its target. Needle instrumentation could enable puncture notifications to help operators limit tool-tip overshoot, but such a solution requires reliable puncture detection from manual (i.e., variable-velocity) needle insertion data streams. Data-driven puncture-detection (DDPD) algorithms are appropriate for this application, but their performance has historically been unacceptably low for use in safety-critical applications. This work contributes towards the development of an intelligent device for manual NT assistance by proposing two novel DDPD algorithms. Three data sets are collected that provide needle forces and displacements acquired during insertions into ex vivo porcine tissue analogs for the human chest, and factors affecting DDPD algorithm performance are analyzed in these data. Puncture event features are examined for each sensor, and the suitability of both accelerometer measurements and diffuse reflectance measurements are evaluated within the context of NT. Finally, DDPD ensembles are proposed that yield a 5.1-fold improvement in precision as compared to the traditional force-only DDPD approach. These results lay a foundation for improving the urgent delivery of percutaneous procedures in space and other resource-poor settings.
BibTeX

Haptic Intelligence Autonomous Learning Empirical Inference Miscellaneous Demonstration: Minsight - A Soft Vision-Based Tactile Sensor for Robotic Fingertips Andrussow, I., Sun, H., Martius, G., Kuchenbecker, K. J. Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (Published)
Beyond vision and hearing, tactile sensing enhances a robot's ability to dexterously manipulate unfamiliar objects and safely interact with humans. Giving touch sensitivity to robots requires compact, robust, affordable, and efficient hardware designs, especially for high-resolution tactile sensing. We present a soft vision-based tactile sensor engineered to meet these requirements. Comparable in size to a human fingertip, Minsight uses machine learning to output high-resolution directional contact force distributions at 60 Hz. Minsight's tactile force maps enable precise sensing of fingertip contacts, which we use in this hands-on demonstration to allow a 3-DoF robot arm to physically track contact with a user's finger. While observing the colorful image captured by Minsight's internal camera, attendees can experience how its ability to detect delicate touches in all directions facilitates real-time robot interaction.
BibTeX

Haptic Intelligence Miscellaneous Demonstration: OCRA - A Kinematic Retargeting Algorithm for Expressive Whole-Arm Teleoperation Mohan, M., Kuchenbecker, K. J. Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (Published)
Traditional teleoperation systems focus on controlling the pose of the end-effector (task space), often neglecting the additional degrees of freedom present in human and many robotic arms. This demonstration presents the Optimization-based Customizable Retargeting Algorithm (OCRA), which was designed to map motions from one serial kinematic chain to another in real time. OCRA is versatile, accommodating any robot joint counts and segment lengths, and it can retarget motions from human arms to kinematically different serial robot arms with revolute joints both expressively and efficiently. One of OCRA's key features is its customizability, allowing the user to adjust the emphasis between hand orientation error and the configuration error of the arm's central line, which we call the arm skeleton. To evaluate the perceptual quality of the motions generated by OCRA, we conducted a video-watching study with 70 participants; the results indicated that the algorithm produces robot motions that closely resemble human movements, with a median rating of 78/100, particularly when the arm skeleton error weight and hand orientation error are balanced. In this demonstration, the presenter will wear an Xsens MVN Link and teleoperate the arms of a NAO child-size humanoid robot to highlight OCRA's ability to create intuitive and human-like whole-arm motions.
BibTeX

Empirical Inference Conference Paper Diffusion-based learning of contact plans for agile locomotion Dh’Edin, V., Ravi, A. K. C., Jordana, A., Zhu, H., Meduri, A., Righetti, L., Schölkopf, B., Khadiv, M. IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), 637-644, IEEE, November 2024 (Published) DOI URL BibTeX

Empirical Inference Conference Paper Do LLMs Think Fast and Slow? A Causal Study on Sentiment Analysis Lyu*, Z., Jin*, Z., Gonzalez, F., Mihalcea, R., Schölkopf, B., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 9353-9372, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024, *equal contribution (Published) DOI URL BibTeX

Empirical Inference Conference Paper Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis Jenny*, D. F., Billeter*, Y., Schölkopf, B., Jin, Z. Proceedings of the Third Workshop on NLP for Positive Impact, 152-178, (Editors: Dementieva, Daryna and Ignat, Oana and Jin, Zhijing and Mihalcea, Rada and Piatti, Giorgio and Tetreault, Joel and Wilson, Steven and Zhao, Jieyu), Association for Computational Linguistics, November 2024, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper Implicit Personalization in Language Models: A Systematic Study Jin, Z., Heil, N., Liu, J., Dhuliawala, S., Qi, Y., Schölkopf, B., Mihalcea, R., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 12309-12325, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Leveraging Unpaired Data for the Creation of Controllable Digital Humans Sanyal, S. Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, November 2024 (Published)
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.
download BibTeX

Empirical Inference Ph.D. Thesis On Principled Modeling of Inductive Bias in Machine Learning Liu, W. University of Cambridge, UK, Cambridge, November 2024, (Cambridge-Tübingen-Fellowship-Program, ELLIS PhD student program) (Published) BibTeX

Empirical Inference Conference Paper The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning Cui, S., Jin, Z., Schölkopf, B., Faltings, B. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16722-16763, (Editors: Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung), Association for Computational Linguistics, November 2024 (Published) URL BibTeX

Empirical Inference Conference Paper RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands Zhao*, Y., Chen*, L., Schneider, J., Gao, Q., Kannala, J., Schölkopf, B., Pajarinen, J., Büchler, D. Proceedings of the 8th Annual Conference on Robot Learning (CoRL), 270:5184-5203, Proceedings of Machine Learning Research, (Editors: Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram), PMLR, Conference on Robot Learning, November 2024, *equal contribution (Published) URL BibTeX

Deep Models and Optimization Article NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs Köprücü, N., Okpekpe, D., Orvieto, A. October 2024 (In preparation) BibTeX

Organizational Leadership and Diversity Conference Paper Is it Part of Me? Exploring Experiences of Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Angerbauer, K., Van Wagoner, P., Halach, T., Vogelsang, J., Hube, N., Smith, A., Keplinger, K., Sedlmair, M. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 1-15, Association for Computing Machinery, New York, NY, USA, ASSETS '24, October 2024 (Published)
Social Virtual Reality (VR) platforms have surged in popularity in recent years, including among people with disabilities (PWD). Previous research has documented accessibility challenges, harassment, and negative experiences for PWD using disability signifiers in VR, primarily focusing on those with visible disabilities who encounter negative experiences. Yet, little is known about the experiences of people with invisible disabilities in social VR environments, and whether positive experiences are also common. To address these gaps, we designed inclusive avatars (avatars with disability signifiers) and investigated the lived experiences of 26 individuals with both visible and invisible disabilities immersing themselves in social interactions in VRChat for a week. We utilized a mixed methods experience sampling design and multilevel regression to explore the relationships between social interactions of PWD in VR and various psychological outcomes. Our results indicate that PWD, both visible and invisible, experienced positive and negative social interactions in VR. These interactions, in turn, significantly influenced users’ overall experience with inclusive avatars, affecting aspects such as emotional responses, engagement levels, satisfaction with the avatar’s design, and perceptions of inclusion in VR. Qualitative interviews of 18 participants allowed for a more nuanced exploration of the experiences of PWD by giving voice to users who are rarely studied in depth. Findings provided unique insights into both the positive and negative experiences of PWD, as well as identified key design factors influencing user experience in social VR.
Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Inclusive Avatar Use for Social VR DOI URL BibTeX

Safety- and Efficiency- aligned Learning Technical Report A Realistic Threat Model for Large Language Model Jailbreaks Boreiko, V., Panfilov, A., Hein, M., Geiping, J. October 2024 (Submitted)
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
URL BibTeX

Autonomous Learning Conference Paper Active Fine-Tuning of Generalist Policies Bagatella, M., Hübotter, J., Martius, G., Krause, A. October 2024 (Submitted) BibTeX

Autonomous Learning Robotics Conference Paper Learning Diverse Skills for Local Navigation under Multi-constraint Optimality Cheng, J., Vlastelica, M., Kolev, P., Li, C., Martius, G. In Learning Diverse Skills for Local Navigation under Multi-constraint Optimality, 5083-5089, ICRA, October 2024 (Published)
Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.
Website DOI URL BibTeX

Learning and Dynamical Systems Conference Paper Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy Fischer, P., Willms, H., Muehlebach, M., Thorwarth, D., Schneider, M., Baumgartner, C. Medical Image Computing and Computer Assisted Intervention - MICCAI 2024 , 696-706, Springer, Cham, 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024), October 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper On predicting 3D bone locations inside the human body Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, 336-346, Springer, Cham, 27th International Conference on Medical Image Computing and Computer assisted Intervention (MICCAI 2024) , October 2024 (Published)
Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.
Project page DOI URL BibTeX

Deep Models and Optimization Conference Paper Loss Landscape Characterization of Neural Networks without Over-Parametrization Islamov, R., Ajroldi, N., Orvieto, A., Lucchi, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Recurrent neural networks: vanishing and exploding gradients are not the end of the story Zucchet, N., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Theoretical Foundations of Deep Selective State-Space Models Muca Cirone, N., Orvieto, A., Walker, B., Salvi, C., Lyons, T. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks Sieber, J., Amo Alonso, C., Didier, A., Zeilinger, M., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Empirical Inference Article A Probabilistic Model behind Self-Supervised Learning Bizeul, A., Schölkopf, B., Allen, C. Transactions on Machine Learning Research, October 2024 (Published) PDF URL BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Active Haptic Feedback for a Virtual Wrist-Anchored User Interface Bartels, J. U., Sanchez-Tamayo, N., Sedlmair, M., Kuchenbecker, K. J. Adjunct Proceedings of the Annual ACM Symposium on User Interface Software and Technology (UIST), (53)1-3, Hands-on demonstration presented at the Annual ACM Symposium on User Interface Software and Technology (UIST), Pittsburgh, USA, October 2024 (Published)
The presented system combines a virtual wrist-anchored user interface (UI) with a new low-profle, wrist-worn device that provides salient and expressive haptic feedback such as contact, pressure and broad-bandwidth vibration. This active feedback is used to add tactile cues to interactions with virtual mid-air UI elements that track the user's wrist; we demonstrate a simple menu-interaction task to showcase the utility of haptics for interactions with virtual buttons and sliders. Moving forward, we intend to use this platform to develop haptic guidelines for body-anchored interfaces and test multiple haptic devices across the body to create engaging interactions.
DOI BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Decline Now: A Combinatorial Model for Algorithmic Collective Action Sigg, D., Hardt, M., Mendler-Dünner, C. CHI Conference on Human Factors in Computing Systems, October 2024 (Accepted)
Drivers on food delivery platforms often run a loss on low-paying orders. In response, workers on DoorDash started a campaign, DeclineNow, to purposefully decline orders below a certain pay threshold. For each declined order, the platform returns the request to other available drivers with slightly increased pay. While contributing to overall pay increase the implementation of the strategy comes with the risk of missing out on orders for each individual driver. In this work, we propose a first combinatorial model to study the strategic interaction between workers and the platform. Within our model, we formalize key quantities such as the average worker benefit of the strategy, the benefit of freeriding, as well as the benefit of participation. We extend our theoretical results with simulations. Our key insights show that the average worker gain of the strategy is always positive, while the benefit of participation is positive only for small degrees of labor oversupply. Beyond this point, the utility of participants decreases faster with an increasing degree of oversupply, compared to the utility of non-participants. Our work highlights the significance of labor supply levels for the effectiveness of collective action on gig platforms. We suggest organizing in shifts as a means to reduce oversupply and empower collectives
arXiv BibTeX

Empirical Inference Article How developments in natural language processing help us in understanding human behaviour Mihalcea, R., Biester, L., Boyd, R. L., Jin, Z., Perez-Rosas, V., Wilson, S., Pennebaker, J. W. Nature Human Behaviour, 8(10):1877-1889, Nature Publishing Group UK London, October 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Learning Digital Humans from Vision and Language Feng, Y. ETH Zürich, October 2024 (Published)
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.
pdf DOI URL BibTeX

Empirical Inference Conference Paper Redesigning Information Markets in the Era of Language Models Weiss, M., Rahaman, N., Wüthrich, M., Bengio, Y., Li, L. E., Schölkopf, B., Pal, C. First Conference on Language Modeling (COLM), arXiv:2403.14443, October 2024 (Published)
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
arXiv URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Stable Video Portraits Ostrek, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.
URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Synthesizing Environment-Specific People in Photographs Ostrek, M., O’Sullivan, C., Black, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.
URL BibTeX

Perceiving Systems Conference Paper HUMOS: Human Motion Model Conditioned on Body Shape Tripathi, S., Taheri, O., Lassner, C., Black, M. J., Holden, D., Stoll, C. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (Published)
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods.
project arXiv BibTeX

Robotic Materials Article Hexagonal electrohydraulic modules for rapidly reconfigurable high-speed robots Yoder, Z., Rumley, E., Schmidt, I., Rothemund, P., Keplinger, C. Science Robotics, 9, September 2024 (Published)
Robots made from reconfigurable modular units feature versatility, cost efficiency, and improved sustainability compared with fixed designs. Reconfigurable modules driven by soft actuators provide adaptable actuation, safe interaction, and wide design freedom, but existing soft modules would benefit from high-speed and high-strain actuation, as well as driving methods well-suited to untethered operation. Here, we introduce a class of electrically actuated robotic modules that provide high-speed (a peak contractile strain rate of 4618\% per second, 15.8-hertz bandwidth, and a peak specific power of 122 watts per kilogram), high-strain (49\% contraction) actuation and that use magnets for reversible mechanical and electrical connections between neighboring modules, thereby serving as building blocks for rapidly reconfigurable and highly agile robotic systems. The actuation performance of each hexagonal electrohydraulic (HEXEL) module is enabled by a synergistic combination of soft and rigid components; a hexagonal exoskeleton of rigid plates amplifies the motion produced by soft electrohydraulic actuators and provides a mechanical structure and connection platform for reconfigurable robots composed of many modules. We characterize the actuation performance of individual HEXEL modules, present a model that captures their quasi-static force-stroke behavior, and demonstrate both a high-jumping and a fast pipe-crawling robot. Using embedded magnetic connections, we arranged multiple modules into reconfigurable robots with diverse functionality, including a high-stroke muscle, a multimodal active array, a table-top active platform, and a fast-rolling robot. We further leveraged the magnetic connections for hosting untethered, snap-on driving electronics, together highlighting the promise of HEXEL modules for creating rapidly reconfigurable high-speed robots.
Video PDF DOI URL BibTeX

Perceiving Systems Conference Paper A Unified Approach for Text- and Image-guided 4D Scene Generation Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., Mello, S. D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 7300-7309, Piscataway, NJ, CVPR, September 2024 (Published)
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
paper project code DOI URL BibTeX

Perceiving Systems Conference Paper Generative Proxemics: A Prior for 3D Social Interaction from Images Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 9687-9697, Piscataway, NJ, CVPR, September 2024 (Published)
Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this, we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from a single image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods. Our code, data, and model are available at our project website.
arXiv project code data DOI URL BibTeX

Empirical Inference Conference Paper GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21295-21304, IEEE, Piscataway, NJ, CVPR, September 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning Zhang, H., Zhang, Y., Hu, L., Zhang, J., Yi, H., Zhang, S., Liu, Y. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1954 - 1964, Piscataway, NJ , CVPR, September 2024 (Published)
Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space. In this work, we introduce ProxyCap, a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning-based network with accurate world-space supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions in world space, our network is designed to learn human motions from a human-centric perspective, which enables the understanding of the same motion captured with different camera trajectories. Moreover, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solution, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held moving cameras.
arxiv project DOI URL BibTeX

Perceiving Systems Neural Capture and Synthesis Human-centric Vision & Learning Conference Paper Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Sklyarova, V., Zakharov, E., Hilliges, O., Black, M. J., Thies, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 4703-4712, Piscataway, NJ, CVPR, September 2024 (Published)
We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the "outer shell", which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches.
ArXiv Code DOI URL BibTeX

Perceiving Systems Conference Paper ChatPose: Chatting about 3D Human Pose Feng, Y., Lin, J., Dwivedi, S. K., Sun, Y., Patel, P., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2093-2103, Piscataway, NJ, CVPR, September 2024 (Published)
We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.
Arxiv Project DOI URL BibTeX

Perceiving Systems Conference Paper EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 1144-1154, Piscataway, NJ, CVPR, September 2024 (Published)
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available.
arXiv project dataset code gradio colab video DOI URL BibTeX

Perceiving Systems Conference Paper HIT: Estimating Internal Human Implicit Tissues from the Body Surface Keller, M., Arora, V., Dakri, A., Chandhok, S., Machann, J., Fritsche, A., Black, M. J., Pujades, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 3480-3490, Piscataway, NJ, CVPR, September 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. To observe a subject's anatomy, expensive medical devices (MRI or CT) are required and the creation of the digital model is often time-consuming and involves manual effort. Instead, we leverage the fact that the shape of the body surface is correlated with the internal anatomy; for example, from surface observations alone, one can predict body composition and skeletal structure. In this work, we go further and learn to infer the 3D location of three important anatomic tissues: subcutaneous adipose tissue (fat), lean tissue (muscles and organs), and long bones. To learn to infer these tissues, we tackle several key challenges. We first create a dataset of human tissues by segmenting full-body MRI scans and registering the SMPL body mesh to the body surface. With this dataset, we train HIT (Human Implicit Tissues), an implicit function that, given a point inside a body, predicts its tissue class. HIT leverages the SMPL body model shape and pose parameters to canonicalize the medical data. Unlike SMPL, which is trained from upright 3D scans, the MRI scans are taken of subjects lying on a table, resulting in significant soft-tissue deformation. Consequently, HIT uses a learned volumetric deformation field that undoes these deformations. Since HIT is parameterized by SMPL, we can repose bodies or change the shape of subjects and the internal structures deform appropriately. We perform extensive experiments to validate HIT's ability to predict plausible internal structure for novel subjects. The dataset and HIT model are publicly available to foster future research in this direction.
Project page Paper DOI URL BibTeX

Perceiving Systems Conference Paper HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video Fan, Z., Parelli, M., Kadoglou, M. E., Kocabas, M., Chen, X., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 494-504, Piscataway, NJ, CVPR, September 2024 (Published)
Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos.
Paper Project Code DOI URL BibTeX

Perceiving Systems Conference Paper HUGS: Human Gaussian Splats Kocabas, M., Chang, R., Gabriel, J., Tuzel, O., Ranjan, A. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 505-515, Piscataway, NJ, CVPR, September 2024 (Published)
Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g., cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ∼100× faster to train over previous work.
arXiv Github Project Page YouTube Poster DOI URL BibTeX

Perceiving Systems Conference Paper SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes Sanyal, S., Ghosh, P., Yang, J., Black, M. J., Thies, J., Bolkart, T. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2362-2371, Piscataway, NJ, CVPR, September 2024 (Published)
We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies.
Project page Data Code Video Arxiv DOI URL BibTeX