Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Haptic Intelligence Miscellaneous Demonstration: OCRA - A Kinematic Retargeting Algorithm for Expressive Whole-Arm Teleoperation Mohan, M., Kuchenbecker, K. J. Hands-on demonstration presented at the Conference on Robot Learning (CoRL), Munich, Germany, November 2024 (Published)
Traditional teleoperation systems focus on controlling the pose of the end-effector (task space), often neglecting the additional degrees of freedom present in human and many robotic arms. This demonstration presents the Optimization-based Customizable Retargeting Algorithm (OCRA), which was designed to map motions from one serial kinematic chain to another in real time. OCRA is versatile, accommodating any robot joint counts and segment lengths, and it can retarget motions from human arms to kinematically different serial robot arms with revolute joints both expressively and efficiently. One of OCRA's key features is its customizability, allowing the user to adjust the emphasis between hand orientation error and the configuration error of the arm's central line, which we call the arm skeleton. To evaluate the perceptual quality of the motions generated by OCRA, we conducted a video-watching study with 70 participants; the results indicated that the algorithm produces robot motions that closely resemble human movements, with a median rating of 78/100, particularly when the arm skeleton error weight and hand orientation error are balanced. In this demonstration, the presenter will wear an Xsens MVN Link and teleoperate the arms of a NAO child-size humanoid robot to highlight OCRA's ability to create intuitive and human-like whole-arm motions.
BibTeX

Empirical Inference Conference Paper Diffusion-based learning of contact plans for agile locomotion Dh’Edin, V., Ravi, A. K. C., Jordana, A., Zhu, H., Meduri, A., Righetti, L., Schölkopf, B., Khadiv, M. IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), 637-644, IEEE, November 2024 (Published) DOI URL BibTeX

Empirical Inference Conference Paper Do LLMs Think Fast and Slow? A Causal Study on Sentiment Analysis Lyu*, Z., Jin*, Z., Gonzalez, F., Mihalcea, R., Schölkopf, B., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 9353-9372, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024, *equal contribution (Published) DOI URL BibTeX

Empirical Inference Conference Paper Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis Jenny*, D. F., Billeter*, Y., Schölkopf, B., Jin, Z. Proceedings of the Third Workshop on NLP for Positive Impact, 152-178, (Editors: Dementieva, Daryna and Ignat, Oana and Jin, Zhijing and Mihalcea, Rada and Piatti, Giorgio and Tetreault, Joel and Wilson, Steven and Zhao, Jieyu), Association for Computational Linguistics, November 2024, *equal contribution (Published) URL BibTeX

Empirical Inference Conference Paper Implicit Personalization in Language Models: A Systematic Study Jin, Z., Heil, N., Liu, J., Dhuliawala, S., Qi, Y., Schölkopf, B., Mihalcea, R., Sachan, M. Findings of the Association for Computational Linguistics: EMNLP, 12309-12325, (Editors: Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen), Association for Computational Linguistics, November 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Leveraging Unpaired Data for the Creation of Controllable Digital Humans Sanyal, S. Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, November 2024 (Published)
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.
download BibTeX

Empirical Inference Ph.D. Thesis On Principled Modeling of Inductive Bias in Machine Learning Liu, W. University of Cambridge, UK, Cambridge, November 2024, (Cambridge-Tübingen-Fellowship-Program, ELLIS PhD student program) (Published) BibTeX

Empirical Inference Conference Paper The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning Cui, S., Jin, Z., Schölkopf, B., Faltings, B. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16722-16763, (Editors: Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung), Association for Computational Linguistics, November 2024 (Published) URL BibTeX

Empirical Inference Conference Paper RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands Zhao*, Y., Chen*, L., Schneider, J., Gao, Q., Kannala, J., Schölkopf, B., Pajarinen, J., Büchler, D. Proceedings of the 8th Annual Conference on Robot Learning (CoRL), 270:5184-5203, Proceedings of Machine Learning Research, (Editors: Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram), PMLR, Conference on Robot Learning, November 2024, *equal contribution (Published) URL BibTeX

Deep Models and Optimization Article NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs Köprücü, N., Okpekpe, D., Orvieto, A. October 2024 (In preparation) BibTeX

Organizational Leadership and Diversity Conference Paper Is it Part of Me? Exploring Experiences of Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Angerbauer, K., Van Wagoner, P., Halach, T., Vogelsang, J., Hube, N., Smith, A., Keplinger, K., Sedlmair, M. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility, 1-15, Association for Computing Machinery, New York, NY, USA, ASSETS '24, October 2024 (Published)
Social Virtual Reality (VR) platforms have surged in popularity in recent years, including among people with disabilities (PWD). Previous research has documented accessibility challenges, harassment, and negative experiences for PWD using disability signifiers in VR, primarily focusing on those with visible disabilities who encounter negative experiences. Yet, little is known about the experiences of people with invisible disabilities in social VR environments, and whether positive experiences are also common. To address these gaps, we designed inclusive avatars (avatars with disability signifiers) and investigated the lived experiences of 26 individuals with both visible and invisible disabilities immersing themselves in social interactions in VRChat for a week. We utilized a mixed methods experience sampling design and multilevel regression to explore the relationships between social interactions of PWD in VR and various psychological outcomes. Our results indicate that PWD, both visible and invisible, experienced positive and negative social interactions in VR. These interactions, in turn, significantly influenced users’ overall experience with inclusive avatars, affecting aspects such as emotional responses, engagement levels, satisfaction with the avatar’s design, and perceptions of inclusion in VR. Qualitative interviews of 18 participants allowed for a more nuanced exploration of the experiences of PWD by giving voice to users who are rarely studied in depth. Findings provided unique insights into both the positive and negative experiences of PWD, as well as identified key design factors influencing user experience in social VR.
Inclusive Avatar Use For Visible and Invisible Disabilities in Social VR Inclusive Avatar Use for Social VR DOI URL BibTeX

Safety- and Efficiency- aligned Learning Technical Report A Realistic Threat Model for Large Language Model Jailbreaks Boreiko, V., Panfilov, A., Hein, M., Geiping, J. October 2024 (Submitted)
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
URL BibTeX

Autonomous Learning Conference Paper Active Fine-Tuning of Generalist Policies Bagatella, M., Hübotter, J., Martius, G., Krause, A. October 2024 (Submitted) BibTeX

Autonomous Learning Robotics Conference Paper Learning Diverse Skills for Local Navigation under Multi-constraint Optimality Cheng, J., Vlastelica, M., Kolev, P., Li, C., Martius, G. In Learning Diverse Skills for Local Navigation under Multi-constraint Optimality, 5083-5089, ICRA, October 2024 (Published)
Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.
Website DOI URL BibTeX

Learning and Dynamical Systems Conference Paper Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy Fischer, P., Willms, H., Muehlebach, M., Thorwarth, D., Schneider, M., Baumgartner, C. Medical Image Computing and Computer Assisted Intervention - MICCAI 2024 , 696-706, Springer, Cham, 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024), October 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper On predicting 3D bone locations inside the human body Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, 336-346, Springer, Cham, 27th International Conference on Medical Image Computing and Computer assisted Intervention (MICCAI 2024) , October 2024 (Published)
Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.
Project page DOI URL BibTeX

Deep Models and Optimization Conference Paper Loss Landscape Characterization of Neural Networks without Over-Parametrization Islamov, R., Ajroldi, N., Orvieto, A., Lucchi, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Recurrent neural networks: vanishing and exploding gradients are not the end of the story Zucchet, N., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Theoretical Foundations of Deep Selective State-Space Models Muca Cirone, N., Orvieto, A., Walker, B., Salvi, C., Lyons, T. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks Sieber, J., Amo Alonso, C., Didier, A., Zeilinger, M., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, October 2024 (Published) URL BibTeX

Empirical Inference Article A Probabilistic Model behind Self-Supervised Learning Bizeul, A., Schölkopf, B., Allen, C. Transactions on Machine Learning Research, October 2024 (Published) PDF URL BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Active Haptic Feedback for a Virtual Wrist-Anchored User Interface Bartels, J. U., Sanchez-Tamayo, N., Sedlmair, M., Kuchenbecker, K. J. Adjunct Proceedings of the Annual ACM Symposium on User Interface Software and Technology (UIST), (53)1-3, Hands-on demonstration presented at the Annual ACM Symposium on User Interface Software and Technology (UIST), Pittsburgh, USA, October 2024 (Published)
The presented system combines a virtual wrist-anchored user interface (UI) with a new low-profle, wrist-worn device that provides salient and expressive haptic feedback such as contact, pressure and broad-bandwidth vibration. This active feedback is used to add tactile cues to interactions with virtual mid-air UI elements that track the user's wrist; we demonstrate a simple menu-interaction task to showcase the utility of haptics for interactions with virtual buttons and sliders. Moving forward, we intend to use this platform to develop haptic guidelines for body-anchored interfaces and test multiple haptic devices across the body to create engaging interactions.
DOI BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Decline Now: A Combinatorial Model for Algorithmic Collective Action Sigg, D., Hardt, M., Mendler-Dünner, C. CHI Conference on Human Factors in Computing Systems, October 2024 (Accepted)
Drivers on food delivery platforms often run a loss on low-paying orders. In response, workers on DoorDash started a campaign, DeclineNow, to purposefully decline orders below a certain pay threshold. For each declined order, the platform returns the request to other available drivers with slightly increased pay. While contributing to overall pay increase the implementation of the strategy comes with the risk of missing out on orders for each individual driver. In this work, we propose a first combinatorial model to study the strategic interaction between workers and the platform. Within our model, we formalize key quantities such as the average worker benefit of the strategy, the benefit of freeriding, as well as the benefit of participation. We extend our theoretical results with simulations. Our key insights show that the average worker gain of the strategy is always positive, while the benefit of participation is positive only for small degrees of labor oversupply. Beyond this point, the utility of participants decreases faster with an increasing degree of oversupply, compared to the utility of non-participants. Our work highlights the significance of labor supply levels for the effectiveness of collective action on gig platforms. We suggest organizing in shifts as a means to reduce oversupply and empower collectives
arXiv BibTeX

Empirical Inference Article How developments in natural language processing help us in understanding human behaviour Mihalcea, R., Biester, L., Boyd, R. L., Jin, Z., Perez-Rosas, V., Wilson, S., Pennebaker, J. W. Nature Human Behaviour, 8(10):1877-1889, Nature Publishing Group UK London, October 2024 (Published) DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Learning Digital Humans from Vision and Language Feng, Y. ETH Zürich, October 2024 (Published)
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.
pdf DOI URL BibTeX

Empirical Inference Conference Paper Redesigning Information Markets in the Era of Language Models Weiss, M., Rahaman, N., Wüthrich, M., Bengio, Y., Li, L. E., Schölkopf, B., Pal, C. First Conference on Language Modeling (COLM), arXiv:2403.14443, October 2024 (Published)
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
arXiv URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Stable Video Portraits Ostrek, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.
URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Synthesizing Environment-Specific People in Photographs Ostrek, M., O’Sullivan, C., Black, M., Thies, J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, European Conference on Computer Vision (ECCV 2024), October 2024 (Published)
We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.
URL BibTeX

Perceiving Systems Conference Paper HUMOS: Human Motion Model Conditioned on Body Shape Tripathi, S., Taheri, O., Lassner, C., Black, M. J., Holden, D., Stoll, C. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, October 2024 (Published)
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods.
project arXiv BibTeX

Robotic Materials Article Hexagonal electrohydraulic modules for rapidly reconfigurable high-speed robots Yoder, Z., Rumley, E., Schmidt, I., Rothemund, P., Keplinger, C. Science Robotics, 9, September 2024 (Published)
Robots made from reconfigurable modular units feature versatility, cost efficiency, and improved sustainability compared with fixed designs. Reconfigurable modules driven by soft actuators provide adaptable actuation, safe interaction, and wide design freedom, but existing soft modules would benefit from high-speed and high-strain actuation, as well as driving methods well-suited to untethered operation. Here, we introduce a class of electrically actuated robotic modules that provide high-speed (a peak contractile strain rate of 4618\% per second, 15.8-hertz bandwidth, and a peak specific power of 122 watts per kilogram), high-strain (49\% contraction) actuation and that use magnets for reversible mechanical and electrical connections between neighboring modules, thereby serving as building blocks for rapidly reconfigurable and highly agile robotic systems. The actuation performance of each hexagonal electrohydraulic (HEXEL) module is enabled by a synergistic combination of soft and rigid components; a hexagonal exoskeleton of rigid plates amplifies the motion produced by soft electrohydraulic actuators and provides a mechanical structure and connection platform for reconfigurable robots composed of many modules. We characterize the actuation performance of individual HEXEL modules, present a model that captures their quasi-static force-stroke behavior, and demonstrate both a high-jumping and a fast pipe-crawling robot. Using embedded magnetic connections, we arranged multiple modules into reconfigurable robots with diverse functionality, including a high-stroke muscle, a multimodal active array, a table-top active platform, and a fast-rolling robot. We further leveraged the magnetic connections for hosting untethered, snap-on driving electronics, together highlighting the promise of HEXEL modules for creating rapidly reconfigurable high-speed robots.
Video PDF DOI URL BibTeX

Perceiving Systems Conference Paper A Unified Approach for Text- and Image-guided 4D Scene Generation Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., Mello, S. D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 7300-7309, Piscataway, NJ, CVPR, September 2024 (Published)
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
paper project code DOI URL BibTeX

Perceiving Systems Conference Paper Generative Proxemics: A Prior for 3D Social Interaction from Images Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 9687-9697, Piscataway, NJ, CVPR, September 2024 (Published)
Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this, we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from a single image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods. Our code, data, and model are available at our project website.
arXiv project code data DOI URL BibTeX

Empirical Inference Conference Paper GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21295-21304, IEEE, Piscataway, NJ, CVPR, September 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning Zhang, H., Zhang, Y., Hu, L., Zhang, J., Yi, H., Zhang, S., Liu, Y. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1954 - 1964, Piscataway, NJ , CVPR, September 2024 (Published)
Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space. In this work, we introduce ProxyCap, a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning-based network with accurate world-space supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions in world space, our network is designed to learn human motions from a human-centric perspective, which enables the understanding of the same motion captured with different camera trajectories. Moreover, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solution, we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held moving cameras.
arxiv project DOI URL BibTeX

Perceiving Systems Neural Capture and Synthesis Human-centric Vision & Learning Conference Paper Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Sklyarova, V., Zakharov, E., Hilliges, O., Black, M. J., Thies, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 4703-4712, Piscataway, NJ, CVPR, September 2024 (Published)
We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the "outer shell", which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches.
ArXiv Code DOI URL BibTeX

Perceiving Systems Conference Paper ChatPose: Chatting about 3D Human Pose Feng, Y., Lin, J., Dwivedi, S. K., Sun, Y., Patel, P., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2093-2103, Piscataway, NJ, CVPR, September 2024 (Published)
We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.
Arxiv Project DOI URL BibTeX

Perceiving Systems Conference Paper EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 1144-1154, Piscataway, NJ, CVPR, September 2024 (Published)
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available.
arXiv project dataset code gradio colab video DOI URL BibTeX

Perceiving Systems Conference Paper HIT: Estimating Internal Human Implicit Tissues from the Body Surface Keller, M., Arora, V., Dakri, A., Chandhok, S., Machann, J., Fritsche, A., Black, M. J., Pujades, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 3480-3490, Piscataway, NJ, CVPR, September 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. To observe a subject's anatomy, expensive medical devices (MRI or CT) are required and the creation of the digital model is often time-consuming and involves manual effort. Instead, we leverage the fact that the shape of the body surface is correlated with the internal anatomy; for example, from surface observations alone, one can predict body composition and skeletal structure. In this work, we go further and learn to infer the 3D location of three important anatomic tissues: subcutaneous adipose tissue (fat), lean tissue (muscles and organs), and long bones. To learn to infer these tissues, we tackle several key challenges. We first create a dataset of human tissues by segmenting full-body MRI scans and registering the SMPL body mesh to the body surface. With this dataset, we train HIT (Human Implicit Tissues), an implicit function that, given a point inside a body, predicts its tissue class. HIT leverages the SMPL body model shape and pose parameters to canonicalize the medical data. Unlike SMPL, which is trained from upright 3D scans, the MRI scans are taken of subjects lying on a table, resulting in significant soft-tissue deformation. Consequently, HIT uses a learned volumetric deformation field that undoes these deformations. Since HIT is parameterized by SMPL, we can repose bodies or change the shape of subjects and the internal structures deform appropriately. We perform extensive experiments to validate HIT's ability to predict plausible internal structure for novel subjects. The dataset and HIT model are publicly available to foster future research in this direction.
Project page Paper DOI URL BibTeX

Perceiving Systems Conference Paper HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video Fan, Z., Parelli, M., Kadoglou, M. E., Kocabas, M., Chen, X., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 494-504, Piscataway, NJ, CVPR, September 2024 (Published)
Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos.
Paper Project Code DOI URL BibTeX

Perceiving Systems Conference Paper HUGS: Human Gaussian Splats Kocabas, M., Chang, R., Gabriel, J., Tuzel, O., Ranjan, A. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 505-515, Piscataway, NJ, CVPR, September 2024 (Published)
Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g., cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ∼100× faster to train over previous work.
arXiv Github Project Page YouTube Poster DOI URL BibTeX

Perceiving Systems Conference Paper SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes Sanyal, S., Ghosh, P., Yang, J., Black, M. J., Thies, J., Bolkart, T. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2362-2371, Piscataway, NJ, CVPR, September 2024 (Published)
We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies.
Project page Data Code Video Arxiv DOI URL BibTeX

Perceiving Systems Conference Paper SMIRK: 3D Facial Expressions through Analysis-by-Neural-Synthesis Retsinas, G., Filntisis, P. P., Danecek, R., Abrevaya, V. F., Roussos, A., Bolkart, T., Maragos, P. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2490-2501, Piscataway, NJ, CVPR, September 2024 (Published)
While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction.
arxiv project code DOI URL BibTeX

Perceiving Systems Conference Paper VAREN: Very Accurate and Realistic Equine Network Zuffi, S., Mellbin, Y., Li, C., Hoeschle, M., Kjellstrom, H., Polikovsky, S., Hernlund, E., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 5374-5383, Piscataway, NJ, CVPR, September 2024 (Published)
Data-driven three-dimensional parametric shape models of the human body have gained enormous popularity both for the analysis of visual data and for the generation of synthetic humans. Following a similar approach for animals does not scale to the multitude of existing animal species, not to mention the difficulty of accessing subjects to scan in 3D. However, we argue that for domestic species of great importance, like the horse, it is a highly valuable investment to put effort into gathering a large dataset of real 3D scans, and learn a realistic 3D articulated shape model. We introduce VAREN, a novel 3D articulated parametric shape model learned from 3D scans of many real horses. VAREN bridges synthesis and analysis tasks, as the generated model instances have unprecedented realism, while being able to represent horses of different sizes and shapes. Differently from previous body models, VAREN has two resolutions, an anatomical skeleton, and interpretable, learned pose-dependent deformations, which are related to the body muscles. We show with experiments that this formulation has superior performance with respect to previous strategies for modeling pose-dependent deformations in the human body case, while also being more compact and allowing an analysis of the relationship between articulation and muscle deformation during articulated motion.
project page paper DOI URL BibTeX

Perceiving Systems Conference Paper WANDR: Intention-guided Human Motion Generation Diomataris, M., Athanasiou, N., Taheri, O., Wang, X., Hilliges, O., Black, M. J. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 927-936, IEEE Computer Society, Piscataway, NJ, CVPR, September 2024 (Published)
Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness.A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel \textit{intention} features that drive rich goal-oriented movement. \textit{Intention} guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations.
project website arXiv YouTube Video Code CVF DOI URL BibTeX

Perceiving Systems Conference Paper WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion Shin, S., Kim, J., Halilaj, E., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2070-2080, Piscataway, NJ, CVPR, September 2024 (Published)
The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes.
arXiv project code DOI URL BibTeX

Robotic Materials Article Electrohydraulic Musculoskeletal Robotic Leg for Agile, Adaptive, yet Energy-Efficient Locomotion Buchner, T. J. K., Fukushima, T., Kazemipour, A., Gravert, S., Prairie, M., Romanescu, P., Arm, P., Zhang, Y., Wang, X., Zhang, S. L., Walter, J., Keplinger, C., Katzschmann, R. K. Nature Communications, 15(1), September 2024 (Published)
Robotic locomotion in unstructured terrain demands an agile, adaptive, and energy-efficient architecture. To traverse such terrains, legged robots use rigid electromagnetic motors and sensorized drivetrains to adapt to the environment actively. These systems struggle to compete with animals that excel through their agile and effortless motion in natural environments. We propose a bio-inspired musculoskeletal leg architecture driven by antagonistic pairs of electrohydraulic artificial muscles. Our leg is mounted on a boom arm and can adaptively hop on varying terrain in an energy-efficient yet agile manner. It can also detect obstacles through capacitive self-sensing. The leg performs powerful and agile gait motions beyond 5 Hz and high jumps up to 40 \% of the leg height. Our leg’s tunable stiffness and inherent adaptability allow it to hop over grass, sand, gravel, pebbles, and large rocks using only open-loop force control. The electrohydraulic leg features a low cost of transport (0.73), and while squatting, it consumes only a fraction of the energy (1.2 \%) compared to its conventional electromagnetic counterpart. Its agile, adaptive, and energy-efficient properties would open a roadmap toward a new class of musculoskeletal robots for versatile locomotion and operation in unstructured natural environments.
Press release Video (overview) Video (technical description) Article in pdf DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Realistic Digital Human Characters: Challenges, Models and Algorithms Osman, A. A. A. University of Tübingen, September 2024 (Published)
Statistical models for the body, head, and hands are essential in various computer vision tasks. However, popular models like SMPL, MANO, and FLAME produce unrealistic deformations due to inherent flaws in their modeling assumptions and how they are trained, which have become standard practices in constructing models for the body and its parts. This dissertation addresses these limitations by proposing new modeling and training algorithms to improve the realism and generalization of current models. We introduce a new model, STAR (Sparse Trained Articulated Human Body Regressor), which learns a sparse representation of the human body deformations, significantly reducing the number of model parameters compared to models like SMPL. This approach ensures that deformations are spatially localized, leading to more realistic deformations. STAR also incorporates shape-dependent pose deformations, accounting for variations in body shape to enhance overall model accuracy and realism. Additionally, we present a novel federated training algorithm for developing a comprehensive suite of models for the body and its parts. We train an expressive body model, SUPR (Sparse Unified Part-Based Representation), on a federated dataset of full-body scans, including detailed scans of the head, hands, and feet. We then separate SUPR into a full suite of state-of-the-art models for the head, hands, and foot. The new foot model captures complex foot deformations, addressing challenges related to foot shape, pose, and ground contact dynamics. The dissertation concludes by introducing AVATAR (Articulated Virtual Humans Trained By Bayesian Inference From a Single Scan), a novel, data-efficient training algorithm. AVATAR allows the creation of personalized, high-fidelity body models from a single scan by framing model construction as a Bayesian inference problem, thereby enabling training from small-scale datasets while reducing the risk of overfitting. These advancements push the state of the art in human body modeling and training techniques, making them more accessible for broader research and practical applications.
Thesis DOI BibTeX

Physical Intelligence Article Fabrication of gold nanoflower-coated photosensitive meta-structures using PuSL 3D printing for hyperthermia applications Ersoy, S., Yildiz, E., Ren, Z., Zhang, M., Zhang, H., Karaz, S., Han, M., Shiva, A., Yunusa, M., Kaya, C., Sitti, M. ACS Applied Polymer Materials, 6:10807–10823, September 2024 (Published)
The objective of this work was to print nanoparticle-added photothermoresponsive hydrogels to remove the drawbacks of photothermal therapy (PTT), which is a substitute for conventional cancer treatment. For printing hydrogels (LIHAM) via N-isopropylacrylamide (NIPAM), polyethylene glycol, green synthesized gold nanoflowers (AuNPs) coated with rose bengal (RB) as a photosensitizer, and polydopamine (PDA) as photoinitiator material were used. The printing procedure for the meta-structure, which was designed as 20 × 2 mm using the 3DS Max Autodesk Software, was carried out with the microArch S240 BMF PμSL 3D printer. Additionally, the intensity of light was 60 lm, and the exposure printer time was 8–6–6–6–4 s for this research article. Five different photosensitive hydrogels were printed for rheological measurements, Fourier-transform infrared spectroscopy, scanning electron microscopy, transmission electron microscopy, differential scanning calorimetry, and hyperthermia analysis. This study also aims to demonstrate that the kirigami LIHAM hydrogel can change shape by doping with AuNPs@PDA@RB exclusively under 565 nm without the need for a heater. The results indicated that the greatest outcomes in terms of mechanical, rheological, chemical, and thermal properties and printability were obtained with LIHAM hydrogels coated with AuNPs@PDA@RB. As a result, it has been seen that the LIHAM hydrogels coated with green synthesized gold nanoflowers can be produced with a 3D printer in microsized and complex structures and can be used in hyperthermia applications in the future.
DOI URL BibTeX

Physics for Inference and Optimization Article A model for efficient dynamical ranking in networks Della Vecchia, A., Neocosmos, K., Larremore, D. B., Moore, C., De Bacco, C. Phys. Rev. E 110, 034310, September 2024 (Published) Preprint Code Paper DOI URL BibTeX