Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Empirical Inference Conference Paper Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration ( incl. Guist, S., Schneider, J., Schölkopf, B., Büchler, D. ). IEEE International Conference on Robotics and Automation (ICRA), 6892-6903, May 2024 (Published) arXiv DOI URL BibTeX

Empirical Inference Conference Paper Out-of-Variable Generalization for Discriminative Models Guo, S., Wildberger, J., Schölkopf, B. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Perceiving Systems Conference Paper Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., Wen, Y., Black, M. J., Weller, A., Schölkopf, B. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), The Twelfth International Conference on Learning Representations, May 2024 (Published)
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
Home Code HuggingFace project URL BibTeX

Empirical Inference Conference Paper Skill or Luck? Return Decomposition via Advantage Functions Pan, H., Schölkopf, B. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Conference Paper Some Intriguing Aspects about Lipschitz Continuity of Neural Networks Khromov*, G., Singh*, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Stochastic Gradient Descent for Gaussian Processes Done Right Lin*, J. A., Padhy*, S., Antorán*, J., Tripp, A., Terenin, A., Szepesvari, C., Hernández-Lobato, J. M., Janz, D. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Targeted Reduction of Causal Models Kekić, A., Schölkopf, B., Besserve, M. ICLR 2024 Workshop on AI4DifferentialEquations In Science, May 2024 (Published) URL BibTeX

Social Foundations of Computation Conference Paper Test-Time Training on Nearest Neighbors for Large Language Models Hardt, M., Sun, Y. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.
ArXiv Code URL BibTeX

Perceiving Systems Article The Poses for Equine Research Dataset (PFERD) Li, C., Mellbin, Y., Krogager, J., Polikovsky, S., Holmberg, M., Ghorbani, N., Black, M. J., Kjellström, H., Zuffi, S., Hernlund, E. Nature Scientific Data, 11, May 2024 (Published)
Studies of quadruped animal motion help us to identify diseases, understand behavior and unravel the mechanics behind gaits in animals. The horse is likely the best-studied animal in this aspect, but data capture is challenging and time-consuming. Computer vision techniques improve animal motion extraction, but the development relies on reference datasets, which are scarce, not open-access and often provide data from only a few anatomical landmarks. Addressing this data gap, we introduce PFERD, a video and 3D marker motion dataset from horses using a full-body set-up of densely placed over 100 skin-attached markers and synchronized videos from ten camera angles. Five horses of diverse conformations provide data for various motions from basic poses (eg. walking, trotting) to advanced motions (eg. rearing, kicking). We further express the 3D motions with current techniques and a 3D parameterized model, the hSMAL model, establishing a baseline for 3D horse markerless motion capture. PFERD enables advanced biomechanical studies and provides a resource of ground truth data for the methodological development of markerless motion capture.
paper DOI URL BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Three-Dimensional Surface Reconstruction of a Soft System via Distributed Magnetic Sensing Sundaram, V. H., Smith, L., Turin, Z., Rentschler, M. E., Gonzalez Welker, C. Workshop paper (3 pages) presented at the ICRA Workshop on Advancing Wearable Devices and Applications Through Novel Design, Sensing, Actuation, and AI, Yokohama, Japan, May 2024 (Published)
This study presents a new method for reconstructing continuous 3D surface deformations for a soft pneumatic actuation system using embedded magnetic sensors. A finite element analysis (FEA) model was developed to quantify the surface deformation given the magnetometer readings, with a relative error between the experimental and the simulated sensor data of 7.8%. Using the FEA simulation solutions and a basic model-based mapping, our method achieves sub-millimeter accuracy in measuring deformation from sensor data with an absolute error between the experimental and simulated sensor data of 13.5%. These results show promise for real-time adjustments to deformation, crucial in environments like prosthetic and orthotic interfaces with human limbs.
URL BibTeX

Empirical Inference Conference Paper Towards Meta-Pruning via Optimal Transport Theus, A., Geimer, O., Wicke, F., Hofmann, T., Anagnostidis, S., Singh, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Conference Paper Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion Meterez*, A., Joudaki*, A., Orabona, F., Immer, A., Rätsch, G., Daneshmand, H. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Transformer Fusion with Optimal Transport Imfeld*, M., Graldi*, J., Giordano*, M., Hofmann, T., Anagnostidis, S., Singh, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Social Foundations of Computation Conference Paper Unprocessing Seven Years of Algorithmic Fairness Cruz, A. F., Hardt, M. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
ArXiv Code URL BibTeX

Autonomous Learning Conference Paper Wild Visual Navigation: Fast Traversability Learning via Pre-Trained Models and Online Self-Supervision Mattamala, M., Frey, J., Libera, P., Chebrolu, N., Martius, G., Cadena, C., Hutter, M., Fallon, M. April 2024 (Accepted)
Natural environments such as forests and grasslands are challenging for robotic navigation because of the false perception of rigid obstacles from high grass, twigs, or bushes. In this work, we present Wild Visual Navigation (WVN), an online self-supervised learning system for visual traversability estimation. The system is able to continuously adapt from a short human demonstration in the field, only using onboard sensing and computing. One of the key ideas to achieve this is the use of high-dimensional features from pre-trained self-supervised models, which implicitly encode semantic information that massively simplifies the learning task. Further, the development of an online scheme for supervision generator enables concurrent training and inference of the learned model in the wild. We demonstrate our approach through diverse real-world deployments in forests, parks, and grasslands. Our system is able to bootstrap the traversable terrain segmentation in less than 5 min of in-field training time, enabling the robot to navigate in complex, previously unseen outdoor terrains.
URL BibTeX

Empirical Inference Master Thesis Algorithmic Compositional Learning of Language Models Thomm, J. ETH Zurich, Switzerland, April 2024 (Published) BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Cutaneous Electrohydraulic (CUTE) Wearable Devices for Multimodal Haptic Feedback Sanchez-Tamayo, N., Yoder, Z., Ballardini, G., Rothemund, P., Keplinger, C., Kuchenbecker, K. J. Extended abstract (1 page) presented at the IEEE RoboSoft Workshop on Multimodal Soft Robots for Multifunctional Manipulation, Locomotion, and Human-Machine Interaction, San Diego, USA, April 2024 (Published) BibTeX

Empirical Inference Miscellaneous Evidence for eccentricity in the population of binary black holes observed by LIGO-Virgo-KAGRA Gupte, N., Ramos-Buades, A., Buonanno, A., Gair, J., Miller, M. C., Dax, M., Green, S. R., Pürrer, M., Wildberger, J., Macke, J. H., Romero-Shaw, I. M., Schölkopf, B. April 2024 (Published) URL BibTeX

Social Foundations of Computation Conference Paper ImageNot: A Contrast with ImageNet Preserves Model Rankings Salaudeen, O., Hardt, M. April 2024 (Submitted)
We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
ArXiv BibTeX

Empirical Inference Conference Paper PILLAR: How to make semi-private learning more effective Pinto, F., Hu, Y., Yang, F., Sanyal, A. 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 110-139, April 2024 (Published) DOI BibTeX

Empirical Inference Article SimReadUntil for benchmarking selective sequencing algorithms on ONT devices Mordig, M., Ratsch, G., Kahles, A. Bioinformatics, 40(5):btae199, April 2024 (Published) DOI URL BibTeX

Empirical Inference Article VIPurPCA: Visualizing and Propagating Uncertainty in Principal Component Analysis Zabel, S., Hennig, P., Nieselt, K. IEEE Transactions on Visualization and Computer Graphics, 30(4):2011-2022, April 2024 (Published) DOI BibTeX

Haptic Intelligence Miscellaneous CAPT Motor: A Strong Direct-Drive Rotary Haptic Interface Javot, B., Nguyen, V. H., Ballardini, G., Kuchenbecker, K. J. Hands-on demonstration presented at the IEEE Haptics Symposium, Long Beach, USA, April 2024 (Published)
We have designed and built a new motor named CAPT Motor that delivers continuous and precise torque. It is a brushless ironless motor using a Halbach-magnet ring and a planar axial Lorentz-coil array. This motor is unique as we use a two-phase design allowing for higher fill factor and geometrical accuracy of the coils, as they can all be made separately. This motor outperforms existing Halbach ring and cylinder motors with a torque constant per magnet volume of 9.94 (Nm/A)/dm3, a record in the field. The angular position of the rotor is measured by a high-resolution incremental optical encoder and tracked by a multimodal data acquisition device. The system's control firmware uses this angle measurement to calculate the two-phase motor currents needed to produce the torque commanded by the virtual environment at the rotor's position. The strength and precision of the CAPT Motor's torque and the lack of any mechanical transmission enable unusually high haptic rendering quality, indicating the promise of this new motor design.
URL BibTeX

Perceiving Systems Ph.D. Thesis Self- and Interpersonal Contact in 3D Human Mesh Reconstruction Müller, L. University of Tübingen, Tübingen, March 2024 (Published)
The ability to perceive tactile stimuli is of substantial importance for human beings in establishing a connection with the surrounding world. Humans rely on the sense of touch to navigate their environment and to engage in interactions with both themselves and other people. The field of computer vision has made great progress in estimating a person’s body pose and shape from an image, however, the investigation of self- and interpersonal contact has received little attention despite its considerable significance. Estimating contact from images is a challenging endeavor because it necessitates methodologies capable of predicting the full 3D human body surface, i.e. an individual’s pose and shape. The limitations of current methods become evident when considering the two primary datasets and labels employed within the community to supervise the task of human pose and shape estimation. First, the widely used 2D joint locations lack crucial information for representing the entire 3D body surface. Second, in datasets of 3D human bodies, e.g. collected from motion capture systems or body scanners, contact is usually avoided, since it naturally leads to occlusion which complicates data cleaning and can break the data processing pipelines. In this thesis, we first address the problem of estimating contact that humans make with themselves from RGB images. To do this, we introduce two novel methods that we use to create new datasets tailored for the task of human mesh estimation for poses with self-contact. We create (1) 3DCP, a dataset of 3D body scan and motion capture data of humans in poses with self-contact and (2) MTP, a dataset of images taken in the wild with accurate 3D reference data using pose mimicking. Next, we observe that 2D joint locations can be readily labeled at scale given an image, however, an equivalent label for self-contact does not exist. Consequently, we introduce (3) distrecte self-contact (DSC) annotations indicating the pairwise contact of discrete regions on the human body. We annotate three existing image datasets with discrete self-contact and use these labels during mesh optimization to bring body parts supposed to touch into contact. Then we train TUCH, a human mesh regressor, on our new datasets. When evaluated on the task of human body pose and shape estimation on public benchmarks, our results show that knowing about self-contact not only improves mesh estimates for poses with self-contact, but also for poses without self-contact. Next, we study contact humans make with other individuals during close social interaction. Reconstructing these interactions in 3D is a significant challenge due to the mutual occlusion. Furthermore, the existing datasets of images taken in the wild with ground-truth contact labels are of insufficient size to facilitate the training of a robust human mesh regressor. In this work, we employ a generative model, BUDDI, to learn the joint distribution of 3D pose and shape of two individuals during their interaction and use this model as prior during an optimization routine. To construct training data we leverage pre-existing datasets, i.e. motion capture data and Flickr images with discrete contact annotations. Similar to discrete self-contact labels, we utilize discrete human- human contact to jointly fit two meshes to detected 2D joint locations. The majority of methods for generating 3D humans focus on the motion of a single person and operate on 3D joint locations. While these methods can effectively generate motion, their representation of 3D humans is not sufficient for physical contact since they do not model the body surface. Our approach, in contrast, acts on the pose and shape parameters of a human body model, which enables us to sample 3D meshes of two people. We further demonstrate how the knowledge of human proxemics, incorporated in our model, can be used to guide an optimization routine. For this, in each optimization iteration, BUDDI takes the current mesh and proposes a refinement that we subsequently consider in the objective function. This procedure enables us to go beyond state of the art by forgoing ground-truth discrete human-human contact labels during optimization. Self- and interpersonal contact happen on the surface of the human body, however, the majority of existing art tends to predict bodies with similar, “average” body shape. This is due to a lack of training data of paired images taken in the wild and ground- truth 3D body shape and because 2D joint locations are not sufficient to explain body shape. The most apparent solution would be to collect body scans of people together with their photos. This is, however, a time-consuming and cost-intensive process that lacks scalability. Instead, we leverage the vocabulary humans use to describe body shape. First, we ask annotators to label how much a word like “tall” or “long legs” applies to a human body. We gather these ratings for rendered meshes of various body shapes, for which we have ground-truth body model shape parameters, and for images collected from model agency websites. Using this data, we learn a shape-to-attribute (A2S) model that predicts body shape ratings from body shape parameters. Then we train a human mesh regressor, SHAPY, on the model agency images wherein we supervise body shape via attribute annotations using A2S. Since no suitable test set of diverse 3D ground-truth body shape with images taken in natural settings exists, we introduce Human Bodies in the Wild (HBW). This novel dataset contains photographs of individuals together with their body scan. Our model predicts more realistic body shapes from an image and quantitatively improves body shape estimation on this new benchmark. In summary, we present novel datasets, optimization methods, a generative model, and regressors to advance the field of 3D human pose and shape estimation. Taken together, these methods open up ways to obtain more accurate and realistic 3D mesh estimates from images with multiple people in self- and mutual contact poses and with diverse body shapes. This line of research also enables generative approaches to create more natural, human-like avatars. We believe that knowing about self- and human-human contact through computer vision has wide-ranging implications in other fields as for example robotics, fitness, or behavioral science.
download Thesis DOI BibTeX

Empirical Inference Poster Koopman Spectral Analysis Uncovers the Temporal Structure of Spontaneous Neural Events Shao, K., Xu, Y., Logothetis, N., Shen, Z., Besserve, M. Computational and Systems Neuroscience Meeting (COSYNE), March 2024 (Published) URL BibTeX

Haptic Intelligence Conference Paper Expert Perception of Teleoperated Social Exercise Robots Mohan, M., Mat Husin, H., Kuchenbecker, K. J. In Companion of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), 769-773, Boulder, USA, March 2024, Late-Breaking Report (LBR) (5 pages) presented at the IEEE/ACM International Conference on Human-Robot Interaction (HRI) (Published)
Social robots could help address the growing issue of physical inactivity by inspiring users to engage in interactive exercise. Nevertheless, the practical implementation of social exercise robots poses substantial challenges, particularly in terms of personalizing their activities to individuals. We propose that motion-capture-based teleoperation could serve as a viable solution to address these needs by enabling experts to record custom motions that could later be played back without their real-time involvement. To gather feedback about this idea, we conducted semi-structured interviews with eight exercise-therapy professionals. Our findings indicate that experts' attitudes toward social exercise robots become more positive when considering the prospect of teleoperation to record and customize robot behaviors.
DOI BibTeX

Neural Capture and Synthesis Conference Paper GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar Kabadayi, B., Zielonka, W., Bhatnagar, B. L., Pons-Moll, G., Thies, J. In International Conference on 3D Vision (3DV), March 2024 (Published)
Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.
Video Webpage Code Arxiv BibTeX

Rationality Enhancement Article Gamification of Behavior Change: Mathematical Principle and Proof-of-Concept Study Lieder, F., Chen, P., Prentice, M., Amo, V., Tošić, M. JMIR Serious Games , 12, JMIR Publications, March 2024 (Published)
Many people want to build good habits to become healthier, live longer, or become happier but struggle to change their behavior. Gamification can make behavior change easier by awarding points for the desired behavior and deducting points for its omission.
DOI URL BibTeX

Social Foundations of Computation Article Integration of Generative AI in the Digital Markets Act: Contestability and Fairness from a Cross-Disciplinary Perspective Yasar, A. G., Chong, A., Dong, E., Gilbert, T., Hladikova, S., Mougan, C., Shen, X., Singh, S., Stoica, A., Thais, S. Workshop on Generative AI + Law (GenLaw) , LSE Legal Studies Working Paper, The Fortieth International Conference on Machine Learning (ICML) 2023 , March 2024 (Published)
The EU’s Digital Markets Act (DMA) aims to address the lack of contestability and unfair practices in digital markets. But the current framework of the DMA does not adequately cover the rapid advance of generative AI. As the EU adopts AI-specific rules and considers possible amendments to the DMA, this paper suggests that generative AI should be added to the DMA’s list of core platform services. This amendment is the first necessary step to address the emergence of entrenched and durable positions in the generative AI industry.
URL BibTeX

Empirical Inference Article Learning Graph Embeddings for Open World Compositional Zero-Shot Learning Mancini, M., Naeem, M. F., Xian, Y., Akata, Z. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3):1545-1560, IEEE, New York, NY, March 2024 (Published) DOI BibTeX

Perceiving Systems Conference Paper Physically plausible full-body hand-object interaction synthesis Braun, J., Christen, S., Kocabas, M., Aksan, E., Hilliges, O. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting. While recent advancements have addressed specific facets of human-object interactions, a comprehensive physics-based approach remains a challenge. Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. In contrast, our proposed method embraces reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches. Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives. Our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.
arXiv Project Page Github YouTube BibTeX

Conference Paper SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras Velasquez, H. C., Hewitt, C., Aliakbarian, S., Baltrušaitis, T. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Accepted)
Our work addresses the problem of egocentric human pose estimation from downwards-facing cameras on head-mounted devices (HMD). This presents a challenging scenario, as parts of the body often fall outside of the image or are occluded. Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues. They also predict 2D heat-maps per joint and lift them to 3D space to deal with self-occlusions, but this requires large network architectures which are impractical to deploy on resource-constrained HMDs. We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame. As such, we directly regress probabilistic joint rotations represented as matrix Fisher distributions for a parameterized body model. This allows us to quantify pose uncertainties and explain out-of-frame or occluded joints. This also removes the need to compute 2D heat-maps and allows for simplified DNN architectures which require less compute. Given the lack of egocentric datasets using rectilinear camera lenses, we introduce the SynthEgo dataset, a synthetic dataset with 60K stereo images containing high diversity of pose, shape, clothing and skin tone. Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body. Our architecture also has eight times fewer parameters and runs twice as fast as the current state-of-the-art. Experiments show that training on our synthetic dataset leads to good generalization to real world images without fine-tuning.
Home Dataset BibTeX

Perceiving Systems Conference Paper ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation Zhang, H., Christen, S., Fan, Z., Zheng, L., Hwangbo, J., Song, J., Hilliges, O. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.
pdf project code BibTeX

Perceiving Systems Conference Paper GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency Taheri, O., Zhou, Y., Tzionas, D., Zhou, Y., Ceylan, D., Pirk, S., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to encourage motion temporal consistency in the latent space (LTC) and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP “upgrades” them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets. Our models and code are available for research purposes.
Paper SupMat Poster URL BibTeX

Perceiving Systems Conference Paper PACE: Human and Camera Motion Estimation from in-the-wild Videos Kocabas, M., Yuan, Y., Molchanov, P., Guo, Y., Black, M. J., Hilliges, O., Kautz, J., Iqbal, U. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world RICH datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions.
arXiv Project Page YouTube Poster BibTeX

Perceiving Systems Conference Paper POCO: 3D Pose and Shape Estimation using Confidence Dwivedi, S. K., Schmid, C., Yi, H., Black, M. J., Tzionas, D. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
The regression of 3D Human Pose and Shape HPS from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames.
Code Paper SupMat Poster URL BibTeX

Perceiving Systems Conference Paper TADA! Text to Animatable Digital Avatars Liao, T., Yi, H., Xiu, Y., Tang, J., Huang, Y., Thies, J., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent align-007 ment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes.
Home Code Video URL BibTeX

Perceiving Systems Neural Capture and Synthesis Conference Paper TECA: Text-Guided Generation and Editing of Compositional 3D Avatars Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.
arXiv project URL BibTeX

Perceiving Systems Conference Paper TeCH: Text-guided Reconstruction of Lifelike Clothed Humans Huang, Y., Yi, H., Xiu, Y., Liao, T., Tang, J., Cai, D., Thies, J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality.
Code Home Video arXiv URL BibTeX

Autonomous Learning Conference Paper Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements Charaja, J. P., Wochner, I., Schumacher, P., Ilg, W., Giese, M., Maufroy, C., Bulling, A., Schmitt, S., Martius, G., Haeufle, D. F. Proceedings 2024 10th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob), 562-568, IEEE, 2024 10TH IEEE RAS/EMBS INTERNATIONAL CONFERENCE FOR BIOMEDICAL ROBOTICS\nAND BIOMECHATRONICS, BIOROB, February 2024 (Published)
The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.
DOI URL BibTeX