Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Conference Paper Detecting Human-Object Contact in Images Chen, Y., Kumar Dwivedi, S., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 17100-17110, CVPR, June 2023 (Published)
Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability.
Project Page Paper Code DOI URL BibTeX

Empirical Inference Conference Paper Editing a Woman’s Voice Costello, A., Fedorova, E., Jin, Z., Mihalcea, R. International Conference on the Science of Science and Innovation (ICSSI), June 2023 (Published) URL BibTeX

Haptic Intelligence Article Generating Clear Vibrotactile Cues with a Magnet Embedded in a Soft Finger Sheath Gertler, I., Serhat, G., Kuchenbecker, K. J. Soft Robotics, 10(3):624-635, June 2023 (Published)
Haptic displays act on the user's body to stimulate the sense of touch and enrich applications from gaming and computer-aided design to rehabilitation and remote surgery. However, when crafted from typical rigid robotic components, they tend to be heavy, bulky, and expensive, while sleeker designs often struggle to create clear haptic cues. This article introduces a lightweight wearable silicone finger sheath that can deliver salient and rich vibrotactile cues using electromagnetic actuation. We fabricate the sheath on a ferromagnetic mandrel with a process based on dip molding, a robust fabrication method that is rarely used in soft robotics but is suitable for commercial production. A miniature rare-earth magnet embedded within the silicone layers at the center of the finger pad is driven to vibrate by the application of alternating current to a nearby air-coil. Experiments are conducted to determine the amplitude of the magnetic force and the frequency response function for the displacement amplitude of the magnet perpendicular to the skin. In addition, high-fidelity finite element analyses of the finger wearing the device are performed to investigate the trends observed in the measurements. The experimental and simulated results show consistent dynamic behavior from 10 to 1000 Hz, with the displacement decreasing after about 300 Hz. These results match the detection threshold profile obtained in a psychophysical study performed by 17 users, where more current was needed only at the highest frequency. A cue identification experiment and a demonstration in virtual reality validate the feasibility of this approach to fingertip haptics.
DOI BibTeX

Perceiving Systems Conference Paper Generating Holistic 3D Human Motion from Speech Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 469-480, CVPR, June 2023 (Published)
This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code are released for research purposes at https://talkshow.is.tue.mpg.de.
project SHOW code TalkSHOW code arXiv paper BibTeX

Perceiving Systems Conference Paper High-Fidelity Clothed Avatar Reconstruction from a Single Image Liao, T., Zhang, X., Xiu, Y., Yi, H., Liu, X., Qi, G., Zhang, Y., Wang, X., Zhu, X., Lei, Z. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8662-8672, CVPR, June 2023 (Published)
This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction (CAR) from a single image. At the first stage, we use an implicit model to learn the general shape in the canonical space of a person in a learning-based way, and at the second stage, we refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way. A hyper-network is utilized to generate a good initialization so that the convergence of the optimization process is greatly accelerated. Extensive experiments on various datasets show that the proposed CAR successfully produces high-fidelity avatars for arbitrarily clothed humans in real scenes.
Code Paper Homepage Youtube URL BibTeX

Haptic Intelligence Article In the Arms of a Robot: Designing Autonomous Hugging Robots with Intra-Hug Gestures Block, A. E., Seifi, H., Hilliges, O., Gassert, R., Kuchenbecker, K. J. ACM Transactions on Human-Robot Interaction, 12(2):1-49, June 2023, Special Issue on Designing the Robot Body: Critical Perspectives on Affective Embodied Interaction (Published)
Hugs are complex affective interactions that often include gestures like squeezes. We present six new guidelines for designing interactive hugging robots, which we validate through two studies with our custom robot. To achieve autonomy, we investigated robot responses to four human intra-hug gestures: holding, rubbing, patting, and squeezing. Thirty-two users each exchanged and rated sixteen hugs with an experimenter-controlled HuggieBot 2.0. The robot's inflated torso's microphone and pressure sensor collected data of the subjects' demonstrations that were used to develop a perceptual algorithm that classifies user actions with 88\% accuracy. Users enjoyed robot squeezes, regardless of their performed action, they valued variety in the robot response, and they appreciated robot-initiated intra-hug gestures. From average user ratings, we created a probabilistic behavior algorithm that chooses robot responses in real time. We implemented improvements to the robot platform to create HuggieBot 3.0 and then validated its gesture perception system and behavior algorithm with sixteen users. The robot's responses and proactive gestures were greatly enjoyed. Users found the robot more natural, enjoyable, and intelligent in the last phase of the experiment than in the first. After the study, they felt more understood by the robot and thought robots were nicer to hug.
DOI BibTeX

Perceiving Systems Conference Paper Instant Multi-View Head Capture through Learnable Registration Bolkart, T., Li, T., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 768-779, CVPR, June 2023 (Published)
Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans’ surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.
project video paper sup. mat. poster BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Instant Volumetric Head Avatars Zielonka, W., Bolkart, T., Thies, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 4574-4584, CVPR, June 2023 (Published)
We present Instant Volumetric Head Avatars (INSTA),a novel approach for reconstructing photo-realistic digital avatars instantaneously. INSTA models a dynamic neural radiance field based on neural graphics primitives embedded around a parametric face model. Our pipeline is trained on a single monocular RGB portrait video that observes the subject under different expressions and views. While state-of-the-art methods take up to several days to train an avatar, our method can reconstruct a digital avatar in less than 10 minutes on modern GPU hardware, which is orders of magnitude faster than previous solutions. In addition, it allows for the interactive rendering of novel poses and expressions. By leveraging the geometry prior of the underlying parametric face model, we demonstrate that INSTA extrapolates to unseen poses. In quantitative and qualitative studies on various subjects, INSTA outperforms state-of-the-art methods regarding rendering quality and training time.
pdf project video code face tracker code dataset DOI URL BibTeX

Empirical Inference Conference Paper Learning Locomotion Skills from MPC in Sensor Space Khadiv, M., Meduri, A., Zhu, H., Righetti, L., Schölkopf, B. Proceedings of The 5th Annual Learning for Dynamics and Control Conference (L4DC), 211:1218-1230, (Editors: Matni, Nikolai and Morari, Manfred and Pappas, George J.), PMLR, June 2023 (Published) URL BibTeX

Perceiving Systems Conference Paper Learning from Synthetic Data Generated with GRADE Bonetto, E., Xu, C., Ahmad, A. In ICRA 2023 Pretraining for Robotics (PT4R) Workshop , ICRA 2023 Pretraining for Robotics (PT4R) Workshop, June 2023 (Published)
Recently, synthetic data generation and realistic rendering has advanced tasks like target tracking and human pose estimation. Simulations for most robotics applications are obtained in (semi)static environments, with specific sensors and low visual fidelity. To solve this, we present a fully customizable framework for generating realistic animated dynamic environments (GRADE) for robotics research, first introduced in~\cite{GRADE}. GRADE supports full simulation control, ROS integration, realistic physics, while being in an engine that produces high visual fidelity images and ground truth data. We use GRADE to generate a dataset focused on indoor dynamic scenes with people and flying objects. Using this, we evaluate the performance of YOLO and Mask R-CNN on the tasks of segmenting and detecting people. Our results provide evidence that using data generated with GRADE can improve the model performance when used for a pre-training step. We also show that, even training using only synthetic data, can generalize well to real-world images in the same application domain such as the ones from the TUM-RGBD dataset. The code, results, trained models, and the generated data are provided as open-source at https://eliabntt.github.io/grade-rr.
Code Data and network models pdf URL BibTeX

Physical Intelligence Article Liquid Metal Actuators: A Comparative Analysis of Surface Tension Controlled Actuation Liao, J., Majidi, C., Sitti, M. Advanced Materials (Deerfield Beach, Fla.), e2300560-e2300560, June 2023 DOI BibTeX

Dynamic Locomotion Conference Paper Multi-segmented Adaptive Feet for Versatile Legged Locomotion in Natural Terrain Chatterjee, A., Mo, A., Kiss, B., Goenen, E. C., Badri-Spröwitz, A. 2023 IEEE International Conference on Robotics and Automation (ICRA 2023), 1162-1169 , IEEE, Piscataway, NJ, IEEE International Conference on Robotics and Automation (ICRA), June 2023 (Published)
Most legged robots are built with leg structures from serially mounted links and actuators and are controlled through complex controllers and sensor feedback. In comparison, animals developed multi-segment legs, mechanical coupling between joints, and multi-segmented feet. They run agile over all terrains, arguably with simpler locomotion control. Here we focus on developing foot mechanisms that resist slipping and sinking also in natural terrain. We present first results of multi-segment feet mounted to a bird-inspired robot leg with multi-joint mechanical tendon coupling. Our one- and two-segment, mechanically adaptive feet show increased viable horizontal forces on multiple soft and hard substrates before starting to slip. We also observe that segmented feet reduce sinking on soft substrates compared to ball-feet and cylinder feet. We report how multi-segmented feet provide a large range of viable centre of pressure points well suited for bipedal robots, but also for quadruped robots on slopes and natural terrain. Our results also offer a functional understanding of segmented feet in animals like ratite birds.
Youtube Edmond CAD DOI URL BibTeX

Haptic Intelligence Miscellaneous Naturalistic Vibrotactile Feedback Could Facilitate Telerobotic Assembly on Construction Sites Gong, Y., Javot, B., Lauer, A. P. R., Sawodny, O., Kuchenbecker, K. J. Poster presented at the ICRA Workshop on Future of Construction: Robot Perception, Mapping, Navigation, Control in Unstructured and Cluttered Environments, London, UK, June 2023 (Published) BibTeX

Empirical Inference Article Quantification of intratumoural heterogeneity in mice and patients via machine-learning models trained on PET–MRI data Katiyar, P., Schwenck, J., Frauenfeld, L., Divine, M. R., Agrawal, V., Kohlhofer, U., Gatidis, S., Kontermann, R., Königsrainer, A., Quintanilla-Martinez, L., la Fougère, C., Schölkopf, B., Pichler, B. J., Disselhorst, J. A. Nature Biomedical Engineering, 7(8):1014-1027, June 2023 (Published) DOI BibTeX

Haptic Intelligence Perceiving Systems Conference Paper Reconstructing Signing Avatars from Video Using Linguistic Priors Forte, M., Kulits, P., Huang, C. P., Choutas, V., Tzionas, D., Kuchenbecker, K. J., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12791-12801, CVPR, June 2023 (Published)
Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at sgnify.is.tue.mpg.de.
pdf arXiv project code DOI URL BibTeX

Empirical Inference Conference Paper Robustness Implies Fairness in Causal Algorithmic Recourse Ehyaei, A., Karimi, A., Schölkopf, B., Maghsudi, S. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 984-1001, ACM, June 2023 (Published) DOI BibTeX

Perceiving Systems Conference Paper Simulation of Dynamic Environments for SLAM Bonetto, E., Xu, C., Ahmad, A. In ICRA 2023 Workshop on the Active Methods in Autonomous Navigation, ICRA, June 2023 (Published)
Simulation engines are widely adopted in robotics. However, they lack either full simulation control, ROS integration, realistic physics, or photorealism. Recently, synthetic data generation and realistic rendering has advanced tasks like target tracking and human pose estimation. However, when focusing on vision applications, there is usually a lack of information like sensor measurements or time continuity. On the other hand, simulations for most robotics tasks are performed in (semi)static environments, with specific sensors and low visual fidelity. To solve this, we introduced in our previous work a fully customizable framework for generating realistic animated dynamic environments (GRADE) [1]. We use GRADE to generate an indoor dynamic environment dataset and then compare multiple SLAM algorithms on different sequences. By doing that, we show how current research over-relies on known benchmarks, failing to generalize. Our tests with refined YOLO and Mask R-CNN models provide further evidence that additional research in dynamic SLAM is necessary. The code, results, and generated data are provided as open-source at https://eliabntt.github.io/grade-rr .
Code Evaluation code Data pdf URL BibTeX

Perceiving Systems Article Virtual Reality Exposure to a Healthy Weight Body Is a Promising Adjunct Treatment for Anorexia Nervosa Behrens, S. C., Tesch, J., Sun, P. J., Starke, S., Black, M. J., Schneider, H., Pruccoli, J., Zipfel, S., Giel, K. E. Psychotherapy Psychosomatics, 92(3):170-179, June 2023 (Published)
ntroduction/Objective: Treatment results of anorexia nervosa (AN) are modest, with fear of weight gain being a strong predictor of treatment outcome and relapse. Here, we present a virtual reality (VR) setup for exposure to healthy weight and evaluate its potential as an adjunct treatment for AN. Methods: In two studies, we investigate VR experience and clinical effects of VR exposure to higher weight in 20 women with high weight concern or shape concern and in 20 women with AN. Results: In study 1, 90\% of participants (18/20) reported symptoms of high arousal but verbalized low to medium levels of fear. Study 2 demonstrated that VR exposure to healthy weight induced high arousal in patients with AN and yielded a trend that four sessions of exposure improved fear of weight gain. Explorative analyses revealed three clusters of individual reactions to exposure, which need further exploration. Conclusions: VR exposure is a well-accepted and powerful tool for evoking fear of weight gain in patients with AN. We observed a statistical trend that repeated virtual exposure to healthy weight improved fear of weight gain with large effect sizes. Further studies are needed to determine the mechanisms and differential effects.
DOI URL BibTeX

Perceiving Systems Conference Paper 3D Human Pose Estimation via Intuitive Physics Tripathi, S., Müller, L., Huang, C. P., Taheri, O., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , 4713-4725, CVPR, June 2023 (Published)
The estimation of 3D human body shape and pose from images has advanced rapidly. While the results are often well aligned with image features in the camera view, the 3D pose is often physically implausible; bodies lean, float, or penetrate the floor. This is because most methods ignore the fact that bodies are typically supported by the scene. To address this, some methods exploit physics engines to enforce physical plausibility. Such methods, however, are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. To account for this, we take a different approach that exploits novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Specifically, we infer biomechanically relevant features such as the pressure heatmap of the body on the floor, the Center of Pressure (CoP) from the heatmap, and the SMPL body’s Center of Mass (CoM) projected on the floor. With these, we develop IPMAN, to estimate a 3D body from a color image in a “stable” configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, and can be integrated into any SMPL-based optimization or regression method; we show examples of both. To evaluate our method, we present MoYo, a dataset with synchronized multi-view color images and 3D bodies with complex poses, body-floor contact, and ground-truth CoM and pressure. Evaluation on MoYo, RICH and Human3.6M show that our IP terms produce more plausible results than the state of the art; they improve accuracy for static poses, while not hurting dynamic ones.
Project Page Moyo Dataset DOI URL BibTeX

Perceiving Systems Conference Paper ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12943-12954, CVPR, June 2023 (Published)
Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC.
Project Page Code Paper arXiv Video DOI URL BibTeX

Perceiving Systems Conference Paper BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion Black, M. J., Patel, P., Tesch, J., Yang, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8726-8737, CVPR, June 2023 (Published)
We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: https://bedlam.is.tue.mpg.de/.
pdf project CVF code DOI URL BibTeX

Perceiving Systems Conference Paper BITE: Beyond Priors for Improved Three-D Dog Pose Estimation Rüegg, N., Tripathi, S., Schindler, K., Black, M. J., Zuffi, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8867-8876, CVPR, June 2023 (Published)
We address the problem of inferring the 3D shape and pose of dogs from images. Given the lack of 3D training data, this problem is challenging, and the best methods lag behind those designed to estimate human shape and pose. To make progress, we attack the problem from multiple sides at once. First, we need a good 3D shape prior, like those available for humans. To that end, we learn a dog-specific 3D parametric model, called D-SMAL. Second, existing methods focus on dogs in standing poses because when they sit or lie down, their legs are self occluded and their bodies deform. Without access to a good pose prior or 3D data, we need an alternative approach. To that end, we exploit contact with the ground as a form of side information. We consider an existing large dataset of dog images and label any 3D contact of the dog with the ground. We exploit body-ground contact in estimating dog pose and find that it significantly improves results. Third, we develop a novel neural network architecture to infer and exploit this contact information. Fourth, to make progress, we have to be able to measure it. Current evaluation metrics are based on 2D features like keypoints and silhouettes, which do not directly correlate with 3D errors. To address this, we create a synthetic dataset containing rendered images of scanned 3D dogs. With these advances, our method recovers significantly better dog shape and pose than the state of the art, and we evaluate this improvement in 3D. Our code, model and test dataset are publicly available for research purposes at https://bite.is.tue.mpg.de.
pdf supp project DOI URL BibTeX

Perceiving Systems Conference Paper ECON: Explicit Clothed humans Optimized via Normal integration Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 512-523, CVPR, June 2023 (Published)
The combination of artist-curated scans, and deep implicit functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry but produce disembodied limbs or degenerate shapes for unseen poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit and explicit methods. To this end, we make two key observations:(1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a “canvas” for stitching together detailed surface patches. ECON infers high-fidelity 3D humans even in loose clothes and challenging poses, while having realistic faces and fingers. This goes beyond previous methods. Quantitative, evaluation of the CAPE and Renderpeople datasets shows that ECON is more accurate than the state of the art. Perceptual studies also show that ECON’s perceived realism is better by a large margin.
Page Paper Demo Code Video Colab DOI URL BibTeX

Perceiving Systems Conference Paper HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics Grigorev, A., Thomaszewski, B., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 16965-16974, CVPR, June 2023 (Published)
We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method furthermore handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that our method outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods.
arXiv project pdf supp URL BibTeX

Perceiving Systems Neural Capture and Synthesis Conference Paper MIME: Human-Aware 3D Scene Generation Yi, H., Huang, C. P., Tripathi, S., Hering, L., Thies, J., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12965-12976, CVPR, June 2023 (Published)
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a “scanner” of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.
project arXiv paper URL BibTeX

Perceiving Systems Conference Paper PointAvatar: Deformable Point-Based Head Avatars From Videos Zheng, Y., Yifan, W., Wetzstein, G., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 21057-21067, CVPR, June 2023 (Published)
The ability to create realistic animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting and albedo, limiting the ability to re-render the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods.
pdf project code video DOI URL BibTeX

Perceiving Systems Conference Paper SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments Dai, Y., Lin, Y., Lin, X., Wen, C., Xu, L., Yi, H., Shen, S., Ma, Y., Wang, C. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 682-692, CVF, CVPR, June 2023 (Published)
We present SLOPER4D, a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. Employing a head-mounted device integrated with a LiDAR and camera, we record 12 human subjects’ activities over 10 diverse urban scenes from an egocentric view. Frame-wise annotations for 2D key points, 3D pose parameters, and global translations are provided, together with reconstructed scene point clouds. To obtain accurate 3D ground truth in such large dynamic scenes, we propose a joint optimization method to fit local SMPL meshes to the scene and fine-tune the camera calibration during dynamic motions frame by frame, resulting in plausible and scene-natural 3D human poses. Eventually, SLOPER4D consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 200 square meters (up to 30,000 square meters), including more than 100K LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE. The in-depth analysis demonstrates SLOPER4D poses significant challenges to existing methods and produces great research opportunities. The dataset and code are released https://github.com/climbingdaily/SLOPER4D.
project dataset codebase paper arXiv BibTeX

Perceiving Systems Conference Paper TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments Sun, Y., Bao, Q., Liu, W., Mei, T., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8856-8866, CVPR, June 2023 (Published)
Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.
pdf supp code video URL BibTeX

Autonomous Learning Conference Paper Backpropagation through Combinatorial Algorithms: Identity with Projection Works Sahoo, S., Paulus, A., Vlastelica, M., Musil, V., Kuleshov, V., Martius, G. In Proceedings of the Eleventh International Conference on Learning Representations, May 2023 (Accepted)
Embedding discrete solvers as differentiable layers has given modern deep learning architectures combinatorial expressivity and discrete reasoning capabilities. The derivative of these solvers is zero or undefined, therefore a meaningful replacement is crucial for effective gradient-based learning. Prior works rely on smoothing the solver with input perturbations, relaxing the solver to continuous problems, or interpolating the loss landscape with techniques that typically require additional solver calls, introduce extra hyper-parameters, or compromise performance. We propose a principled approach to exploit the geometry of the discrete solution space to treat the solver as a negative identity on the backward pass and further provide a theoretical justification. Our experiments demonstrate that such a straightforward hyper-parameter-free approach is able to compete with previous more complex methods on numerous experiments such as backpropagation through discrete samplers, deep graph matching, and image retrieval. Furthermore, we substitute the previously proposed problem-specific and label-dependent margin with a generic regularization procedure that prevents cost collapse and increases robustness.
OpenReview Arxiv Pdf URL BibTeX

Autonomous Learning Empirical Inference Conference Paper Benchmarking Offline Reinforcement Learning on Real-Robot Hardware Gürtler, N., Blaes, S., Kolev, P., Widmaier, F., Wüthrich, M., Bauer, S., Schölkopf, B., Martius, G. In Proceedings of the Eleventh International Conference on Learning Representations, The Eleventh International Conference on Learning Representations (ICLR), May 2023 (Published)
Learning policies from previously recorded data is a promising direction for real-world robotics tasks, as online learning is often infeasible. Dexterous manipulation in particular remains an open problem in its general form. The combination of offline reinforcement learning with large diverse datasets, however, has the potential to lead to a breakthrough in this challenging domain analogously to the rapid progress made in supervised learning in recent years. To coordinate the efforts of the research community toward tackling this problem, we propose a benchmark including: i) a large collection of data for offline learning from a dexterous manipulation platform on two tasks, obtained with capable RL agents trained in simulation; ii) the option to execute learned policies on a real-world robotic system and a simulation for efficient debugging. We evaluate prominent open-sourced offline reinforcement learning algorithms on the datasets and provide a reproducible experimental setup for offline reinforcement learning on real systems.
Website arXiv Code URL BibTeX

Autonomous Learning Empirical Inference Conference Paper DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems Schumacher, P., Haeufle, D. F., Büchler, D., Schmitt, S., Martius, G. In The Eleventh International Conference on Learning Representations (ICLR), May 2023 (Published)
Muscle-actuated organisms are capable of learning an unparalleled diversity of dexterous movements despite their vast amount of muscles. Reinforcement learning (RL) on large musculoskeletal models, however, has not been able to show similar performance. We conjecture that ineffective exploration in large overactuated action spaces is a key problem. This is supported by our finding that common exploration noise strategies are inadequate in synthetic examples of overactuated systems. We identify differential extrinsic plasticity (DEP), a method from the domain of self-organization, as being able to induce state-space covering exploration within seconds of interaction. By integrating DEP into RL, we achieve fast learning of reaching and locomotion in musculoskeletal systems, outperforming current approaches in all considered tasks in sample efficiency and robustness.
Arxiv pdf Website URL BibTeX

Empirical Inference Article ResMiCo: Increasing the quality of metagenome-assembled genomes with deep learning Mineeva*, O., Danciu*, D., Schölkopf, B., Ley, R. E., Rätsch, G., Youngblut, N. D. PLOS Computational Biology, 19(5), Public Library of Science, San Francisco, CA, May 2023, *equal contribution (Published) DOI BibTeX

Haptic Intelligence Miscellaneous 3D Reconstruction for Minimally Invasive Surgery: Lidar Versus Learning-Based Stereo Matching Caccianiga, G., Nubert, J., Hutter, M., Kuchenbecker., K. J. Workshop paper (2 pages) presented at the ICRA Workshop on Robot-Assisted Medical Imaging, London, UK, May 2023 (Published)
This work investigates real-time 3D surface reconstruction for minimally invasive surgery. Specifically, we analyze depth sensing through laser-based time-of-flight sensing (lidar) and stereo endoscopy on ex-vivo porcine tissue samples. When compared to modern learning-based stereo matching from endoscopic images, lidar achieves lower processing delay, higher frame rate, and superior robustness against sensor distance and poor illumination. Furthermore, we report on the negative effect of near-infrared light penetration on the accuracy of time-of-flight measurements across different tissue types.
BibTeX

Empirical Inference Article A Kernel Stein Test for Comparing Latent Variable Models Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., Gretton, A. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):986-1011, May 2023 (Published) arXiv DOI BibTeX

Social Foundations of Computation Conference Paper A Theory of Dynamic Benchmarks Shirali, A., Abebe, R., Hardt, M. In The Eleventh International Conference on Learning Representations (ICLR 2023) , May 2023 (Published)
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
arXiv URL BibTeX

Empirical Inference Conference Paper A law of adversarial risk, interpolation, and label noise Paleka, D., Sanyal, A. The Eleventh International Conference on Learning Representations (ICLR), May 2023 (Published) URL BibTeX

Haptic Intelligence Miscellaneous AiroTouch: Naturalistic Vibrotactile Feedback for Telerobotic Construction-Related Tasks Gong, Y., Tashiro, N., Javot, B., Lauer, A. P. R., Sawodny, O., Kuchenbecker, K. J. Extended abstract (1 page) presented at the ICRA Workshop on Communicating Robot Learning across Human-Robot Interaction, London, UK, May 2023 (Published) BibTeX

Empirical Inference Article Better Together: Data Harmonization and Cross-StudAnalysis of Abdominal MRI Data From UK Biobank and the German National Cohort Gatidis, S., Kart, T., Fischer, M., Winzeck, S., Glocker, B., Bai, W., Bülow, R., Emmel, C., Friedrich, L., Kauczor, H., Keil, T., Kröncke, T., Mayer, P., Niendorf, T., Peters, A., Pischon, T., Schaarschmidt, B., Schmidt, B., Schulze, M., Umutle, L., et al. Investigative Radiology, 58(5):346-354, May 2023 (Published) DOI BibTeX

Autonomous Learning Empirical Inference Conference Paper Bridging the Gap to Real-World Object-Centric Learning Seitzer, M., Horn, M., Zadaianchuk, A., Zietlow, D., Xiao, T., Simon-Gabriel, C., He, T., Zhang, Z., Schölkopf, B., Brox, T., Locatello, F. In Proceedings of the Eleventh International Conference on Learning Representations, The Eleventh International Conference on Learning Representations (ICLR), May 2023 (Published)
Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real world-datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.
Code Website URL BibTeX

Empirical Inference Conference Paper Disentanglement of Correlated Factors via Hausdorff Factorized Support Roth, K., Ibrahim, M., Akata, Z., Vincent, P., Bouchacourt, D. The Eleventh International Conference on Learning Representations (ICLR), May 2023 (Published) URL BibTeX