Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Conference Paper Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings Prabhu, V., Chandrasekaran, A., Saenko, K., Hoffman, J. In Proc. International Conference on Computer Vision (ICCV), 8485-8494, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Generalizing deep neural networks to new target domains is critical to their real-world utility. In practice, it may be feasible to get some target data labeled, but to be cost-effective it is desirable to select a subset that is maximally-informative via active learning (AL). In this work, we study the problem of AL under a domain shift. We empirically demonstrate how existing AL approaches based solely on model uncertainty or representative sampling are suboptimal for active domain adaptation. Our algorithm, Active Domain Adaptation via CLustering Uncertainty-weighted Embeddings (ADA-CLUE), i) identifies diverse datapoints for labeling that are both uncertain under the model and representative of unlabeled target data, and ii) leverages the available source and target data for adaptation by optimizing a semi-supervised adversarial entropy loss that is complimentary to our active sampling objective. On standard image classification benchmarks for domain adaptation, ADA-CLUE consistently performs as well or better than competing active adaptation, active learning, and domain adaptation methods across shift severities, model initializations, and labeling budgets.
paper project webpage Code Video DOI BibTeX
Thumb ticker lg adaclue cropped image

Perceiving Systems Conference Paper Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D Shape, Pose, and Appearance Consistency Sanyal, S., Vorobiov, A., Bolkart, T., Loper, M., Mohler, B., Davis, L., Romero, J., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11118-11127, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Synthesizing images of a person in novel poses from a single image is a highly ambiguous task. Most existing approaches require paired training images; i.e. images of the same person with the same clothing in different poses. However, obtaining sufficiently large datasets with paired data is challenging and costly. Previous methods that forego paired supervision lack realism. We propose a self-supervised framework named SPICE (Self-supervised Person Image CrEation) that closes the image quality gap with supervised methods. The key insight enabling self-supervision is to exploit 3D information about the human body in several ways. First, the 3D body shape must remain unchanged when reposing. Second, representing body pose in 3D enables reasoning about self occlusions. Third, 3D body parts that are visible before and after reposing, should have similar appearance features. Once trained, SPICE takes an image of a person and generates a new image of that person in a new target pose. SPICE achieves state-of-the-art performance on the DeepFashion dataset, improving the FID score from 29.9 to 7.8 compared with previous unsupervised methods, and with performance similar to the state-of-the-art supervised method (6.4). SPICE also generates temporally coherent videos given an input image and a sequence of poses, despite being trained on static images only.
pdf arxiv DOI BibTeX
Thumb ticker lg screenshot from 2021 10 06 21 50 55

Perceiving Systems Conference Paper Learning To Regress Bodies From Images Using Differentiable Semantic Rendering Dwivedi, S. K., Athanasiou, N., Kocabas, M., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11230-11239, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Learning to regress 3D human body shape and pose (e.g. SMPL parameters) from monocular images typically exploits losses on 2D keypoints, silhouettes, and/or part segmentation when 3D training data is not available. Such losses, however, are limited because 2D keypoints do not supervise body shape and segmentations of people in clothing do not match projected minimally-clothed SMPL shapes. To exploit richer image information about clothed people, we introduce higher-level semantic information about clothing to penalize clothed and non-clothed regions of the human body differently. To do so, we train a body regressor using a novel “Differentiable Semantic Rendering (DSR)” loss. For Minimally-Clothed (MC) regions, we define the DSRMC loss, which encourages a tight match between a rendered SMPL body and the minimally-clothed regions of the image. For clothed regions, we define the DSR-C loss to encourage the rendered SMPL body to be inside the clothing mask. To ensure end-to-end differentiable training, we learn a semantic clothing prior for SMPL vertices from thousands of clothed human scans. We perform extensive qualitative and quantitative experiments to evaluate the role of clothing semantics on the accuracy of 3D human pose and shape estimation. We outperform all previous state-of-the-art methods on 3DPW and Human3.6M and obtain on par results on MPI-INF-3DHP. Code and trained models are available for research at https://dsr.is.tue.mpg.de/
pdf supp code project-website video poster DOI BibTeX
Thumb ticker lg dsr1

Perceiving Systems Conference Paper Monocular, One-Stage, Regression of Multiple 3D People Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M. J., Mei, T. In Proc. International Conference on Computer Vision (ICCV), 11159-11168, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.
pdf supp arXiv code DOI BibTeX
Thumb ticker lg romp2

Perceiving Systems Conference Paper Stochastic Scene-Aware Motion Prediction Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M. In Proc. International Conference on Computer Vision (ICCV), 11354-11364, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
A long-standing goal in computer vision is to capture, model, and realistically synthesize human behavior. Specifically, by learning from data, our goal is to enable virtual humans to navigate within cluttered indoor scenes and naturally interact with objects. Such embodied behavior has applications in virtual reality, computer games, and robotics, while synthesized behavior can be used as training data. The problem is challenging because real human motion is diverse and adapts to the scene. For example, a person can sit or lie on a sofa in many places and with varying styles. We must model this diversity to synthesize virtual humans that realistically perform human-scene interactions. We present a novel data-driven, stochastic motion synthesis method that models different styles of performing a given action with a target object. Our Scene-Aware Motion Prediction method (SAMP) generalizes to target objects of various geometries while enabling the character to navigate in cluttered scenes. To train SAMP, we collected mocap data covering various sitting, lying down, walking, and running styles. We demonstrate SAMP on complex indoor scenes and achieve superior performance than existing solutions.
Project Page pdf DOI BibTeX
Thumb ticker lg samp3

Perceiving Systems Conference Paper The Power of Points for Modeling Humans in Clothing Ma, Q., Yang, J., Tang, S., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 10954-10964, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Currently it requires an artist to create 3D human avatars with realistic clothing that can move naturally. Despite progress on 3D scanning and modeling of human bodies, there is still no technology that can easily turn a static scan into an animatable avatar. Automating the creation of such avatars would enable many applications in games, social networking, animation, and AR/VR to name a few. The key problem is one of representation. Standard 3D meshes are widely used in modeling the minimally-clothed body but do not readily capture the complex topology of clothing. Recent interest has shifted to implicit surface models for this task but they are computationally heavy and lack compatibility with existing 3D tools. What is needed is a 3D representation that can capture varied topology at high resolution and that can be learned from data. We argue that this representation has been with us all along — the point cloud. Point clouds have properties of both implicit and explicit representations that we exploit to model 3D garment geometry on a human body. We train a neural network with a novel local clothing geometric feature to represent the shape of different outfits. The network is trained from 3D point clouds of many types of clothing, on many bodies, in many poses, and learns to model pose-dependent clothing deformations. The geometry feature can be optimized to fit a previously unseen scan of a person in clothing, enabling the scan to be reposed realistically. Our model demonstrates superior quantitative and qualitative results in both multi-outfit modeling and unseen outfit animation. The code is available for research purposes.
Project Page Code Video Dataset arXiv PDF Supp. Poster DOI BibTeX
Thumb ticker lg pop2

Perceiving Systems Conference Paper Topologically Consistent Multi-View Face Inference Using Volumetric Sampling Li, T., Liu, S., Bolkart, T., Liu, J., Li, H., Zhao, Y. In Proc. International Conference on Computer Vision (ICCV), 3804-3814, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
High-fidelity face digitization solutions often combine multi-view stereo (MVS) techniques for 3D reconstruction and a non-rigid registration step to establish dense correspondence across identities and expressions. A common problem is the need for manual clean-up after the MVS step, as 3D scans are typically affected by noise and outliers and contain hairy surface regions that need to be cleaned up by artists. Furthermore, mesh registration tends to fail for extreme facial expressions. Most learning-based methods use an underlying 3D morphable model (3DMM) to ensure robustness, but this limits the output accuracy for extreme facial expressions. In addition, the global bottleneck of regression architectures cannot produce meshes that tightly fit the ground truth surfaces. We propose ToFu, Topologically consistent Face from multi-view, a geometry inference framework that can produce topologically consistent meshes across facial identities and expressions using a volumetric representation instead of an explicit underlying 3DMM. Our novel progressive mesh generation network embeds the topological structure of the face in a feature volume, sampled from geometry-aware local features. A coarse-to-fine architecture facilitates dense and accurate facial mesh predictions in a consistent mesh topology. ToFu further captures displacement maps for pore-level geometric details and facilitates high-quality rendering in the form of albedo and specular reflectance maps. These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering. We demonstrate state-of-the-art geometric and correspondence accuracy, while only taking 0.385 seconds to compute a mesh with 10K vertices, which is three orders of magnitude faster than traditional techniques. The code and the model are available for research purposes at https://tianyeli.github.io/tofu.
project paper DOI BibTeX
Thumb ticker lg tofu

Perceiving Systems Conference Paper PARE: Part Attention Regressor for 3D Human Body Estimation Kocabas, M., Huang, C. P., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11107-11117, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Despite significant progress, state of the art 3D human pose and shape estimation methods remain sensitive to partial occlusion and can produce dramatically wrong predictions although much of the body is observable. To address this, we introduce a soft attention mechanism, called the Part Attention REgressor (PARE), that learns to predict body-part-guided attention masks. We observe that state-of-the-art methods rely on global feature representations, making them sensitive to even small occlusions. In contrast, PARE's part-guided attention mechanism overcomes these issues by exploiting information about the visibility of individual body parts while leveraging information from neighboring body-parts to predict occluded parts. We show qualitatively that PARE learns sensible attention masks, and quantitative evaluation confirms that PARE achieves more accurate and robust reconstruction results than existing approaches on both occlusion-specific and standard benchmarks.
pdf supp code video arXiv project website poster DOI BibTeX
Thumb ticker lg website teaser v3

Perceiving Systems Autonomous Vision Conference Paper SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes Chen, X., Zheng, Y., Black, M. J., Hilliges, O., Geiger, A. In Proc. International Conference on Computer Vision (ICCV), 11574-11584, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Neural implicit surface representations have emerged as a promising paradigm to capture 3D shapes in a continuous and resolution-independent manner. However, adapting them to articulated shapes is non-trivial. Existing approaches learn a backward warp field that maps deformed to canonical points. However, this is problematic since the backward warp field is pose dependent and thus requires large amounts of data to learn. To address this, we introduce SNARF, which combines the advantages of linear blend skinning (LBS) for polygonal meshes with those of neural implicit surfaces by learning a forward deformation field without direct supervision. This deformation field is defined in canonical, pose-independent, space, enabling generalization to unseen poses. Learning the deformation field from posed meshes alone is challenging since the correspondences of deformed points are defined implicitly and may not be unique under changes of topology. We propose a forward skinning model that finds all canonical correspondences of any deformed point using iterative root finding. We derive analytical gradients via implicit differentiation, enabling end-to-end training from 3D meshes with bone transformations. Compared to state-of-the-art neural implicit representations, our approach generalizes better to unseen poses while preserving accuracy. We demonstrate our method in challenging scenarios on (clothed) 3D humans in diverse and unseen poses.
pdf pdf 2 supplementary material project blog blog 2 video video 2 code DOI URL BibTeX
Thumb ticker lg snarf

Perceiving Systems Conference Paper SOMA: Solving Optical Marker-Based MoCap Automatically Ghorbani, N., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11097-11106, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Marker-based optical motion capture (mocap) is the “gold standard” method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. “labelling”. Given these labels, one can then “solve” for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at https://soma.is.tue.mpg.de/.
code pdf suppl arxiv project website video poster DOI BibTeX
Thumb ticker lg teaser

Perceiving Systems Conference Paper SPEC: Seeing People in the Wild with an Estimated Camera Kocabas, M., Huang, C. P., Tesch, J., Müller, L., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11015-11025, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies.
pdf supp arXiv code video project website poster DOI BibTeX
Thumb ticker lg webpage teaser

Perceiving Systems Conference Paper Active Visual SLAM with Independently Rotating Camera Bonetto, E., Goldschmid, P., Black, M. J., Ahmad, A. In 2021 European Conference on Mobile Robots (ECMR 2021), 266-273, IEEE, Piscataway, NJ, European Conference on Mobile Robots (ECMR 2021), September 2021 (Published)
In active Visual-SLAM (V-SLAM), a robot relies on the information retrieved by its cameras to control its own movements for autonomous mapping of the environment. Cameras are usually statically linked to the robot's body, limiting the extra degrees of freedom for visual information acquisition. In this work, we overcome the aforementioned problem by introducing and leveraging an independently rotating camera on the robot base. This enables us to continuously control the heading of the camera, obtaining the desired optimal orientation for active V-SLAM, without rotating the robot itself. However, this additional degree of freedom introduces additional estimation uncertainties, which need to be accounted for. We do this by extending our robot's state estimate to include the camera state and jointly estimate the uncertainties. We develop our method based on a state-of-the-art active V-SLAM approach for omnidirectional robots and evaluate it through rigorous simulation and real robot experiments. We obtain more accurate maps, with lower energy consumption, while maintaining the benefits of the active approach with respect to the baseline. We also demonstrate how our method easily generalizes to other non-omnidirectional robotic platforms, which was a limitation of the previous approach. Code and implementation details are provided as open-source.
Code Data DOI URL BibTeX
Thumb ticker lg untitled

Perceiving Systems Article Separated and overlapping neural coding of face and body identity Foster, C., Zhao, M., Bolkart, T., Black, M. J., Bartels, A., Bülthoff, I. Human Brain Mapping, 42(13):4242-4260, September 2021 (Published)
Recognising a person's identity often relies on face and body information, and is tolerant to changes in low-level visual input (e.g., viewpoint changes). Previous studies have suggested that face identity is disentangled from low-level visual input in the anterior face-responsive regions. It remains unclear which regions disentangle body identity from variations in viewpoint, and whether face and body identity are encoded separately or combined into a coherent person identity representation. We trained participants to recognise three identities, and then recorded their brain activity using fMRI while they viewed face and body images of these three identities from different viewpoints. Participants' task was to respond to either the stimulus identity or viewpoint. We found consistent decoding of body identity across viewpoint in the fusiform body area, right anterior temporal cortex, middle frontal gyrus and right insula. This finding demonstrates a similar function of fusiform and anterior temporal cortex for bodies as has previously been shown for faces, suggesting these regions may play a general role in extracting high-level identity information. Moreover, we could decode identity across fMRI activity evoked by faces and bodies in the early visual cortex, right inferior occipital cortex, right parahippocampal cortex and right superior parietal cortex, revealing a distributed network that encodes person identity abstractly. Lastly, identity decoding was consistently better when participants attended to identity, indicating that attention to identity enhances its neural representation. These results offer new insights into how the brain develops an abstract neural coding of person identity, shared by faces and bodies.
on-line DOI BibTeX
Thumb ticker lg celia

Perceiving Systems Patent Skinned multi-infant linear body model Hesse, N., Pujades, S., Romero, J., Black, M. (US Patent 11,127,163, 2021), September 2021 (Published)
A computer-implemented method for automatically obtaining pose and shape parameters of a human body. The method includes obtaining a sequence of digital 3D images of the body, recorded by at least one depth camera; automatically obtaining pose and shape parameters of the body, based on images of the sequence and a statistical body model; and outputting the pose and shape parameters. The body may be an infant body.
BibTeX
Thumb ticker lg smilpatent

Perceiving Systems Article The role of sexual orientation in the relationships between body perception, body weight dissatisfaction, physical comparison, and eating psychopathology in the cisgender population Meneguzzo, P., Collantoni, E., Bonello, E., Vergine, M., Behrens, S. C., Tenconi, E., Favaro, A. Eating and Weight Disorders - Studies on Anorexia, Bulimia and Obesity , 26(6):1985-2000, Springer, August 2021 (Published)
Purpose: Body weight dissatisfaction (BWD) and visual body perception are specific aspects that can influence the own body image, and that can concur with the development or the maintenance of specific psychopathological dimensions of different psychiatric disorders. The sexual orientation is a fundamental but understudied aspect in this field, and, for this reason, the purpose of this study is to improve knowledge about the relationships among BWD, visual body size-perception, and sexual orientation. Methods: A total of 1033 individuals participated in an online survey. Physical comparison, depression, and self-esteem was evaluated, as well as sexual orientation and the presence of an eating disorder. A Figure Rating Scale was used to assess different valences of body weight, and mediation analyses were performed to investigated specific relationships between psychological aspects. Results: Bisexual women and gay men reported significantly higher BWD than other groups (p < 0.001); instead, higher body misperception was present in gay men (p = 0.001). Physical appearance comparison mediated the effect of sexual orientation in both BWD and perceptual distortion. No difference emerged between women with a history of eating disorders and without, as regards the value of body weight attributed to attractiveness, health, and presence on social media. Conclusion: This study contributes to understanding the relationship between sexual orientations and body image representation and evaluation. Physical appearance comparisons should be considered as critical psychological factors that can improve and affect well-being. The impact on subjects with high levels of eating concerns is also discussed. Level of evidence: Level III: case–control analytic study.
DOI BibTeX
Thumb ticker lg behrens

Perceiving Systems Article Learning an Animatable Detailed 3D Face Model from In-the-Wild Images Feng, Y., Feng, H., Black, M. J., Bolkart, T. ACM Transactions on Graphics, 40(4):88:1-88:13, August 2021 (Published)
While current monocular 3D face reconstruction methods can recover fine geometric details, they suffer several limitations. Some methods produce faces that cannot be realistically animated because they do not model how wrinkles vary with expression. Other methods are trained on high-quality face scans and do not generalize well to in-the-wild images. We present the first approach that regresses 3D face shape and animatable details that are specific to an individual but change with expression. Our model, DECA (Detailed Expression Capture and Animation), is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. To enable this, we introduce a novel detail-consistency loss that disentangles person-specific details from expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged. DECA is learned from in-the-wild images with no paired 3D supervision and achieves state-of-the-art shape reconstruction accuracy on two benchmarks. Qualitative results on in-the-wild data demonstrate DECA's robustness and its ability to disentangle identity- and expression-dependent details enabling animation of reconstructed faces. The model and code are publicly available at https://deca.is.tue.mpg.de.
pdf Sup Mat code video talk DOI BibTeX
Thumb ticker lg deca

Perceiving Systems Conference Paper On Self-Contact and Human Pose Müller, L., Osman, A. A. A., Tang, S., Huang, C. P., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 9985-9994, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
People touch their face 23 times an hour, they cross their arms and legs, put their hands on their hips, etc. While many images of people contain some form of self-contact, current 3D human pose and shape (HPS) regression methods typically fail to estimate this contact. To address this, we develop new datasets and methods that significantly improve human pose estimation with self-contact. First, we create a dataset of 3D Contact Poses (3DCP) containing SMPL-X bodies fit to 3D scans as well as poses from AMASS, which we refine to ensure good contact. Second, we leverage this to create the Mimic-The-Pose (MTP) dataset of images, collected via Amazon Mechanical Turk, containing people mimicking the 3DCP poses with self-contact. Third, we develop a novel HPS optimization method, SMPLify-XMC, that includes contact constraints and uses the known 3DCP body pose during fitting to create near ground-truth poses for MTP images. Fourth, for more image variety, we label a dataset of in-the-wild images with Discrete Self-Contact (DSC) information and use another new optimization method, SMPLify-DC, that exploits discrete contacts during pose optimization. Finally, we use our datasets during SPIN training to learn a new 3D human pose regressor, called TUCH (Towards Understanding Contact in Humans). We show that the new self-contact training data significantly improves 3D human pose estimates on withheld test data and existing datasets like 3DPW. Not only does our method improve results for self-contact poses, but it also improves accuracy for non-contact poses. The code and data are available for research purposes at https://tuch.is.tue.mpg.de.
project arXiv poster video code DOI BibTeX
Thumb ticker lg mpi website teaser

Perceiving Systems Conference Paper Populating 3D Scenes by Learning Human-Scene Interaction Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 14703-14713, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Humans live within a 3D space and constantly interact with it to perform tasks. Such interactions involve physical contact between surfaces that is semantically meaningful. Our goal is to learn how humans interact with scenes and leverage this to enable virtual characters to do the same. To that end, we introduce a novel Human-Scene Interaction (HSI) model that encodes proximal relationships, called POSA for “Pose with prOximitieS and contActs”. The representation of interaction is body-centric, which enables it to generalize to new scenes. Specifically, POSA augments the SMPL-X parametric human body model such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset, which contains SMPL-X meshes of people interacting with 3D scenes, and the corresponding scene semantics from the PROX-E dataset. We demonstrate the value of POSA with two applications. First, we automatically place 3D scans of people in scenes. We use a SMPL-X model fit to the scan as a proxy and then find its most likely placement in 3D. POSA provides an effective representation to search for “affordances” in the scene that match the likely contact relationships for that pose. We perform a perceptual study that shows significant improvement over the state of the art on this task. Second, we show that POSA’s learned representation of body-scene interaction supports monocular human pose estimation that is consistent with a 3D scene, improving on the state of the art. Our model and code are available for research purposes at https://posa.is.tue.mpg.de.
project pdf poster video DOI BibTeX
Thumb ticker lg posa

Perceiving Systems Article Red shape, blue shape: Political ideology influences the social perception of body shape Quiros-Ramirez, M. A., Streuber, S., Black, M. J. Nature Humanities and Social Sciences Communications, 8:148, June 2021 (Published)
Political elections have a profound impact on individuals and societies. Optimal voting is thought to be based on informed and deliberate decisions yet, it has been demonstrated that the outcomes of political elections are biased by the perception of candidates’ facial features and the stereotypical traits voters attribute to these. Interestingly, political identification changes the attribution of stereotypical traits from facial features. This study explores whether the perception of body shape elicits similar effects on political trait attribution and whether these associations can be visualized. In Experiment 1, ratings of 3D body shapes were used to model the relationship between perception of 3D body shape and the attribution of political traits such as ‘Republican’, ‘Democrat’, or ‘Leader’. This allowed analyzing and visualizing the mental representations of stereotypical 3D body shapes associated with each political trait. Experiment 2 was designed to test whether political identification of the raters affected the attribution of political traits to different types of body shapes. The results show that humans attribute political traits to the same body shapes differently depending on their own political preference. These findings show that our judgments of others are influenced by their body shape and our own political views. Such judgments have potential political and societal implications.
pdf on-line sup. mat. sup. figure author pdf DOI BibTeX
Thumb ticker lg rsbsfig6crop

Perceiving Systems Conference Paper We are More than Our Joints: Predicting how 3D Bodies Move Zhang, Y., Black, M. J., Tang, S. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 3371-3381, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
A key step towards understanding human behavior is the prediction of 3D human motion. Successful solutions have many applications in human tracking, HCI, and graphics. Most previous work focuses on predicting a time series of future 3D joint locations given a sequence 3D joints from the past. This Euclidean formulation generally works better than predicting pose in terms of joint rotations. Body joint locations, however, do not fully constrain 3D human pose, leaving degrees of freedom (like rotation about a limb) undefined. Note that 3D joints can be viewed as a sparse point cloud. Thus the problem of human motion prediction can be seen as a problem of point cloud prediction. With this observation, we instead predict a sparse set of locations on the body surface that correspond to motion capture markers. Given such markers, we fit a parametric body model to recover the 3D body of the person. These sparse surface markers also carry detailed information about human movement that is not present in the joints, increasing the naturalness of the predicted motions. Using the AMASS dataset, we train MOJO (More than Our JOints), which is a novel variational autoencoder with a latent DCT space that generates motions from latent frequencies. MOJO preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion. We note that motion prediction methods accumulate errors over time, resulting in joints or markers that diverge from true human bodies. To address this, we fit the SMPL-X body model to the predictions at each time step, projecting the solution back onto the space of valid bodies, before propagating the new markers in time. Quantitative and qualitative experiments show that our approach produces state-of-the-art results and realistic 3D body animations. The code is available for research purposes at https://yz-cnsdqz.github.io/MOJO/MOJO.html.
code arXiv DOI BibTeX
Thumb ticker lg mojo

Perceiving Systems Conference Paper AGORA: Avatars in Geography Optimized for Regression Analysis Patel, P., Huang, C. P., Tesch, J., Hoffmann, D. T., Tripathi, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 13463-13473, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. This suggests that there is a domain gap between current datasets and common scenes containing people. To obtain ground-truth 3D pose, current datasets limit the complexity of clothing, environmental conditions, number of subjects, and occlusion. Moreover, current datasets evaluate sparse 3D joint locations corresponding to the major joints of the body, ignoring the hand pose and the face shape. To evaluate the current state-of-the-art methods on more challenging images, and to drive the field to address new problems, we introduce AGORA, a synthetic dataset with high realism and highly accurate ground truth. Here we use 4240 commercially-available, high-quality, textured human scans in diverse poses and natural clothing; this includes 257 scans of children. We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans, taking into account clothing. We create around 14K training and 3K test images by rendering between 5 and 15 people per image us- ing either image-based lighting or rendered 3D environments, taking care to make the images physically plausible and photoreal. In total, AGORA consists of 173K individual person crops. We evaluate existing state-of-the- art methods for 3D human pose estimation on this dataset. and find that most methods perform poorly on images of children. Hence, we extend the SMPL-X model to better capture the shape of children. Additionally, we fine- tune methods on AGORA and show improved performance on both AGORA and 3DPW, confirming the realism of the dataset. We provide all the registered 3D reference training data, rendered images, and a web-based evaluation site at https://agora.is.tue.mpg.de/.
dataset pdf video DOI BibTeX
Thumb ticker lg agora

Perceiving Systems Conference Paper BABEL: Bodies, Action and Behavior with English Labels Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, M. A., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 722-731, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , June 2021 (Published)
Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43.5 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels which describe the overall action in the sequence, and frame labels which describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.
dataset poster pdf sup mat video code DOI BibTeX
Thumb ticker lg babel cropped teaser

Perceiving Systems Conference Paper LEAP: Learning Articulated Occupancy of People Mihajlovic, M., Zhang, Y., Black, M. J., Tang, S. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 10456-10466, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Substantial progress has been made on modeling rigid 3D objects using deep implicit representations. Yet, extending these methods to learn neural models of human shape is still in its infancy. Human bodies are complex and the key challenge is to learn a representation that generalizes such that it can express body shape deformations for unseen subjects in unseen, highly-articulated, poses. To address this challenge, we introduce LEAP (LEarning Articulated occupancy of People), a novel neural occupancy representation of the human body. Given a set of bone transformations (i.e. joint locations and rotations) and a query point in space, LEAP first maps the query point to a canonical space via learned linear blend skinning (LBS) functions and then efficiently queries the occupancy value via an occupancy network that models accurate identity- and pose- dependent deformations in the canonical space. Experiments show that our canonicalized occupancy estimation with the learned LBS functions greatly improves the generalization capability of the learned occupancy representation across various human shapes and poses, outperforming existing solutions in all settings.
project arXiv pdf code DOI BibTeX
Thumb ticker lg leap3

Perceiving Systems Conference Paper SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements Ma, Q., Saito, S., Yang, J., Tang, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 16077-16088, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Learning to model and reconstruct humans in clothing is challenging due to articulation, non-rigid deformation, and varying clothing types and topologies. To enable learning, the choice of representation is the key. Recent work uses neural networks to parameterize local surface elements. This approach captures locally coherent geometry and non-planar details, can deal with varying topology, and does not require registered training data. However, naively using such methods to model 3D clothed humans fails to capture fine-grained local deformations and generalizes poorly. To address this, we present three key innovations: First, we deform surface elements based on a human body model such that large-scale deformations caused by articulation are explicitly separated from topological changes and local clothing deformations. Second, we address the limitations of existing neural surface elements by regressing local geometry from local features, significantly improving the expressiveness. Third, we learn a pose embedding on a 2D parameterization space that encodes posed body geometry, improving generalization to unseen poses by reducing non-local spurious correlations. We demonstrate the efficacy of our surface representation by learning models of complex clothing from point clouds. The clothing can change topology and deviate from the topology of the body. Once learned, we can animate previously unseen motions, producing high-quality point clouds, from which we generate realistic images with neural rendering. We assess the importance of each technical contribution and show that our approach outperforms the state-of-the- art methods in terms of reconstruction accuracy and inference time. The code is available for research purposes at https://qianlim.github.io/SCALE.
Project Page Code Video arXiv PDF Supp. Poster DOI BibTeX
Thumb ticker lg scale ps website

Perceiving Systems Conference Paper SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks Saito, S., Yang, J., Ma, Q., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2885-2896, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
We present SCANimate, an end-to-end trainable framework that takes raw 3D scans of a clothed human and turns them into an animatable avatar. These avatars are driven by pose parameters and have realistic clothing that moves and deforms naturally. SCANimate does not rely on a customized mesh template or surface mesh registration. We observe that fitting a parametric 3D body model, like SMPL, to a clothed human scan is tractable while surface registration of the body topology to the scan is often not, because clothing can deviate significantly from the body shape. We also observe that articulated transformations are invertible, resulting in geometric cycle-consistency in the posed and unposed shapes. These observations lead us to a weakly supervised learning method that aligns scans into a canonical pose by disentangling articulated deformations without template-based surface registration. Furthermore, to complete missing regions in the aligned scans while modeling pose-dependent deformations, we introduce a locally pose-aware implicit function that learns to complete and model geometry with learned pose correctives. In contrast to commonly used global pose embeddings, our local pose conditioning significantly reduces long-range spurious correlations and improves generalization to unseen poses, especially when training data is limited. Our method can be applied to pose- aware appearance modeling to generate a fully textured avatar. We demonstrate our approach on various clothing types with different amounts of training data, outperforming existing solutions and other variants in terms of fidelity and generality in every setting. The code is available at https://scanimate.is.tue.mpg.de.
Project Page PDF Supp. Video arXiv Poster code DOI URL BibTeX
Thumb ticker lg scanimate

Perceiving Systems Article Weight bias and linguistic body representation in anorexia nervosa: Findings from the BodyTalk project Behrens, S. C., Meneguzzo, P., Favaro, A., Teufel, M., Skoda, E., Lindner, M., Walder, L., Quiros-Ramirez, M. A., Zipfel, S., Mohler, B., Black, M., Giel, K. E. European Eating Disorders Review, 29(2):204-215, Wiley, March 2021 (Published)
Objective: This study provides a comprehensive assessment of own body representation and linguistic representation of bodies in general in women with typical and atypical anorexia nervosa (AN). Methods: In a series of desktop experiments, participants rated a set of adjectives according to their match with a series of computer generated bodies varying in body mass index, and generated prototypic body shapes for the same set of adjectives. We analysed how body mass index of the bodies was associated with positive or negative valence of the adjectives in the different groups. Further, body image and own body perception were assessed. Results: In a German‐Italian sample comprising 39 women with AN, 20 women with atypical AN and 40 age matched control participants, we observed effects indicative of weight stigmatization, but no significant differences between the groups. Generally, positive adjectives were associated with lean bodies, whereas negative adjectives were associated with obese bodies. Discussion: Our observations suggest that patients with both typical and atypical AN affectively and visually represent body descriptions not differently from healthy women. We conclude that overvaluation of low body weight and fear of weight gain cannot be explained by generally distorted perception or cognition, but require individual consideration.
on-line pdf DOI BibTeX
Thumb ticker lg bodytalktiny

Perceiving Systems Conference Paper Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests Terreran, M., Bonetto, E., Ghidoni, S. 2020 25th International Conference on Pattern Recognition (ICPR 2020), 4634-4641, IEEE, Piscataway, NJ, 25th International Conference on Pattern Recognition (ICPR 2020), January 2021 (Published)
Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the 3D Entangled Forests algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.
DOI URL BibTeX
Thumb ticker lg entangledfeatures

Perceiving Systems MPI Year Book Computer-Vision-Forschung deckt Stereotypen über Körperformen auf / Computer vision research unveils stereotypes about body shapes Quirós-Ramírez, M. A., Streuber, S., Black, M. J. January 2021 (Published)
Untersuchungen am MPI für Intelligente Systeme haben gezeigt, wie Menschen andere wahrnehmen und dabei einer Person allein auf Basis ihrer Körperform unbewusst bestimmte Charaktereigenschaften zuschreiben. Dabei stellte das Forschungsteam fest, dass die Parteizugehörigkeit der Betrachtenden die soziale Wahrnehmung beeinflusst. Das Urteil über andere hängt sowohl von der Körperform der betrachteten Person als auch von der eigenen politischen Gesinnung ab, was möglicherweise politische und gesellschaftliche Auswirkungen hat. Diese Forschungsarbeit soll das Bewusstsein für diese Voreingenommenheit schärfen. ENGLISH: Scientists at the Max Planck Institute for Intelligent Systems show how people perceive others and unconsciously attribute certain character traits to a person based solely on their body shape. They also found that the political affiliation of the observer influences the social perception. The judgment of others depends on both the body shape of the person being observed and the viewer’s own political preference, potentially having political and social implications. The research project aims to raise awareness of this bias.
URL BibTeX

Perceiving Systems Conference Paper SMPLpix: Neural Avatars from 3D Human Models Prokudin, S., Black, M. J., Romero, J. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 1809-1818, IEEE, Piscataway, NJ, IEEE Winter Conference on Applications of Computer Vision (WACV 2021), January 2021 (Published)
Recent advances in deep generative models have led to an unprecedented level of realism for synthetically generated images of humans. However, one of the remaining fundamental limitations of these models is the ability to flexibly control the generative process, e.g. change the camera and human pose while retaining the subject identity. At the same time, deformable human body models like SMPL \cite{loper2015smpl} and its successors provide full control over pose and shape, but rely on classic computer graphics pipelines for rendering. Such rendering pipelines require explicit mesh rasterization that (a) does not have the potential to fix artifacts or lack of realism in the original 3D geometry and (b) until recently, were not fully incorporated into deep learning frameworks. In this work, we propose to bridge the gap between classic geometry-based rendering and the latest generative networks operating in pixel space. We train a network that directly converts a sparse set of 3D mesh vertices into photorealistic images, alleviating the need for traditional rasterization mechanism. We train our model on a large corpus of human 3D models and corresponding real photos, and show the advantage over conventional differentiable renderers both in terms of the level of photorealism and rendering efficiency.
project official pdf video preprint code DOI BibTeX
Thumb ticker lg smplpix

Perceiving Systems Article Analyzing the Direction of Emotional Influence in Nonverbal Dyadic Communication: A Facial-Expression Study Shadaydeh, M., Müller, L., Schneider, D., Thuemmel, M., Kessler, T., Denzler, J. IEEE Access, 9:73780-73790, IEEE, 2021 (Published)
Identifying the direction of emotional influence in a dyadic dialogue is of increasing interest in the psychological sciences with applications in psychotherapy, analysis of political interactions, or interpersonal conflict behavior. Facial expressions are widely described as being automatic and thus hard to be overtly influenced. As such, they are a perfect measure for a better understanding of unintentional behavior cues about socio-emotional cognitive processes. With this view, this study is concerned with the analysis of the direction of emotional influence in dyadic dialogues based on facial expressions only. We exploit computer vision capabilities along with causal inference theory for quantitative verification of hypotheses on the direction of emotional influence, i.e., cause-effect relationships, in dyadic dialogues. We address two main issues. First, in a dyadic dialogue, emotional influence occurs over transient time intervals and with intensity and direction that are variant over time. To this end, we propose a relevant interval selection approach that we use prior to causal inference to identify those transient intervals where causal inference should be applied. Second, we propose to use fine-grained facial expressions that are present when strong distinct facial emotions are not visible. To specify the direction of influence, we apply the concept of Granger causality to the time-series of facial expressions over selected relevant intervals. We tested our approach on newly, experimentally obtained data. Based on quantitative verification of hypotheses on the direction of emotional influence, we were able to show that the proposed approach is promising to reveal the cause-effect pattern in various instructed interaction conditions.
DOI BibTeX
Thumb ticker lg leafaces

Perceiving Systems Article Body Image Disturbances and Weight Bias After Obesity Surgery: Semantic and Visual Evaluation in a Controlled Study, Findings from the BodyTalk Project Meneguzzo, P., Behrens, S. C., Favaro, A., Tenconi, E., Vindigni, V., Teufel, M., Skoda, E., Lindner, M., Quiros-Ramirez, M. A., Mohler, B., Black, M., Zipfel, S., Giel, K. E., Pavan, C. Obesity Surgery, 31(4):1625-1634, 2021 (Published)
Purpose: Body image has a significant impact on the outcome of obesity surgery. This study aims to perform a semantic evaluation of body shapes in obesity surgery patients and a group of controls. Materials and Methods: Thirty-four obesity surgery (OS) subjects, stable after weight loss (average 48.03 ± 18.60 kg), and 35 overweight/obese controls (MC), were enrolled in this study. Body dissatisfaction, self-esteem, and body perception were evaluated with self-reported tests, and semantic evaluation of body shapes was performed with three specific tasks constructed with realistic human body stimuli. Results: The OS showed a more positive body image compared to HC (p < 0.001), higher levels of depression (p < 0.019), and lower self-esteem (p < 0.000). OS patients and HC showed no difference in weight bias, but OS used a higher BMI than HC in the visualization of positive adjectives (p = 0.011). Both groups showed a mental underestimation of their body shapes. Conclusion: OS patients are more psychologically burdened and have more difficulties in judging their bodies than overweight/obese peers. Their mental body representations seem not to be linked to their own BMI. Our findings provide helpful insight for the design of specific interventions in body image in obese and overweight people, as well as in OS.
paper pdf DOI BibTeX
Thumb ticker lg obesity

Perceiving Systems Conference Paper Collaborative Mapping of Archaeological Sites Using Multiple UAVs Patel, M., Bandopadhyay, A., Ahmad, A. In Lecture Notes in Networks and Systems, 412:54-70, Springer International Publishing, Intelligent Autonomous Systems 16, 2021
UAVs have found an important application in archaeological mapping. Majority of the existing methods employ an offline method to process the data collected from an archaeological site. They are time-consuming and computationally expensive. In this paper, we present a multi-UAV approach for faster mapping of archaeological sites. Employing a team of UAVs not only reduces the mapping time by distribution of coverage area, but also improves the map accuracy by exchange of information. Through extensive experiments in a realistic simulation (AirSim), we demonstrate the advantages of using a collaborative mapping approach. We then create the first 3D map of the Sadra Fort, a 15th Century Fort located in Gujarat, India using our proposed method. Additionally, we present two novel archaeological datasets recorded in both simulation and real-world to facilitate research on collaborative archaeological mapping. For the benefit of the community, we make the AirSim simulation environment, as well as the datasets publicly available (Project web page: http://patelmanthan.in/castle-ruins-airsim/).
DOI BibTeX
Thumb ticker lg cover

Empirical Inference Haptic Intelligence Perceiving Systems Physical Intelligence Robotic Materials MPI Year Book Scientific Report 2016 - 2021 2021
This report presents research done at the Max Planck Institute for Intelligent Systems from January2016 to November 2021. It is our fourth report since the founding of the institute in 2011. Dueto the fact that the upcoming evaluation is an extended one, the report covers a longer reportingperiod.This scientific report is organized as follows: we begin with an overview of the institute, includingan outline of its structure, an introduction of our latest research departments, and a presentationof our main collaborative initiatives and activities (Chapter1). The central part of the scientificreport consists of chapters on the research conducted by the institute’s departments (Chapters2to6) and its independent research groups (Chapters7 to24), as well as the work of the institute’scentral scientific facilities (Chapter25). For entities founded after January 2016, the respectivereport sections cover work done from the date of the establishment of the department, group, orfacility. These chapters are followed by a summary of selected outreach activities and scientificevents hosted by the institute (Chapter26). The scientific publications of the featured departmentsand research groups published during the 6-year review period complete this scientific report.
Scientific Report 2016 - 2021 BibTeX
Thumb ticker lg sab 2016 2021

Perceiving Systems Empirical Inference Conference Paper Grasping Field: Learning Implicit Representations for Human Grasps Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S. In 2020 International Conference on 3D Vision (3DV 2020), 333-344, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
Robotic grasping of house-hold objects has made remarkable progress in recent years. Yet, human grasps are still difficult to synthesize realistically. There are several key reasons: (1) the human hand has many degrees of freedom (more than robotic manipulators); (2) the synthesized hand should conform to the surface of the object; and (3) it should interact with the object in a semantically and physically plausible manner. To make progress in this direction, we draw inspiration from the recent progress on learning-based implicit representations for 3D object reconstruction. Specifically, we propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks. Our insight is that every point in a three-dimensional space can be characterized by the signed distances to the surface of the hand and the object, respectively. Consequently, the hand, the object, and the contact area can be represented by implicit surfaces in a common space, in which the proximity between the hand and the object can be modelled explicitly. We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data. We demonstrate that the proposed grasping field is an effective and expressive representation for human grasp generation. Specifically, our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud. The extensive experiments demonstrate that our generative model compares favorably with a strong baseline and approaches the level of natural human grasps. Furthermore, based on the grasping field representation, we propose a deep network for the challenging task of 3D hand-object interaction reconstruction from a single RGB image. Our method improves the physical plausibility of the hand-object contact reconstruction and achieves comparable performance for 3D hand reconstruction compared to state-of-the-art methods. Our model and code are available for research purpose at https://github.com/korrawe/grasping_field.
pdf arXiv code DOI BibTeX
Thumb ticker lg graspinfield

Perceiving Systems Article Occlusion Boundary: A Formal Definition & Its Detection via Deep Exploration of Context Wang, C., Fu, H., Tao, D., Black, M. J. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2641-2656, November 2020 (Published)
Occlusion boundaries contain rich perceptual information about the underlying scene structure and provide important cues in many visual perception-related tasks such as object recognition, segmentation, motion estimation, scene understanding, and autonomous navigation. However, there is no formal definition of occlusion boundaries in the literature, and state-of-the-art occlusion boundary detection is still suboptimal. With this in mind, in this paper we propose a formal definition of occlusion boundaries for related studies. Further, based on a novel idea, we develop two concrete approaches with different characteristics to detect occlusion boundaries in video sequences via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context) with deep models and conditional random fields. Experimental evaluations of our methods on two challenging occlusion boundary benchmarks (CMU and VSB100) demonstrate that our detectors significantly outperform the current state-of-the-art. Finally, we empirically assess the roles of several important components of the proposed detectors to validate the rationale behind these approaches.
official version DOI BibTeX
Thumb ticker lg occlusionpami

Perceiving Systems Conference Paper GIF: Generative Interpretable Faces Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T. In 2020 International Conference on 3D Vision (3DV 2020), 1:868-878, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de
pdf project code video DOI BibTeX
Thumb ticker lg gif

Perceiving Systems Conference Paper PLACE: Proximity Learning of Articulation and Contact in 3D Environments Zhang, S., Zhang, Y., Ma, Q., Black, M. J., Tang, S. In 2020 International Conference on 3D Vision (3DV 2020), 1:642-651, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
High fidelity digital 3D environments have been proposed in recent years, however, it remains extremely challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depth or semantic maps to represent the scene, and parametric human models to represent 3D bodies. While being straight-forward, their generated human-scene interactions often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To synthesize realistic human-scene interactions, it is essential to effectively represent the physical contact and proximity between the body and the world. To that end, we propose a novel interaction generation method, named PLACE(Proximity Learning of Articulation and Contact in 3D Environments), which explicitly models the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the minimum distances from the basis points to the human body surface. The generated proximal relationship exhibits which region of the scene is in contact with the person. Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally. Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. The code and model are available for research at https://sanweiliti.github.io/PLACE/PLACE.html
pdf arXiv project code DOI BibTeX
Thumb ticker lg place

Perceiving Systems Article 3D Morphable Face Models - Past, Present and Future Egger, B., Smith, W. A. P., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T., Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., Theobalt, C., Blanz, V., Vetter, T. ACM Transactions on Graphics, 39(5):157, October 2020 (Published)
In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications.
project page pdf preprint DOI BibTeX
Thumb ticker lg 3dmm

Perceiving Systems Article AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning Tallamraju, R., Saini, N., Bonetto, E., Pabst, M., Liu, Y. T., Black, M., Ahmad, A. IEEE Robotics and Automation Letters, 5(4):6678-6685, IEEE, October 2020, Also accepted and presented in the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). (Published)
In this letter, we introduce a deep reinforcement learning (DRL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose, and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system, and observation models. Such models are difficult to derive, and generalize across different systems. Moreover, the non-linearities, and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions.
DOI URL BibTeX
Thumb ticker lg cover paper website

Perceiving Systems Conference Paper Learning a statistical full spine model from partial observations Meng, D., Keller, M., Boyer, E., Black, M., Pujades, S. In Shape in Medical Imaging, 122-133, Lecture Notes in Computer Science, 12474, (Editors: Reuter, Martin and Wachinger, Christian and Lombaert, Hervé and Paniagua, Beatriz and Goksel, Orcun and Rekik, Islem), Springer, Cham, International Workshop on Shape in Medical Imaging (ShapeMI 2020), October 2020 (Published)
The study of the morphology of the human spine has attracted research attention for its many potential applications, such as image segmentation, bio-mechanics or pathology detection. However, as of today there is no publicly available statistical model of the 3D surface of the full spine. This is mainly due to the lack of openly available 3D data where the full spine is imaged and segmented. In this paper we propose to learn a statistical surface model of the full-spine (7 cervical, 12 thoracic and 5 lumbar vertebrae) from partial and incomplete views of the spine. In order to deal with the partial observations we use probabilistic principal component analysis (PPCA) to learn a surface shape model of the full spine. Quantitative evaluation demonstrates that the obtained model faithfully captures the shape of the population in a low dimensional space and generalizes to left out data. Furthermore, we show that the model faithfully captures the global correlations among the vertebrae shape. Given a partial observation of the spine, i.e. a few vertebrae, the model can predict the shape of unseen vertebrae with a mean error under 3 mm. The full-spine statistical model is trained on the VerSe 2019 public dataset and is publicly made available to the community for non-commercial purposes. (https://gitlab.inria.fr/spine/spine_model)
Gitlab Code PDF DOI BibTeX
Thumb ticker lg demo 600

Perceiving Systems Conference Paper GRAB: A Dataset of Whole-Body Human Grasping of Objects Taheri, O., Ghorbani, N., Black, M. J., Tzionas, D. In Computer Vision – ECCV 2020, 4:581-600, Lecture Notes in Computer Science, 12349, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de.
pdf suppl video (long) video (short) DOI URL BibTeX
Thumb ticker lg grab thumb3

Perceiving Systems Conference Paper Monocular Expressive Body Regression through Body-Driven Attention Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J. In Computer Vision - ECCV 2020, 10:20-40, Lecture Notes in Computer Science, 12355, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.
code Short video Long video arxiv pdf suppl DOI URL BibTeX
Thumb ticker lg merged2

Perceiving Systems Conference Paper STAR: Sparse Trained Articulated Human Body Regressor Osman, A. A. A., Bolkart, T., Black, M. J. In Computer Vision - ECCV 2020, 6:598-613, Lecture Notes in Computer Science, 12351, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at http://star.is.tue.mpg.de.
Project Page Code Video paper supplemental DOI BibTeX
Thumb ticker lg main teaser

Perceiving Systems Ph.D. Thesis Addressing the Data Scarcity of Learning-based Optical Flow Approaches Janai, J. University of Tübingen, July 2020 (Published)
Learning to solve optical flow in an end-to-end fashion from examples is attractive as deep neural networks allow for learning more complex hierarchical flow representations directly from annotated data. However, training such models requires large datasets, and obtaining ground truth for real images is challenging. Due to the difficulty of capturing dense ground truth, existing optical flow datasets are limited in size and diversity. Therefore, we present two strategies to address this data scarcity problem: First, we propose an approach to create new real-world datasets by exploiting temporal constraints using a high-speed video camera. We tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of state of the art in optical flow under various levels of motion blur. Second, we investigate how to learn sophisticated models from unlabeled data. Unsupervised learning is a promising direction, yet the performance of current unsupervised methods is still limited. In particular, the lack of proper occlusion handling in commonly used data terms constitutes a major source of error. While most optical flow methods process pairs of consecutive frames, more advanced occlusion reasoning can be realized when considering multiple frames. We propose a framework for unsupervised learning of optical flow and occlusions over multiple frames. More specifically, we exploit the minimal configuration of three frames to strengthen the photometric loss and explicitly reason about occlusions. We demonstrate that our multi-frame, occlusion-sensitive formulation outperforms previous unsupervised methods and even produces results on par with some fully supervised methods. Both directions are essential for future advances in optical flow. While new datasets allow measuring the advancements and comparing novel approaches, unsupervised learning permits the usage of new data sources to train better models.
download DOI BibTeX
Thumb ticker lg joelthesis

Perceiving Systems Article Analysis of motor development within the first year of life: 3-D motion tracking without markers for early detection of developmental disorders Parisi, C., Hesse, N., Tacke, U., Rocamora, S. P., Blaschek, A., Hadders-Algra, M., Black, M. J., Heinen, F., Müller-Felber, W., Schroeder, A. S. Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz, 63(7):881–890, July 2020 (Published)
Children with motor development disorders benefit greatly from early interventions. An early diagnosis in pediatric preventive care (U2–U5) can be improved by automated screening. Current approaches to automated motion analysis, however, are expensive, require lots of technical support, and cannot be used in broad clinical application. Here we present an inexpensive, marker-free video analysis tool (KineMAT) for infants, which digitizes 3‑D movements of the entire body over time allowing automated analysis in the future. Three-minute video sequences of spontaneously moving infants were recorded with a commercially available depth-imaging camera and aligned with a virtual infant body model (SMIL model). The virtual image generated allows any measurements to be carried out in 3‑D with high precision. We demonstrate seven infants with different diagnoses. A selection of possible movement parameters was quantified and aligned with diagnosis-specific movement characteristics. KineMAT and the SMIL model allow reliable, three-dimensional measurements of spontaneous activity in infants with a very low error rate. Based on machine-learning algorithms, KineMAT can be trained to automatically recognize pathological spontaneous motor skills. It is inexpensive and easy to use and can be developed into a screening tool for preventive care for children.
pdf on-line w/ sup mat DOI BibTeX
Thumb ticker lg baby2

Perceiving Systems Conference Paper Generating 3D People in Scenes without People Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6193-6203, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: \url{https://vlg.inf.ethz.ch/projects/PSI/}.
Code Video PDF DOI BibTeX
Thumb ticker lg yancvpr

Perceiving Systems Conference Paper Learning Physics-guided Face Relighting under Directional Light Nestmeyer, T., Lalonde, J., Matthews, I., Lehrmann, A. M. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 5123-5132, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.
Paper DOI BibTeX
Thumb ticker lg teaser incl ball

Perceiving Systems Article Learning and Tracking the 3D Body Shape of Freely Moving Infants from RGB-D sequences Hesse, N., Pujades, S., Black, M., Arens, M., Hofmann, U., Schroeder, S. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2540-2551, June 2020
Statistical models of the human body surface are generally learned from thousands of high-quality 3D scans in predefined poses to cover the wide variety of human body shapes and articulations. Acquisition of such data requires expensive equipment, calibration procedures, and is limited to cooperative subjects who can understand and follow instructions, such as adults. We present a method for learning a statistical 3D Skinned Multi-Infant Linear body model (SMIL) from incomplete, low-quality RGB-D sequences of freely moving infants. Quantitative experiments show that SMIL faithfully represents the RGB-D data and properly factorizes the shape and pose of the infants. To demonstrate the applicability of SMIL, we fit the model to RGB-D sequences of freely moving infants and show, with a case study, that our method captures enough motion detail for General Movements Assessment (GMA), a method used in clinical practice for early detection of neurodevelopmental disorders in infants. SMIL provides a new tool for analyzing infant shape and movement and is a step towards an automated system for GMA.
pdf Journal DOI BibTeX
Thumb ticker lg hessepami

Perceiving Systems Conference Paper Learning to Dress 3D People in Generative Clothing Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6468-6477, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.
Project page Code Long video Short Video arXiv DOI URL BibTeX
Thumb ticker lg cape teaser ps website v3

Perceiving Systems Patent Machine learning systems and methods of estimating body shape from images Black, M., Rachlin, E., Heron, N., Loper, M., Weiss, A., Hu, K., Hinkle, T., Kristiansen, M. (US Patent 10,679,046), June 2020
Disclosed is a method including receiving an input image including a human, predicting, based on a convolutional neural network that is trained using examples consisting of pairs of sensor data, a corresponding body shape of the human and utilizing the corresponding body shape predicted from the convolutional neural network as input to another convolutional neural network to predict additional body shape metrics.
BibTeX
Thumb ticker lg patent2020b