Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Conference Paper Collaborative Regression of Expressive Bodies using Moderation Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M. J. 2021 International Conference on 3D Vision (3DV 2021), 792-804, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)
Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars with realistic facial detail, from a single image. For this, PIXIE uses two key observations. First, existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. All part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D facial surface displacements. Quantitative and qualitative evaluation shows that PIXIE estimates more accurate whole-body shape and detailed face shape than the state of the art. Models and code are available at https://pixie.is.tue.mpg.de.
arXiv project pdf suppl DOI BibTeX
Thumb ticker lg pixie

Perceiving Systems Conference Paper Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation Fan, Z., Spurr, A., Kocabas, M., Tang, S., Black, M. J., Hilliges, O. 2021 International Conference on 3D Vision (3DV 2021), 1-10, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)
In natural conversation and interaction, our hands often overlap or are in contact with each other. Due to the homogeneous appearance of hands, this makes estimating the 3D pose of interacting hands from images difficult. In this paper we demonstrate that self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands and their parts, is a major cause of the final 3D pose error. Motivated by this insight, we propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image. The method consists of two interwoven branches that process the input imagery into a per-pixel semantic part segmentation mask and a visual feature volume. In contrast to prior work, we do not decouple the segmentation from the pose estimation stage, but rather leverage the per-pixel probabilities directly in the downstream pose estimation task. To do so, the part probabilities are merged with the visual features and processed via fully-convolutional layers. We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset for both single and interacting hands across all metrics. We provide detailed ablation studies to demonstrate the efficacy of our method and to provide insights into how the modelling of pixel ownership affects single and interacting hand pose estimation. Our code will be released for research purposes.
arXiv project code video DOI BibTeX
Thumb ticker lg hands3dv

Autonomous Vision Perceiving Systems Conference Paper MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S. In Advances in Neural Information Processing Systems 34, 4:2810-2822, (Editors: Ranzato, M. and Beygelzimer, A. and Dauphin, Y. and Liang, P. S. and Wortman Vaughan, J.), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (Published)
In this paper, we aim to create generalizable and controllable neural signed distance fields (SDFs) that represent clothed humans from monocular depth observations. Recent advances in deep learning, especially neural implicit representations, have enabled human shape reconstruction and controllable avatar generation from different sensor inputs. However, to generate realistic cloth deformations from novel input poses, watertight meshes or dense full-body scans are usually needed as inputs. Furthermore, due to the difficulty of effectively modeling pose-dependent cloth deformations for diverse body shapes and cloth types, existing approaches resort to per-subject/cloth-type optimization from scratch, which is computationally expensive. In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images. We achieve this by using meta-learning to learn an initialization of a hypernetwork that predicts the parameters of neural SDFs. The hypernetwork is conditioned on human poses and represents a clothed neural avatar that deforms non-rigidly according to the input poses. Meanwhile, it is meta-learned to effectively incorporate priors of diverse body shapes and cloth types and thus can be much faster to fine-tune compared to models trained from scratch. We qualitatively and quantitatively show that our approach outperforms state-of-the-art approaches that require complete meshes as inputs while our approach requires only depth frames as inputs and runs orders of magnitudes faster. Furthermore, we demonstrate that our meta-learned hypernetwork is very robust, being the first to generate avatars with realistic dynamic cloth deformations given as few as 8 monocular depth frames.
Project page arXiv URL BibTeX
Thumb ticker lg metaavatar

Perceiving Systems Article The neural coding of face and body orientation in occipitotemporal cortex Foster, C., Zhao, M., Bolkart, T., Black, M. J., Bartels, A., Bülthoff, I. NeuroImage, 246:118783, December 2021 (Published)
Face and body orientation convey important information for us to understand other people's actions, intentions and social interactions. It has been shown that several occipitotemporal areas respond differently to faces or bodies of different orientations. However, whether face and body orientation are processed by partially overlapping or completely separate brain networks remains unclear, as the neural coding of face and body orientation is often investigated separately. Here, we recorded participants’ brain activity using fMRI while they viewed faces and bodies shown from three different orientations, while attending to either orientation or identity information. Using multivoxel pattern analysis we investigated which brain regions process face and body orientation respectively, and which regions encode both face and body orientation in a stimulus-independent manner. We found that patterns of neural responses evoked by different stimulus orientations in the occipital face area, extrastriate body area, lateral occipital complex and right early visual cortex could generalise across faces and bodies, suggesting a stimulus-independent encoding of person orientation in occipitotemporal cortex. This finding was consistent across functionally defined regions of interest and a whole-brain searchlight approach. The fusiform face area responded to face but not body orientation, suggesting that orientation responses in this area are face-specific. Moreover, neural responses to orientation were remarkably consistent regardless of whether participants attended to the orientation of faces and bodies or not. Together, these results demonstrate that face and body orientation are processed in a partially overlapping brain network, with a stimulus-independent neural code for face and body orientation in occipitotemporal cortex.
paper DOI BibTeX
Thumb ticker lg celianeuro

Perceiving Systems Article A pose-independent method for accurate and precise body composition from 3D optical scans Wong, M. C., Ng, B. K., Tian, I., Sobhiyeh, S., Pagano, I., Dechenaud, M., Kennedy, S. F., Liu, Y. E., Kelly, N., Chow, D., Garber, A. K., Maskarinec, G., Pujades, S., Black, M. J., Curless, B., Heymsfield, S. B., Shepherd, J. A. Obesity, 29(11):1835-1847, Wiley, November 2021 (Published)
Objective: The aim of this study was to investigate whether digitally re-posing three-dimensional optical (3DO) whole-body scans to a standardized pose would improve body composition accuracy and precision regardless of the initial pose. Methods: Healthy adults (n = 540), stratified by sex, BMI, and age, completed whole-body 3DO and dual-energy X-ray absorptiometry (DXA) scans in the Shape Up! Adults study. The 3DO mesh vertices were represented with standardized templates and a low-dimensional space by principal component analysis (stratified by sex). The total sample was split into a training (80%) and test (20%) set for both males and females. Stepwise linear regression was used to build prediction models for body composition and anthropometry outputs using 3DO principal components (PCs). Results: The analysis included 472 participants after exclusions. After re-posing, three PCs described 95% of the shape variance in the male and female training sets. 3DO body composition accuracy compared with DXA was as follows: fat mass R2 = 0.91 male, 0.94 female; fat-free mass R2 = 0.95 male, 0.92 female; visceral fat mass R2 = 0.77 male, 0.79 female. Conclusions: Re-posed 3DO body shape PCs produced more accurate and precise body composition models that may be used in clinical or nonclinical settings when DXA is unavailable or when frequent ionizing radiation exposure is unwanted.
Wiley online adress Author Version DOI BibTeX
Thumb ticker lg tpose obesity 21

Perceiving Systems Conference Paper How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI Kalyan, A., Kumar, A., Chandrasekaran, A., Sabharwal, A., Clark, P. In 2021 Conference on Empirical Methods in Natural Language Processing - Proceedings of the Conference, 7318-7328, (Editors: Moens, Marie-Francine and Huang, Xuanjing and Specia, Lucia and Wen-tau Yih, Scott), Association for Computational Linguistics, Stroudsburg, PA, Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), November 2021 (Published)
Many real-world problems require the combined application of multiple reasoning abilities -- employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, "How much would the sea level rise if all ice in the world melted?" FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question-answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large-scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.
paper project webpage data DOI BibTeX
Thumb ticker lg fermi teaser cropped

Perceiving Systems Conference Paper Action-Conditioned 3D Human Motion Synthesis with Transformer VAE Petrovich, M., Black, M. J., Varol, G. In Proc. International Conference on Computer Vision (ICCV), 10965-10975, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Code and models are available on our project page.
website code paper-arxiv video DOI BibTeX
Thumb ticker lg actor

Perceiving Systems Conference Paper Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings Prabhu, V., Chandrasekaran, A., Saenko, K., Hoffman, J. In Proc. International Conference on Computer Vision (ICCV), 8485-8494, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Generalizing deep neural networks to new target domains is critical to their real-world utility. In practice, it may be feasible to get some target data labeled, but to be cost-effective it is desirable to select a subset that is maximally-informative via active learning (AL). In this work, we study the problem of AL under a domain shift. We empirically demonstrate how existing AL approaches based solely on model uncertainty or representative sampling are suboptimal for active domain adaptation. Our algorithm, Active Domain Adaptation via CLustering Uncertainty-weighted Embeddings (ADA-CLUE), i) identifies diverse datapoints for labeling that are both uncertain under the model and representative of unlabeled target data, and ii) leverages the available source and target data for adaptation by optimizing a semi-supervised adversarial entropy loss that is complimentary to our active sampling objective. On standard image classification benchmarks for domain adaptation, ADA-CLUE consistently performs as well or better than competing active adaptation, active learning, and domain adaptation methods across shift severities, model initializations, and labeling budgets.
paper project webpage Code Video DOI BibTeX
Thumb ticker lg adaclue cropped image

Perceiving Systems Conference Paper Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D Shape, Pose, and Appearance Consistency Sanyal, S., Vorobiov, A., Bolkart, T., Loper, M., Mohler, B., Davis, L., Romero, J., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11118-11127, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Synthesizing images of a person in novel poses from a single image is a highly ambiguous task. Most existing approaches require paired training images; i.e. images of the same person with the same clothing in different poses. However, obtaining sufficiently large datasets with paired data is challenging and costly. Previous methods that forego paired supervision lack realism. We propose a self-supervised framework named SPICE (Self-supervised Person Image CrEation) that closes the image quality gap with supervised methods. The key insight enabling self-supervision is to exploit 3D information about the human body in several ways. First, the 3D body shape must remain unchanged when reposing. Second, representing body pose in 3D enables reasoning about self occlusions. Third, 3D body parts that are visible before and after reposing, should have similar appearance features. Once trained, SPICE takes an image of a person and generates a new image of that person in a new target pose. SPICE achieves state-of-the-art performance on the DeepFashion dataset, improving the FID score from 29.9 to 7.8 compared with previous unsupervised methods, and with performance similar to the state-of-the-art supervised method (6.4). SPICE also generates temporally coherent videos given an input image and a sequence of poses, despite being trained on static images only.
pdf arxiv DOI BibTeX
Thumb ticker lg screenshot from 2021 10 06 21 50 55

Perceiving Systems Conference Paper Learning To Regress Bodies From Images Using Differentiable Semantic Rendering Dwivedi, S. K., Athanasiou, N., Kocabas, M., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11230-11239, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Learning to regress 3D human body shape and pose (e.g. SMPL parameters) from monocular images typically exploits losses on 2D keypoints, silhouettes, and/or part segmentation when 3D training data is not available. Such losses, however, are limited because 2D keypoints do not supervise body shape and segmentations of people in clothing do not match projected minimally-clothed SMPL shapes. To exploit richer image information about clothed people, we introduce higher-level semantic information about clothing to penalize clothed and non-clothed regions of the human body differently. To do so, we train a body regressor using a novel “Differentiable Semantic Rendering (DSR)” loss. For Minimally-Clothed (MC) regions, we define the DSRMC loss, which encourages a tight match between a rendered SMPL body and the minimally-clothed regions of the image. For clothed regions, we define the DSR-C loss to encourage the rendered SMPL body to be inside the clothing mask. To ensure end-to-end differentiable training, we learn a semantic clothing prior for SMPL vertices from thousands of clothed human scans. We perform extensive qualitative and quantitative experiments to evaluate the role of clothing semantics on the accuracy of 3D human pose and shape estimation. We outperform all previous state-of-the-art methods on 3DPW and Human3.6M and obtain on par results on MPI-INF-3DHP. Code and trained models are available for research at https://dsr.is.tue.mpg.de/
pdf supp code project-website video poster DOI BibTeX
Thumb ticker lg dsr1

Perceiving Systems Conference Paper Monocular, One-Stage, Regression of Multiple 3D People Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M. J., Mei, T. In Proc. International Conference on Computer Vision (ICCV), 11159-11168, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.
pdf supp arXiv code DOI BibTeX
Thumb ticker lg romp2

Perceiving Systems Conference Paper Stochastic Scene-Aware Motion Prediction Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M. In Proc. International Conference on Computer Vision (ICCV), 11354-11364, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
A long-standing goal in computer vision is to capture, model, and realistically synthesize human behavior. Specifically, by learning from data, our goal is to enable virtual humans to navigate within cluttered indoor scenes and naturally interact with objects. Such embodied behavior has applications in virtual reality, computer games, and robotics, while synthesized behavior can be used as training data. The problem is challenging because real human motion is diverse and adapts to the scene. For example, a person can sit or lie on a sofa in many places and with varying styles. We must model this diversity to synthesize virtual humans that realistically perform human-scene interactions. We present a novel data-driven, stochastic motion synthesis method that models different styles of performing a given action with a target object. Our Scene-Aware Motion Prediction method (SAMP) generalizes to target objects of various geometries while enabling the character to navigate in cluttered scenes. To train SAMP, we collected mocap data covering various sitting, lying down, walking, and running styles. We demonstrate SAMP on complex indoor scenes and achieve superior performance than existing solutions.
Project Page pdf DOI BibTeX
Thumb ticker lg samp3

Perceiving Systems Conference Paper The Power of Points for Modeling Humans in Clothing Ma, Q., Yang, J., Tang, S., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 10954-10964, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Currently it requires an artist to create 3D human avatars with realistic clothing that can move naturally. Despite progress on 3D scanning and modeling of human bodies, there is still no technology that can easily turn a static scan into an animatable avatar. Automating the creation of such avatars would enable many applications in games, social networking, animation, and AR/VR to name a few. The key problem is one of representation. Standard 3D meshes are widely used in modeling the minimally-clothed body but do not readily capture the complex topology of clothing. Recent interest has shifted to implicit surface models for this task but they are computationally heavy and lack compatibility with existing 3D tools. What is needed is a 3D representation that can capture varied topology at high resolution and that can be learned from data. We argue that this representation has been with us all along — the point cloud. Point clouds have properties of both implicit and explicit representations that we exploit to model 3D garment geometry on a human body. We train a neural network with a novel local clothing geometric feature to represent the shape of different outfits. The network is trained from 3D point clouds of many types of clothing, on many bodies, in many poses, and learns to model pose-dependent clothing deformations. The geometry feature can be optimized to fit a previously unseen scan of a person in clothing, enabling the scan to be reposed realistically. Our model demonstrates superior quantitative and qualitative results in both multi-outfit modeling and unseen outfit animation. The code is available for research purposes.
Project Page Code Video Dataset arXiv PDF Supp. Poster DOI BibTeX
Thumb ticker lg pop2

Perceiving Systems Conference Paper Topologically Consistent Multi-View Face Inference Using Volumetric Sampling Li, T., Liu, S., Bolkart, T., Liu, J., Li, H., Zhao, Y. In Proc. International Conference on Computer Vision (ICCV), 3804-3814, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
High-fidelity face digitization solutions often combine multi-view stereo (MVS) techniques for 3D reconstruction and a non-rigid registration step to establish dense correspondence across identities and expressions. A common problem is the need for manual clean-up after the MVS step, as 3D scans are typically affected by noise and outliers and contain hairy surface regions that need to be cleaned up by artists. Furthermore, mesh registration tends to fail for extreme facial expressions. Most learning-based methods use an underlying 3D morphable model (3DMM) to ensure robustness, but this limits the output accuracy for extreme facial expressions. In addition, the global bottleneck of regression architectures cannot produce meshes that tightly fit the ground truth surfaces. We propose ToFu, Topologically consistent Face from multi-view, a geometry inference framework that can produce topologically consistent meshes across facial identities and expressions using a volumetric representation instead of an explicit underlying 3DMM. Our novel progressive mesh generation network embeds the topological structure of the face in a feature volume, sampled from geometry-aware local features. A coarse-to-fine architecture facilitates dense and accurate facial mesh predictions in a consistent mesh topology. ToFu further captures displacement maps for pore-level geometric details and facilitates high-quality rendering in the form of albedo and specular reflectance maps. These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering. We demonstrate state-of-the-art geometric and correspondence accuracy, while only taking 0.385 seconds to compute a mesh with 10K vertices, which is three orders of magnitude faster than traditional techniques. The code and the model are available for research purposes at https://tianyeli.github.io/tofu.
project paper DOI BibTeX
Thumb ticker lg tofu

Perceiving Systems Conference Paper PARE: Part Attention Regressor for 3D Human Body Estimation Kocabas, M., Huang, C. P., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11107-11117, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Despite significant progress, state of the art 3D human pose and shape estimation methods remain sensitive to partial occlusion and can produce dramatically wrong predictions although much of the body is observable. To address this, we introduce a soft attention mechanism, called the Part Attention REgressor (PARE), that learns to predict body-part-guided attention masks. We observe that state-of-the-art methods rely on global feature representations, making them sensitive to even small occlusions. In contrast, PARE's part-guided attention mechanism overcomes these issues by exploiting information about the visibility of individual body parts while leveraging information from neighboring body-parts to predict occluded parts. We show qualitatively that PARE learns sensible attention masks, and quantitative evaluation confirms that PARE achieves more accurate and robust reconstruction results than existing approaches on both occlusion-specific and standard benchmarks.
pdf supp code video arXiv project website poster DOI BibTeX
Thumb ticker lg website teaser v3

Perceiving Systems Autonomous Vision Conference Paper SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes Chen, X., Zheng, Y., Black, M. J., Hilliges, O., Geiger, A. In Proc. International Conference on Computer Vision (ICCV), 11574-11584, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Neural implicit surface representations have emerged as a promising paradigm to capture 3D shapes in a continuous and resolution-independent manner. However, adapting them to articulated shapes is non-trivial. Existing approaches learn a backward warp field that maps deformed to canonical points. However, this is problematic since the backward warp field is pose dependent and thus requires large amounts of data to learn. To address this, we introduce SNARF, which combines the advantages of linear blend skinning (LBS) for polygonal meshes with those of neural implicit surfaces by learning a forward deformation field without direct supervision. This deformation field is defined in canonical, pose-independent, space, enabling generalization to unseen poses. Learning the deformation field from posed meshes alone is challenging since the correspondences of deformed points are defined implicitly and may not be unique under changes of topology. We propose a forward skinning model that finds all canonical correspondences of any deformed point using iterative root finding. We derive analytical gradients via implicit differentiation, enabling end-to-end training from 3D meshes with bone transformations. Compared to state-of-the-art neural implicit representations, our approach generalizes better to unseen poses while preserving accuracy. We demonstrate our method in challenging scenarios on (clothed) 3D humans in diverse and unseen poses.
pdf pdf 2 supplementary material project blog blog 2 video video 2 code DOI URL BibTeX
Thumb ticker lg snarf

Perceiving Systems Conference Paper SOMA: Solving Optical Marker-Based MoCap Automatically Ghorbani, N., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11097-11106, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Marker-based optical motion capture (mocap) is the “gold standard” method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. “labelling”. Given these labels, one can then “solve” for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at https://soma.is.tue.mpg.de/.
code pdf suppl arxiv project website video poster DOI BibTeX
Thumb ticker lg teaser

Perceiving Systems Conference Paper SPEC: Seeing People in the Wild with an Estimated Camera Kocabas, M., Huang, C. P., Tesch, J., Müller, L., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11015-11025, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies.
pdf supp arXiv code video project website poster DOI BibTeX
Thumb ticker lg webpage teaser

Perceiving Systems Conference Paper Active Visual SLAM with Independently Rotating Camera Bonetto, E., Goldschmid, P., Black, M. J., Ahmad, A. In 2021 European Conference on Mobile Robots (ECMR 2021), 266-273, IEEE, Piscataway, NJ, European Conference on Mobile Robots (ECMR 2021), September 2021 (Published)
In active Visual-SLAM (V-SLAM), a robot relies on the information retrieved by its cameras to control its own movements for autonomous mapping of the environment. Cameras are usually statically linked to the robot's body, limiting the extra degrees of freedom for visual information acquisition. In this work, we overcome the aforementioned problem by introducing and leveraging an independently rotating camera on the robot base. This enables us to continuously control the heading of the camera, obtaining the desired optimal orientation for active V-SLAM, without rotating the robot itself. However, this additional degree of freedom introduces additional estimation uncertainties, which need to be accounted for. We do this by extending our robot's state estimate to include the camera state and jointly estimate the uncertainties. We develop our method based on a state-of-the-art active V-SLAM approach for omnidirectional robots and evaluate it through rigorous simulation and real robot experiments. We obtain more accurate maps, with lower energy consumption, while maintaining the benefits of the active approach with respect to the baseline. We also demonstrate how our method easily generalizes to other non-omnidirectional robotic platforms, which was a limitation of the previous approach. Code and implementation details are provided as open-source.
Code Data DOI URL BibTeX
Thumb ticker lg untitled

Perceiving Systems Article Separated and overlapping neural coding of face and body identity Foster, C., Zhao, M., Bolkart, T., Black, M. J., Bartels, A., Bülthoff, I. Human Brain Mapping, 42(13):4242-4260, September 2021 (Published)
Recognising a person's identity often relies on face and body information, and is tolerant to changes in low-level visual input (e.g., viewpoint changes). Previous studies have suggested that face identity is disentangled from low-level visual input in the anterior face-responsive regions. It remains unclear which regions disentangle body identity from variations in viewpoint, and whether face and body identity are encoded separately or combined into a coherent person identity representation. We trained participants to recognise three identities, and then recorded their brain activity using fMRI while they viewed face and body images of these three identities from different viewpoints. Participants' task was to respond to either the stimulus identity or viewpoint. We found consistent decoding of body identity across viewpoint in the fusiform body area, right anterior temporal cortex, middle frontal gyrus and right insula. This finding demonstrates a similar function of fusiform and anterior temporal cortex for bodies as has previously been shown for faces, suggesting these regions may play a general role in extracting high-level identity information. Moreover, we could decode identity across fMRI activity evoked by faces and bodies in the early visual cortex, right inferior occipital cortex, right parahippocampal cortex and right superior parietal cortex, revealing a distributed network that encodes person identity abstractly. Lastly, identity decoding was consistently better when participants attended to identity, indicating that attention to identity enhances its neural representation. These results offer new insights into how the brain develops an abstract neural coding of person identity, shared by faces and bodies.
on-line DOI BibTeX
Thumb ticker lg celia

Perceiving Systems Patent Skinned multi-infant linear body model Hesse, N., Pujades, S., Romero, J., Black, M. (US Patent 11,127,163, 2021), September 2021 (Published)
A computer-implemented method for automatically obtaining pose and shape parameters of a human body. The method includes obtaining a sequence of digital 3D images of the body, recorded by at least one depth camera; automatically obtaining pose and shape parameters of the body, based on images of the sequence and a statistical body model; and outputting the pose and shape parameters. The body may be an infant body.
BibTeX
Thumb ticker lg smilpatent

Perceiving Systems Article The role of sexual orientation in the relationships between body perception, body weight dissatisfaction, physical comparison, and eating psychopathology in the cisgender population Meneguzzo, P., Collantoni, E., Bonello, E., Vergine, M., Behrens, S. C., Tenconi, E., Favaro, A. Eating and Weight Disorders - Studies on Anorexia, Bulimia and Obesity , 26(6):1985-2000, Springer, August 2021 (Published)
Purpose: Body weight dissatisfaction (BWD) and visual body perception are specific aspects that can influence the own body image, and that can concur with the development or the maintenance of specific psychopathological dimensions of different psychiatric disorders. The sexual orientation is a fundamental but understudied aspect in this field, and, for this reason, the purpose of this study is to improve knowledge about the relationships among BWD, visual body size-perception, and sexual orientation. Methods: A total of 1033 individuals participated in an online survey. Physical comparison, depression, and self-esteem was evaluated, as well as sexual orientation and the presence of an eating disorder. A Figure Rating Scale was used to assess different valences of body weight, and mediation analyses were performed to investigated specific relationships between psychological aspects. Results: Bisexual women and gay men reported significantly higher BWD than other groups (p < 0.001); instead, higher body misperception was present in gay men (p = 0.001). Physical appearance comparison mediated the effect of sexual orientation in both BWD and perceptual distortion. No difference emerged between women with a history of eating disorders and without, as regards the value of body weight attributed to attractiveness, health, and presence on social media. Conclusion: This study contributes to understanding the relationship between sexual orientations and body image representation and evaluation. Physical appearance comparisons should be considered as critical psychological factors that can improve and affect well-being. The impact on subjects with high levels of eating concerns is also discussed. Level of evidence: Level III: case–control analytic study.
DOI BibTeX
Thumb ticker lg behrens

Perceiving Systems Article Learning an Animatable Detailed 3D Face Model from In-the-Wild Images Feng, Y., Feng, H., Black, M. J., Bolkart, T. ACM Transactions on Graphics, 40(4):88:1-88:13, August 2021 (Published)
While current monocular 3D face reconstruction methods can recover fine geometric details, they suffer several limitations. Some methods produce faces that cannot be realistically animated because they do not model how wrinkles vary with expression. Other methods are trained on high-quality face scans and do not generalize well to in-the-wild images. We present the first approach that regresses 3D face shape and animatable details that are specific to an individual but change with expression. Our model, DECA (Detailed Expression Capture and Animation), is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. To enable this, we introduce a novel detail-consistency loss that disentangles person-specific details from expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged. DECA is learned from in-the-wild images with no paired 3D supervision and achieves state-of-the-art shape reconstruction accuracy on two benchmarks. Qualitative results on in-the-wild data demonstrate DECA's robustness and its ability to disentangle identity- and expression-dependent details enabling animation of reconstructed faces. The model and code are publicly available at https://deca.is.tue.mpg.de.
pdf Sup Mat code video talk DOI BibTeX
Thumb ticker lg deca

Perceiving Systems Conference Paper On Self-Contact and Human Pose Müller, L., Osman, A. A. A., Tang, S., Huang, C. P., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 9985-9994, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
People touch their face 23 times an hour, they cross their arms and legs, put their hands on their hips, etc. While many images of people contain some form of self-contact, current 3D human pose and shape (HPS) regression methods typically fail to estimate this contact. To address this, we develop new datasets and methods that significantly improve human pose estimation with self-contact. First, we create a dataset of 3D Contact Poses (3DCP) containing SMPL-X bodies fit to 3D scans as well as poses from AMASS, which we refine to ensure good contact. Second, we leverage this to create the Mimic-The-Pose (MTP) dataset of images, collected via Amazon Mechanical Turk, containing people mimicking the 3DCP poses with self-contact. Third, we develop a novel HPS optimization method, SMPLify-XMC, that includes contact constraints and uses the known 3DCP body pose during fitting to create near ground-truth poses for MTP images. Fourth, for more image variety, we label a dataset of in-the-wild images with Discrete Self-Contact (DSC) information and use another new optimization method, SMPLify-DC, that exploits discrete contacts during pose optimization. Finally, we use our datasets during SPIN training to learn a new 3D human pose regressor, called TUCH (Towards Understanding Contact in Humans). We show that the new self-contact training data significantly improves 3D human pose estimates on withheld test data and existing datasets like 3DPW. Not only does our method improve results for self-contact poses, but it also improves accuracy for non-contact poses. The code and data are available for research purposes at https://tuch.is.tue.mpg.de.
project arXiv poster video code DOI BibTeX
Thumb ticker lg mpi website teaser

Perceiving Systems Conference Paper Populating 3D Scenes by Learning Human-Scene Interaction Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 14703-14713, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Humans live within a 3D space and constantly interact with it to perform tasks. Such interactions involve physical contact between surfaces that is semantically meaningful. Our goal is to learn how humans interact with scenes and leverage this to enable virtual characters to do the same. To that end, we introduce a novel Human-Scene Interaction (HSI) model that encodes proximal relationships, called POSA for “Pose with prOximitieS and contActs”. The representation of interaction is body-centric, which enables it to generalize to new scenes. Specifically, POSA augments the SMPL-X parametric human body model such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset, which contains SMPL-X meshes of people interacting with 3D scenes, and the corresponding scene semantics from the PROX-E dataset. We demonstrate the value of POSA with two applications. First, we automatically place 3D scans of people in scenes. We use a SMPL-X model fit to the scan as a proxy and then find its most likely placement in 3D. POSA provides an effective representation to search for “affordances” in the scene that match the likely contact relationships for that pose. We perform a perceptual study that shows significant improvement over the state of the art on this task. Second, we show that POSA’s learned representation of body-scene interaction supports monocular human pose estimation that is consistent with a 3D scene, improving on the state of the art. Our model and code are available for research purposes at https://posa.is.tue.mpg.de.
project pdf poster video DOI BibTeX
Thumb ticker lg posa

Perceiving Systems Article Red shape, blue shape: Political ideology influences the social perception of body shape Quiros-Ramirez, M. A., Streuber, S., Black, M. J. Nature Humanities and Social Sciences Communications, 8:148, June 2021 (Published)
Political elections have a profound impact on individuals and societies. Optimal voting is thought to be based on informed and deliberate decisions yet, it has been demonstrated that the outcomes of political elections are biased by the perception of candidates’ facial features and the stereotypical traits voters attribute to these. Interestingly, political identification changes the attribution of stereotypical traits from facial features. This study explores whether the perception of body shape elicits similar effects on political trait attribution and whether these associations can be visualized. In Experiment 1, ratings of 3D body shapes were used to model the relationship between perception of 3D body shape and the attribution of political traits such as ‘Republican’, ‘Democrat’, or ‘Leader’. This allowed analyzing and visualizing the mental representations of stereotypical 3D body shapes associated with each political trait. Experiment 2 was designed to test whether political identification of the raters affected the attribution of political traits to different types of body shapes. The results show that humans attribute political traits to the same body shapes differently depending on their own political preference. These findings show that our judgments of others are influenced by their body shape and our own political views. Such judgments have potential political and societal implications.
pdf on-line sup. mat. sup. figure author pdf DOI BibTeX
Thumb ticker lg rsbsfig6crop

Perceiving Systems Conference Paper We are More than Our Joints: Predicting how 3D Bodies Move Zhang, Y., Black, M. J., Tang, S. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 3371-3381, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
A key step towards understanding human behavior is the prediction of 3D human motion. Successful solutions have many applications in human tracking, HCI, and graphics. Most previous work focuses on predicting a time series of future 3D joint locations given a sequence 3D joints from the past. This Euclidean formulation generally works better than predicting pose in terms of joint rotations. Body joint locations, however, do not fully constrain 3D human pose, leaving degrees of freedom (like rotation about a limb) undefined. Note that 3D joints can be viewed as a sparse point cloud. Thus the problem of human motion prediction can be seen as a problem of point cloud prediction. With this observation, we instead predict a sparse set of locations on the body surface that correspond to motion capture markers. Given such markers, we fit a parametric body model to recover the 3D body of the person. These sparse surface markers also carry detailed information about human movement that is not present in the joints, increasing the naturalness of the predicted motions. Using the AMASS dataset, we train MOJO (More than Our JOints), which is a novel variational autoencoder with a latent DCT space that generates motions from latent frequencies. MOJO preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion. We note that motion prediction methods accumulate errors over time, resulting in joints or markers that diverge from true human bodies. To address this, we fit the SMPL-X body model to the predictions at each time step, projecting the solution back onto the space of valid bodies, before propagating the new markers in time. Quantitative and qualitative experiments show that our approach produces state-of-the-art results and realistic 3D body animations. The code is available for research purposes at https://yz-cnsdqz.github.io/MOJO/MOJO.html.
code arXiv DOI BibTeX
Thumb ticker lg mojo

Perceiving Systems Conference Paper AGORA: Avatars in Geography Optimized for Regression Analysis Patel, P., Huang, C. P., Tesch, J., Hoffmann, D. T., Tripathi, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 13463-13473, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. This suggests that there is a domain gap between current datasets and common scenes containing people. To obtain ground-truth 3D pose, current datasets limit the complexity of clothing, environmental conditions, number of subjects, and occlusion. Moreover, current datasets evaluate sparse 3D joint locations corresponding to the major joints of the body, ignoring the hand pose and the face shape. To evaluate the current state-of-the-art methods on more challenging images, and to drive the field to address new problems, we introduce AGORA, a synthetic dataset with high realism and highly accurate ground truth. Here we use 4240 commercially-available, high-quality, textured human scans in diverse poses and natural clothing; this includes 257 scans of children. We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans, taking into account clothing. We create around 14K training and 3K test images by rendering between 5 and 15 people per image us- ing either image-based lighting or rendered 3D environments, taking care to make the images physically plausible and photoreal. In total, AGORA consists of 173K individual person crops. We evaluate existing state-of-the- art methods for 3D human pose estimation on this dataset. and find that most methods perform poorly on images of children. Hence, we extend the SMPL-X model to better capture the shape of children. Additionally, we fine- tune methods on AGORA and show improved performance on both AGORA and 3DPW, confirming the realism of the dataset. We provide all the registered 3D reference training data, rendered images, and a web-based evaluation site at https://agora.is.tue.mpg.de/.
dataset pdf video DOI BibTeX
Thumb ticker lg agora

Perceiving Systems Conference Paper BABEL: Bodies, Action and Behavior with English Labels Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, M. A., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 722-731, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , June 2021 (Published)
Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43.5 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels which describe the overall action in the sequence, and frame labels which describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.
dataset poster pdf sup mat video code DOI BibTeX
Thumb ticker lg babel cropped teaser

Perceiving Systems Conference Paper LEAP: Learning Articulated Occupancy of People Mihajlovic, M., Zhang, Y., Black, M. J., Tang, S. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 10456-10466, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Substantial progress has been made on modeling rigid 3D objects using deep implicit representations. Yet, extending these methods to learn neural models of human shape is still in its infancy. Human bodies are complex and the key challenge is to learn a representation that generalizes such that it can express body shape deformations for unseen subjects in unseen, highly-articulated, poses. To address this challenge, we introduce LEAP (LEarning Articulated occupancy of People), a novel neural occupancy representation of the human body. Given a set of bone transformations (i.e. joint locations and rotations) and a query point in space, LEAP first maps the query point to a canonical space via learned linear blend skinning (LBS) functions and then efficiently queries the occupancy value via an occupancy network that models accurate identity- and pose- dependent deformations in the canonical space. Experiments show that our canonicalized occupancy estimation with the learned LBS functions greatly improves the generalization capability of the learned occupancy representation across various human shapes and poses, outperforming existing solutions in all settings.
project arXiv pdf code DOI BibTeX
Thumb ticker lg leap3

Perceiving Systems Conference Paper SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements Ma, Q., Saito, S., Yang, J., Tang, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 16077-16088, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Learning to model and reconstruct humans in clothing is challenging due to articulation, non-rigid deformation, and varying clothing types and topologies. To enable learning, the choice of representation is the key. Recent work uses neural networks to parameterize local surface elements. This approach captures locally coherent geometry and non-planar details, can deal with varying topology, and does not require registered training data. However, naively using such methods to model 3D clothed humans fails to capture fine-grained local deformations and generalizes poorly. To address this, we present three key innovations: First, we deform surface elements based on a human body model such that large-scale deformations caused by articulation are explicitly separated from topological changes and local clothing deformations. Second, we address the limitations of existing neural surface elements by regressing local geometry from local features, significantly improving the expressiveness. Third, we learn a pose embedding on a 2D parameterization space that encodes posed body geometry, improving generalization to unseen poses by reducing non-local spurious correlations. We demonstrate the efficacy of our surface representation by learning models of complex clothing from point clouds. The clothing can change topology and deviate from the topology of the body. Once learned, we can animate previously unseen motions, producing high-quality point clouds, from which we generate realistic images with neural rendering. We assess the importance of each technical contribution and show that our approach outperforms the state-of-the- art methods in terms of reconstruction accuracy and inference time. The code is available for research purposes at https://qianlim.github.io/SCALE.
Project Page Code Video arXiv PDF Supp. Poster DOI BibTeX
Thumb ticker lg scale ps website

Perceiving Systems Conference Paper SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks Saito, S., Yang, J., Ma, Q., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2885-2896, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
We present SCANimate, an end-to-end trainable framework that takes raw 3D scans of a clothed human and turns them into an animatable avatar. These avatars are driven by pose parameters and have realistic clothing that moves and deforms naturally. SCANimate does not rely on a customized mesh template or surface mesh registration. We observe that fitting a parametric 3D body model, like SMPL, to a clothed human scan is tractable while surface registration of the body topology to the scan is often not, because clothing can deviate significantly from the body shape. We also observe that articulated transformations are invertible, resulting in geometric cycle-consistency in the posed and unposed shapes. These observations lead us to a weakly supervised learning method that aligns scans into a canonical pose by disentangling articulated deformations without template-based surface registration. Furthermore, to complete missing regions in the aligned scans while modeling pose-dependent deformations, we introduce a locally pose-aware implicit function that learns to complete and model geometry with learned pose correctives. In contrast to commonly used global pose embeddings, our local pose conditioning significantly reduces long-range spurious correlations and improves generalization to unseen poses, especially when training data is limited. Our method can be applied to pose- aware appearance modeling to generate a fully textured avatar. We demonstrate our approach on various clothing types with different amounts of training data, outperforming existing solutions and other variants in terms of fidelity and generality in every setting. The code is available at https://scanimate.is.tue.mpg.de.
Project Page PDF Supp. Video arXiv Poster code DOI URL BibTeX
Thumb ticker lg scanimate

Perceiving Systems Article Weight bias and linguistic body representation in anorexia nervosa: Findings from the BodyTalk project Behrens, S. C., Meneguzzo, P., Favaro, A., Teufel, M., Skoda, E., Lindner, M., Walder, L., Quiros-Ramirez, M. A., Zipfel, S., Mohler, B., Black, M., Giel, K. E. European Eating Disorders Review, 29(2):204-215, Wiley, March 2021 (Published)
Objective: This study provides a comprehensive assessment of own body representation and linguistic representation of bodies in general in women with typical and atypical anorexia nervosa (AN). Methods: In a series of desktop experiments, participants rated a set of adjectives according to their match with a series of computer generated bodies varying in body mass index, and generated prototypic body shapes for the same set of adjectives. We analysed how body mass index of the bodies was associated with positive or negative valence of the adjectives in the different groups. Further, body image and own body perception were assessed. Results: In a German‐Italian sample comprising 39 women with AN, 20 women with atypical AN and 40 age matched control participants, we observed effects indicative of weight stigmatization, but no significant differences between the groups. Generally, positive adjectives were associated with lean bodies, whereas negative adjectives were associated with obese bodies. Discussion: Our observations suggest that patients with both typical and atypical AN affectively and visually represent body descriptions not differently from healthy women. We conclude that overvaluation of low body weight and fear of weight gain cannot be explained by generally distorted perception or cognition, but require individual consideration.
on-line pdf DOI BibTeX
Thumb ticker lg bodytalktiny

Perceiving Systems Conference Paper Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests Terreran, M., Bonetto, E., Ghidoni, S. 2020 25th International Conference on Pattern Recognition (ICPR 2020), 4634-4641, IEEE, Piscataway, NJ, 25th International Conference on Pattern Recognition (ICPR 2020), January 2021 (Published)
Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the 3D Entangled Forests algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.
DOI URL BibTeX
Thumb ticker lg entangledfeatures

Perceiving Systems MPI Year Book Computer-Vision-Forschung deckt Stereotypen über Körperformen auf / Computer vision research unveils stereotypes about body shapes Quirós-Ramírez, M. A., Streuber, S., Black, M. J. January 2021 (Published)
Untersuchungen am MPI für Intelligente Systeme haben gezeigt, wie Menschen andere wahrnehmen und dabei einer Person allein auf Basis ihrer Körperform unbewusst bestimmte Charaktereigenschaften zuschreiben. Dabei stellte das Forschungsteam fest, dass die Parteizugehörigkeit der Betrachtenden die soziale Wahrnehmung beeinflusst. Das Urteil über andere hängt sowohl von der Körperform der betrachteten Person als auch von der eigenen politischen Gesinnung ab, was möglicherweise politische und gesellschaftliche Auswirkungen hat. Diese Forschungsarbeit soll das Bewusstsein für diese Voreingenommenheit schärfen. ENGLISH: Scientists at the Max Planck Institute for Intelligent Systems show how people perceive others and unconsciously attribute certain character traits to a person based solely on their body shape. They also found that the political affiliation of the observer influences the social perception. The judgment of others depends on both the body shape of the person being observed and the viewer’s own political preference, potentially having political and social implications. The research project aims to raise awareness of this bias.
URL BibTeX

Perceiving Systems Conference Paper SMPLpix: Neural Avatars from 3D Human Models Prokudin, S., Black, M. J., Romero, J. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 1809-1818, IEEE, Piscataway, NJ, IEEE Winter Conference on Applications of Computer Vision (WACV 2021), January 2021 (Published)
Recent advances in deep generative models have led to an unprecedented level of realism for synthetically generated images of humans. However, one of the remaining fundamental limitations of these models is the ability to flexibly control the generative process, e.g. change the camera and human pose while retaining the subject identity. At the same time, deformable human body models like SMPL \cite{loper2015smpl} and its successors provide full control over pose and shape, but rely on classic computer graphics pipelines for rendering. Such rendering pipelines require explicit mesh rasterization that (a) does not have the potential to fix artifacts or lack of realism in the original 3D geometry and (b) until recently, were not fully incorporated into deep learning frameworks. In this work, we propose to bridge the gap between classic geometry-based rendering and the latest generative networks operating in pixel space. We train a network that directly converts a sparse set of 3D mesh vertices into photorealistic images, alleviating the need for traditional rasterization mechanism. We train our model on a large corpus of human 3D models and corresponding real photos, and show the advantage over conventional differentiable renderers both in terms of the level of photorealism and rendering efficiency.
project official pdf video preprint code DOI BibTeX
Thumb ticker lg smplpix

Perceiving Systems Article Analyzing the Direction of Emotional Influence in Nonverbal Dyadic Communication: A Facial-Expression Study Shadaydeh, M., Müller, L., Schneider, D., Thuemmel, M., Kessler, T., Denzler, J. IEEE Access, 9:73780-73790, IEEE, 2021 (Published)
Identifying the direction of emotional influence in a dyadic dialogue is of increasing interest in the psychological sciences with applications in psychotherapy, analysis of political interactions, or interpersonal conflict behavior. Facial expressions are widely described as being automatic and thus hard to be overtly influenced. As such, they are a perfect measure for a better understanding of unintentional behavior cues about socio-emotional cognitive processes. With this view, this study is concerned with the analysis of the direction of emotional influence in dyadic dialogues based on facial expressions only. We exploit computer vision capabilities along with causal inference theory for quantitative verification of hypotheses on the direction of emotional influence, i.e., cause-effect relationships, in dyadic dialogues. We address two main issues. First, in a dyadic dialogue, emotional influence occurs over transient time intervals and with intensity and direction that are variant over time. To this end, we propose a relevant interval selection approach that we use prior to causal inference to identify those transient intervals where causal inference should be applied. Second, we propose to use fine-grained facial expressions that are present when strong distinct facial emotions are not visible. To specify the direction of influence, we apply the concept of Granger causality to the time-series of facial expressions over selected relevant intervals. We tested our approach on newly, experimentally obtained data. Based on quantitative verification of hypotheses on the direction of emotional influence, we were able to show that the proposed approach is promising to reveal the cause-effect pattern in various instructed interaction conditions.
DOI BibTeX
Thumb ticker lg leafaces

Perceiving Systems Article Body Image Disturbances and Weight Bias After Obesity Surgery: Semantic and Visual Evaluation in a Controlled Study, Findings from the BodyTalk Project Meneguzzo, P., Behrens, S. C., Favaro, A., Tenconi, E., Vindigni, V., Teufel, M., Skoda, E., Lindner, M., Quiros-Ramirez, M. A., Mohler, B., Black, M., Zipfel, S., Giel, K. E., Pavan, C. Obesity Surgery, 31(4):1625-1634, 2021 (Published)
Purpose: Body image has a significant impact on the outcome of obesity surgery. This study aims to perform a semantic evaluation of body shapes in obesity surgery patients and a group of controls. Materials and Methods: Thirty-four obesity surgery (OS) subjects, stable after weight loss (average 48.03 ± 18.60 kg), and 35 overweight/obese controls (MC), were enrolled in this study. Body dissatisfaction, self-esteem, and body perception were evaluated with self-reported tests, and semantic evaluation of body shapes was performed with three specific tasks constructed with realistic human body stimuli. Results: The OS showed a more positive body image compared to HC (p < 0.001), higher levels of depression (p < 0.019), and lower self-esteem (p < 0.000). OS patients and HC showed no difference in weight bias, but OS used a higher BMI than HC in the visualization of positive adjectives (p = 0.011). Both groups showed a mental underestimation of their body shapes. Conclusion: OS patients are more psychologically burdened and have more difficulties in judging their bodies than overweight/obese peers. Their mental body representations seem not to be linked to their own BMI. Our findings provide helpful insight for the design of specific interventions in body image in obese and overweight people, as well as in OS.
paper pdf DOI BibTeX
Thumb ticker lg obesity

Perceiving Systems Conference Paper Collaborative Mapping of Archaeological Sites Using Multiple UAVs Patel, M., Bandopadhyay, A., Ahmad, A. In Lecture Notes in Networks and Systems, 412:54-70, Springer International Publishing, Intelligent Autonomous Systems 16, 2021
UAVs have found an important application in archaeological mapping. Majority of the existing methods employ an offline method to process the data collected from an archaeological site. They are time-consuming and computationally expensive. In this paper, we present a multi-UAV approach for faster mapping of archaeological sites. Employing a team of UAVs not only reduces the mapping time by distribution of coverage area, but also improves the map accuracy by exchange of information. Through extensive experiments in a realistic simulation (AirSim), we demonstrate the advantages of using a collaborative mapping approach. We then create the first 3D map of the Sadra Fort, a 15th Century Fort located in Gujarat, India using our proposed method. Additionally, we present two novel archaeological datasets recorded in both simulation and real-world to facilitate research on collaborative archaeological mapping. For the benefit of the community, we make the AirSim simulation environment, as well as the datasets publicly available (Project web page: http://patelmanthan.in/castle-ruins-airsim/).
DOI BibTeX
Thumb ticker lg cover

Empirical Inference Haptic Intelligence Perceiving Systems Physical Intelligence Robotic Materials MPI Year Book Scientific Report 2016 - 2021 2021
This report presents research done at the Max Planck Institute for Intelligent Systems from January2016 to November 2021. It is our fourth report since the founding of the institute in 2011. Dueto the fact that the upcoming evaluation is an extended one, the report covers a longer reportingperiod.This scientific report is organized as follows: we begin with an overview of the institute, includingan outline of its structure, an introduction of our latest research departments, and a presentationof our main collaborative initiatives and activities (Chapter1). The central part of the scientificreport consists of chapters on the research conducted by the institute’s departments (Chapters2to6) and its independent research groups (Chapters7 to24), as well as the work of the institute’scentral scientific facilities (Chapter25). For entities founded after January 2016, the respectivereport sections cover work done from the date of the establishment of the department, group, orfacility. These chapters are followed by a summary of selected outreach activities and scientificevents hosted by the institute (Chapter26). The scientific publications of the featured departmentsand research groups published during the 6-year review period complete this scientific report.
Scientific Report 2016 - 2021 BibTeX
Thumb ticker lg sab 2016 2021