Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Empirical Inference Conference Paper Grasping Field: Learning Implicit Representations for Human Grasps Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S. In 2020 International Conference on 3D Vision (3DV 2020), 333-344, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
Robotic grasping of house-hold objects has made remarkable progress in recent years. Yet, human grasps are still difficult to synthesize realistically. There are several key reasons: (1) the human hand has many degrees of freedom (more than robotic manipulators); (2) the synthesized hand should conform to the surface of the object; and (3) it should interact with the object in a semantically and physically plausible manner. To make progress in this direction, we draw inspiration from the recent progress on learning-based implicit representations for 3D object reconstruction. Specifically, we propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks. Our insight is that every point in a three-dimensional space can be characterized by the signed distances to the surface of the hand and the object, respectively. Consequently, the hand, the object, and the contact area can be represented by implicit surfaces in a common space, in which the proximity between the hand and the object can be modelled explicitly. We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data. We demonstrate that the proposed grasping field is an effective and expressive representation for human grasp generation. Specifically, our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud. The extensive experiments demonstrate that our generative model compares favorably with a strong baseline and approaches the level of natural human grasps. Furthermore, based on the grasping field representation, we propose a deep network for the challenging task of 3D hand-object interaction reconstruction from a single RGB image. Our method improves the physical plausibility of the hand-object contact reconstruction and achieves comparable performance for 3D hand reconstruction compared to state-of-the-art methods. Our model and code are available for research purpose at https://github.com/korrawe/grasping_field.
pdf arXiv code DOI BibTeX
Thumb ticker lg graspinfield

Perceiving Systems Article Occlusion Boundary: A Formal Definition & Its Detection via Deep Exploration of Context Wang, C., Fu, H., Tao, D., Black, M. J. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2641-2656, November 2020 (Published)
Occlusion boundaries contain rich perceptual information about the underlying scene structure and provide important cues in many visual perception-related tasks such as object recognition, segmentation, motion estimation, scene understanding, and autonomous navigation. However, there is no formal definition of occlusion boundaries in the literature, and state-of-the-art occlusion boundary detection is still suboptimal. With this in mind, in this paper we propose a formal definition of occlusion boundaries for related studies. Further, based on a novel idea, we develop two concrete approaches with different characteristics to detect occlusion boundaries in video sequences via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context) with deep models and conditional random fields. Experimental evaluations of our methods on two challenging occlusion boundary benchmarks (CMU and VSB100) demonstrate that our detectors significantly outperform the current state-of-the-art. Finally, we empirically assess the roles of several important components of the proposed detectors to validate the rationale behind these approaches.
official version DOI BibTeX
Thumb ticker lg occlusionpami

Perceiving Systems Conference Paper GIF: Generative Interpretable Faces Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T. In 2020 International Conference on 3D Vision (3DV 2020), 1:868-878, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de
pdf project code video DOI BibTeX
Thumb ticker lg gif

Perceiving Systems Conference Paper PLACE: Proximity Learning of Articulation and Contact in 3D Environments Zhang, S., Zhang, Y., Ma, Q., Black, M. J., Tang, S. In 2020 International Conference on 3D Vision (3DV 2020), 1:642-651, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
High fidelity digital 3D environments have been proposed in recent years, however, it remains extremely challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depth or semantic maps to represent the scene, and parametric human models to represent 3D bodies. While being straight-forward, their generated human-scene interactions often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To synthesize realistic human-scene interactions, it is essential to effectively represent the physical contact and proximity between the body and the world. To that end, we propose a novel interaction generation method, named PLACE(Proximity Learning of Articulation and Contact in 3D Environments), which explicitly models the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the minimum distances from the basis points to the human body surface. The generated proximal relationship exhibits which region of the scene is in contact with the person. Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally. Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. The code and model are available for research at https://sanweiliti.github.io/PLACE/PLACE.html
pdf arXiv project code DOI BibTeX
Thumb ticker lg place

Perceiving Systems Article 3D Morphable Face Models - Past, Present and Future Egger, B., Smith, W. A. P., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T., Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., Theobalt, C., Blanz, V., Vetter, T. ACM Transactions on Graphics, 39(5):157, October 2020 (Published)
In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications.
project page pdf preprint DOI BibTeX
Thumb ticker lg 3dmm

Perceiving Systems Article AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning Tallamraju, R., Saini, N., Bonetto, E., Pabst, M., Liu, Y. T., Black, M., Ahmad, A. IEEE Robotics and Automation Letters, 5(4):6678-6685, IEEE, October 2020, Also accepted and presented in the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). (Published)
In this letter, we introduce a deep reinforcement learning (DRL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose, and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system, and observation models. Such models are difficult to derive, and generalize across different systems. Moreover, the non-linearities, and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions.
DOI URL BibTeX
Thumb ticker lg cover paper website

Perceiving Systems Conference Paper Learning a statistical full spine model from partial observations Meng, D., Keller, M., Boyer, E., Black, M., Pujades, S. In Shape in Medical Imaging, 122-133, Lecture Notes in Computer Science, 12474, (Editors: Reuter, Martin and Wachinger, Christian and Lombaert, Hervé and Paniagua, Beatriz and Goksel, Orcun and Rekik, Islem), Springer, Cham, International Workshop on Shape in Medical Imaging (ShapeMI 2020), October 2020 (Published)
The study of the morphology of the human spine has attracted research attention for its many potential applications, such as image segmentation, bio-mechanics or pathology detection. However, as of today there is no publicly available statistical model of the 3D surface of the full spine. This is mainly due to the lack of openly available 3D data where the full spine is imaged and segmented. In this paper we propose to learn a statistical surface model of the full-spine (7 cervical, 12 thoracic and 5 lumbar vertebrae) from partial and incomplete views of the spine. In order to deal with the partial observations we use probabilistic principal component analysis (PPCA) to learn a surface shape model of the full spine. Quantitative evaluation demonstrates that the obtained model faithfully captures the shape of the population in a low dimensional space and generalizes to left out data. Furthermore, we show that the model faithfully captures the global correlations among the vertebrae shape. Given a partial observation of the spine, i.e. a few vertebrae, the model can predict the shape of unseen vertebrae with a mean error under 3 mm. The full-spine statistical model is trained on the VerSe 2019 public dataset and is publicly made available to the community for non-commercial purposes. (https://gitlab.inria.fr/spine/spine_model)
Gitlab Code PDF DOI BibTeX
Thumb ticker lg demo 600

Perceiving Systems Conference Paper GRAB: A Dataset of Whole-Body Human Grasping of Objects Taheri, O., Ghorbani, N., Black, M. J., Tzionas, D. In Computer Vision – ECCV 2020, 4:581-600, Lecture Notes in Computer Science, 12349, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de.
pdf suppl video (long) video (short) DOI URL BibTeX
Thumb ticker lg grab thumb3

Perceiving Systems Conference Paper Monocular Expressive Body Regression through Body-Driven Attention Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J. In Computer Vision - ECCV 2020, 10:20-40, Lecture Notes in Computer Science, 12355, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.
code Short video Long video arxiv pdf suppl DOI URL BibTeX
Thumb ticker lg merged2

Perceiving Systems Conference Paper STAR: Sparse Trained Articulated Human Body Regressor Osman, A. A. A., Bolkart, T., Black, M. J. In Computer Vision - ECCV 2020, 6:598-613, Lecture Notes in Computer Science, 12351, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)
The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at http://star.is.tue.mpg.de.
Project Page Code Video paper supplemental DOI BibTeX
Thumb ticker lg main teaser

Perceiving Systems Ph.D. Thesis Addressing the Data Scarcity of Learning-based Optical Flow Approaches Janai, J. University of Tübingen, July 2020 (Published)
Learning to solve optical flow in an end-to-end fashion from examples is attractive as deep neural networks allow for learning more complex hierarchical flow representations directly from annotated data. However, training such models requires large datasets, and obtaining ground truth for real images is challenging. Due to the difficulty of capturing dense ground truth, existing optical flow datasets are limited in size and diversity. Therefore, we present two strategies to address this data scarcity problem: First, we propose an approach to create new real-world datasets by exploiting temporal constraints using a high-speed video camera. We tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of state of the art in optical flow under various levels of motion blur. Second, we investigate how to learn sophisticated models from unlabeled data. Unsupervised learning is a promising direction, yet the performance of current unsupervised methods is still limited. In particular, the lack of proper occlusion handling in commonly used data terms constitutes a major source of error. While most optical flow methods process pairs of consecutive frames, more advanced occlusion reasoning can be realized when considering multiple frames. We propose a framework for unsupervised learning of optical flow and occlusions over multiple frames. More specifically, we exploit the minimal configuration of three frames to strengthen the photometric loss and explicitly reason about occlusions. We demonstrate that our multi-frame, occlusion-sensitive formulation outperforms previous unsupervised methods and even produces results on par with some fully supervised methods. Both directions are essential for future advances in optical flow. While new datasets allow measuring the advancements and comparing novel approaches, unsupervised learning permits the usage of new data sources to train better models.
download DOI BibTeX
Thumb ticker lg joelthesis

Perceiving Systems Article Analysis of motor development within the first year of life: 3-D motion tracking without markers for early detection of developmental disorders Parisi, C., Hesse, N., Tacke, U., Rocamora, S. P., Blaschek, A., Hadders-Algra, M., Black, M. J., Heinen, F., Müller-Felber, W., Schroeder, A. S. Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz, 63(7):881–890, July 2020 (Published)
Children with motor development disorders benefit greatly from early interventions. An early diagnosis in pediatric preventive care (U2–U5) can be improved by automated screening. Current approaches to automated motion analysis, however, are expensive, require lots of technical support, and cannot be used in broad clinical application. Here we present an inexpensive, marker-free video analysis tool (KineMAT) for infants, which digitizes 3‑D movements of the entire body over time allowing automated analysis in the future. Three-minute video sequences of spontaneously moving infants were recorded with a commercially available depth-imaging camera and aligned with a virtual infant body model (SMIL model). The virtual image generated allows any measurements to be carried out in 3‑D with high precision. We demonstrate seven infants with different diagnoses. A selection of possible movement parameters was quantified and aligned with diagnosis-specific movement characteristics. KineMAT and the SMIL model allow reliable, three-dimensional measurements of spontaneous activity in infants with a very low error rate. Based on machine-learning algorithms, KineMAT can be trained to automatically recognize pathological spontaneous motor skills. It is inexpensive and easy to use and can be developed into a screening tool for preventive care for children.
pdf on-line w/ sup mat DOI BibTeX
Thumb ticker lg baby2

Perceiving Systems Conference Paper Generating 3D People in Scenes without People Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6193-6203, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: \url{https://vlg.inf.ethz.ch/projects/PSI/}.
Code Video PDF DOI BibTeX
Thumb ticker lg yancvpr

Perceiving Systems Conference Paper Learning Physics-guided Face Relighting under Directional Light Nestmeyer, T., Lalonde, J., Matthews, I., Lehrmann, A. M. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 5123-5132, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.
Paper DOI BibTeX
Thumb ticker lg teaser incl ball

Perceiving Systems Article Learning and Tracking the 3D Body Shape of Freely Moving Infants from RGB-D sequences Hesse, N., Pujades, S., Black, M., Arens, M., Hofmann, U., Schroeder, S. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2540-2551, June 2020
Statistical models of the human body surface are generally learned from thousands of high-quality 3D scans in predefined poses to cover the wide variety of human body shapes and articulations. Acquisition of such data requires expensive equipment, calibration procedures, and is limited to cooperative subjects who can understand and follow instructions, such as adults. We present a method for learning a statistical 3D Skinned Multi-Infant Linear body model (SMIL) from incomplete, low-quality RGB-D sequences of freely moving infants. Quantitative experiments show that SMIL faithfully represents the RGB-D data and properly factorizes the shape and pose of the infants. To demonstrate the applicability of SMIL, we fit the model to RGB-D sequences of freely moving infants and show, with a case study, that our method captures enough motion detail for General Movements Assessment (GMA), a method used in clinical practice for early detection of neurodevelopmental disorders in infants. SMIL provides a new tool for analyzing infant shape and movement and is a step towards an automated system for GMA.
pdf Journal DOI BibTeX
Thumb ticker lg hessepami

Perceiving Systems Conference Paper Learning to Dress 3D People in Generative Clothing Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6468-6477, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.
Project page Code Long video Short Video arXiv DOI URL BibTeX
Thumb ticker lg cape teaser ps website v3

Perceiving Systems Patent Machine learning systems and methods of estimating body shape from images Black, M., Rachlin, E., Heron, N., Loper, M., Weiss, A., Hu, K., Hinkle, T., Kristiansen, M. (US Patent 10,679,046), June 2020
Disclosed is a method including receiving an input image including a human, predicting, based on a convolutional neural network that is trained using examples consisting of pairs of sensor data, a corresponding body shape of the human and utilizing the corresponding body shape predicted from the convolutional neural network as input to another convolutional neural network to predict additional body shape metrics.
BibTeX
Thumb ticker lg patent2020b

Perceiving Systems Conference Paper GENTEL: GENerating Training data Efficiently for Learning to segment medical images Thakur, R. P., Pujades, S., Goel, L., Pohmann, R., Machann, J., Black, M. J. RFIAP 2020 - Congrés Reconnaissance des Formes, Image, Apprentissage et Perception , Congrès Reconnaissance des Formes, Image, Apprentissage et Perception (RFAIP 2020) , June 2020 (Published)
Accurately segmenting MRI images is crucial for many clinical applications. However, manually segmenting images with accurate pixel precision is a tedious and time consuming task. In this paper we present a simple, yet effective method to improve the efficiency of the image segmentation process. We propose to transform the image annotation task into a binary choice task. We start by using classical image processing algorithms with different parameter values to generate multiple, different segmentation masks for each input MRI image. Then, instead of segmenting the pixels of the images, the user only needs to decide whether a segmentation is acceptable or not. This method allows us to efficiently obtain high quality segmentations with minor human intervention. With the selected segmentations, we train a state-of-the-art neural network model. For the evaluation, we use a second MRI dataset (1.5T Dataset), acquired with a different protocol and containing annotations. We show that the trained network i) is able to automatically segment cases where none of the classical methods obtain a high quality result ; ii) generalizes to the second MRI dataset, which was acquired with a different protocol and was never seen at training time ; and iii) enables detection of miss-annotations in this second dataset. Quantitatively, the trained network obtains very good results: DICE score - mean 0.98, median 0.99- and Hausdorff distance (in pixels) - mean 4.7, median 2.0-.
Project Page PDF URL BibTeX
Thumb ticker lg gentel

Perceiving Systems Conference Paper VIBE: Video Inference for Human Body Pose and Shape Estimation Kocabas, M., Athanasiou, N., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 5252-5262, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)
Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE
arXiv code video supplemental video pdf DOI BibTeX
Thumb ticker lg vibe

Perceiving Systems Article General Movement Assessment from videos of computed 3D infant body models is equally effective compared to conventional RGB Video rating Schroeder, S., Hesse, N., Weinberger, R., Tacke, U., Gerstl, L., Hilgendorff, A., Heinen, F., Arens, M., Bodensteiner, C., Dijkstra, L. J., Pujades, S., Black, M., Hadders-Algra, M. Early Human Development, 144:104967, May 2020
Background: General Movement Assessment (GMA) is a powerful tool to predict Cerebral Palsy (CP). Yet, GMA requires substantial training hampering its implementation in clinical routine. This inspired a world-wide quest for automated GMA. Aim: To test whether a low-cost, marker-less system for three-dimensional motion capture from RGB depth sequences using a whole body infant model may serve as the basis for automated GMA. Study design: Clinical case study at an academic neurodevelopmental outpatient clinic. Subjects: Twenty-nine high-risk infants were recruited and assessed at their clinical follow-up at 2-4 month corrected age (CA). Their neurodevelopmental outcome was assessed regularly up to 12-31 months CA. Outcome measures: GMA according to Hadders-Algra by a masked GMA-expert of conventional and computed 3D body model (“SMIL motion”) videos of the same GMs. Agreement between both GMAs was assessed, and sensitivity and specificity of both methods to predict CP at ≥12 months CA. Results: The agreement of the two GMA ratings was substantial, with κ=0.66 for the classification of definitely abnormal (DA) GMs and an ICC of 0.887 (95% CI 0.762;0.947) for a more detailed GM-scoring. Five children were diagnosed with CP (four bilateral, one unilateral CP). The GMs of the child with unilateral CP were twice rated as mildly abnormal. DA-ratings of both videos predicted bilateral CP well: sensitivity 75% and 100%, specificity 88% and 92% for conventional and SMIL motion videos, respectively. Conclusions: Our computed infant 3D full body model is an attractive starting point for automated GMA in infants at risk of CP.
DOI BibTeX
Thumb ticker lg gma

Empirical Inference Perceiving Systems Probabilistic Learning Group Conference Paper From Variational to Deterministic Autoencoders Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B. 8th International Conference on Learning Representations (ICLR) , April 2020, *equal contribution (Published)
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.
arXiv URL BibTeX
Thumb ticker lg rae

Perceiving Systems Article Learning Multi-Human Optical Flow Ranjan, A., Hoffmann, D. T., Tzionas, D., Tang, S., Romero, J., Black, M. J. International Journal of Computer Vision (IJCV), 128(4):873-890, April 2020 (Published)
The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single-and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.
pdf DOI poster DOI URL BibTeX
Thumb ticker lg multihumanoflow thumb

Perceiving Systems Conference Paper Attractiveness and Confidence in Walking Style of Male and Female Virtual Characters Thaler, A., Bieg, A., Mahmood, N., Black, M. J., Mohler, B. J., Troje, N. F. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW 2020), 678-679, IEEE, Piscataway, NJ, IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW 2020), March 2020 (Published)
Animated virtual characters are essential to many applications. Little is known so far about biological and personality inferences made from a virtual character’s body shape and motion. Here, we investigated how sex-specific differences in walking style relate to the perceived attractiveness and confidence of male and female virtual characters. The characters were generated by reconstructing body shape and walking motion from optical motion capture data. The results suggest that sexual dimorphism in walking style plays a different role in attributing biological and personality traits to male and female virtual characters. This finding has important implications for virtual character animation.
pdf DOI BibTeX
Thumb ticker lg thaler2020

Perceiving Systems Article The iReAct study - A biopsychosocial analysis of the individual response to physical activity Thiel, A., Sudeck, G., Gropper, H., Maturana, F. M., Schubert, T., Srismith, D., Widmann, M., Behrens, S., Martus, P., Munz, B., Giel, K., Zipfel, S., Niess, A. M. Contemporary Clinical Trials Communications , 17:100508, March 2020 (Published)
Background Physical activity is a substantial promoter for health and well-being. Yet, while an increasing number of studies shows that the responsiveness to physical activity is highly individual, most studies focus this issue from only one perspective and neglect other contributing aspects. In reference to a biopsychosocial framework, the goal of our study is to examine how physically inactive individuals respond to two distinct standardized endurance trainings on various levels. Based on an assessment of activity- and health-related biographical experiences across the life course, our mixed-method study analyzes the responsiveness to physical activity in the form of a transdisciplinary approach, considering physiological, epigenetic, motivational, affective, and body image-related aspects. Methods Participants are randomly assigned to two different training programs (High Intensity Interval Training vs. Moderate Intensity Continuous Training) for six weeks. After this first training period, participants switch training modes according to a two-period sequential-training-intervention (STI) design and train for another six weeks. In order to analyse baseline characteristics as well as acute and adaptive biopsychosocial responses, three extensive mixed-methods diagnostic blocks take place at the beginning (t0) of the study and after the first (t1) and the second (t2) training period resulting in a net follow-up time of 15 weeks. The study is divided into five modules in order to cover a wide array of perspectives. Discussion The study's transdisciplinary mixed-method design allows to interlace a multitude of subjective and objective data and therefore to draw an integrated picture of the biopsychosocial efficacy of two distinct physical activity programs. The results of our study can be expected to contribute to the development and design of individualised training programs for the promotion of physical activity.
DOI BibTeX
Thumb ticker lg 1 s2.0 s2451865419302716 gr1

Perceiving Systems Conference Paper Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations Rueegg, N., Lassner, C., Black, M. J., Schindler, K. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 4:5561-5569, AAAI Press, Palo Alto, CA, Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), February 2020 (Published)
The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
pdf DOI BibTeX
Thumb ticker lg aaai

Perceiving Systems Patent Machine learning systems and methods for augmenting images Black, M., Rachlin, E., Lee, E., Heron, N., Loper, M., Weiss, A., Smith, D. (US Patent 10,529,137 B1), January 2020
Disclosed is a method including receiving visual input comprising a human within a scene, detecting a pose associated with the human using a trained machine learning model that detects human poses to yield a first output, estimating a shape (and optionally a motion) associated with the human using a trained machine learning model associated that detects shape (and optionally motion) to yield a second output, recognizing the scene associated with the visual input using a trained convolutional neural network which determines information about the human and other objects in the scene to yield a third output, and augmenting reality within the scene by leveraging one or more of the first output, the second output, and the third output to place 2D and/or 3D graphics in the scene.
BibTeX
Thumb ticker lg patentaugmentation

Perceiving Systems Article Influence of Physical Activity Interventions on Body Representation: A Systematic Review Srismith, D., Wider, L., Wong, H. Y., Zipfel, S., Thiel, A., Giel, K. E., Behrens, S. C. Frontiers in Psychiatry, 11:99, 2020 (Published)
Distorted representation of one's own body is a diagnostic criterion and corepsychopathology of disorders such as anorexia nervosa and body dysmorphic disorder. Previousliterature has raised the possibility of utilising physical activity intervention (PI) as atreatment option for individuals suffering from poor body satisfaction, which is traditionallyregarded as a systematic distortion in “body image.” In this systematic review,conducted according to the PRISMA statement, the evidence on effectiveness of PI on body representation outcomes is synthesised. We provide an update of 34 longitudinal studies evaluating the effectiveness of different types of PIs on body representation. No systematic risk of bias within or across studies were identified. The reviewed studies show that the implementation of structured PIs may be efficacious in increasing individuals’ satisfaction of their own body, and thus improving their subjective body image related assessments. However, there is no clear evidence regarding an additional or interactive effect of PI when implemented in conjunction with established treatments for clinical populations. We argue for theoretically sound, mechanism-oriented, multimethod approaches to future investigations on body image disturbance. Specifically, we highlight the need to consider expanding the theoretical framework for the investigation of body representation disturbances to include further body representations besides body image.
DOI BibTeX
Thumb ticker lg frontiers