Expressive Body Models

Until recently, human pose has often been represented by 10-12 body joints in 2D or 3D. This is inspired by Johannson's moving light displays, which showed that some human actions can be recognized from the motion of the major joints of the body. We have argued that such representations are too impoverished to model human behavior. Humans express their emotions through the surface of their face and manipulate the world through the surface of their bodies.

Consequently, Perceiving Systems has focused on modeling and inferring 3D human pose and shape (HPS) using expressive 3D body models that capture the surface of the body either explicitly as a mesh or implicitly as a neural network. Such 3D shape models allow us to capture human-scene contact and provide information about a person related to their health, age, fitness, and clothing size.

We introduced the SMPL body model in 2015 [] and made it available for research and commercial licensing. SMPL is realistic, efficient, posable, and compatible with most graphics packages. It is also differentiable and easy to integrate into optimization or deep learning methods. Since its release, it has become the de facto standard in the field and is widely used in industry and academia.

SMPL also has limitations, some of which have been addressed by STAR [], which is learned from thousands more 3D body scans and has local pose corrective blend shapes.

We have steadily improved on SMPL adding hands and faces to create SMPL-X []. Most recently we have combined this with the detailed facial model from DECA [] to increase expressive realism. We always combine these models with methods to estimate them from images. Our most recent neural regression recent method, PIXIE [], uses a moderator to assess the reliability of face and hand regressors before integrating the body, face, and hand features.

Current work is extending these models to include clothing. For example, CAPE [] uses a convolutional mesh VAE to learn a generative model of clothing that is compatible with SMPL. See also our work on learning implicit models of clothed 3D humans.

This work on modeling humans is the foundation for our analysis of human movement, emotion, and behavior.

Videos

Datasets

ExPose (ECCV 2020) dataset (link)

A curated dataset that contains 32.617 pairs of:
- an in-the-wild RB image, and
- an expressive whole-body 3D human reconstruction (SMPL-X).

The dataset can be used to train models that predict expressive 3D
human bodies, from a single RB image as input, similar to ExPose.

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Dimitris Tzionas

Guest Scientist

Perceiving Systems

Timo Bolkart

Research Scientist

Perceiving Systems

Ahmed Osman

Guest Scientist

Perceiving Systems

Vassilis Choutas

Perceiving Systems

Nima Ghorbani

Guest Scientist

Perceiving Systems

Georgios Pavlakos

Intern

Publications

Perceiving Systems Conference Paper Collaborative Regression of Expressive Bodies using Moderation Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M. J. 2021 International Conference on 3D Vision (3DV 2021), 792-804, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)

Abstract ›

Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars with realistic facial detail, from a single image. For this, PIXIE uses two key observations. First, existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. All part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D facial surface displacements. Quantitative and qualitative evaluation shows that PIXIE estimates more accurate whole-body shape and detailed face shape than the state of the art. Models and code are available at https://pixie.is.tue.mpg.de.

arXiv project pdf suppl DOI BibTeX

Perceiving Systems Conference Paper Monocular Expressive Body Regression through Body-Driven Attention Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J. In Computer Vision - ECCV 2020, 10:20-40, Lecture Notes in Computer Science, 12355, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)

Abstract ›

To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.

code Short video Long video arxiv pdf suppl DOI URL BibTeX

Perceiving Systems Conference Paper Expressive Body Capture: 3D Hands, Face, and Body from a Single Image Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , 10975-10985, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

Abstract ›

To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.

video code pdf suppl poster DOI URL BibTeX