Optimizing Human Pose and Shape

While data-driven methods for directly regressing 3D humans from 2D images are widely popular, optimization-based methods continue to play an important role. While typically slower than regression methods, optimization approaches require no training data, can be quickly adapted to new problems, and produce image-aligned results. In our view, the two approaches are not competing, but rather, complimentary.

Optimization-based approaches directly fit a 3D body model like SMPL to image observations (e.g., detected joint locations, edges, silhouettes, semantic segmentations, etc.). We introduced the first such method, SMPLify [], which optimizes SMPL pose and shape to minimize the 2D error between detected joints and projected SMPL joints. Because of the inherent ambiguity in estimating 3D from 2D, SMPLify introduced a pose prior trained on mocap data and a term that discouraged self-penetration.

With SMPLify-X [] we extend this concept to estimate the expressive SMPL-X model by fitting it to 2D landmarks from OpenPose. SMPLify-X introduced several improvements including a gender classifier so that the estimated body shapes better matched the image. We also introduced a better VAE-based pose prior, VPoser, trained on AMASS, and we improved the interpenetration detection.

Because images with ground-truth human pose and shape are hard to obtain, these optimization methods provide critical pseudo ground truth for training deep regression networks. For example, we use SMPLify-X to obtain SMPL-X fits to images and use these to train ExPose []. With SPIN [], we showed that an even tighter integration of regression and optimization is valuable and synergistic. SPIN uses a regressor to initialize SMPLify, which is then run for a few optimization steps, improving the fit. These improved fits are then used to retrain the regressor. By doing this in a loop, we incrementally obtain better training data and a better regressor. This training approach is now widely used.

The basic SMPLify(-X) approach is easily adapted to new problems making it a foundational tool in our research. For example, we extended it to perform multi-view fitting and use silhouettes [], which we exploited to create the AGORA [] and SPEC-MTP [] datasets. We use it with aerial vehicles to simultaneously solve for camera extrinsics and body pose in multi-view images []. We adapted it to RGB-D images by including a depth loss and scene contact constraints in the objective function, enabling the creation of the PROX dataset []. We added constraints related to self-contact and exploited this to create the training and test data for TUCH [].

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Christoph Lassner

Affiliated Researcher

Perceiving Systems

Javier Romero

Affiliated Researcher

Robust Machine Learning

Martin Kiefel

Postdoctoral Researcher

Perceiving Systems

Federica Bogo

Perceiving Systems

Peter Vincent Gehler

Research Group Leader

Perceiving Systems

Angjoo Kanazawa

Perceiving Systems

Georgios Pavlakos

Intern

Perceiving Systems

Vassilis Choutas

Perceiving Systems

Nima Ghorbani

Guest Scientist

Perceiving Systems

Timo Bolkart

Research Scientist

Perceiving Systems

Ahmed Osman

Guest Scientist

Perceiving Systems

Dimitris Tzionas

Guest Scientist

Publications

Perceiving Systems Conference Paper SPEC: Seeing People in the Wild with an Estimated Camera Kocabas, M., Huang, C. P., Tesch, J., Müller, L., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11015-11025, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies.

pdf supp arXiv code video project website poster DOI BibTeX

Perceiving Systems Conference Paper On Self-Contact and Human Pose Müller, L., Osman, A. A. A., Tang, S., Huang, C. P., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 9985-9994, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)

Abstract ›

People touch their face 23 times an hour, they cross their arms and legs, put their hands on their hips, etc. While many images of people contain some form of self-contact, current 3D human pose and shape (HPS) regression methods typically fail to estimate this contact. To address this, we develop new datasets and methods that significantly improve human pose estimation with self-contact. First, we create a dataset of 3D Contact Poses (3DCP) containing SMPL-X bodies fit to 3D scans as well as poses from AMASS, which we refine to ensure good contact. Second, we leverage this to create the Mimic-The-Pose (MTP) dataset of images, collected via Amazon Mechanical Turk, containing people mimicking the 3DCP poses with self-contact. Third, we develop a novel HPS optimization method, SMPLify-XMC, that includes contact constraints and uses the known 3DCP body pose during fitting to create near ground-truth poses for MTP images. Fourth, for more image variety, we label a dataset of in-the-wild images with Discrete Self-Contact (DSC) information and use another new optimization method, SMPLify-DC, that exploits discrete contacts during pose optimization. Finally, we use our datasets during SPIN training to learn a new 3D human pose regressor, called TUCH (Towards Understanding Contact in Humans). We show that the new self-contact training data significantly improves 3D human pose estimates on withheld test data and existing datasets like 3DPW. Not only does our method improve results for self-contact poses, but it also improves accuracy for non-contact poses. The code and data are available for research purposes at https://tuch.is.tue.mpg.de.

project arXiv poster video code DOI BibTeX

Perceiving Systems Conference Paper Monocular Expressive Body Regression through Body-Driven Attention Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J. In Computer Vision - ECCV 2020, 10:20-40, Lecture Notes in Computer Science, 12355, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)

Abstract ›

To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.

code Short video Long video arxiv pdf suppl DOI URL BibTeX

Perceiving Systems Conference Paper Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles Saini, N., Price, E., Tallamraju, R., Enficiaud, R., Ludwig, R., Martinović, I., Ahmad, A., Black, M. Proceedings 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 823-832, IEEE, International Conference on Computer Vision (ICCV), October 2019 (Published)

Abstract ›

Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.

Code Data Video Paper Manuscript DOI BibTeX

Perceiving Systems Conference Paper Resolving 3D Human Pose Ambiguities with 3D Scene Constraints Hassan, M., Choutas, V., Tzionas, D., Black, M. J. In International Conference on Computer Vision (ICCV), 2282-2292, October 2019 (Published)

Abstract ›

To understand and analyze human behavior, we need to capture humans moving in, and interacting with, the world. Most existing methods perform 3D human pose estimation without explicitly considering the scene. We observe however that the world constrains the body and vice-versa. To motivate this, we show that current 3D human pose estimation methods produce results that are not consistent with the 3D scene. Our key contribution is to exploit static 3D scene structure to better estimate human pose from monocular images. The method enforces Proximal Relationships with Object eXclusion and is called PROX. To test this, we collect a new dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the scenes. We represent human pose using the 3D human body model SMPL-X and extend SMPLify-X to estimate body pose using scene constraints. We make use of the 3D scene information by formulating two main constraints. The interpenetration constraint penalizes intersection between the body model and the surrounding 3D scene. The contact constraint encourages specific parts of the body to be in contact with scene surfaces if they are close enough in distance and orientation. For quantitative evaluation we capture a separate dataset with 180 RGB frames in which the ground-truth body pose is estimated using a motion-capture system. We show quantitatively that introducing scene constraints significantly reduces 3D joint error and vertex error. Our code and data are available for research at https://prox.is.tue.mpg.de.

pdf poster DOI URL BibTeX

Perceiving Systems Conference Paper Expressive Body Capture: 3D Hands, Face, and Body from a Single Image Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , 10975-10985, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

Abstract ›

To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.

video code pdf suppl poster DOI URL BibTeX

Perceiving Systems Conference Paper Unite the People: Closing the Loop Between 3D and 2D Human Representations Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., Gehler, P. V. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, 4704-4713, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

Abstract ›

3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits “in-the-wild”. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

arXiv project/code/data BibTeX

Perceiving Systems Conference Paper Towards Accurate Marker-less Human Shape and Pose Estimation over Time Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., Black, M. J. In International Conference on 3D Vision (3DV), 421-430, 2017

Abstract ›

Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multiview videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method [12] as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.

Code pdf DOI BibTeX

Perceiving Systems Conference Paper Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M. J. In Computer Vision – ECCV 2016, 561-578, Lecture Notes in Computer Science, Springer International Publishing, 14th European Conference on Computer Vision, October 2016

Abstract ›

We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

pdf Video Sup Mat video Code Project ppt BibTeX