Regressing Humans

Estimating the full 3D human pose and shape (HPS) directly from RGB images enables markerless motion capture and provides the foundation for human behavior analysis. Classical top-down model fitting approaches have several limitations, they require pre-computed keypoints, which are difficult to obtain in complex scenarios and for bodies with occlusions, they are computationally slow (> 30 seconds per image), and these methods are easily trapped in local minima. In contrast, regression methods directly learn the mapping between image pixels and 3D body shape and pose using a deep neural network.

The first HPS regressor, HMR [], is trained using only a 2D joint reprojection error by exploiting an adversarial loss that encourages the model to produce SMPL parameters that are indistinguishable from real ones. VIBE [] generalizes HMR to videos by using a temporal discriminator learned from AMASS []. SPIN [] uses the current regressor to initialize optimization-based fitting, which then serves as supervision to improve the regressor in a collaborative training framework.

Our recent work builds on HMR and SPIN, addressing their limitations. PARE [] learns to predict body-part guided attention masks to increase robustness to partial occlusions by leveraging information from neighboring, non-occluded, body-parts. SPEC [] learns a network to estimate a perspective camera from the input image, and uses this to regress more accurate 3D bodies. TUCH [] augments SPIN during training with 3D bodies that are obtained by exploiting discrete contacts during pose optimization, improving reconstruction performance for both self-contact and non-contact poses.

Typical HPS regressors work in two stages: they detect the human and then regress the body in a cropped image. ROMP [] replaces this with a single stage by estimating the likelihood that a body is centered at any image pixel along with a map of SMPL parameters at every pixel. ROMP estimates multiple bodies simultaneously and in real time.

Most methods regress SMPL parameters. ExPose [] estimates SMPL-X, including hand pose and facial expression, using body-part specific sub-networks to refine the hand and face parameters with body-driven attention. PIXIE [] goes further, introducing a moderator that merges the features of different parts. PIXIE also increases realism by estimating gendered body shapes and detailed face shape.

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Angjoo Kanazawa

Perceiving Systems

Georgios Pavlakos

Intern

Perceiving Systems

Vassilis Choutas

Perceiving Systems

Ahmed Osman

Guest Scientist

Perceiving Systems

Timo Bolkart

Research Scientist

Perceiving Systems

Dimitris Tzionas

Guest Scientist

Perceiving Systems

Muhammed Kocabas

Guest Scientist

Perceiving Systems

Lea Müller

Guest Scientist

Perceiving Systems

Siyu Tang

Guest Scientist

Perceiving Systems

Chun-Hao Paul Huang

Guest Scientist

Perceiving Systems

Joachim Tesch

Software Engineer, Real-time Graphics (VR/MR)

Perceiving Systems

Nikos Kolotouros

Perceiving Systems

Sai Kumar Dwivedi

Doctoral Researcher

Publications

Perceiving Systems Conference Paper Accurate 3D Body Shape Regression using Metric and Semantic Attributes Choutas, V., Müller, L., Huang, C. P., Tang, S., Tzionas, D., Black, M. J. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2708-2718, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), June 2022 (Published)

Abstract ›

While methods that regress 3D human meshes from images have progressed rapidly, the estimated body shapes often do not capture the true human shape. This is problematic since, for many applications, accurate body shape is as important as pose. The key reason that body shape accuracy lags pose accuracy is the lack of data. While humans can label 2D joints, and these constrain 3D pose, it is not so easy to “label” 3D body shape. Since paired data with images and 3D body shape are rare, we exploit two sources of partial information: (1) we collect internet images of diverse models together with a small set of measurements; (2) we collect semantic shape attributes for a wide range of 3D body meshes and model images. Taken together, these datasets provide sufficient constraints to infer metric 3D shape. We exploit this partial and semantic data in several novel ways to train a neural network, called SHAPY, that regresses 3D human pose and shape from an RGB image. We evaluate SHAPY on public benchmarks but note that they either lack significant body shape variation, ground-truth shape, or clothing variation. Thus, we collect a new dataset for 3D human shape estimation, containing photos of people in the wild for whom we have ground-truth 3D body scans. On this new benchmark, SHAPY significantly outperforms recent state-of-the-art methods on the task of 3D body shape estimation. This is the first demonstration that a 3D body shape regressor can be trained from sparse measurements and easy-to-obtain semantic shape attributes. Our model and data are freely available for research.

Home Code Video Paper Supplementary Material Poster DOI URL BibTeX

Perceiving Systems Conference Paper Collaborative Regression of Expressive Bodies using Moderation Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M. J. 2021 International Conference on 3D Vision (3DV 2021), 792-804, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)

Abstract ›

Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars with realistic facial detail, from a single image. For this, PIXIE uses two key observations. First, existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. All part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D facial surface displacements. Quantitative and qualitative evaluation shows that PIXIE estimates more accurate whole-body shape and detailed face shape than the state of the art. Models and code are available at https://pixie.is.tue.mpg.de.

arXiv project pdf suppl DOI BibTeX

Perceiving Systems Conference Paper Learning To Regress Bodies From Images Using Differentiable Semantic Rendering Dwivedi, S. K., Athanasiou, N., Kocabas, M., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11230-11239, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Learning to regress 3D human body shape and pose (e.g. SMPL parameters) from monocular images typically exploits losses on 2D keypoints, silhouettes, and/or part segmentation when 3D training data is not available. Such losses, however, are limited because 2D keypoints do not supervise body shape and segmentations of people in clothing do not match projected minimally-clothed SMPL shapes. To exploit richer image information about clothed people, we introduce higher-level semantic information about clothing to penalize clothed and non-clothed regions of the human body differently. To do so, we train a body regressor using a novel “Differentiable Semantic Rendering (DSR)” loss. For Minimally-Clothed (MC) regions, we define the DSRMC loss, which encourages a tight match between a rendered SMPL body and the minimally-clothed regions of the image. For clothed regions, we define the DSR-C loss to encourage the rendered SMPL body to be inside the clothing mask. To ensure end-to-end differentiable training, we learn a semantic clothing prior for SMPL vertices from thousands of clothed human scans. We perform extensive qualitative and quantitative experiments to evaluate the role of clothing semantics on the accuracy of 3D human pose and shape estimation. We outperform all previous state-of-the-art methods on 3DPW and Human3.6M and obtain on par results on MPI-INF-3DHP. Code and trained models are available for research at https://dsr.is.tue.mpg.de/

pdf supp code project-website video poster DOI BibTeX

Perceiving Systems Conference Paper Monocular, One-Stage, Regression of Multiple 3D People Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M. J., Mei, T. In Proc. International Conference on Computer Vision (ICCV), 11159-11168, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.

pdf supp arXiv code DOI BibTeX

Perceiving Systems Conference Paper PARE: Part Attention Regressor for 3D Human Body Estimation Kocabas, M., Huang, C. P., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11107-11117, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Despite significant progress, state of the art 3D human pose and shape estimation methods remain sensitive to partial occlusion and can produce dramatically wrong predictions although much of the body is observable. To address this, we introduce a soft attention mechanism, called the Part Attention REgressor (PARE), that learns to predict body-part-guided attention masks. We observe that state-of-the-art methods rely on global feature representations, making them sensitive to even small occlusions. In contrast, PARE's part-guided attention mechanism overcomes these issues by exploiting information about the visibility of individual body parts while leveraging information from neighboring body-parts to predict occluded parts. We show qualitatively that PARE learns sensible attention masks, and quantitative evaluation confirms that PARE achieves more accurate and robust reconstruction results than existing approaches on both occlusion-specific and standard benchmarks.

pdf supp code video arXiv project website poster DOI BibTeX

Perceiving Systems Conference Paper SPEC: Seeing People in the Wild with an Estimated Camera Kocabas, M., Huang, C. P., Tesch, J., Müller, L., Hilliges, O., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11015-11025, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies.

pdf supp arXiv code video project website poster DOI BibTeX

Perceiving Systems Conference Paper On Self-Contact and Human Pose Müller, L., Osman, A. A. A., Tang, S., Huang, C. P., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 9985-9994, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)

Abstract ›

People touch their face 23 times an hour, they cross their arms and legs, put their hands on their hips, etc. While many images of people contain some form of self-contact, current 3D human pose and shape (HPS) regression methods typically fail to estimate this contact. To address this, we develop new datasets and methods that significantly improve human pose estimation with self-contact. First, we create a dataset of 3D Contact Poses (3DCP) containing SMPL-X bodies fit to 3D scans as well as poses from AMASS, which we refine to ensure good contact. Second, we leverage this to create the Mimic-The-Pose (MTP) dataset of images, collected via Amazon Mechanical Turk, containing people mimicking the 3DCP poses with self-contact. Third, we develop a novel HPS optimization method, SMPLify-XMC, that includes contact constraints and uses the known 3DCP body pose during fitting to create near ground-truth poses for MTP images. Fourth, for more image variety, we label a dataset of in-the-wild images with Discrete Self-Contact (DSC) information and use another new optimization method, SMPLify-DC, that exploits discrete contacts during pose optimization. Finally, we use our datasets during SPIN training to learn a new 3D human pose regressor, called TUCH (Towards Understanding Contact in Humans). We show that the new self-contact training data significantly improves 3D human pose estimates on withheld test data and existing datasets like 3DPW. Not only does our method improve results for self-contact poses, but it also improves accuracy for non-contact poses. The code and data are available for research purposes at https://tuch.is.tue.mpg.de.

project arXiv poster video code DOI BibTeX

Perceiving Systems Conference Paper Monocular Expressive Body Regression through Body-Driven Attention Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J. In Computer Vision - ECCV 2020, 10:20-40, Lecture Notes in Computer Science, 12355, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)

Abstract ›

To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.

code Short video Long video arxiv pdf suppl DOI URL BibTeX

Perceiving Systems Conference Paper VIBE: Video Inference for Human Body Pose and Shape Estimation Kocabas, M., Athanasiou, N., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 5252-5262, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)

Abstract ›

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

arXiv code video supplemental video pdf DOI BibTeX

Perceiving Systems Conference Paper Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations Rueegg, N., Lassner, C., Black, M. J., Schindler, K. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 4:5561-5569, AAAI Press, Palo Alto, CA, Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), February 2020 (Published)

Abstract ›

The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.

pdf DOI BibTeX

Perceiving Systems Conference Paper Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop Kolotouros, N., Pavlakos, G., Black, M. J., Daniilidis, K. Proceedings International Conference on Computer Vision (ICCV), 2252-2261, IEEE, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019, ISSN: 2380-7504 (Published)

Abstract ›

Model-based human pose estimation is currently approached through two different paradigms. Optimization-based methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate image-model alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins.

pdf code project DOI BibTeX

Perceiving Systems Conference Paper Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation Omran, M., Lassner, C., Pons-Moll, G., Gehler, P. V., Schiele, B. In 3DV, 484-494, September 2018

Abstract ›

Direct prediction of 3D body pose and shape remains a challenge even for highly parameterized deep learning models. Mapping from the 2D image space to the prediction space is difficult: perspective ambiguities make the loss function noisy and training data is scarce. In this paper, we propose a novel approach (Neural Body Fitting (NBF)). It integrates a statistical body model within a CNN, leveraging reliable bottom-up semantic body part segmentation and robust top-down body model constraints. NBF is fully differentiable and can be trained using 2D and 3D annotations. In detailed experiments, we analyze how the components of our model affect performance, especially the use of part segmentations as an explicit intermediate representation, and present a robust, efficiently trainable framework for 3D human pose estimation from 2D images with competitive results on standard benchmarks. Code is available at https://github.com/mohomran/neural_body_fitting

arXiv code BibTeX

Perceiving Systems Conference Paper End-to-end Recovery of Human Shape and Pose Kanazawa, A., Black, M. J., Jacobs, D. W., Malik, J. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7122-7131, IEEE Computer Society, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Abstract ›

We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allows our model to be trained using in-the-wild images that only have ground truth 2D annotations. However, the reprojection loss alone is highly underconstrained. In this work we address this problem by introducing an adversary trained to tell whether human body shape and pose parameters are real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detections and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization-based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.

pdf code project video BibTeX

Perceiving Systems Conference Paper Unite the People: Closing the Loop Between 3D and 2D Human Representations Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M. J., Gehler, P. V. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, 4704-4713, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

Abstract ›

3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits “in-the-wild”. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.

arXiv project/code/data BibTeX