Humans from Video

Humans are in constant motion. Interactions with the world and with each other involve movement. To capture, model, and synthesize human behavior we need to analyze it in video. Despite this, most methods for human 3D human pose and shape (HPS) estimation focus on single images. Intuitively, we should be able to exploit the regularity of human motion and the extra information provided by multiple video frames to improve HPS estimation compared to single-image methods. To that end, we are pursuing several lines or research to enable accurate markerless motion capture from unconstrained video "in the wild".

A key enabler of video-based analysis of motion is training data. To that end, we have exploited our 3D body models (SMPL, etc.) and MoSh, to create the large-scale AMASS dataset [] of human motions in a common 3D representation. We used an early version of this to generate the SURREAL dataset [], which contains rendered videos of people in motion. We used SURREAL, for example, to train methods to estimate the optical flow of people in video []. We also used AMASS to train a network to estimate 3D human pose from a sparse set of IMUs [].

Synthetic datasets like SURREAL are not fully representative of real-world video. Consequently, we created the 3D Poses in the Wild dataset (3DPW) by combining IMU data with monocular video. IMUs are prone to drift but give 3D pose information. Videos give precise 2D alignment with image pixels but lack 3D. By combining these sources of information, 3DPW provides class-leading pseudo ground truth and is, consequently, widely used for training and evaluation.

To estimate 3D humans from video, we have pursued both optimization and regression approaches. Multi-View-SMPLify [] optimizes 3D pose over time using a generic DCT temporal prior. In contrast, VIBE [] uses a GRU-based temporal architecture to regress SMPL from video. VIBE exploits discriminative training using AMASS [] to help the network generate motions that resemble true human movement.

With SMIL [], we capture the motion of infants in RGB-D sequences but go further to use the sequences to learn the 3D shape model. By analyzing the movements of the infants, we provide an assessment related to cerebral palsy [].

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Muhammed Kocabas

Guest Scientist

Perceiving Systems

Nikos Athanasiou

Guest Scientist

Perceiving Systems

Yinghao Huang

Guest Scientist

Perceiving Systems

Federica Bogo

Perceiving Systems

Christoph Lassner

Affiliated Researcher

Perceiving Systems

Angjoo Kanazawa

Perceiving Systems

Peter Vincent Gehler

Research Group Leader

Perceiving Systems

Javier Romero

Affiliated Researcher

Perceiving Systems

Ijaz Akhter

Perceiving Systems

Sergi Pujades

Guest Scientist

Perceiving Systems

Gerard Pons-Moll

Affiliated Researcher

Publications

Perceiving Systems Conference Paper VIBE: Video Inference for Human Body Pose and Shape Estimation Kocabas, M., Athanasiou, N., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 5252-5262, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)

Abstract ›

Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

arXiv code video supplemental video pdf DOI BibTeX

Perceiving Systems Conference Paper Towards Accurate Marker-less Human Shape and Pose Estimation over Time Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P. V., Romero, J., Akhter, I., Black, M. J. In International Conference on 3D Vision (3DV), 421-430, 2017

Abstract ›

Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multiview videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method [12] as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.

Code pdf DOI BibTeX

Perceiving Systems Conference Paper Multi-Person Tracking by Multicuts and Deep Matching Tang, S., Andres, B., Andriluka, M., Schiele, B. ECCV Workshop on Benchmarking Mutliple Object Tracking, 2016 PDF BibTeX