Markers to Avatars

Marker-based motion capture (mocap) is the "gold standard" for capturing human motion for animation and biomechanics. Unfortunately, the amount and quality of existing mocap data is small, limiting its use for deep learning. We address this with two key innovations: MoSh [] and SOMA [], which we use this to create a large dataset of human motions (AMASS []) with fine-grained action labels (BABEL []).

Traditional mocap can produce lifeless and unnatural animations. We argue that this is the result of “indirecting” through a skeleton. In standard mocap, visible 3D markers on the body surface are used to infer the unobserved skeleton in a process called "solving". The skeleton is used to animate a 3D model. While typical protocols place markers on parts of the body that move as rigidly as possible, soft-tissue motion always affects surface marker motion. Since non-rigid motions of surface markers are treated as noise, subtle information about body motion is lost.

MoSh (for Motion and Shape capture) replaces the skeleton with a 3D parametric body model. MoSh simultaneously estimates mocap marker locations on the SMPL body, estimates the body shape, and recovers the articulated body pose. By allowing soft-tissue deformations (DMPL) that vary over time, MoSh achieves high realism.

Mocap is also costly because human intervention is need to "clean" and "label" the capture data, which contains noise and missing markers. With SOMA we automate this process. Given a raw, noisy, and incomplete mocap point cloud, SOMA uses a stacked transformer architecture and a normalization layer to assign captured 3D points to markers on the body. The automatically labelled data is then fit with MoSh.

Using these techniques, we maintain the growing AMASS dataset, which converts many disparate datasets into a single unified SMPL-based representation. We have also built the BABEL dataset by acquiring fine-grained action labels for the motions in AMASS. The size of these datasets opens up human motion to deep learning architectures.

Data

Several related datasets and code repositories are available on-line for scientific purposes:

MoSh http://mosh.is.tue.mpg.de
SOMA https://soma.is.tue.mpg.de/
GRAB https://grab.is.tue.mpg.de/
AMASS and MoSh++ https://amass.is.tue.mpg.de/
BABEL https://babel.is.tue.mpg.de/

Members

Perceiving Systems

Naureen Mahmood

Research Engineer

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Nima Ghorbani

Guest Scientist

Perceiving Systems

Gerard Pons-Moll

Affiliated Researcher

Publications

Perceiving Systems Conference Paper SOMA: Solving Optical Marker-Based MoCap Automatically Ghorbani, N., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11097-11106, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Marker-based optical motion capture (mocap) is the “gold standard” method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. “labelling”. Given these labels, one can then “solve” for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at https://soma.is.tue.mpg.de/.

code pdf suppl arxiv project website video poster DOI BibTeX

Perceiving Systems Conference Paper BABEL: Bodies, Action and Behavior with English Labels Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, M. A., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 722-731, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , June 2021 (Published)

Abstract ›

Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43.5 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels which describe the overall action in the sequence, and frame labels which describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.

dataset poster pdf sup mat video code DOI BibTeX

Perceiving Systems Conference Paper GRAB: A Dataset of Whole-Body Human Grasping of Objects Taheri, O., Ghorbani, N., Black, M. J., Tzionas, D. In Computer Vision – ECCV 2020, 4:581-600, Lecture Notes in Computer Science, 12349, (Editors: Vedaldi, Andrea and Bischof, Horst and Brox, Thomas and Frahm, Jan-Michael), Springer, Cham, 16th European Conference on Computer Vision (ECCV 2020), August 2020 (Published)

Abstract ›

Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de.

pdf suppl video (long) video (short) DOI URL BibTeX

Perceiving Systems Conference Paper AMASS: Archive of Motion Capture as Surface Shapes Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., Black, M. J. In Proceedings International Conference on Computer Vision, 5442-5451, IEEE, International Conference on Computer Vision (ICCV), October 2019 (Published)

Abstract ›

Large datasets are the cornerstone of recent advances in computer vision using deep learning. In contrast, existing human motion capture (mocap) datasets are small and the motions limited, hampering progress on learning models of human motion. While there are many different datasets available, they each use a different parameterization of the body, making it difficult to integrate them into a single meta dataset. To address this, we introduce AMASS, a large and varied database of human motion that unifies 15 different optical marker-based mocap datasets by representing them within a common framework and parameterization. We achieve this using a new method, MoSh++, that converts mocap data into realistic 3D human meshes represented by a rigged body model. Here we use SMPL [26], which is widely used and provides a standard skeletal representation as well as a fully rigged surface mesh. The method works for arbitrary marker-sets, while recovering soft-tissue dynamics and realistic hand motion. We evaluate MoSh++ and tune its hyper-parameters using a new dataset of 4D body scans that are jointly recorded with marker-based mocap. The consistent representation of AMASS makes it readily useful for animation, visualization, and generating training data for deep learning. Our dataset is significantly richer than previous human motion collections, having more than 40 hours of motion data, spanning over 300 subjects, more than 11000 motions, and is available for research at https://amass.is.tue.mpg.de/.

code pdf suppl arxiv project website video poster AMASS_Poster DOI BibTeX

Perceiving Systems Article MoSh: Motion and Shape Capture from Sparse Markers Loper, M. M., Mahmood, N., Black, M. J. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 33(6):220:1-220:13, ACM, New York, NY, USA, November 2014

Abstract ›

Marker-based motion capture (mocap) is widely criticized as producing lifeless animations. We argue that important information about body surface motion is present in standard marker sets but is lost in extracting a skeleton. We demonstrate a new approach called MoSh (Motion and Shape capture), that automatically extracts this detail from mocap data. MoSh estimates body shape and pose together using sparse marker data by exploiting a parametric model of the human body. In contrast to previous work, MoSh solves for the marker locations relative to the body and estimates accurate body shape directly from the markers without the use of 3D scans; this effectively turns a mocap system into an approximate body scanner. MoSh is able to capture soft tissue motions directly from markers by allowing body shape to vary over time. We evaluate the effect of different marker sets on pose and shape accuracy and propose a new sparse marker set for capturing soft-tissue motion. We illustrate MoSh by recovering body shape, pose, and soft-tissue motion from archival mocap data and using this to produce animations with subtlety and realism. We also show soft-tissue motion retargeting to new characters and show how to magnify the 3D deformations of soft tissue to create animations with appealing exaggerations.

pdf video data pdf from publisher DOI URL BibTeX