Clothing Capture and Modeling

While body models like SMPL lack clothing, people in images and videos are typically clothed. Modeling clothing on the body is hard, because of the variety of garments, varied topology of clothing, and the complex physical properties of cloth. Standard methods to dress 3D bodies rely on 2D patterns and physics simulation. Such approaches require expert knowledge and are labor intensive. We seek to capture garments on people "in the wild" and then realistically animate them. Consequently, we take a data-driven approach to learn the shape of clothed humans.

To learn a model of 3D clothing, we use both synthetic data from clothing simulation [] and scans captured in our 4D body scanner []. We estimate the body shape under the clothing using BUFF [] and then model how clothing deviates from the body.

With this data, we learn how clothing deforms with body pose. For example, CAPE [] uses a conditional mesh-VAE-GAN, that is conditioned on pose, to learn clothing deformation from the SMPL body model. CAPE can then add pose-dependent clothing deformation to an animated SMPL body.

CAPE requires registered 3D meshes, which are challenging to obtain for clothing, and is tied to the topology of SMPL. To address these issues, we use implicit surface models. SCANimate [] takes raw 3D scans are un-poses them to a canonical pose with the help of the estimated underlying body as well as a novel cycle-consistency loss. The canonicalized scans are then used to learn an implicit shape model that extends linear blend skinning to blend fields, defined implicitly in 3D space.

Implicit models lack compatibility with standard graphics pipelines. To address that, we propose two models, SCALE [] and POP [], that are based on point clouds and extend deep point cloud representations to deal with articulated human pose. The points are explicit but the surface through them is implicit. POP goes beyond previous methods to model multiple garments, enabling the creation of an animatable avatar, with pose-dependent deformations, from a single scan. These point-based models are readily rendered as images using neural rendering methods; see Neural Rendering for more information.

Videos

SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes (ICCV 2021) []

The Power of Points for Modeling Humans in Clothing (ICCV 2021) []

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021) []

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks (CVPR 2021) []

Learning to Dress 3D People in Generative Clothing (CVPR 2020) []

ClothCap (SIGGRAPH 2017) []

Estimating Body Shape under Clothing (CVPR 2017) []

Datasets

CAPE: Clothed Auto Person Encoding

Raw scans, registered meshes, and fitted bodies.
Go to the official webpage.

BUFF: Bodies Under Flowing Fashion

Raw scans of clothed human and fitted bodies.
Go to the official webpage.

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Gerard Pons-Moll

Affiliated Researcher

Perceiving Systems

Sergi Pujades

Guest Scientist

Perceiving Systems

Jinlong Yang

Guest Scientist

Perceiving Systems

Siyu Tang

Guest Scientist

Perceiving Systems

Qianli Ma

Guest Scientist

Perceiving Systems

Shunsuke Saito

Publications

Autonomous Vision Perceiving Systems Conference Paper MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images Wang, S., Mihajlovic, M., Ma, Q., Geiger, A., Tang, S. In Advances in Neural Information Processing Systems 34, 4:2810-2822, (Editors: Ranzato, M. and Beygelzimer, A. and Dauphin, Y. and Liang, P. S. and Wortman Vaughan, J.), Curran Associates, Inc., Red Hook, NY, 35th Conference on Neural Information Processing Systems (NeurIPS 2021), December 2021 (Published)

Abstract ›

In this paper, we aim to create generalizable and controllable neural signed distance fields (SDFs) that represent clothed humans from monocular depth observations. Recent advances in deep learning, especially neural implicit representations, have enabled human shape reconstruction and controllable avatar generation from different sensor inputs. However, to generate realistic cloth deformations from novel input poses, watertight meshes or dense full-body scans are usually needed as inputs. Furthermore, due to the difficulty of effectively modeling pose-dependent cloth deformations for diverse body shapes and cloth types, existing approaches resort to per-subject/cloth-type optimization from scratch, which is computationally expensive. In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images. We achieve this by using meta-learning to learn an initialization of a hypernetwork that predicts the parameters of neural SDFs. The hypernetwork is conditioned on human poses and represents a clothed neural avatar that deforms non-rigidly according to the input poses. Meanwhile, it is meta-learned to effectively incorporate priors of diverse body shapes and cloth types and thus can be much faster to fine-tune compared to models trained from scratch. We qualitatively and quantitatively show that our approach outperforms state-of-the-art approaches that require complete meshes as inputs while our approach requires only depth frames as inputs and runs orders of magnitudes faster. Furthermore, we demonstrate that our meta-learned hypernetwork is very robust, being the first to generate avatars with realistic dynamic cloth deformations given as few as 8 monocular depth frames.

Project page arXiv URL BibTeX

Perceiving Systems Conference Paper The Power of Points for Modeling Humans in Clothing Ma, Q., Yang, J., Tang, S., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 10954-10964, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Currently it requires an artist to create 3D human avatars with realistic clothing that can move naturally. Despite progress on 3D scanning and modeling of human bodies, there is still no technology that can easily turn a static scan into an animatable avatar. Automating the creation of such avatars would enable many applications in games, social networking, animation, and AR/VR to name a few. The key problem is one of representation. Standard 3D meshes are widely used in modeling the minimally-clothed body but do not readily capture the complex topology of clothing. Recent interest has shifted to implicit surface models for this task but they are computationally heavy and lack compatibility with existing 3D tools. What is needed is a 3D representation that can capture varied topology at high resolution and that can be learned from data. We argue that this representation has been with us all along — the point cloud. Point clouds have properties of both implicit and explicit representations that we exploit to model 3D garment geometry on a human body. We train a neural network with a novel local clothing geometric feature to represent the shape of different outfits. The network is trained from 3D point clouds of many types of clothing, on many bodies, in many poses, and learns to model pose-dependent clothing deformations. The geometry feature can be optimized to fit a previously unseen scan of a person in clothing, enabling the scan to be reposed realistically. Our model demonstrates superior quantitative and qualitative results in both multi-outfit modeling and unseen outfit animation. The code is available for research purposes.

Project Page Code Video Dataset arXiv PDF Supp. Poster DOI BibTeX

Perceiving Systems Autonomous Vision Conference Paper SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes Chen, X., Zheng, Y., Black, M. J., Hilliges, O., Geiger, A. In Proc. International Conference on Computer Vision (ICCV), 11574-11584, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

Neural implicit surface representations have emerged as a promising paradigm to capture 3D shapes in a continuous and resolution-independent manner. However, adapting them to articulated shapes is non-trivial. Existing approaches learn a backward warp field that maps deformed to canonical points. However, this is problematic since the backward warp field is pose dependent and thus requires large amounts of data to learn. To address this, we introduce SNARF, which combines the advantages of linear blend skinning (LBS) for polygonal meshes with those of neural implicit surfaces by learning a forward deformation field without direct supervision. This deformation field is defined in canonical, pose-independent, space, enabling generalization to unseen poses. Learning the deformation field from posed meshes alone is challenging since the correspondences of deformed points are defined implicitly and may not be unique under changes of topology. We propose a forward skinning model that finds all canonical correspondences of any deformed point using iterative root finding. We derive analytical gradients via implicit differentiation, enabling end-to-end training from 3D meshes with bone transformations. Compared to state-of-the-art neural implicit representations, our approach generalizes better to unseen poses while preserving accuracy. We demonstrate our method in challenging scenarios on (clothed) 3D humans in diverse and unseen poses.

pdf pdf 2 supplementary material project blog blog 2 video video 2 code DOI URL BibTeX

Perceiving Systems Conference Paper SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements Ma, Q., Saito, S., Yang, J., Tang, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 16077-16088, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)

Abstract ›

Learning to model and reconstruct humans in clothing is challenging due to articulation, non-rigid deformation, and varying clothing types and topologies. To enable learning, the choice of representation is the key. Recent work uses neural networks to parameterize local surface elements. This approach captures locally coherent geometry and non-planar details, can deal with varying topology, and does not require registered training data. However, naively using such methods to model 3D clothed humans fails to capture fine-grained local deformations and generalizes poorly. To address this, we present three key innovations: First, we deform surface elements based on a human body model such that large-scale deformations caused by articulation are explicitly separated from topological changes and local clothing deformations. Second, we address the limitations of existing neural surface elements by regressing local geometry from local features, significantly improving the expressiveness. Third, we learn a pose embedding on a 2D parameterization space that encodes posed body geometry, improving generalization to unseen poses by reducing non-local spurious correlations. We demonstrate the efficacy of our surface representation by learning models of complex clothing from point clouds. The clothing can change topology and deviate from the topology of the body. Once learned, we can animate previously unseen motions, producing high-quality point clouds, from which we generate realistic images with neural rendering. We assess the importance of each technical contribution and show that our approach outperforms the state-of-the- art methods in terms of reconstruction accuracy and inference time. The code is available for research purposes at https://qianlim.github.io/SCALE.

Project Page Code Video arXiv PDF Supp. Poster DOI BibTeX

Perceiving Systems Conference Paper SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks Saito, S., Yang, J., Ma, Q., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2885-2896, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)

Abstract ›

We present SCANimate, an end-to-end trainable framework that takes raw 3D scans of a clothed human and turns them into an animatable avatar. These avatars are driven by pose parameters and have realistic clothing that moves and deforms naturally. SCANimate does not rely on a customized mesh template or surface mesh registration. We observe that fitting a parametric 3D body model, like SMPL, to a clothed human scan is tractable while surface registration of the body topology to the scan is often not, because clothing can deviate significantly from the body shape. We also observe that articulated transformations are invertible, resulting in geometric cycle-consistency in the posed and unposed shapes. These observations lead us to a weakly supervised learning method that aligns scans into a canonical pose by disentangling articulated deformations without template-based surface registration. Furthermore, to complete missing regions in the aligned scans while modeling pose-dependent deformations, we introduce a locally pose-aware implicit function that learns to complete and model geometry with learned pose correctives. In contrast to commonly used global pose embeddings, our local pose conditioning significantly reduces long-range spurious correlations and improves generalization to unseen poses, especially when training data is limited. Our method can be applied to pose- aware appearance modeling to generate a fully textured avatar. We demonstrate our approach on various clothing types with different amounts of training data, outperforming existing solutions and other variants in terms of fidelity and generality in every setting. The code is available at https://scanimate.is.tue.mpg.de.

Project Page PDF Supp. Video arXiv Poster code DOI URL BibTeX

Perceiving Systems Conference Paper Learning to Dress 3D People in Generative Clothing Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6468-6477, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)

Abstract ›

Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.

Project page Code Long video Short Video arXiv DOI URL BibTeX

Perceiving Systems Conference Paper Detailed, accurate, human shape estimation from clothed 3D scan sequences Zhang, C., Pujades, S., Black, M., Pons-Moll, G. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5484-5493, IEEE Computer Society, Washington, DC, USA, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, Spotlight

Abstract ›

We address the problem of estimating human body shape from 3D scans over time. Reliable estimation of 3D body shape is necessary for many applications including virtual try-on, health monitoring, and avatar creation for virtual reality. Scanning bodies in minimal clothing, however, presents a practical barrier to these applications. We address this problem by estimating body shape under clothing from a sequence of 3D scans. Previous methods that have exploited statistical models of body shape produce overly smooth shapes lacking personalized details. In this paper we contribute a new approach to recover not only an approximate shape of the person, but also their detailed shape. Our approach allows the estimated shape to deviate from a parametric model to fit the 3D scans. We demonstrate the method using high quality 4D data as well as sequences of visual hulls extracted from multi-view images. We also make available a new high quality 4D dataset that enables quantitative evaluation. Our method outperforms the previous state of the art, both qualitatively and quantitatively.

arxiv_preprint video dataset pdf supplemental DOI BibTeX

Perceiving Systems Patent System and method for simulating realistic clothing Black, M. J., Guan, P. June 2017, U.S.~Patent 9,679,409 B2

Abstract ›

Systems, methods, and computer-readable storage media for simulating realistic clothing. The system generates a clothing deformation model for a clothing type, wherein the clothing deformation model factors a change of clothing shape due to rigid limb rotation, pose-independent body shape, and pose-dependent deformations. Next, the system generates a custom-shaped garment for a given body by mapping, via the clothing deformation model, body shape parameters to clothing shape parameters. The system then automatically dresses the given body with the custom- shaped garment.

Google Patents pdf BibTeX

Perceiving Systems Article ClothCap: Seamless 4D Clothing Capture and Retargeting Pons-Moll, G., Pujades, S., Hu, S., Black, M. ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4):73:1-73:15, ACM, New York, NY, USA, 2017, Two first authors contributed equally

Abstract ›

Designing and simulating realistic clothing is challenging and, while several methods have addressed the capture of clothing from 3D scans, previous methods have been limited to single garments and simple motions, lack detail, or require specialized texture patterns. Here we address the problem of capturing regular clothing on fully dressed people in motion. People typically wear multiple pieces of clothing at a time. To estimate the shape of such clothing, track it over time, and render it believably, each garment must be segmented from the others and the body. Our ClothCap approach uses a new multi-part 3D model of clothed bodies, automatically segments each piece of clothing, estimates the naked body shape and pose under the clothing, and tracks the 3D deformations of the clothing over time. We estimate the garments and their motion from 4D scans; that is, high-resolution 3D scans of the subject in motion at 60 fps. The model allows us to capture a clothed person in motion, extract their clothing, and retarget the clothing to new body shapes. ClothCap provides a step towards virtual try-on with a technology for capturing, modeling, and analyzing clothing in motion.

video project_page paper DOI URL BibTeX

Perceiving Systems Article DRAPE: DRessing Any PErson Guan, P., Reiss, L., Hirshberg, D., Weiss, A., Black, M. J. ACM Trans. on Graphics (Proc. SIGGRAPH), 31(4):35:1-35:10, July 2012

Abstract ›

We describe a complete system for animating realistic clothing on synthetic bodies of any shape and pose without manual intervention. The key component of the method is a model of clothing called DRAPE (DRessing Any PErson) that is learned from a physics-based simulation of clothing on bodies of different shapes and poses. The DRAPE model has the desirable property of "factoring" clothing deformations due to body shape from those due to pose variation. This factorization provides an approximation to the physical clothing deformation and greatly simplifies clothing synthesis. Given a parameterized model of the human body with known shape and pose parameters, we describe an algorithm that dresses the body with a garment that is customized to fit and possesses realistic wrinkles. DRAPE can be used to dress static bodies or animated sequences with a learned model of the cloth dynamics. Since the method is fully automated, it is appropriate for dressing large numbers of virtual characters of varying shape. The method is significantly more efficient than physical simulation.

YouTube pdf talk BibTeX