Faces and Expressions

Facial shape and motion are essential to communication. They are also fundamentally three dimensional. Consequently, we need a 3D model of the face that can capture the full range of face shapes and expressions. Such a model should be realistic, easy to animate, and easy to fit to data. See [] for a comprehensive overview of different facial representations.

To that end, we train an expressive 3D head model called FLAME from over 33,000 3D scans. Because it is learned from large-scale, expressive data, it is more realistic than previous models. To capture non-linear expression shape variations, we introduce CoMA [], a versatile autoencoder framework for meshes with hierarchical mesh up- and down-sampling operations. Models like FLAME and CoMA require large datasets of 3D faces in dense semantic correspondence across different identities and expressions. ToFu [], a geometry inference framework that facilitates a hierarchical volumetric feature aggregation scheme, predicts facial meshes in a consistent mesh topology directly from calibrated multi-view images three orders of magnitude faster than traditional techniques.

To capture, model, and understand facial expressions, we need to estimate the parameters of our face models from images and videos. Training a neural network to regress model parameters from image pixels is difficult because we lack paired training data of images and the true 3D face shape. To address this, RingNet [] directly learns this mapping using only 2D image features. DECA [] additionally learns an animatable detailed displacement model from in-the-wild images. This enables important applications such as creation of animatable avatars from a single image. Our NoW benchmark enables the field to quantitatively compare such methods for the first time.

Classical rendering methods can be used to generate images using FLAME but these look unrealistic due to the lack of hair, eyes, and the mouth cavity (i.e., teeth or tongue). To address this, we are developing new neural rendering methods. GIF [] combines a generative adversarial network (GAN) with FLAME’s parameter control to generate realistic looking face images.

Members

Perceiving Systems

Timo Bolkart

Research Scientist

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Soubhik Sanyal

Guest Scientist

Perceiving Systems

Anurag Ranjan

Doctoral Researcher

Perceiving Systems

Tianye Li

Perceiving Systems

Javier Romero

Affiliated Researcher

Perceiving Systems

Cassidy Laidlaw

Perceiving Systems

Yao Feng

Guest Scientist

Perceiving Systems

Haiwen Feng

Guest Scientist

Empirical Inference

Partha Ghosh

Postdoctoral Researcher

Publications

Perceiving Systems Conference Paper Topologically Consistent Multi-View Face Inference Using Volumetric Sampling Li, T., Liu, S., Bolkart, T., Liu, J., Li, H., Zhao, Y. In Proc. International Conference on Computer Vision (ICCV), 3804-3814, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

High-fidelity face digitization solutions often combine multi-view stereo (MVS) techniques for 3D reconstruction and a non-rigid registration step to establish dense correspondence across identities and expressions. A common problem is the need for manual clean-up after the MVS step, as 3D scans are typically affected by noise and outliers and contain hairy surface regions that need to be cleaned up by artists. Furthermore, mesh registration tends to fail for extreme facial expressions. Most learning-based methods use an underlying 3D morphable model (3DMM) to ensure robustness, but this limits the output accuracy for extreme facial expressions. In addition, the global bottleneck of regression architectures cannot produce meshes that tightly fit the ground truth surfaces. We propose ToFu, Topologically consistent Face from multi-view, a geometry inference framework that can produce topologically consistent meshes across facial identities and expressions using a volumetric representation instead of an explicit underlying 3DMM. Our novel progressive mesh generation network embeds the topological structure of the face in a feature volume, sampled from geometry-aware local features. A coarse-to-fine architecture facilitates dense and accurate facial mesh predictions in a consistent mesh topology. ToFu further captures displacement maps for pore-level geometric details and facilitates high-quality rendering in the form of albedo and specular reflectance maps. These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering. We demonstrate state-of-the-art geometric and correspondence accuracy, while only taking 0.385 seconds to compute a mesh with 10K vertices, which is three orders of magnitude faster than traditional techniques. The code and the model are available for research purposes at https://tianyeli.github.io/tofu.

project paper DOI BibTeX

Perceiving Systems Article Learning an Animatable Detailed 3D Face Model from In-the-Wild Images Feng, Y., Feng, H., Black, M. J., Bolkart, T. ACM Transactions on Graphics, 40(4):88:1-88:13, August 2021 (Published)

Abstract ›

While current monocular 3D face reconstruction methods can recover fine geometric details, they suffer several limitations. Some methods produce faces that cannot be realistically animated because they do not model how wrinkles vary with expression. Other methods are trained on high-quality face scans and do not generalize well to in-the-wild images. We present the first approach that regresses 3D face shape and animatable details that are specific to an individual but change with expression. Our model, DECA (Detailed Expression Capture and Animation), is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. To enable this, we introduce a novel detail-consistency loss that disentangles person-specific details from expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged. DECA is learned from in-the-wild images with no paired 3D supervision and achieves state-of-the-art shape reconstruction accuracy on two benchmarks. Qualitative results on in-the-wild data demonstrate DECA's robustness and its ability to disentangle identity- and expression-dependent details enabling animation of reconstructed faces. The model and code are publicly available at https://deca.is.tue.mpg.de.

pdf Sup Mat code video talk DOI BibTeX

Perceiving Systems Conference Paper GIF: Generative Interpretable Faces Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T. In 2020 International Conference on 3D Vision (3DV 2020), 1:868-878, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)

Abstract ›

Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de

pdf project code video DOI BibTeX

Perceiving Systems Article 3D Morphable Face Models - Past, Present and Future Egger, B., Smith, W. A. P., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T., Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., Theobalt, C., Blanz, V., Vetter, T. ACM Transactions on Graphics, 39(5):157, October 2020 (Published)

Abstract ›

In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications.

project page pdf preprint DOI BibTeX

Perceiving Systems Conference Paper Capture, Learning, and Synthesis of 3D Speaking Styles Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M. J. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 10101-10111, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

Abstract ›

Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input—even speech in languages other than English—and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.

code Project Page video paper BibTeX

Perceiving Systems Conference Paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision Sanyal, S., Bolkart, T., Feng, H., Black, M. J. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 7763-7772, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

Abstract ›

The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual’s face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces “not quite in-the-wild” (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes.

code pdf preprint URL BibTeX

Perceiving Systems Conference Paper Generating 3D Faces using Convolutional Mesh Autoencoders Ranjan, A., Bolkart, T., Sanyal, S., Black, M. J. In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science, vol 11207:725-741, Springer, Cham, September 2018

Abstract ›

Learned 3D representations of human faces are useful for computer vision problems such as 3D face tracking and reconstruction from images, as well as graphics applications such as character generation and animation. Traditional models learn a latent representation of a face using linear subspaces or higher-order tensor generalizations. Due to this linearity, they can not capture extreme deformations and non-linear expressions. To address this, we introduce a versatile model that learns a non-linear representation of a face using spectral convolutions on a mesh surface. We introduce mesh sampling operations that enable a hierarchical mesh representation that captures non-linear variations in shape and expression at multiple scales within the model. In a variational setting, our model samples diverse realistic 3D faces from a multivariate Gaussian distribution. Our training data consists of 20,466 meshes of extreme expressions captured over 12 different subjects. Despite limited training data, our trained model outperforms state-of-the-art face models with 50% lower reconstruction error, while using 75% fewer parameters. We also show that, replacing the expression space of an existing state-of-the-art face model with our autoencoder, achieves a lower reconstruction error. Our data, model and code are available at http://coma.is.tue.mpg.de/.

Code (tensorflow) Code (pytorch) Project Page paper supplementary DOI BibTeX

Perceiving Systems Article Learning a model of facial shape and expression from 4D scans Li, T., Bolkart, T., Black, M. J., Li, H., Romero, J. ACM Transactions on Graphics, 36(6):194:1-194:17, November 2017, Two first authors contributed equally

Abstract ›

The field of 3D face modeling has a large gap between high-end and low-end methods. At the high end, the best facial animation is indistinguishable from real humans, but this comes at the cost of extensive manual labor. At the low end, face capture from consumer depth sensors relies on 3D face models that are not expressive enough to capture the variability in natural facial shape and expression. We seek a middle ground by learning a facial model from thousands of accurately aligned 3D scans. Our FLAME model (Faces Learned with an Articulated Model and Expressions) is designed to work with existing graphics software and be easy to fit to data. FLAME uses a linear shape space trained from 3800 scans of human heads. FLAME combines this linear shape space with an articulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global expression from 4D face sequences in the D3DFACS dataset along with additional 4D sequences.We accurately register a template mesh to the scan sequences and make the D3DFACS registrations available for research purposes. In total the model is trained from over 33, 000 scans. FLAME is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model. We compare FLAME to these models by fitting them to static 3D scans and 4D sequences using the same optimization method. FLAME is significantly more accurate and is available for research purposes (http://flame.is.tue.mpg.de).

data/model video code chumpy code tensorflow paper supplemental BibTeX