Back

Perceiving Systems Members Publications

Neural Rendering

GIF [File Icon] (top) generates realistic face images that are controlled by FLAME parameters. SPICE [File Icon] (middle) learns to repose an image of a person without any paired training data by exploiting information about 3D bodies. SMPLpix [File Icon] (bottom) generates realistic images of people from a sparse set of colored SMPL vertices.

Members

Perceiving Systems
Emeritus / Acting Director
Perceiving Systems
  • Guest Scientist
Perceiving Systems
Affiliated Researcher
Empirical Inference
  • Postdoctoral Researcher
Perceiving Systems
  • Research Scientist
Perceiving Systems
  • Doctoral Researcher
no image
Perceiving Systems
Perceiving Systems
  • Guest Scientist
Perceiving Systems
Perceiving Systems
Affiliated Researcher

Publications

Perceiving Systems Conference Paper Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D Shape, Pose, and Appearance Consistency Sanyal, S., Vorobiov, A., Bolkart, T., Loper, M., Mohler, B., Davis, L., Romero, J., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 11118-11127, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)
Synthesizing images of a person in novel poses from a single image is a highly ambiguous task. Most existing approaches require paired training images; i.e. images of the same person with the same clothing in different poses. However, obtaining sufficiently large datasets with paired data is challenging and costly. Previous methods that forego paired supervision lack realism. We propose a self-supervised framework named SPICE (Self-supervised Person Image CrEation) that closes the image quality gap with supervised methods. The key insight enabling self-supervision is to exploit 3D information about the human body in several ways. First, the 3D body shape must remain unchanged when reposing. Second, representing body pose in 3D enables reasoning about self occlusions. Third, 3D body parts that are visible before and after reposing, should have similar appearance features. Once trained, SPICE takes an image of a person and generates a new image of that person in a new target pose. SPICE achieves state-of-the-art performance on the DeepFashion dataset, improving the FID score from 29.9 to 7.8 compared with previous unsupervised methods, and with performance similar to the state-of-the-art supervised method (6.4). SPICE also generates temporally coherent videos given an input image and a sequence of poses, despite being trained on static images only.
pdf arxiv DOI BibTeX

Perceiving Systems Conference Paper SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements Ma, Q., Saito, S., Yang, J., Tang, S., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 16077-16088, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)
Learning to model and reconstruct humans in clothing is challenging due to articulation, non-rigid deformation, and varying clothing types and topologies. To enable learning, the choice of representation is the key. Recent work uses neural networks to parameterize local surface elements. This approach captures locally coherent geometry and non-planar details, can deal with varying topology, and does not require registered training data. However, naively using such methods to model 3D clothed humans fails to capture fine-grained local deformations and generalizes poorly. To address this, we present three key innovations: First, we deform surface elements based on a human body model such that large-scale deformations caused by articulation are explicitly separated from topological changes and local clothing deformations. Second, we address the limitations of existing neural surface elements by regressing local geometry from local features, significantly improving the expressiveness. Third, we learn a pose embedding on a 2D parameterization space that encodes posed body geometry, improving generalization to unseen poses by reducing non-local spurious correlations. We demonstrate the efficacy of our surface representation by learning models of complex clothing from point clouds. The clothing can change topology and deviate from the topology of the body. Once learned, we can animate previously unseen motions, producing high-quality point clouds, from which we generate realistic images with neural rendering. We assess the importance of each technical contribution and show that our approach outperforms the state-of-the- art methods in terms of reconstruction accuracy and inference time. The code is available for research purposes at https://qianlim.github.io/SCALE.
Project Page Code Video arXiv PDF Supp. Poster DOI BibTeX

Perceiving Systems Conference Paper SMPLpix: Neural Avatars from 3D Human Models Prokudin, S., Black, M. J., Romero, J. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 1809-1818, IEEE, Piscataway, NJ, IEEE Winter Conference on Applications of Computer Vision (WACV 2021), January 2021 (Published)
Recent advances in deep generative models have led to an unprecedented level of realism for synthetically generated images of humans. However, one of the remaining fundamental limitations of these models is the ability to flexibly control the generative process, e.g. change the camera and human pose while retaining the subject identity. At the same time, deformable human body models like SMPL \cite{loper2015smpl} and its successors provide full control over pose and shape, but rely on classic computer graphics pipelines for rendering. Such rendering pipelines require explicit mesh rasterization that (a) does not have the potential to fix artifacts or lack of realism in the original 3D geometry and (b) until recently, were not fully incorporated into deep learning frameworks. In this work, we propose to bridge the gap between classic geometry-based rendering and the latest generative networks operating in pixel space. We train a network that directly converts a sparse set of 3D mesh vertices into photorealistic images, alleviating the need for traditional rasterization mechanism. We train our model on a large corpus of human 3D models and corresponding real photos, and show the advantage over conventional differentiable renderers both in terms of the level of photorealism and rendering efficiency.
project official pdf video preprint code DOI BibTeX

Perceiving Systems Conference Paper GIF: Generative Interpretable Faces Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T. In 2020 International Conference on 3D Vision (3DV 2020), 1:868-878, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)
Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de
pdf project code video DOI BibTeX

Perceiving Systems Conference Paper A Generative Model of People in Clothing Lassner, C., Pons-Moll, G., Gehler, P. V. In Proceedings IEEE International Conference on Computer Vision (ICCV), 853-862, IEEE, Piscataway, NJ, USA, IEEE International Conference on Computer Vision (ICCV), October 2017
We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.
URL BibTeX