Capturing Humans, Objects and Scenes

The reconstruction of digital humans in motion is a vivid research field at the intersection of computer graphics and computer vision. While the entertainment industry focuses on high-quality avatars reconstructed from multi-view studio setups, our aim is to reconstruct and animate digital humans using commodity hardware like a single webcam to be accessible to everyone. In the past years, we had tremendous success in this effort, by employing optimization-based facial tracking methods like Face2Face which build upon the principle of analysis-by-synthesis, or by applying learned regression methods like MICA or InverseFaceNet that estimate the shape and expression state with a single feed-forward network. Face tracking plays an important role since humans are trained to interpret facial motion, thus, subtle tracking failures lead to a loss of information in a conversation. While regression-based methods show good results for single images, they are predicting very inconsistent 3D geometry for consecutive frames in a video, and, thus, are not applicable for the reconstruction of digital humans. Instead, current avatar reconstruction methods are using optimization-based techniques for tracking or hybrids leveraging methods like MICA to initialize the head shape. On top of this tracking which uses priors in the form of 3D morphable models like FLAME, we developed a series of methods that are able to represent the appearance of a subject including all its fine-scale detail. For example, we introduced 2D and 3D neural rendering and neural scene representation approaches which can be trained in a few minutes (INSTA) or are robust to facial expression tracking errors (GAN-Avatar). In 3VP, we demonstrated how an entire 360° animatable avatar of a human head can be reconstructed from a single camera using neural radiance fields.
These neural rendering methods revolutionized the field of 3D reconstruction and appearance capturing which we summarized in our EuroGraphics state of the art reports which were presented at several top tier conferences. It is worth noting that these methods are general and can be applied to other objects and even entire scenes (RGBDNeRF, TransformerFusion). Reconstructing objects and scenes is important in the context of human interactions which we ultimately try to model as well. In works like MOVER or MIME, we explicitly handle the reconstruction and modelling of scenes and humans together.