Stable Video Portraits

Perceiving Systems, Neural Capture and Synthesis

Mirela Ostrek

Guest Scientist

Neural Capture and Synthesis, Perceiving Systems

Justus Thies

Max Planck Research Group Leader

Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.

Author(s):	Mirela Ostrek and Justus Thies
Book Title:	European Conference on Computer Vision (ECCV 2024)
Year:	2024
Month:	October
Series:	LNCS
Publisher:	Springer Cham

BibTeX Type:	Conference Paper (inproceedings)

Event Name:	European Conference on Computer Vision (ECCV 2024)
Event Place:	Milan, Italy
State:	Published
URL:	https://svp.is.tue.mpg.de/

Degree Type:	PhD
Digital:	True
Electronic Archiving:	grant_archive

BibTeX

@inproceedings{svp,
  title = {Stable Video Portraits},
  booktitle = {European Conference on Computer Vision (ECCV 2024)},
  abstract = {Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present Stable Video Portraits, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any test-time fine-tuning. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods. 
  },
  series = {LNCS},
  publisher = {Springer Cham},
  degree_type = {PhD},
  month = oct,
  year = {2024},
  author = {Ostrek, Mirela and Thies, Justus},
  url = {https://svp.is.tue.mpg.de/},
  month_numeric = {10}
}