Institute Homepage

Institute Homepage DE Sign In

Perceiving Systems Conference Paper 2026

Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

project arXiv

Perceiving Systems

Radek Daněček

Guest Scientist

Optics and Sensing Laboratory

Carolin Schmitt

Guest Scientist

Optics and Sensing Laboratory

Senya Polikovsky

Optics & Sensing Laboratory

Perceiving Systems

Michael Black

Emeritus / Acting Director

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.

Author(s):	Radek Danecek and Carolin Schmitt and Senya Polikovsky and Michael J. Black
Links:	project arXiv
Book Title:	Int. Conf. on 3D Vision (3DV)
Year:	2026
Month:	March
Day:	20

BibTeX Type:	Conference Paper (inproceedings)

State:	Accepted

BibTeX

@inproceedings{Thunder26,
  title = {Supervising {3D} Talking Head Avatars with Analysis-by-Audio-Synthesis},
  booktitle = {Int.~Conf.~on 3D Vision (3DV)},
  abstract = {In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.},
  month = mar,
  year = {2026},
  author = {Danecek, Radek and Schmitt, Carolin and Polikovsky, Senya and Black, Michael J.},
  month_numeric = {3}
}

Research

Departments

Research Groups

Max Planck Research Groups

Start-Up Teams

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives