Realistic Avatar Generation in the Wild (text-to-video)
This research paper, written by the Tavus team, details the development of Phoenix, a groundbreaking generative model for realistic avatar creation and text-to-video generation. Phoenix leverages audio and text-driven 3D models, integrating volumetric rendering techniques and 2D Generative Adversarial Networks (GANs) to create lifelike replicas from short video clips.
Christian Safka & Keyu Chen
Christian is the Head of Machine Learning, and Keyu is a Senior Researcher at Tavus, a leading generative AI video research company specializing in models and APIs for talking-head videos.
July 23, 2024

In this post we dive into the development of generative models for realistic avatar creation and text-to-video generation. Specifically, we use audio and text driven 3D models alongside a combination of volumetric rendering techniques and 2D GANs, creating lifelike replicas from short 1–2 minute length videos.

Our groundbreaking model, Phoenix, demonstrates the capability of generating high quality full-body replicas that capture a broad spectrum of human appearance, expression, and emotion.  After the replica has been trained, we use model weights and other intermediates for efficient text-to-video generation with new unseen scripts.  Phoenix is robust to large diversity in voices, appearances, and video conditions.  It can generate videos spanning a wide array of durations, resolutions, and aspect ratios.

This in-depth study will cover the following:

  1. Comparison of several models in the area of talking-head generation
  2. Technical overview of the Phoenix model

The field of avatar creation has grown significantly over the last few years, with recent innovations vastly eclipsing earlier methodologies. Current pioneering efforts are predominantly driven by the growing sophistication of 2D Generative Adversarial Networks (GANs) e.g., Wav2Lip (1), Write-a-speaker (2), and DINet (3), and the innovative 3D reconstruction and rendering techniques offered by Neural Radiance Fields (4) (NeRF) and 3D Gaussian Splatting (5) (3D-GS).

Concurrently, there are research efforts toward one-shot or few-shot avatar generation (6) that are very promising. At this time, those techniques struggle with maintaining high-fidelity output across a diverse range of input.

Technical overview of the Phoenix model

The goals of Phoenix are twofold - first to create a replica of the person, and second to enable real-time text-to-video generation from new scripts.

Our model can be broken down into the following stages:

  • Text-to-speech (TTS)
  • 3D reconstruction of the head and shoulders
  • Script-driven facial animation
  • High-fidelity rendering

Text to speech

Our audio engine trains multiple voice models for each avatar.  It is able to detect and deploy the best version of the model to capture your accent, your range of expression, as well as sound natural with new script input.

To match the voice model to the original audio, we explore the use of a speaker similarity model with encoding, distance calculation, and prediction modules trained end-to-end (See Figure 1).

Figure 1: Select the best voice model for you.
Different TTS models may have varying performance on different people. We employ a voice embedding network to map the original and generated voices to a shared latent space. Based on the similarity metric, we filter out the best voice model with the closest distance between your real voice and the generated one.

3D reconstruction of the head and shoulders

Using frames from a short video, our 3D reconstruction model observes the same person with dynamic head movements and expressions. First, we regress a 3D Morphable Model (3DMM) of the head and shoulder (See Figure 2-A and Figure 2-B for some examples from similar research work). Then we combine this model with differential rendering techniques and fine-tune the facial geometry details, learning them from thousands of 3D scans and cutting-edge implicit representations (See Figure 3 for another research example). 

Figure 2-A: Illustration of 3D template model for head and shoulder.
We estimate the deformed 3D shapes that match your appearance from the video. The 3D model will cover as much as the full head and shoulder just like some similar research work (7)

Figure 2-B: Mathematical explanation of 3D Morphable Model (3DMM).
Based on the linear combination of a series of identity and expression basis, the 3DMM model can represent the deformed shape of different people with different expressions. 

Figure 3: Fine-tune facial geometry details with differential rendering.
Here is an idea illustration from open-released research work (8) on fine-detailed face reconstruction.

Ⓒ Figure from Wood, Erroll, et al. "3d face reconstruction with dense landmarks." European Conference on Computer Vision. 2022.

To tackle the 3D reconstruction problem, we implemented an in-house pipeline utilizing several components such as dense face landmarks, face tracking, pose estimation, identity/expression refinement, wrinkle and facial detail recovery.

Script-driven facial animation

We generate audio from text-to-speech models. The audio is encoded using embedding models trained on thousands of hours of multilingual speech.  We transform the audio to speech related signals and convert them into audio features like Mel-frequency cepstral coefficients (9) (MFCC). Finally, the pretrained audio embedding model will map the high-dimensional audio features into a compact latent space, in which we can evaluate the voice similarities and extract latent representations for visual model training (See Figure 4 for a visual explanation). 

Figure 4: Encode the audio input to latent space.
The audio embedding network will take as input the original audio signals and extract the spectrogram features. The output will be mapped into a compact latent space in which each latent vector uniquely represents the audio information for speech.

To build a realistic facial animation model, we compiled a large diverse in-the-wild video dataset.  We train multiple foundational models on text and audio for a wide array of emotions, facial attributes and lip synchronization (See Figure 5).

Further personalizing the animation to the target, we fine-tune these models per avatar.  This fine-tuning uses the raw audio, audio embeddings, and the 3D model from the previous steps as input to learn the specific speaking styles of each person.

Figure 5: Learning 3D facial animation from audio input.
Here is an example result from audio-driven facial animation output. We utilize multiple fundamental networks to predict different facial animation data e.g. expressions, head poses, lip movements, etc. These models will further be fine-tuned per each avatar to reach the best alignment with your personalized speaking style.

High-fidelity rendering 

In order to achieve high-fidelity avatar videos, we combine state-of-the-art GANs and cutting-edge implicit volumetric rendering techniques (e.g. NeRFs and 3D Gaussian Splatting) to build our video rendering pipeline.

Since traditional GANs are generally limited by the image resolution while volumetric models are struggling with the temporal consistency issue, we make significant improvements to both research categories and strategically integrate them together. By jointly optimizing these merged models, we are able to surpass the current limitations of either solution.

Finally, we have a rendered video with new animations and surface texture from the original training video.

Try it out for yourself! https://tavus.io/developer

________________________________________

References
(1) Prajwal, K. R., et al. "A lip sync expert is all you need for speech to lip generation in the wild." Proceedings of the 28th ACM international conference on multimedia. 2020.
(2) Li, Lincheng, et al. "Write-a-speaker: Text-based emotional and rhythmic talking-head generation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 3. 2021.
(3) Zhang, Zhimeng, et al. "DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video." AAAI 2023.
(4) Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.
(5) Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics 42.4 (2023).
(6) Tian, Linrui, et al. "EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions." arXiv preprint arXiv:2402.17485 (2024).
(7) Li, Ruilong, et al. "Learning formation of physically-based face attributes." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
(8) Wood, Erroll, et al. "3d face reconstruction with dense landmarks." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
(9) Mermelstein, Paul. "Distance measures for speech recognition, psychological and instrumental." Pattern recognition and artificial intelligence 116 (1976): 374-388.

Get insights in your inbox
Get Tavus updates and video hacks in your inbox, every week.
Build AI video with Tavus APIs
Get Started
Get Started
Build with Tavus AI Video API
Get Started
Get Started

More from Tavus Blog