Phoenix-1: Realistic Avatar Generation in the Wild

Christian Safka

April 1, 2024

Table of Contents

In this post we dive into the development of generative models for realistic avatar creation and text-to-video generation. Specifically, we use audio and text driven 3D models alongside a combination of volumetric rendering techniques and 2D GANs, creating lifelike replicas from short 1–2 minute length videos.

‍

Our groundbreaking model, Phoenix, demonstrates the capability of generating high quality full-body replicas that capture a broad spectrum of human appearance, expression, and emotion. After the replica has been trained, we use model weights and other intermediates for efficient text-to-video generation with new unseen scripts. Phoenix is robust to large diversity in voices, appearances, and video conditions. It can generate videos spanning a wide array of durations, resolutions, and aspect ratios.

‍

This in-depth study will cover the following:

Comparison of several models in the area of talking-head generation
Technical overview of the Phoenix model

‍

The field of avatar creation has grown significantly over the last few years, with recent innovations vastly eclipsing earlier methodologies. Current pioneering efforts are predominantly driven by the growing sophistication of 2D Generative Adversarial Networks (GANs) e.g., Wav2Lip (1), Write-a-speaker (2), and DINet (3), and the innovative 3D reconstruction and rendering techniques offered by Neural Radiance Fields (4) (NeRF) and 3D Gaussian Splatting (5) (3D-GS).

‍

Concurrently, there are research efforts toward one-shot or few-shot avatar generation (6) that are very promising. At this time, those techniques struggle with maintaining high-fidelity output across a diverse range of input.

‍

Technical overview of the Phoenix model

‍‍

The goals of Phoenix are twofold - first to create a replica of the person, and second to enable real-time text-to-video generation from new scripts.

‍

Our model can be broken down into the following stages:

Text-to-speech (TTS)
3D reconstruction of the head and shoulders
Script-driven facial animation
High-fidelity rendering

‍

Text to speech

‍

Our audio engine trains multiple voice models for each avatar. It is able to detect and deploy the best version of the model to capture your accent, your range of expression, as well as sound natural with new script input.

‍

To match the voice model to the original audio, we explore the use of a speaker similarity model with encoding, distance calculation, and prediction modules trained end-to-end (See Figure 1).

***Figure 1: Select the best voice model for you.***
Different TTS models may have varying performance on different people. We employ a voice embedding network to map the original and generated voices to a shared latent space. Based on the similarity metric, we filter out the best voice model with the closest distance between your real voice and the generated one.

‍

3D reconstruction of the head and shoulders

‍

Using frames from a short video, our 3D reconstruction model observes the same person with dynamic head movements and expressions. First, we regress a 3D Morphable Model (3DMM) of the head and shoulder (See Figure 2-A and Figure 2-B for some examples from similar research work). Then we combine this model with differential rendering techniques and fine-tune the facial geometry details, learning them from thousands of 3D scans and cutting-edge implicit representations (See Figure 3 for another research example).

‍

Figure 2-A: Illustration of 3D template model for head and shoulder.
‍*We estimate the deformed 3D shapes that match your appearance from the video. The 3D model will cover as much as the full head and shoulder just like some similar research work (7)*

‍

Figure 2-B: Mathematical explanation of 3D Morphable Model (3DMM).
‍*Based on the linear combination of a series of identity and expression basis, the 3DMM model can represent the deformed shape of different people with different expressions.*

‍

Figure 3: Fine-tune facial geometry details with differential rendering.
‍*Here is an idea illustration from open-released research work (8) on fine-detailed face reconstruction.*
‍
‍*Ⓒ Figure from Wood, Erroll, et al. "3d face reconstruction with dense landmarks." European Conference on Computer Vision. 2022.*

‍

To tackle the 3D reconstruction problem, we implemented an in-house pipeline utilizing several components such as dense face landmarks, face tracking, pose estimation, identity/expression refinement, wrinkle and facial detail recovery.

‍

Script-driven facial animation

‍

We generate audio from text-to-speech models. The audio is encoded using embedding models trained on thousands of hours of multilingual speech. We transform the audio to speech related signals and convert them into audio features like Mel-frequency cepstral coefficients (9) (MFCC). Finally, the pretrained audio embedding model will map the high-dimensional audio features into a compact latent space, in which we can evaluate the voice similarities and extract latent representations for visual model training (See Figure 4 for a visual explanation).

Figure 4: Encode the audio input to latent space.
‍The audio embedding network will take as input the original audio signals and extract the spectrogram features. The output will be mapped into a compact latent space in which each latent vector uniquely represents the audio information for speech.

‍

To build a realistic facial animation model, we compiled a large diverse in-the-wild video dataset. We train multiple foundational models on text and audio for a wide array of emotions, facial attributes and lip synchronization (See Figure 5).

‍

Further personalizing the animation to the target, we fine-tune these models per avatar. This fine-tuning uses the raw audio, audio embeddings, and the 3D model from the previous steps as input to learn the specific speaking styles of each person.

Figure 5: Learning 3D facial animation from audio input.
‍Here is an example result from audio-driven facial animation output. We utilize multiple fundamental networks to predict different facial animation data e.g. expressions, head poses, lip movements, etc. These models will further be fine-tuned per each avatar to reach the best alignment with your personalized speaking style.

‍

High-fidelity rendering

‍

In order to achieve high-fidelity avatar videos, we combine state-of-the-art GANs and cutting-edge implicit volumetric rendering techniques (e.g. NeRFs and 3D Gaussian Splatting) to build our video rendering pipeline.

‍

Since traditional GANs are generally limited by the image resolution while volumetric models are struggling with the temporal consistency issue, we make significant improvements to both research categories and strategically integrate them together. By jointly optimizing these merged models, we are able to surpass the current limitations of either solution.

‍

Finally, we have a rendered video with new animations and surface texture from the original training video.

‍

Try it out for yourself! https://tavus.io/developer
‍

________________________________________

References
(1) Prajwal, K. R., et al. "A lip sync expert is all you need for speech to lip generation in the wild." Proceedings of the 28th ACM international conference on multimedia. 2020.
(2) Li, Lincheng, et al. "Write-a-speaker: Text-based emotional and rhythmic talking-head generation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 3. 2021.
(3) Zhang, Zhimeng, et al. "DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video." AAAI 2023.
(4) Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.
(5) Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics 42.4 (2023).
(6) Tian, Linrui, et al. "EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions." arXiv preprint arXiv:2402.17485 (2024).
(7) Li, Ruilong, et al. "Learning formation of physically-based face attributes." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
(8) Wood, Erroll, et al. "3d face reconstruction with dense landmarks." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
(9) Mermelstein, Paul. "Distance measures for speech recognition, psychological and instrumental." Pattern recognition and artificial intelligence 116 (1976): 374-388.

FAQs

No items found.

Research

How to build a conversational video AI with a transparent background

Conversational AI video APIs

Build immersive AI-generated video experiences in your application

Get a Demo

FAQs

Smarter, faster, fairer: How AI is reshaping the future of recruiting

How creating Sparrow made me a better conversationalist

Understanding intuition behind multi-turn LLMs through the prism of search

Why I joined Tavus as the Head of Growth Marketing

How Final Round AI scales realistic mock interviews with Tavus CVI

How to build a conversational video AI with a transparent background

Conversational AI video APIs