Synthesis of Photorealistic Avatar From Audio Cues

Written By: kinza.sabir
Last Updated On: January 13, 2024

Photorealistic avatars are acting like humans with all the expressions and body language, provided an audio input. Yes, this is possible now after the innovation of this latest model. This model is all about converting dyadic conversation into realistic-looking digital characters. Dyadic conversation is termed as a conversation or dialogue between two people. Researchers from the Codec Avatars Lab, Meta, Pittsburgh and University of California, Berkeley presented this cutting-edge model.

The model takes input in the form of dyadic audio and generates 3D avatar motion for the face, body and hand movements. The final output is in the form of photorealistic motion showing how these digital creators move and how they appear just like they would act in real life.

The researchers presented a framework that creates digital characters that looks realistic in their appearance. Their actions, gestures, body languages and movements are like human when they are in dyadic conversation. They can behave, move their body and acts in such a way that it exactly duplicate the conversation just like real people when they talk to each other.

How Photorealistic Avatars work?

With the help of vector quantization, diverse samples of movements are grouped or organized in such a way that the photorealistic avatar learns from these wide range of movements. These photorealistic avatars take advantage of these examples and trained on these movements.

Through the process of Diffusion, minute details such as movements and expressions are added to the character which makes the final output more expressive and natural. The visual representation of character in the form of an avatar looks extremely realistic as they can easily express subtle emotions like sneer and smirk.

From the above mentioned it can be clearly seen the expressions of the woman. The woman showing two totally different expressions humorous (left) and serious (right), despite having same body language. Due to photorealistic avatar the expressions can be perceived easily whereas in the meshes it is difficult to understand the expressions of the avatar.

The diffusion model is used to add fine details to the movement of face making it more realistic. To make the digital character more natural and expressive, the dyadic conversation along with the facial gestures based on pre-trained lip movements are used. Whereas, for body movement motion diffusion model is used to make the body movements more smooth and realistic. These movements are based on audio which are not very much varied or different.

This method is extremely versatile and diverse as it is able to generate multiple samples from a single audio as it is expressed in the image above. The top avatars are in the position of listening so the model has generated multiple samples of this position. Additionally, the photorealistic avatar is showing accurate expression of the mode of listening by showing from its body language and facial expressions that it is paying full attention (top) whereas, in the mode of speaking (bottom) the avatar’s hand and body movements are also accurate.

The researchers introduced a Multiview dataset that captures details from every angle to create the conversation in a more realistic way. The model excels in creating gestures that match the dyadic conversation that looks incredibly close to the real life scenario.

When comparing this latest model with all the previous models which focus on diffusion and vector quantization, this model showed outstanding performance by showing a variety of movements that match with the dyadic conversation.

Wrap Up!

This latest model generates gestures and facial expressions for a photorealistic avatar that is totally synchronous with the provided audio. Through evaluation it has been observed that for an accurate judgement of minute details such as body language, facial expression or hand movements, the character should be photorealistic avatar instead of meshes.