Nvidia Proposes A Neural Talking-Head Video Synthesis AI Model, Making Video Conferencing 10x More Bandwidth Efficient

NVIDIA researchers introduce an AI system that generates a realistic talking-head video of a person using one source image and a driving video. The source image encodes an individual’s appearance, and the driving video directs motions in the resulting video. 

The researchers have proposed a pure neural rendering approach in which a talking-head video is rendered using a deep network in a one-shot setting without using a 3D human head’s graphics model. When compared to 3D graphics-based models, 2D based methods have various advantages such as below:

  1. It avoids 3D model acquisition, which is usually difficult and costly.
  2. 2D-based techniques can adequately synthesize hair, beard, etc. In contrast, it is challenging to acquire accurate 3D geometries of these regions.
  3. They can directly synthesize accessories in source images like eyeglasses, hats, and scarves without their 3D models.

However, due to 3D graphics models’ absence, existing 2D-based one-shot talking-head methods can only synthesize the talking-head from the original viewpoint. They cannot render the talking-head from a unique view. 

The proposed approach addresses this issue of fixed viewpoint limitation and achieves local free-view synthesis. One can freely change the talking-heads viewpoint in a large area of the original view.

The model firstly extracts appearance features and 3D canonical vital points from the reference image. With this, the source keypoints are computed and generated for the synthesis videos. The system decomposes the keypoint representations into person-specific canonical keypoints and motion-related transformations, using the 3D keypoints to model facial appearances and geometric impressions, creating a talking-head synthesis video with face and head pose information. 


Nvidia’s Maxine software development kit for video conferencing services meets the requirements for high-quality video conferencing. Maxine helps developers to build and deploy AI-powered features in their applications without creating enormous similar resource elements.  

The random break-ups, jitters, freezes, etc., usually result from the video conferencing app’s heavy bandwidth demands. The novel approach reduces bandwidth requirements reducing costs significantly. This is achieved by only sending a keypoint representation of faces and reconstructing the source video on the receiver side with generative adversarial networks (GANs) synthesize the talking heads. Compared to the commercial H.264 standard, this approach can achieve a one-tenth reduction in bandwidth.


Most video calling systems transmit a compressed video signal (comprising massive streams of pixel-packed images) via associates’ Internet connections, which frequently cannot control the load. In the method proposed by Nvidia, the transmitted data is restricted to only some keypoint locations around the caller’s eyes, nose, and mouth.

The researchers have also included a pre-trained face recognition network and a pre-trained head pose estimator to ensure that the generated images’ head poses and angles are accurate and acceptable.

When examined on talking-head synthesis tasks such as video reconstruction, motion transfer, and face redirection, the proposed method outperformed other approaches such as FOMMfew-shot vid2vid (fs-vid2vid), and bi-layer neural avatars (bilayer) on benchmark datasets. 

Paper: https://arxiv.org/pdf/2011.15126.pdf

Github: https://nvlabs.github.io/face-vid2vid/

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...