NTU Researchers Propose ‘AvatarCLIP’: A Novel Zero-Shot Text-Driven 3D Avatar Generation And Animation Pipeline

The creation of 3D digital avatars is crucial for several industries, ranging from movies to videogames. However, the whole production process is often affordable only by big companies since it requires the use of professional software, specialists that are familiar with them, and large amounts of working hours. For this reason, a group of researchers at the Nanyang Technological University in Singapore have proposed AvatarCLIP: a framework that generates animated 3D avatars only based on natural language descriptions. Figure 1 shows how AvatarCLIP works by considering descriptions of the shape (in blue), appearance (in orange), and motion (in red) of the avatar.

Figure 2 shows how the pipeline of AvatarCLIP is divided into two stages: static avatar generation and motion generation. The first stage consists of creating an avatar based on the natural language descriptions of its shape and appearance. Then, such an avatar is animated during the second stage based on the motion description.

Static Avatar Generation

The static avatar generation stage includes three steps: coarse shape generation, shape sculpting, and texture generation, and making the static avatar animatable.

The goal of the first step is to generate a coarse shape of the avatar based on the description of the avatar shape provided through natural language. First of all, the authors considered SMPL as the source of naked human body shapes. Indeed, SMPL is a realistic 3D model of the human body driven by thousands of 3D body scans. Based on these shapes, a Variational Autoencoder (VAE) is trained to construct a code-book of body shapes. Then, the CLIP model is used to get the best matching body shape by considering the features of the body shapes in the code-book and the shape description expressed in natural language. CLIP is a pre-trained model for the text-driven generation of images. For this reason, it is used to find the body shape closest to the natural language description.

In the second step, the body shape and its texture must be further enhanced to match the natural language description of the avatar’s appearance. For this step, the authors relied on NeuS, a volume rendering method that reconstructs photo-realistic objects from 2D image inputs. To use NeuS, the authors first generated 2D multi-view renderings of the body shape obtained in the first step. The multi-view renderings are then provided to a NeuS model that is then used to initialize a 3D avatar. Moreover, the NeuS model is optimized through CLIP so that the texture and geometry of the 3D avatar are guided by its appearance description. Finally, to avoid the generation of textures at wrong body parts, the optimization process is further optimized through prior human knowledge. The appearance description is augmented so that textures are generated by focusing on different parts of the body shape. For instance, as shown in Figure 7, the appearance description “Steve Jobs” is augmented with two additional descriptions “the face of Steve Jobs” and “the back of Steve Jobs”.

Immagine che contiene testo, uomo, screenshot

Descrizione generata automaticamente
Source: https://arxiv.org/pdf/2205.08535v1.pdf

The third step of the Static Avatar Generation stage consists of making the 3D avatar animatable. The authors relied on the marching cube algorithms to extract the mesh from the generated avatar to achieve this goal.

Motion Generation

The authors proposed a two-step motion generation process: candidate pose generation guided by CLIP and motion sequences generation using motion priors with candidate pose sas references.

Unfortunately, CLIPS cannot be directly used for the first step to generate motion sequences from natural language descriptions. However, it could be exploited to value the similarity between a rendered human pose and a description. Indeed, the authors first created a code-book of human poses thanks to the AMASS dataset. Then, they relied on CLIP to find in the code-book a set of candidate poses closest to the motion description expressed through natural language.

To generate the motion sequence that matches the motion description, the authors used a pre-trained motion VAE. The latent code of the VAE is optimized through the candidate poses. In this way, the VAE will generate a motion sequence close enough to the candidate poses and, hence, to the motion description.

This Article is written as a paper summary article by Marktechpost Research Staff based on the paper 'AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github.

Please Don't Forget To Join Our ML Subreddit

Luca is Ph.D. student at the Department of Computer Science of the University of Milan. His interests are Machine Learning, Data Analysis, IoT, Mobile Programming, and Indoor Positioning. His research currently focuses on Pervasive Computing, Context-awareness, Explainable AI, and Human Activity Recognition in smart environments.