Meet StyleAvatar3D: A New AI Method for Generating Stylized 3D Avatars Using Image-Text Diffusion Models and a GAN-based 3D Generation Network

Since the advent of large-scale image-text pairings and sophisticated generative model topologies like diffusion models, generative models have made tremendous progress in producing high-fidelity 2D pictures. These models eliminate manual involvement by allowing users to create realistic visuals from text cues. Due to the lack of diversity and accessibility of 3D learning models compared to their 2D counterparts, 3D generative models continue to confront significant problems. The availability of high-quality 3D models is constrained by the arduous and highly specialized manual development of 3D assets in software engines. 

Researchers have lately investigated pre-trained image-text generative methods for creating high-fidelity 3D models to address this issue. These models include detailed priors of item geometry and appearance, which may make it easier to create realistic and varied 3D models. In this study researchers from Tencent, Nanyang Technological University, Fudan University and  Zhejiang University present a unique method for creating 3D-styled avatars that use text-to-image diffusion models that have already undergone training and allow users to choose avatars’ styles and facial features via text prompts. They use EG3D, a GAN-based 3D generation network, specifically because it has several benefits. 

First, EG3D uses calibrated photos rather than 3D data for training, making it possible to continuously increase the variety and realism of 3D models using improved image data. This feat is quite simple for 2D photographs. Second, they can produce each view independently, effectively controlling the randomness during picture formation because the images used for training do not require stringent multi-view uniformity in appearance. Their method uses ControlNet based upon StableDiffusion, which permits picture production directed by predetermined postures, to create calibrated 2D training images for training EG3D. 

Reusing camera characteristics from posture photographs for learning purposes enables these poses to be synthesized or retrieved from avatars in current engines. Even when utilizing accurate stance photographs as guidance, ControlNet frequently struggles to create views with enormous angles, such as the back of the head. The generation of complete 3D models needs to be improved by these failed outputs. They have taken two separate approaches to the problem to address it. First, they have created view-specific prompts for various views during picture production to reduce failure occurrences dramatically. The synthesized photos might partially match the stance photographs, even with view-specific cues. 

To address this mismatch, they have created a coarse-to-fine discriminator for 3D GAN training. Each picture data in their system has a coarse and fine posture annotation. They select a training annotation at random during GAN training. They give a high chance of adopting good posture annotation for confident views like the front face, but learning for the rest of the opinions relies more heavily on coarse ideas. This method can produce more accurate and varied 3D models even when the input photos include cluttered annotations. Additionally, they have created a latent diffusion model in the latent style space of StyleGAN to enable conditional 3D creation using an image input. 

The diffusion model can be trained quickly because of the style code’s low dimensions, great expressiveness, and compactness. They directly sample image and style code pairings from their trained 3D generators to learn the diffusion model. They ran comprehensive tests on many massive datasets to gauge the efficacy of their suggested strategy. Their findings show that their method exceeds current cutting-edge techniques regarding visual quality and variety. In conclusion, this research introduces a unique method that uses trained image-text diffusion models to produce high-fidelity 3D avatars. 

Their architecture considerably increases the versatility of avatar production by allowing styles and facial features to be determined by text prompts. To address the issue of picture-position misalignment, they have also suggested a coarse-to-fine pose-aware discriminator, which will allow for better use of image data with erroneous pose annotations. Last but not least, they have created an additional conditional generation module that enables conditional 3D creation using picture input in the latent style space. This module further increases the framework’s adaptability and allows users to create 3D models that are customized to their tastes. They also plan to open-source their code. 

Check Out The Paper and Github link. Don’t forget to join our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft