Transforming the Future of Artificial Intelligence (AI) and Image Synthesis with Classifier-free-guided Deep Variational Auto-Encoders

Deep generative modeling has emerged as a powerful approach for generating high-quality images in recent years. Specifically, technical improvements in utilizing techniques like diffusion and autoregressive models have enabled the generation of stunning and photo-realistic images conditioned on a text input prompt. Although these models offer remarkable performance, they suffer from a significant limitation: their slow sampling speed. A large neural network needs to be evaluated 50-1000 times to generate a single image, as each step in the generative process relies on reusing the same function. This inefficiency is a crucial factor to consider in real-world scenarios and can present a hurdle for the widespread application of these models.

One popular technique in this field is deep variational autoencoders (VAEs), which combine deep neural networks with probabilistic modeling to learn latent data representations. These representations can then be used to generate new images that are similar to the original data but have unique variations. The utilization of deep VAEs for image generation has enabled remarkable progress in the field of image generation.

However, hierarchical VAEs have yet to produce high-quality images on large, diverse datasets, which is particularly unexpected given their hierarchical generation process, which appears well-suited for image generation. In contrast, autoregressive models have shown greater success, although their inductive bias involves generating images in a simple raster-scan order. Therefore, the authors of the paper discussed in this article have examined the factors contributing to autoregressive models’ success and transposed them to VAEs.

For instance, the key to the success of autoregressive models lies in training on a sequence of compressed image tokens rather than on direct pixel values. By doing so, they can concentrate on learning the relationships between image semantics while disregarding imperceptible image details. Hence, similarly to pixel-space autoregressive models, existing pixel-space hierarchical VAEs may primarily focus on learning fine-grained features, limiting their ability to capture the underlying composition of image concepts.

Based on the abovementioned considerations, the work exploits deep VAEs by leveraging the latent space of a deterministic autoencoder (DAE).

This approach comprises two stages: training a DAE to reconstruct images from low-dimensional latents and then training a VAE to construct a generative model from these latents. 

The model gains two critical benefits by training the VAE on low-dimensional latents instead of pixel space: a more robust and lighter training process. Indeed, the compressed latent code is much smaller than its RGB representation, yet it preserves almost all of the image’s perceptual information. A smaller code length is advantageous since it emphasizes global features, which comprise only a few bits. Furthermore, the VAE can concentrate entirely on the image structure because imperceptible details are discarded. Second, the reduced dimensionality of the latent variable reduces computational costs and enables training larger models with the same resources.

Furthermore, large-scale diffusion and autoregressive models utilize classifier-free guidance to enhance image fidelity. The purpose of this technique is to balance diversity and sample quality since poor likelihood-based models tend to generate samples that do not align with the data distribution. The guidance mechanism aids in steering samples toward regions that more closely match a desired label by comparing conditional and unconditional likelihood functions. For this reason, the authors extend the classifier-free guidance concept to deep VAEs.

The comparison of the results between the proposed method and state-of-the-art approaches is depicted below.

This was the summary of a novel lightweight deep VAEs architecture for image generation.

If you are interested or want to learn more about this framework, you can find a link to the paper and the project page.

Check out the Paper. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...