Meet GigaGAN: A Large-scale Modified GAN Architecture for Text-to-Image Synthesis

The introduction of popular language models like ChatGPT and DALL-E has been a massive topic of interest for the past few months, especially in the Artificial Intelligence community. These models can perform tasks ranging from answering questions and generating content to producing good-quality images. They do so by using some advanced deep-learning methodologies. For the unaware, DALL-E, developed by OpenAI, is a text-to-image generation model that creates high-quality images with the help of the fed textual description as input. Trained on massive datasets of texts and images, DALL-E and other text-to-image generation models develop a visual representation of the given text or the prompt. Apart from this, Stable diffusion even allows the generation of a new image from an existing image. 

These LLMs completely rely on an iterative interface, making them useful for stable training with basic objectives but computationally expensive and less efficient. Compared to these models, Generative Adversarial Networks (GANs) are more efficient as generating images in GANs takes place only through a single pass. GANs are basically deep learning architectures consisting of a generator network to create samples and discriminator data to evaluate the samples if they are real or fake. The goal of GANs is to simply produce new data that imitates some known data distribution. But scaling GANs has been established with certain instabilities in the training procedure. A recent paper has explored whether and how GANs can be scaled up with stable training. 

A team of researchers has developed GigaGAN, which is a new GAN architecture that can far exceed the limitations of the previously existing StyleGAN architecture. GigaGAN is a one billion parameter GAN and showed stable and scalable training on large-scale datasets such as LAION2B-en. GigaGAN is extremely fast and can produce a 512px image in just 0.13 seconds and 4096px at 3.7s. It can also produce high-resolution images, such as 16-megapixel images, in just 3.66 seconds. The two main components of GigaGAN’s architecture does the following – 

  1. GigaGAN generator – It includes a text encoding branch, style mapping network, and a multi-scale synthesis network which is augmented by stable attention and adaptive kernel selection.
  2. GigaGAN discriminator – It includes two branches for processing the image as well as the text conditioning. The text branch processes the text like the generator, and the image branch receives an image pyramid making independent predictions for each image scale.

GigaGAN even supports a number of latent space editings applications, such as latent interpolation, style mixing, and vector arithmetic operations. Compared to Stable Diffusion v1.5, DALL·E 2, and Parti-750M, GigaGAN has a lower Fréchet inception distance (FID), a metric used to evaluate the quality of images created by a generative model by calculating the distance between feature vectors. Lower scores show that the two groups of images are more similar. 

With a disentangled, continuous, and controllable latent space, GigaGAN is a viable option for text-to-image synthesis and offers significant advantages over other generative models.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🚀 The end of project management by humans (Sponsored)