Bytedance Researchers Propose CLIP-GEN: A New Self-Supervised Deep Learning Generative Approach Based On CLIP And VQ-GAN To Generate Reliable Samples From Text Prompts

This Article Is Based On The Research Paper 'CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP'. All Credit For This Research Goes To The Researchers Of This Paper šŸ‘šŸ‘šŸ‘

Please Don't Forget To Join Our ML Subreddit

Synthesizing images from text has been a challenging topic in recent years. Early work, usually based on a convolutional generator that produces images directly from the given text, has shown promising results when working with limited domains; but, when extending the approach to the general domain, these methods have performed too poorly in terms of quality and image-test matching.

Recently, transformers have replaced convolution in text-image generation, and work such as OpenAI’s DALL-E has achieved significant improvements, mainly due to the introduction of VQ-GAN’s discretized representation and increased model size. However, a huge limitation is the number of images they need for training, which is about hundreds of millions of high-quality paired text-image data.

A solution to this problem has been introduced (again by OpenAI) with CLIP (Contrastive Language-Image Pre-training), a cross-modality language vision to predict the relatedness between a text prompt and an image. From this, various methods have attempted to search the image space based on a text query by optimizing the text-image matching score of a pre-trained CLIP model. But currently, the results generated by these approaches are of low quality or limited to a specific domain.

For this reason, the research group introduced CLIP-GEN, a self-supervised scheme based on VQ-GAN, for general text-image generation with language-image priors extracted from a pre-trained CLIP model.

CLIP and VQ-GAN background

CLIP is a model trained to map language-image pairs to a joint embedding space. In other words, given a text prompt and an image, CLIP returns the relatedness between the two entities. The power of CLIP is that its knowledge can extend outside of the data it was trained on; thus, it is a zero-shot predictor. CLIP was trained on 400 million text-image pairs following this scheme: images pass through one encoder that returns the first embedding, while text messages pass through another encoder that produces a second embedding. The InfoNCE loss function is computed over the different embeddings to ensure that semantically relevant data is close to each other in the common embedding space. 

VQ-GAN enables an image to be described by discretized tokens. More specifically, in the first phase, a decoder, an encoder, and a discriminator are trained to learn a codebook (used to convert the latent vector produced by the encoder to a quantized vector). In the second phase, a transformer is trained to predict the next token in the quantized vector in order to be able to generate images during inference.

CLIP-GEN architecture


In general, CLIP-GEN first extracts the cross-modality embedding of the image using the pre-trained CLIP model (in this work, the ViT-B/32 variant of CLIP was used). In parallel, the image is also converted into a sequence of discrete tokens in the VQ-GAN codebook space. Finally, an autoregressive transformer that predicts the image tokens based on the CLIP embedding is trained. During inference, a prompt text is given to CLIP (it is possible to use text as input as texts and images share the same latent space in CLIP), and the resulting embedding is passed to the transformer, which is able to generate consistent image tokens. The generated image tokens can then be reconstructed into an image with the VQ-GAN decoder. This whole process is resumed in the figure above.

More specifically, the training phase is divided into two phases. In the first phase (figure below – (a)), VQ-GAN is trained on the image dataset in a self-supervised manner. In this case, the loss function is an ensemble of a reconstruction loss (which controls the similarity between the real and the generated image) and the typical adversarial loss. In the second phase (figure below – (b)), a sum of two losses was used: the first is meant to maximize the likelihood between the image quantized tokens and the CLIP embeddings, while the second is a reconstruction loss between the CLIP embedding of the input image and the CLIP embedding of the generated image.


Results and Conclusion

CLIP-GEN was trained and evaluated with two of the most popular existing dataset, ImageNet, and MS-COCO, and compared with DF-GAN, CogView, and VQGAN+CLIP. The result of this comparison on the MS-COC dataset is shown below. 

Image generation from text is one of the most exciting topics in generative methods, and this paper presented a straightforward yet efficient solution to produce reliable images without the need for a labeled dataset.Ā 


Leonardo Tanzi is currently a Ph.D. Student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methodologies for smart support during complex interventions in the medical domain, using Deep Learning and Augmented Reality for 3D assistance.