Meet GLIGEN: An AI Approach that Extends the Functionality of Existing Pre-Trained Text-to-Image Diffusion Models by Enabling Conditioning on Grounding Inputs

Since millions of image-text pairings have been used to train diffusion models, it only makes sense to ask if they can add additional conditional input modalities to the models that have already been pretrained. Similar to the recognition literature, by using pretrained models to gain further control over current text-to-image generation models, they may improve performance on other generation tasks due to the extensive concept knowledge they possess. With the objectives above in mind, they provide a technique for giving fresh, grounded conditional inputs to trained text-to-image diffusion models. As seen in Figure 1, they continue to accept the text caption as input while enabling other input modalities, including grounding part key points, grounding reference pictures, and grounding idea bounding boxes.

Figure1: By feeding various grounding conditions to a frozen text-to-image generation model, GLIGEN allows flexible grounding capabilities. Text entity + box, image entity + box, image style and text + box, and text entity + key points are all supported by GLIGEN. Each scenario’s produced examples are displayed in the top-left, top-right, bottom-left, and bottom-right positions, respectively.

The method that information may be conveyed is constrained by the current input, which is natural language alone. For instance, it is challenging to convey an object’s exact location using text, but bounding boxes and key points make this possible, as seen in Figure 1. There are conditional diffusion models and GANs for inpainting, layout2img creation, etc., that accept inputs other than text, but they seldom combine such inputs for controlling text2img production. Furthermore, previous generative models are often trained individually on each task-specific dataset, regardless of the generative model family. In contrast, the traditional approach in the recognition area has been to develop a task-specific recognition model from a foundation model that has been pretrained on a huge amount of picture data or image-text pairings.

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

Can they build upon already-pretrained diffusion models and provide them with new conditional input modalities, given that they have been trained on billions of image-text pairs? Due to the extensive concept information that the pretrained models possess, and similarly to the recognition literature, they can improve performance on other generation tasks while gaining more controllability over current text-to-image generation models. They provide a technique for giving fresh, grounded conditional inputs to trained text-to-image diffusion models to accomplish the abovementioned goals. As seen in Figure 1, in addition to enabling alternative input modalities like grounding part key points, grounding reference pictures, and grounding bounding boxes for dropping ideas, they also keep the text caption as input.

The main problem is learning to incorporate new grounding information while maintaining the original wide concept knowledge in the pretrained model. They suggest freezing the old model weights and adding fresh trainable gated Transformer layers that use the new grounding input to avoid knowledge forgetting (e.g., bounding box). Using a controlled approach, they progressively incorporate the new grounding data into the pretrained model during training. It is possible to generate generation results that accurately reflect the grounding conditions while having high image quality by using the full model (all layers) in the first half of the sampling steps and only using the original layers (without the gated Transformer layers) in the latter half. This architecture offers flexibility in the sampling procedure during generation for better quality and controllability.

They focus on grounded text2img generation utilizing bounding boxes in their investigations because of the recent scaling success of learning grounded language-image understanding models with boxes in GLIP. They feed the encoded tokens into the newly added layers with their encoded position information using the same pre-trained text encoder (for encoding the caption) to encode each phrase associated with each grounded item (i.e., one phrase per bounding box). This allows their model to ground open-world vocabulary concepts. They discover that their model can generalize to unknown objects because of the common word space even when trained on the COCO dataset. Its generalization on LVIS significantly beats a robust fully-supervised baseline. Following GLIP, they combine object detection and grounding data formats for training to enhance their model’s grounding capability further. These forms have complementary advantages in that grounding data has a wider vocabulary while detection data is more plentiful.

The generalization of their model is continuously enhanced with bigger training data. Contributions. 1) They provide a novel text2img generating technique that gives text2img diffusion models increased grounding controllability. 2) Their model produces open-world grounded text2img with bounding box inputs by retaining the pretrained weights and learning to progressively incorporate the new localization layers or synthesizing newly localized ideas that were not seen during training. 3) By greatly outperforming the previous state-of-the-art layout2img tasks, their model’s zero-shot performance demonstrates the value of using large pretrained generative models for downstream tasks.


Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit PageDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.