Images, such as graphics, paintings, and photographs, may frequently be explained in language, but they might also take specific talents and hours of effort to make. As a result, a technology capable of creating realistic graphics from natural language can enable humans to produce rich and diverse visual material with previously unimaginable simplicity. The capacity to modify photos with spoken language enables iterative refinement and fine-grained control, both essential for real-world applications.
DALL-E, a 12-billion parameter version of OpenAI’s GPT-3 transformer language model meant to produce photorealistic pictures using text captions as cues, was unveiled in January. DALL-E’s fantastic performance was an instant hit in the AI community, as well as broad mainstream media coverage. Last month, NVIDIA unveiled the GAN-based GauGAN2 – a term inspired by French Post-Impressionist painter Paul Gauguin, much as DALL-E was inspired by Surrealist artist Salvador Dali.
Not to be outshined, OpenAI researchers unveiled GLIDE (Guided Language-to-Image Diffusion for Generation and Editing). This diffusion model achieves performance comparable to DALL-E despite utilizing only one-third of the parameters.
While most visuals can be described in words, producing images from text inputs necessitates specific skills and many hours of work. Enabling an AI agent to automatically make photorealistic pictures from natural language prompts gives people unparalleled ease in creating rich and diverse visual material and allows for simpler iterative refinement and fine-grained management of the created images.
Recent research has demonstrated that likelihood-based diffusion models may create high-quality synthetic pictures, especially when combined with a guiding strategy that trades off diversity for fidelity. OpenAI released a guided diffusion model, which allows diffusion models to condition on the labels of a classifier.
GLIDE advances this work by bringing guided diffusion to the problem of text-conditional image generation. After training the GLIDE diffusion model, the researchers investigated two distinct guiding techniques that employ a text encoder to condition natural language descriptions:
1) CLIP guidance
2) Classifier-Free Guidance.
CLIP is a scalable method for learning joint representations of text and pictures that delivers a score based on how near an image is to a caption. This strategy was applied to the team’s diffusion models by substituting the classifier with a CLIP model that “guides” the models.
Meanwhile, classifier-free guidance is a method of directing diffusion models that do not need the training of a separate classifier. It has two enticing features:
1) Allowing a single model to harness its own knowledge during guidance rather than depending on a separate classification model;
2) Simplifying guidance when conditioning complex data to forecast using a classifier.
According to the study, human evaluators preferred classifier-free guiding picture outputs for photorealism and caption resemblance.
GLIDE generated high-quality photos with realistic shadows, reflections, and textures in testing. The model may also mix numerous ideas (corgis, bow ties, and party hats) while assigning properties to these items, such as colors.
In addition to producing images from text, GLIDE may be used to change existing images by using natural language text prompts to insert new objects, add shadows and reflections, conduct image inpainting, and so on. It can also convert basic line drawings into photorealistic photos, and it has powerful zero-sample production and repair capabilities for complicated circumstances.
Human assessors favored GLIDE’s output images to DALL- E’s, even though it is a considerably smaller model as it is 3.5 billion vs. DALL-E’s 12 billion parameters. Further, GLIDE needs more minor sampling delay and does not require CLIP reordering.
The team only published a smaller diffusion model and a noised CLIP model trained on filtered datasets to protect against malevolent actors who may create convincing misinformation or deepfakes. The code and weights for these models may be found on the project’s GitHub page.
The research paper for the GLIDE can be found here.