OpenAI Introduces DALL·E: A Neural Network That Creates Images From Text Descriptions


OpenAI has recently trained a neural network called DALL·E that creates images from text descriptions for various concepts expressible in natural language. 

It is possible to teach an extensive neural network to perform various text generation tasks using the GPT-3 model. Using Image GPT-3, the same neural network can be used to generate images with high accuracy. DALL·E is a 12-billion parameter version of GPT-3 trained to create images from text descriptions, using a text–image pair dataset.  

Similar to GPT-3, DALL·E is a simple decoder-only transformer. It takes both text and image as a single stream of data containing 1280 tokens, 256 for the text and 1024 for the image. It is molded using maximum likelihood to generate all of the tokens, one after another. It has a mask at each of its 64 self-attention layers allowing all image tokens to attend to all text tokens. Depending on the layer, it uses the standard causal mask for the text tokens and sparse attention for the image tokens with either a column, a row, or a convolutional attention pattern.

AdvertisementCoursera Plus banner featuring Johns Hopkins University, Google, and University of Michigan courses highlighting data science career-advancing content

This training procedure allows DALL·E to generate an image from scratch and reconstruct a rectangular region of existing images that extends to the bottom-right corner. DALL·E can create plausible images for many sentences that explore the compositional structure of language. 

DALL·E has distinct capabilities, such as: 

  • Creating anthropomorphized versions of animals and objects, animal chimeras, and emojis.
  • Combining unrelated concepts in plausible ways to synthesize objects, some of which are unlikely to exist in the real world. 
  • Rendering text
  • Applying transformations to existing images

Controlling Attributes and Drawing multiple objects

DALL· E can modify the object’s attributes and the number of times that it appears. It is challenging to control objects, their features, and their spatial relationship simultaneously. For example, the phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and green pants.” To precisely interpret the expression, DALL·E must precisely compose each piece of apparel with the animal and form the associations like (hat, red), (shirt, blue), (gloves, yellow), and (pants, green) without confusing between them.

This job is called variable binding, and the team has tested DALL· E’s ability to perform it for relative positioning, stacking objects, and controlling multiple attributes. However, DALL· E’s performance also depends on how the caption is phrased. On introducing multiple items, DALL·E can confuse the associations between the objects and their colors, and the success rate decreases. 

Visualizing perspective and three-dimensionality

DALL·E also allows for control over a scene’s viewpoint and the 3D style in which a scene is rendered. The researchers were able to recover a smooth animation of the rotating head on testing DALL· E to repeatedly draw the head of a well-known figure at each angle from a sequence of equally spaced angles.

Visualizing the internal and external structure

DALL· E can render internal structure with cross-sectional views and an external network with macro photographs. This was observed using the samples from the “extreme close-up view” and “x-ray” style.

Inferring contextual details

Translating text to images is said to be underspecified as a single caption can correspond to many possible images, so the image is not uniquely determined. These underspecifications are studied for DALL· E’s in three cases: 

  • Alternating way, setting, and time
  • Rendering the same object in several different situations
  • Creating an object’s image with a particular text written on it.

Unlike a 3D rendering engine ( in which inputs must be defined unambiguously in detail ), DALL·E can essentially “fill in the blanks” when the description does not explicitly state a specific feature that the image should contain.  

Zero-Shot Reasoning

GPT-3 has zero-shot reasoning capability, meaning it can perform many tasks only from a description to generate the answer without any additional training. For example, when presented with the phrase “a person walking his dog in the park,” it is translated into French using GPT-3 to result in “un Homme Qui promène son Chien Dans le parc.”

Similarly, Although no modifications were made to the neural networks, the team finds that DALL·E extends this skill to the visual domain and can perform several image-to-image translation tasks when prompted correctly.

Additionally, DALL·E has also learned about geographic facts, landmarks, and neighborhoods. However, its knowledge of these concepts is precise in some ways but flawed in others.

The work involving generative models has the potential for significant, broad societal impacts. The team plans to analyze how models like DALL·E relates to societal issues like economic impact on specific work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.




Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.