How Do DALL·E 2, Stable Diffusion, and Midjourney Work?

Over the last few years, many advancements have been made in Artificial Intelligence (AI), and one of the new additions to AI is AI Image Generator. It is a tool capable of converting an input statement into a picture or painting. There are many options for text-to-image AI tools, but the ones that stand out are DALLE 2, Stable Diffusion, and Midjourney.


DALL·E 2 is an AI program created by OpenAI that creates images from textual descriptions. Using more than 10 billion parameter training versions of the GPT-3 transformer model, it interprets natural language inputs and generates the corresponding image.

An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula – created using DALLE 2

Stable Diffusion

Stable Diffusion is a text-to-image model that uses a frozen CLIP ViT-L/14 text encoder to tune the model at text prompts. It separates the imaging process into a “diffusion” process at runtime- it starts with only noise and gradually improves the image until it is entirely free of noise, progressively approaching the provided text description.

A pikachu fine dining with a view to the Eiffel Tower – generated by Stable Diffusion


Midjourney is another AI-powered tool that generates images from user prompts. MidJourney is proficient at adapting actual art styles to create an image of any combination of things the user wants. It excels at creating environments, especially fantasy and sci-fi scenes, with dramatic lighting that looks like rendered concept art from a video game.

                          Cloud Castle at night, cinematic – created by Midjourney

The technology behind DALL·E 2

DALL·E 2 primarily consists of 2 parts – one to convert the user input into the representation of an image (called Prior) and another to convert this representation into an actual photo (called Decoder).


The text and image embeddings used come from another network called CLIP (Contrastive Language-Image Pre-training), also created by OpenAI. CLIP is a neural network that returns the best caption for an input image. It does the opposite of what DALLE 2 does – text-to-image conversion. The objective of CLIP is to learn the connection between the visual and textual representation of an object.

DALL·E 2’s goal is to train two models. The first is Prior, trained to take text labels and create CLIP image embeddings. The second is the Decoder, which takes the CLIP image embeddings and produces a learned image. After training, the workflow of inference looks like this:

  • The entered caption is transformed into a CLIP text embedding using a neural network.
  • Prior reduces the dimensionality of the text embedding using Principal Component Analysis or PCA.
  • Image embedding is created using the text embedding.
  • In the decoder step, a diffusion model is used to transform the image embedding into the image.
  • The image is upscaled from 64×64 to 256×256 and then finally to 1024×1024 using a Convolutional Neural Network.

The technology behind Stable Diffusion

Stable Diffusion is powered by Latent Diffusion Model (LDM), a cutting-edge text-to-image synthesis technique. Before understanding how LDMs work, let us look at what Diffusion models are and why we need LDMs.

Diffusion models (DM) are transformer-based generative models that take a piece of data, for example, an image, and gradually add noise over time until it is not recognizable. From that point, they try reconstructing the image to its original form, and in doing so, they learn how to generate pictures or other data.

The issue with DMs is that the powerful ones often consume hundreds of GPU days, and inference is quite expensive due to sequential evaluations. To enable DM training on limited computational resources without compromising their quality as well as flexibility, DMs are applied in the latent space of powerful pre-trained autoencoders. 

Training a diffusion model on such a representation makes it possible to achieve an optimal point between complexity reduction and detail preservation, significantly improving visual fidelity. Introducing a cross-attention layer to the model architecture turns the diffusion model into a powerful and flexible generator for generally conditioned inputs such as text and bounding boxes, enabling high-resolution convolution-based synthesis.

How does Midjourney work?

Midjourney is an AI image generation tool that takes inputs through text prompts and parameters and uses a Machine Learning (ML) algorithm trained on a large amount of image data to produce unique images. 

Midjourney is currently only accessible via the Discord bot on their official Discord. The user generates the image using the ‘/imagine’ command and enters the command prompt like any other AI art generator tool. The bot then returns a snap.

Comparison between DALL·E 2, Stable Diffusion and Midjourney

DALL·E 2 has been trained on millions of stock images, making its output more sophisticated and perfect for enterprise use. DALL·E 2 produces a much better picture than Midjourney or Stable Diffusion when there are more than two characters.

Midjourney, on the other hand, is a tool best known for its artistic style. Midjourney uses its Discord bot to send as well as receive calls to AI servers, and almost everything happens on Discord. The resulting image rarely looks like a photograph; it seems more like a painting.

Stable Diffusion is an open-source model accessible to everyone. It also has a relatively good understanding of contemporary artistic illustration and can produce highly detailed artwork. However, it needs an interpretation of the complex original prompt. Stable Diffusion is excellent for intricate, creative illustrations but falls short when creating general images such as logos.

The below prompts help to understand the similarities and differences between each model.

Don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more. 


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...