Researchers At UC Berkeley Propose IntructPix2Pix: A Diffusion Model To Edit Images From Human-Written Instructions

In recent years, the possible applications of text-to-image models have increased enormously. However, image editing to human-written instruction is one subfield that still has numerous shortcomings. The biggest drawback is how challenging it is to gather training data for this task. 

To solve this issue, a technique for creating a paired dataset that includes multiple large models pretrained on various modalities was proposed by a research team from the University of Berkeley based on a large language model (GPT-3) and a text-to-image model (Stable Diffusion). After producing the paired dataset, the authors trained a conditional diffusion model on the generated data to produce the edited image from an input image and a textual description of how to edit it.

Dataset generation

The authors first only worked in the text domain, utilizing a big language model to take in image captions, generate editing instructions, and then output the edited text captions. As an example, the language model may produce the plausible edit instruction “have her ride a dragon” and the suitably updated output caption “photograph of a girl riding a dragon” given the input caption “photograph of a girl riding a horse,” as seen in the figure above. Working in the text domain made it possible to produce a broad range of adjustments while preserving a relationship between the language instructions and image changes. 

A relatively modest human-written dataset of editing triplets – input captions, edit instructions, and output captions – was used to fine-tune GPT-3 to train the model. The authors manually created the instructions and output captions for the fine-tuning dataset after selecting 700 input caption samples from the LAION-Aesthetics V2 6.5+ dataset. With the aid of this data and the default training parameters, the GPT-3 Davinci model’s fine-tuning for a single epoch was accomplished while taking advantage of its vast knowledge and generalization skills.

They then converted two captions into two images using a pretrained text-to-image algorithm. The fact that text-to-picture models don’t ensure visual consistency, even with slight changes to the conditioning prompt, makes it difficult to convert two captions into two comparable images. Two very similar instructions, such as “draw a picture of a cat” and “draw a picture of a black cat,” for instance, could result in vastly diverse drawings of cats. So, they employ Prompt-to-Prompt, a new technique designed to promote similarity across several generations of a text-to-image diffusion model. A comparison of sampled images with and without prompt-to-prompt is 

shown in the figure below.

Immagine che contiene testo, erba, cielo, persona

Descrizione generata automaticamente


After generating the training data, the authors trained a conditional diffusion model, named InstructPix2Pix, that edits images from written instructions. The model is based on Stable Diffusion, a large-scale text-to-image latent diffusion model. Diffusion models use a series of denoising autoencoders to learn how to create data samples. Latent diffusion, which operates in the latent space of a pretrained variational autoencoder, enhances the effectiveness and quality of diffusion models. The authors initialized the weights of the model with a pretrained Stable Diffusion checkpoint, utilizing its extensive text-to-image generation capabilities, because fine-tuning a large image diffusion model outperforms training a model from scratch for image translation tasks, especially when paired training data is scarce. Classifier-free diffusion guidance, a technique for balancing the quality and diversity of samples produced by a diffusion model, was used.


The model performs zero-shot generalization to both arbitrary real images and natural human-written instructions despite being trained completely on synthetic samples.

The paradigm provides intuitive picture editing that can execute a wide range of alterations, including object replacement, image style changes, setting changes, and creative medium changes, as illustrated below.

The authors also conducted a study on gender bias (see below), which is typically ignored by research articles and demonstrates the biases on which the models are based.

Immagine che contiene testo, persona, interni, gruppo

Descrizione generata automaticamente

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit PageDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Leonardo Tanzi is currently a Ph.D. Student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methodologies for smart support during complex interventions in the medical domain, using Deep Learning and Augmented Reality for 3D assistance.

🚀 The end of project management by humans (Sponsored)