Machine Learning (ML), or more precisely, Deep Learning (DL), has revolutionized the field of Artificial Intelligence (AI) and made tremendous breakthroughs in numerous areas, including computers. DL is a branch of ML that uses deep neural networks, i.e., neural networks composed of multiple hidden layers, to accomplish tasks that were previously impossible. It has opened up a whole new world of possibilities, allowing machines to “learn” and make decisions in ways not seen before. Regarding computer vision, DL is nowadays the most powerful tool for image generation and editing.
As a matter of fact, DL models are nowadays capable of creating realistic photographs from scratch in the style of a particular artist, making images look older or younger than they truly are, or exploiting textual descriptions with text-attention mechanisms to guide the generation. A very known example is Stable Diffusion, a text-to-image generation model recently released in version 2.0.
Several image editing tasks, such as in-painting, colorization, and text-driven transformations, are already performed successfully by DL end-to-end architectures. In particular, text-driven image editing has recently attracted interest from a vast public.
In the original formulation, image editing models traditionally targeted a single editing task, usually style transfer. Other methods encode the images into vectors in the latent space and then manipulate these latent vectors to apply the transformation.
Recently other publications have focused on pretrained text-to-image diffusion models for image editing. Although some of these models have the ability to modify images, in most cases, they offer no guarantees that similar text prompts will yield similar outcomes, as clear from the results presented later on.
The idea and innovation introduced by the presented approach, termed InstructPix2Pix, is the consideration of instruction-based image editing as a supervised learning problem. The first task is the generation of pairs composed of text editing instructions and images before/after the edit. The following step is the supervised training of the proposed diffusion model on this generated dataset. Precisely, the model architecture is summarized in the figure below.
The first part (Training Data Generation in the figure) involves two large-scale pretrained models that operate on different modalities: a language model and a text-to-image model. For the language model, GTP-3 has been exploited and fine-tuned on a small human-written dataset of 700 editing triplets: input captions, edit instructions, and output captions. The final dataset generated by this model includes more than 450k triplets, which are used to guide the editing process. Still, we only have text tuples, but we need images to train the diffusion model. At this point, Stable Diffusion and Prompt2Prompt are utilized to generate appropriate images from these text triplets. In particular, Prompt2Promt is a recent technique that helps achieve great similarity within the pairs of generated images through a cross-attention mechanism. This solution is certainly to be encouraged since the idea is to edit or alter a portion of the input image and not create a completely different one.
The second part (Instruction-following Diffusion Model in the figure) refers to the proposed diffusion model, which aims at producing a transformed image according to an editing instruction and an input image.
The structure is equivalent to the notorious latent diffusion models. Diffusion models learn to generate data samples through a sequence of denoising autoencoders that estimate the input data distribution. Latent diffusion improves the efficiency of diffusion models by operating in the latent space of a pretrained variational autoencoder.
The idea behind diffusion models is rather trivial. The diffusion process starts by adding noise to an input image or an encoded latent vector representing the image. Using the text-attention mechanism, denoisers are applied to the noisy image to obtain a much clearer and more detailed result. This was a summary of InstructPix2Pix, a novel text-driven approach to guide image editing. You can find additional information in the links below if you want to learn more about it.
Check out the Paper and Project Page. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.