Google AI Unveils Imagen Editor and EditBench to Improve and Evaluate Text-Guided Image Inpainting

There has been a recent rise in curiosity over text-to-image converters. These generative models are surprisingly useful, although they sometimes produce the wrong results on the first try, especially for customers with more particular creative or design requirements. Text-guided image editing can improve the image creation process by allowing for interactive refining. Generating modifications that are true to text prompts and compatible with input images is a significant difficulty. Researchers from Good have developed Imagen Editor, a cascaded diffusion model for inpainting with text instructions.

Imagen Editor can make modifications that accurately represent the text prompts by employing object detectors to propose inpainting masks during training. Imagen Editor can capture even the finest of features in the input image by conditioning the cascaded pipeline on the original high-resolution image. To enhance qualitative and quantitative evaluation, Google researchers provide EditBench, a standardized benchmark for text-guided image inpainting. EditBench analyzes inpainting alterations by examining objects, properties, and scenes in real and synthetic images. In-depth human evaluation on EditBench reveals that object masking during training significantly gains text-image alignment, with Imagen Editor coming out on top against DALL-E 2 and Stable Diffusion. Collectively, these models are more adept at object rendering than text rendering and handling material/color/size attributes than counting/shape attributes.

Image Editor

To modify images, use Imagen Editor, a diffusion-based model specifically optimized for Imagen. It strives for more accurate representations of linguistic inputs, granular commands, and high-quality outputs. The image to be modified, a binary mask to identify the edit region, and a text prompt are the three inputs that Imagen Editor uses to determine the output samples.

Image Editor allows users to make targeted changes to certain regions of an image based on a mask and a set of instructions. The model considers the user’s goals and makes realistic adjustments to the image. Image Editor is a text-guided image editor that blends broad linguistic representations with granular control to generate high-quality results. Imagen Editor is an enhanced version of Imagen that uses a cascaded diffusion model to fine-tune text-guided image inpainting. Using three convolutional downsampling image encoders, Imagen Editor provides more image and mask context for each diffusion stage.

Image Editor’s reliable text-guided image inpainting is based on three fundamental methods:

Imagen Editor uses an object detector masking policy with an object detector module to generate object masks during training instead of the random box and stroke masks used by previous inpainting models.

Imagen Editor improves high-resolution editing by requiring full-resolution, channel-wise concatenation of the input image and the mask during training and inference.

To influence data toward a certain conditioning, in this case, text prompts, researchers use classifier-free guiding (CFG) at inference. CFG interpolates between the predictions of the conditioned and unconditioned models to achieve high precision in text-guided image inpainting. 

Having generated outputs be true to the text prompts is a major difficulty in text-guided image inpainting.


EditBench uses 240 photos to create a new standard for text-guided image inpainting. A mask is associated with each image that denotes the area that will be altered during the inpainting process. To help users specify the modification, researchers give three text prompts for each image-mask pair. EditBench is a hand-curated text-to-image creation benchmark that, like DrawBench and PartiPrompts, attempts to capture various categories and factors of difficulty—in gathering images. An equal split of natural photos culled from preexisting computer vision datasets and synthetic images produced by text-to-image models included in EditBench.

The range of mask sizes supported by EditBench is extensive, and it even includes big masks that extend to the images’ borders. EditBench questions are structured to evaluate models’ performance on a variety of fine-grained details across three categories:

  1. Attributes (such as material, color, shape, size, and count)
  2. Object types (such as common, rare, and text rendering)
  3. Scenes (such as indoor, outdoor, realistic, or painted)


Text-image alignment and image quality on EditBench undergo rigorous human tests by the research team. Additionally, they compare and contrast human preferences with computerized measures. They perform an analysis of four models:

  • Image Editor (IM)
  • Imagen EditorRM (IMRM)
  • Stable Diffusion (SD)
  • DALL-E 2 (DL2)

To assess the benefits of object masking during training, researchers compare Imagen Editor with Imagen EditorRM. To put our work in perspective with those of others and to more widely examine the limitations of the current state of the art, we have included evaluations of Stable Diffusion and DALL-E 2.

To sum it up

The provided image editing models are part of a larger family of generative models that enable previously inaccessible capabilities in content production. Still, they also carry the risk of generating content that is damaging to individuals or society as a whole. It is generally accepted in language modeling that text generation models can unintentionally reflect and magnify social biases existing in their training data. The Imagen Editor is an improved version of Imagen’s text-guided image inpainting. Imagen Editor relies on an object masking policy for training and the addition of new convolution layers for high-resolution editing. EditBench is a large-scale, systematic benchmark for inpainting images based on textual descriptions. EditBench conducts comprehensive tests of attribute-based, object-based, and scene-based inpainting systems. 

Check Out The Paper and Google Blog. Don’t forget to join our 23k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...