Large text-to-image diffusion models have been an innovative tool for creating and editing content because they make it possible to synthesize a variety of images with unmatched quality that correspond to a particular text prompt. Despite the text prompt’s semantic direction, these models still lack logical control handles that may direct the spatial characteristics of the synthesized images. One unsolved problem is how to direct a pre-trained text-to-image diffusion model during inference with a spatial map from another domain, like sketches.
To map the guided picture into the latent space of the pretrained unconditional diffusion model, one approach is to train a dedicated encoder. However, the trained encoder does well within the domain but has trouble outside the domain free-hand sketching.
In this work, three researchers from Google Brain and Tel Aviv University addressed this issue by introducing a general method to direct the inference process of a pretrained text-to-image diffusion model with an edge predictor that operates on the internal activations of the diffusion model’s core network, inducing the edge of the synthesized image to adhere to a reference sketch.
Latent Edge Predictor (LEP)
The first objective is to train an MLP that guides the image generation process with a target edge map, as shown in the figure below. The MLP is trained to map the internal activations of a denoising diffusion model network into spatial edge maps. The core U-net network of the diffusion model is then used to extract the activations from a predetermined order of intermediate layers.
The triplets (x, e, c) containing an image (x), an edge map (e), and a corresponding text caption (c) are used to train the network. The edge maps (e) and images (x) are preprocessed by the model encoder E to produce E(x) and E(e). Then, using text c and the quantity of noise t given to E, the activations are extracted from a predefined sequence of intermediary layers in the diffusion model’s core U-net network.
The extracted features are mapped to the encoded edge map E(e) by training the MLP per pixel with the sum of their channels. The MLP is trained to predict edges in a local manner, being indifferent to the domain of the image, due to the per-pixel nature of the architecture. Additionally, it permits training on a small volume of a few thousand images.
Sketch-Guided Text-to-Image Synthesis
Once the LEP is trained, given a sketch image e and a caption c, the goal is to generate a corresponding highly detailed image that follows the sketch outline. This process is shown in the figure below.
The authors started with a latent image representation zT sampled from a uniform Gaussian. Normally, the DDPM synthesis consists of T consecutive denoising steps, which constitute the reverse diffusion process. The internal activations are once again collected in the U-Net shape network and concatenated to a per-pixel spatial tensor. Then using the pretrained per-pixel LEP, a sketch is predicted. The loss is computed as the similarity between the predicted sketch and the target e. At the end of the training, the model produces a natural image aligned with the desired sketch.
Some (impressive) results are shown below. At inference time, starting from a text prompt and an input sketch, the model is able to produce realistic samples guided by the two input information.
Moreover, as shown below, the authors performed additional studies on specific use cases, such as realism vs. edge fidelity, or stroke importance.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Leonardo Tanzi is currently a Ph.D. Student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methodologies for smart support during complex interventions in the medical domain, using Deep Learning and Augmented Reality for 3D assistance.