Recently, we are seeing the huge impact of language-guided image editing based on deep learning techniques that allow non-expert people to generate outstanding artistic images. In this scenario, a challenging task is semantic image editing, whose goal is to manipulate the semantics of the content inside an image while preserving its overall realism. The application of famous language-image models can be non-trivial for this task since textual descriptions are ambiguous and may not accurately reflect user-desired effects.
In this work, researchers from the University of Science and Technology of China and Microsoft Research Asia propose a more intuitive image editing approach that allows semantic manipulation of image content based on an examplar image provided by the user. In particular, the proposed method merges a reference image provided by the user into a source image so that the fused image looks photo-realistic.
Figure 1 shows some examples obtained thanks to the proposed method.
To achieve their goal, the authors trained a diffusion model conditioned on the exemplar image. Figure 4 shows the overall training pipeline of the proposed method.
Since it is impossible to collect enough training triplets containing the source image, an exemplar, and the corresponding combined ground truth, in this method, the objects contained in the input images are randomly cropped and considered as reference images. During training, the goal is to use the original image without the cropped object and the reference cropped object to reconstruct the original image. However, this approach must be enhanced since the model just learns how to copy and paste the reference object into the original masked image.
First of all, there is a train-test mismatch issue to handle. Indeed, the reference object during training is directly derived from the source image. However, this approach cannot generalize well on test data. For this reason, the authors adopt different data augmentation techniques (e.e.,g rotation, blur) on the reference object to break down its connection with the source image. Hence, it is possible to use the pre-trained text-to-image model CLIP to generate a new image conditioned on the reference object instead of a text prompt.
Moreover, information bottleneck is introduced to force the network to deeply understand the content of the reference object instead of just copying it inside the masked source image. For this reason, the reference object is compressed from a 224 x 224 x 3 image to a one-dimensional vector of 1024 elements. This allows ignoring the high-level details of the objects while keeping its overall semantic information. Moreover, to avoid directly remembering and regenerating the reference object, the proposed method relies on Stable Diffusion to initialize a strong image prior.
Finally, the authors also consider image editing controllability. Specifically, they enable the end-user to control the shape of the edit region. This is possible since the proposed method considers, during training, arbitrarily shaped masks based on the bounding box of the reference object. By involving irregular masks during training, the model is able to generate photo-realistic images given different shape masks. Moreover, the end-user also has control over the similarity degree between the edit region and the reference object. In particular, the proposed method uses the classifier-free guidance strategy to allow controlling the similarity between the generated image and the reference object.
Check out the Paper and Github Link. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Luca is Ph.D. student at the Department of Computer Science of the University of Milan. His interests are Machine Learning, Data Analysis, IoT, Mobile Programming, and Indoor Positioning. His research currently focuses on Pervasive Computing, Context-awareness, Explainable AI, and Human Activity Recognition in smart environments.