Recently, text-to-image (T2I) diffusion models have exhibited promising outcomes, sparking explorations into numerous generative tasks. Some efforts have been made to invert pre-trained text-to-image models to obtain text embedding representations, allowing for capturing object appearances in reference images. However, there has been limited exploration of capturing object relations, a more challenging task involving the understanding of interactions between objects and image composition. Existing inversion methods struggle with this task due to entity leakage from reference images, which happens when a model leaks sensitive information about entities or individuals, leading to privacy violations.
Nonetheless, addressing this challenge is of significant importance.
This study focuses on the Relation Inversion task, which aims to learn relationships in given exemplar images. The objective is to derive a relation prompt within the text embedding space of a pre-trained text-to-image diffusion model, where objects in each exemplar image follow a specific relation. Combining the relation prompt with user-defined text prompts allows users to generate images corresponding to specific relationships while customizing objects, styles, backgrounds, and more.
A preposition prior is introduced to enhance the representation of high-level relation concepts using the learnable prompt. This prior is based on the observation that prepositions are closely linked to relations, prepositions and words of other parts of speech are individually clustered in the text embedding space, and complex real-world relations can be expressed using a basic set of prepositions.
Building upon the preposition prior, a novel framework termed ReVersion is proposed to address the Relation Inversion problem. An overview of the framework is illustrated below.
This framework incorporates a novel relation-steering contrastive learning scheme to guide the relation prompt toward a relation-dense region in the text embedding space. Basis prepositions are used as positive samples to encourage embedding into the sparsely activated area. At the same time, words of other parts of speech in text descriptions are considered negatives, disentangling semantics related to object appearances. A relation-focal importance sampling strategy is devised to emphasize object interactions over low-level details, constraining the optimization process for improved relation inversion results.
In addition, the researchers introduce the ReVersion Benchmark, which offers a variety of exemplar images featuring diverse relations. This benchmark serves as an evaluation tool for future research in the Relation Inversion task. Results across various relations demonstrate the effectiveness of the preposition prior and the ReVersion framework.
As presented in the study, we report some of the provided outcomes below. Since this entails a novel task, there is no other state-of-the-art approach to compare with.
This was the summary of ReVersion, a novel AI diffusion model framework designed to address the Relation Inversion task. If you are interested and want to learn more about it, please feel free to refer to the links cited below.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.