Despite the remarkable capabilities demonstrated by advancements in generating images from text using diffusion models, the accuracy of the generated images in conveying the intended meaning of the original text prompt is not always guaranteed, as found by recent research. Generating images that effectively align with the semantic content of the text query is a challenging task that necessitates a deep understanding of textual concepts and their meaning in visual representations.
Due to the challenges of acquiring detailed annotations, current text-to-image models struggle to fully comprehend the intricate relationship between text and images. Consequently, these models tend to generate images that resemble frequently occurring text-image pairs in the training datasets. As a result, the generated images often lack requested attributes or contain undesired ones. While recent research efforts have focused on addressing this issue by reintroducing missing objects or attributes to modify images based on well-crafted text prompts, there is a limited exploration of techniques for removing redundant attributes or explicitly instructing the model to exclude unwanted objects using negative prompts.
Based on this research gap, a new approach has been proposed to address the current limitations of the existing algorithm for negative prompts. According to the authors of this work, the current implementation of negative prompts can lead to unsatisfactory results, particularly when there is an overlap between the main prompt and the negative prompts.
To address this issue, they propose a novel algorithm called Perp-Neg, which does not require any training and can be applied to a pre-trained diffusion model. The architecture is reported below.
The name “Perp-Neg” is derived from the concept of utilizing the perpendicular score estimated by the denoiser for the negative prompt. This choice of name reflects the key principle behind the Perp-Neg algorithm. Specifically, Perp-Neg employs a denoising process that is restricted to be perpendicular to the direction of the main prompt. This geometric constraint plays a crucial role in achieving the desired outcome.
Perp-Neg effectively addresses the issue of undesired perspectives in the negative prompts by limiting the denoising process to be perpendicular to the main prompt. It ensures that the model focuses on eliminating aspects that are orthogonal or unrelated to the main semantics of the prompt. In other words, Perp-Neg enables the model to remove undesirable attributes or objects not aligned with the text’s intended meaning while preserving the main prompt’s core essence.
This approach helps in enhancing the overall quality and coherence of the generated images, ensuring a stronger alignment with the original text input.
Some results obtained via Perp-Neg are presented in the figure below.
Beyond image synthesis, Perp-Neg is also extended to DreamFusion, an advanced text-to-3D model. Furthermore, in this context, the authors demonstrate its effectiveness in mitigating the Janus problem. The Janus (or multi-faced) problem refers to situations where a 3D-generated object is primarily rendered according to its canonical view rather than other perspectives. This problem mainly happens because the training dataset is unbalanced. For instance, animals or people are usually depicted from their front view and only sporadically from the side or back views.
This was the summary of Perp-Neg, a novel AI algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm. If you are interested, you can learn more about this technique in the links below.
Check out the Paper, Project, and Github. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.