Humans naturally possess the ability to break down complicated scenes into component elements and imagine them in various scenarios. One might easily picture the same creature in multiple attitudes and locales or imagine the same bowl in a new environment, given a snapshot of a ceramic artwork showing a creature reclining on a bowl. Today’s generative models, however, need help with tasks of this nature. Recent research suggests personalizing large-scale text-to-image models by optimizing freshly added specialized text embeddings or fine-tuning the model weights, given many pictures of a single idea, to enable synthesizing instances of this concept in unique situations.
In this study, researchers from the Hebrew University of Jerusalem, Google Research, Reichman University and Tel Aviv University present a novel scenario for textual scene decomposition: given a single image of a scene that might include several concepts of various types, their objective is to separate out a specific text token for each idea. This permits the creation of innovative pictures from verbal prompts that highlight certain concepts or combinations of many themes. The ideas they want to learn or extract from the customization activity are only sometimes apparent, which makes it potentially unclear. Previous works have dealt with this ambiguity by focusing on a single topic at a time and using a variety of photographs to show the notion in various settings. However, alternative methods are required to resolve the problem when transitioning to a single-picture situation.
They specifically suggest adding a series of masks to the input image to add further information about the concepts they want to extract. These masks may be free-form ones that the user supplies or ones produced by an automated segmentation approach (such as). Adapting the two primary techniques, TI and DB, to this environment indicate a reconstruction-editability tradeoff. Whereas TI fails to rebuild the ideas in a new context properly, DB needs more context control due to overfitting. In this study, the authors suggest a unique customization pipeline that successfully strikes a compromise between maintaining learned concept identity and preventing overfitting.
Figure 1 provides an overview of our methodology, which has four main parts: (1) We use a union-sampling approach, in which a new subset of the tokens is sampled every time, to train the model to handle various combinations of created ideas. Additionally, (2) in order to prevent overfitting, we employ a two-phase training regime, starting with the optimisation of just the recently inserted tokens with a high learning rate and continuing with the model weights in the second phase with a reduced learning rate. The desired ideas are reconstructed by use of a (3) disguised diffusion loss. Fourth, we employ a unique cross-attention loss to promote disentanglement between the learned ideas.
Their pipeline contains two steps, which are shown in Figure 1. To rebuild the input image, they first identify a group of special text characters (called handles), freeze the model weights, and then optimize the handles. They continue to refine the handles while switching over to fine-tuning the model weights in the second phase. Their method strongly emphasizes disentangling concept extraction or ensuring that each handle is connected to just one target concept. They also understand that the customization procedure cannot be performed independently for each idea to develop graphics showcasing combinations of notions. In response to this discovery, we offer union sampling, a training approach that meets this need and improves the creation of idea combinations.
They do this by utilizing the masked diffusion loss, a modified variation of the standard diffusion loss. The model is not penalized if a handle is linked to more than one concept because of this loss, which guarantees that each custom handle may deliver its intended idea. Their main finding is that they may punish such entanglement by additionally imposing a loss on the cross-attention maps, which are known to correlate with the scene layout. Due to the additional loss, each handle will concentrate solely on the areas covered by its target concept. They offer several automatic measurements for the task to compare their methodology to the benchmarks.
They have made the following contributions, in order: (1) they introduce the novel task of textual scene decomposition; (2) they propose a novel method for this situation that strikes a balance between concept fidelity and scene editability by learning a set of disentangled concept handles; and (3) they suggest several automatic evaluation metrics and use them, along with a user study, to demonstrate the effectiveness of their approach. They also conduct user research, which shows that human assessors also like their methodology. In their last part, they suggest several applications for their technique.
Check Out The Paper and Project Page. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.