Text-to-Image generation using diffusion models has been a hot topic in generative modeling for the past few years. Diffusion models are capable of generating high-quality images of concepts learned during training, but those training datasets are very large and not personalized. Now users want some personalization in these models; instead of generating images of a random dog at some place, the user wants to create images of their dog at some place in their house. One straightforward solution to this problem is retraining the model by involving the new information in the dataset. But there are certain limitations to it: First, for learning a new concept, the model needs a very large amount of data, but the user can only have up to a few examples. Second, retraining the model whenever we need to learn a new concept is highly inefficient. Third, learning new concepts will result in forgetting the previously learned concepts.
To address these limitations, a team of researchers from Carnegie Mellon University, Tsinghua University, and Adobe Research proposes a method to learn multiple new concepts without the need to retrain the model completely, only using a few examples. They listed their experiments and findings in the paper “Multi-Concept Customization of Text-to-Image Diffusion.”
In this paper, the team proposed a fine-tuning technique, Custom Diffusion for the text-to-image diffusion models, which identifies a small subset of model weights such that fine-tuning only those weights is enough to model the new concepts. At the same time, it prevents catastrophic forgetting and is highly efficient as only a very small number of parameters are being trained. To further avoid forgetting, intermixing similar concepts, and overfitting to the new concept, a small set of real images with a caption similar to the target images is chosen and fed to the model while fine-tuning (Figure 2).
The method is built on Stable Diffusion, and up to 4 images are used as training examples while fine-tuning.
We got that fine-tuning only a small set of parameters is effective and highly efficient, but how do we choose those parameters, and why does it work?
The idea behind this answer is simply an observation from experiments. The team trained the complete models on the dataset involving new concepts and carefully observed how the weights of different layers changed. The result of the observation was weights of Cross-Attention layers were affected the most, implying it plays a significant role while fine-tuning. The team leveraged that and concluded that the model could be customized significantly by only fine-tuning the cross-attention layers. And it works magnificently.
In addition to this, there is another important component in this approach: The regularisation dataset. Since we are using only a few samples for fine-tuning, the model can overfit the target concept and lead to language drift. For example, training on “moongate” will lead to the model forgetting the association of “moon” and “gate” with the previously learned concepts. To avoid this, a set of 200 images is selected from the LAION-400M dataset with corresponding captions that are highly similar to the target image captions. By fine-tuning on this dataset, the model learns the new concept while also revising the previously learned concepts. Hence, avoiding forgetting and intermixing of concepts (Figure 5).
The following figures and tables shows results of the papers:
This paper concludes that Custom Diffusion is an efficient method for
augmenting existing text-to-image models. It can quickly acquire a new concept given only a few examples and compose multiple concepts together in novel settings. The authors found that optimizing just a few parameters of the model was sufficient to represent these new concepts while still being memory and computationally efficient.
However, there are some limitations of pretrained models that the fine-tuned model inherits. As shown in Figure 11, Tough compositions, e.g., A tortoise plushy and a teddy bear, remains challenging. Moreover, composing three or more concepts is also problematic. Addressing these limitations can be a future direction for research in this field.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.