Latest Computer Vision Research at Google and Boston University Proposes ‘DreamBooth,’ A Technique for Fine-Tuning a Text-to-Image Model with a very Limited Set of Images

In recent years, text-to-image models have evolved at an exciting speed. The quality of the results given by models such as Open AI’s DALL-E2 or Google’s Imagen was something that, just some years ago, we wouldn’t have even imagined. Nevertheless, it has always been an unresolved problem the idea of personalizing a pre-trained model with a very limited set of images. For example, a user has a bunch of pictures of his cat and wants to generate new images in new contexts having his cat as the subject. At the beginning of August, a research team from NVIDIA and Tel-Aviv University published a paper that tackles this problem using textual inversion, a very smart technique for fine-tuning large models on a limited set. After some weeks, DreamBooth’s paper came out, meaning that personalization is finally becoming a topic that researchers are dealing with. 

DreamBooth was introduced by a team composed of researchers from Google and Boston University and is based on a new method to personalize large pre-trained text-to-image models, in this case, Imagen, on a very limited set of images (~3-5). The general idea is quite simple: they want to expand the language-vision dictionary such that rare token identifiers are connected with a specific subject that the user wants to generate. To have a general idea before diving into the methods, the figure below shows some examples of personalization.


Wanting to associate a limited set of images with a unique identifier, the first step is how to represent the subject. Their approach was to label all input images of the new subject as “a [identifier] [class noun],” where [class noun] is a coarse class descriptor (e.g., dog or sunglasses in the image above). But how to construct the unique [identifier]? The most naïve approach is to use an existing word, such as “unique” or “special”. Still, a disadvantage of this idea is that the model spent much time forgetting the original meaning and substituting it with the new concept.  Another way to do this is to select random characters from the English language and concatenate them to generate a rare identifier (e.g., “xxy5syt00”). The problem is that the tokenizer could tokenize each letter separately, thus losing the sense of a unique identifier. The solution was found by finding rare tokens through a lookup in the vocabulary and using them as unique identifiers.


The two main problems that arise when fine-tuning a large model on a limited set are overfitting and, consequently, language drift. To solve the first problem, the authors fine-tuned all layers of the model, including the ones that are conditioned on the text embedding. This raised the issue of language drift, which happens when a language model progressively loses the knowledge of the language as it learns new concepts. As the authors used the [class noun], the risk is that the network would lose the ability to generate generic subjects of the same [class noun]. For example, if you use images of a specific dog, after fine-tuning, the network will always generate this dog even without using the [identifier]. 

Prior-preservation loss

To solve this issue, the authors proposed a class-specific prior-preservation loss. Briefly, the idea is to supervise the model with its own generated samples in order to not forget the original knowledge before the fine-tuning. This idea is clearer in the image below. The images from the limited set are associated with the sentence “a [identifier] [class noun]” and the diffusion model is trained with a reconstruction loss. In parallel, the original network is used to produce samples of the [class noun] and perform the same task using as text “a [class noun]”. Given the fact that the two networks share the weights, the final model would be able to not forget the original meaning of the class noun and overcome the problem of language drift. Finally, the authors added a super-resolution (SR) model fine-tuned on the subjects to produce high-quality outputs.



The authors tested the model in different tasks: recontextualization (shown at the beginning of this article), art renditions, expression manipulation, novel view synthesis, accessorization, and property modification. Some astonishing examples of these applications are shown in the images below. 

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and project.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...