Huawei Research Introduces DiffFit For Efficiently Fine-Tuning Large Diffusion Models

One of the most important challenges in machine learning is modeling intricate probability distributions. Diffusion probabilistic models DPMs aim to learn the inverse of a well-defined stochastic process that progressively destroys information. 

Image synthesis, video production, and 3D editing are some areas where denoising diffusion probabilistic models (DDPMs) have shown their worth. As a result of their large parameter sizes and frequent inference steps per image, current state-of-the-art DDPMs incur high computational costs. In reality, not all users have access to sufficient financial means to cover the cost of computation and storage. Therefore, it is crucial to investigate strategies for effectively customizing publically available, big, pre-trained diffusion models for individual applications.

A new study by Huawei Noah’s Ark Lab researchers uses the Diffusion Transformer as a foundation and offers DiffFit, a straightforward and effective fine-tuning technique for large diffusion models. Recent NLP (BitFit) research has shown that adjusting the bias term can fine-tune a pre-trained model for downstream tasks. The researchers wanted to adapt these effective tuning strategies for image generation. They first immediately apply BitFi, and to improve feature scaling and generalizability, they incorporate learnable scaling factors to particular layers of the model, with a default value of 1.0 and dataset-specific tweaks. The empirical results indicate that including strategic places throughout the model is crucial for improving the Frechet Inception Distance (FID) score.

BitFit, AdaptFormer, LoRA, and VPT are only some of the parameter-efficient fine-tuning strategies the team used and compared over 8 downstream datasets. Regarding the number of trainable parameters and the FID trade-off, the findings show that DiffFit performs better than these other techniques. In addition, the researchers also found that their DiffFit strategy could be easily employed to fine-tune a low-resolution diffusion model, allowing it to adapt to high-resolution picture production at a cheap cost simply by treating high-resolution images as a distinct domain from low-resolution ones. 

DiffFit outperformed the prior state-of-the-art diffusion models on ImageNet 512×512 by starting with a pretrained ImageNet 256×256 checkpoint and fine-tuning DIT for only 25 epochs. DiffFit outperforms the original DiT-XL/2-512 model (which has 640M trainable parameters and 3M iterations) in terms of FID while having only roughly 0.9 million trainable parameters. It also requires 30% less time to train. 

Overall, DiffFit seeks to provide insight into the efficient fine-tuning of bigger diffusion models by establishing a simple and powerful baseline for parameter-efficient fine-tuning in picture production.

Check out the Paper. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...