A New AI Research From MIT Reduces Variance in Denoising Score-Matching, Improving Image Quality, Stability, and Training Speed in Diffusion Models

Diffusion models have recently produced outstanding results on various generating tasks, including the creation of images, 3D point clouds, and molecular conformers. Ito stochastic differential equations (SDE) are a unified framework that can incorporate these models. The models acquire knowledge of time-dependent score fields through score-matching, which later directs the reverse SDE during generative sampling. Variance-exploding (VE) and variance-preserving (VP) SDE are common diffusion models. EDM offers the finest performance to date by expanding on these compositions. The existing training method for diffusion models can still be enhanced, despite achieving outstanding empirical results.

The Stable Target Field (STF) objective is a generalized variation of the denoising score-matching objective. Particularly, the high volatility of the denoising score matching (DSM) objective’s training targets can result in subpar performance. They divide the score field into three regimes to comprehend the cause of this volatility better. According to their investigation, the phenomenon mostly occurs in the intermediate regime, defined by various modes or data points having a similar impact on the scores. In other words, under this regime, it is still being determined where the noisy samples produced throughout the forward process originated. Figure 1(a) illustrates the differences between the DSM and their proposed STF objectives.

Figure 1: Examples of the DSM objective’s and our suggested STF objective’s contrasts.

While their sources (in red box) are separated from one another, the “destroyed” photos (in blue box) are close together. Despite the fact that the true score in expectation is the weighted average of vi, the DSM objective’s individual training updates have a high variation, which our STF objective considerably lowers by using a sizable reference batch (yellow box)

The plan is to add a second reference batch of examples to be utilized as targets when calculating weighted conditional scores. They aggregate the contribution of each example in the reference batch using self-normalized importance sampling. Although this method, particularly in the intermediate regime, can significantly reduce the variation of training objectives (Figure 1(b)), it does introduce some bias. However, they demonstrate that as the size of the reference batch increases, the bias and trace-of-covariance of the STF training targets decrease to zero. Through experiments, they show how their STF objective, when added into EDM, yields new state-of-the-art performance on CIFAR10 unconditional generation. The final FID score after 35 network evaluations is 1.90.

In most instances, STF also raises the FID/Inception scores for other score-based model variations, such as VE and VP SDEs. Additionally, it enhances the stability of convergent score-based models on CIFAR-10 and CelebA 642 across random seeds and aids in preventing the development of noisy pictures in VE. STF quickens the training of score-based models while achieving the same or higher FID scores (3.6 speed-up for VE on CIFAR-10). As far as they know, STF is the first method for accelerating the training of diffusion models. They also illustrate the detrimental impact of excessive variance while demonstrating the performance benefit with increasing reference batch size.

The following is a summary of their contributions: 

(1) They characterize the part of the forward process known as the intermediate phase, where the score-learning targets are most changeable

(2) They propose a generalized score-matching goal-stable target field to provide more consistent training targets 

(3) They examine the behavior of the new objective and demonstrate that it is asymptotically unbiased and reduces the trace-of-covariance of the training targets in the intermediate phase under benign conditions by a factor related to the reference batch size

(4) They use empirical evidence to support the theoretical arguments and demonstrate how the proposed STF objective enhances score-based approaches’ functionality, stability, and training efficiency. 

In particular, when paired with EDM, it gets the most recent state-of-the-art FID score on the CIFAR-10 benchmark.


Check out the Paper and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 13k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

↗ Step by Step Tutorial on 'How to Build LLM Apps that can See Hear Speak'