Nowadays, text-to-image synthesis is gaining a lot of popularity. A diffusion probabilistic model is a class of latent variable models that have arisen to be state-of-the-art on this task. Different models have been proposed lately, like DALLE-2, Imagen, Stable Diffusion, etc., which are surprisingly good at generating hyper-realistic images from a given text prompt. But how are they able to do so? Why are some models better than others in terms of image quality, speed, and reliability? How can we further improve these models? These are some questions the author of the paper has tried to answer.
Diffusion probabilistic models (DPMs) have achieved impressive results in high-resolution image synthesis. A technique called guidance sampling is used to improve the sample quality of DPMs. The denoising diffusion implicit model (DDIM) is a commonly used fast guidance sampler.
DDIM is a first-order diffusion ordinary differential equation (ODE) solver that needs around 100–250 steps to generate high-quality samples. Some higher-order solvers are faster at sampling when they don’t have guidance, but when they do have guidance, they become unstable and can even be slower than DDIM for high guidance scale values.
So now the question arises: why is guidance so important?
The answer to this question is simple: guidance helps to improve the quality of samples generated by the model by enforcing some conditions, like aligning a generated image more with the text prompt. Indeed, it costs some diversity in the generated samples, but we can adjust the guidance scale to get a good trade-off between diversity and fidelity.
Let’s first understand how does DPMs work?
The sampling procedures of DPMs gradually remove the noise from pure Gaussian random variables to obtain clear data. This sampling can be done by either discretizing diffusion SDEs or diffusion ODEs, which are also defined in two ways: a parameterized noise prediction model and a data prediction model.
ODE solvers generally take 100–250 steps to converge, while high-order diffusion SDEs can generate high-quality samples in just 10–20 steps when sampling without guidance. Then why not always use higher-order diffusion SDEs for sampling?
There are two challenges that we face while applying high-order solvers:
- The large guidance scale narrows down the convergence radius of high-order solvers, making them unstable.
- The converged solution falls into a different range than the original data. It is also known as a Train-test mismatch.
The given training data is bounded, but a large guidance scale may push the conditional noise prediction model away from the true noise, hence resulting in the sample falling out of bounds of the training data, which means the samples look unrealistic (as some are shown in Fig. 1). Large values of the guidance scale may amplify the output and the higher derivatives, which are more sensitive to amplifications. The derivatives affect the convergence range of the ODE solvers, and since the derivatives have been amplified, it is intuitive that it may need high-order solvers with small step sizes to converge.
Now, how do they tackle these two challenges?
The authors proposed a high-order, training-free diffusion ODE solver for faster-guided sampling, which they named DPM-Solver++, to address the first challenge. DPM-Solver++ is designed for data prediction models, unlike the previous high-order solvers designed for noise prediction models. One of the reasons to choose the data prediction model is that thresholding methods can be used to keep the sample bounded in the same range as the original data, which is the second challenge we have to cope with. There are two versions of DPM-Solver++ that are proposed; one is DPM-Solver++(2S), which is a second-order single-step solver, while the other is DPM-Solver++(2M) which is a multistep second-order solver. The latter deals with the instability problem of the high-order solvers. Now you must be wondering what is the difference between the two and what makes a multistep solver better than a single-step version. Here is the answer, Suppose we can only evaluate the data prediction model N-times. Then, according to the algorithm for the kth order single-step method, it can only use M=N/k steps. While for the same N evaluations, the multistep process can use M=N steps as it uses previously calculated values of x̃t1 and x̃t2 for calculating higher order derivatives where t1 =ti-1 and t2= ti-2 instead of just discarding them like a single-step solver. This way, the information from the previous steps is not lost, thereby providing stability to the high-order solver. In conclusion, from the results, multistep methods are slightly better than single-step methods. For large guidance scales, the multistep DPM-Solver++(2M) performs better than DPM-Solver++(2S), while for marginally smaller guidance scales single-step solver performs better than the multistep solver.
Comparing with previous high-order samplers (DEIS, PNDM, and DPM-Sampler), they found that DPM-Solver++ achieves the best convergence speed and stability for both large and small guidance scales.
DPM-Solver++ is capable of converging within 15 number of function evaluations. We can use DPM-Solver++ with both pixel-space DPMs as well as Latent-space DPMs. But in the case of latent-space DPMs, we do not use the thresholding method as the latents are unbounded.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'DPM-SOLVER++: FAST SOLVER FOR GUIDED SAMPLING OF DIFFUSION PROBABILISTIC MODELS '. All Credit For This Research Goes To Researchers on This Project. Check out the preprint paper, and code.
Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.