Using The Diffusion Model, Google AI Is Able To Generate High Fidelity Images That Are Indistinguishable From Real Ones

Using super-resolution diffusion models, Google’s latest super-resolution research can generate realistic high-resolution images from low-resolution images, making it difficult for humans to distinguish between composite images and photos. Google uses the diffusion model to increase the resolution of photos, making it difficult for humans to differentiate between synthetic and real photos.

Google researchers published a new method of realistic image generation, which can break through the limitations of diffusion model synthesis image quality, by combining iterative refinement (SR3) algorithm, and a type called Cascaded Diffusion Models (CDM) Conditional synthesis model, the quality of the generated image is better than all current methods.

Naturally synthesized images are one of the applications of machine learning technology, which can be widely used in various fields, such as image super-resolution (Super-Resolution), through training models to convert low-resolution images into detailed For high-resolution images, researchers mentioned that this method brings many benefits. Whether it is family portraits or medical imaging systems, the image quality can be significantly improved. Another image synthesis task is the generation of Class-Conditional images. The trained model can generate sample images based on the tags input by the user.

Usually, these image synthesis tasks are performed by GANs, VAEs, autoregressive models, and other such deep generative models. Even then, each of these generative models has its downsides when trained to synthesize high-quality samples on complicated, high-resolution datasets. GANs, for example, often suffer from unstable training and mode collapse, and autoregressive models typically suffer from slow synthesis speed.

First presented in 2015, diffusion models have recently resurfaced in popularity because of their training stability and promising sample quality results in picture and audio production. As a result, they may provide better trade-offs than other forms of deep generative models. Diffusion models function by progressively introducing Gaussian noise to the training data, gradually wiping away subtleties in the data until it transforms into pure noise, and then training a neural network to reverse the corruption process. By gradually denoising the noise until a clean sample is obtained, this reversed corruption approach synthesizes data from pure noise. This synthesis technique can be interpreted as an optimization algorithm that follows the gradient of the data density to create likely samples.

https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html

Because of its training stability and good sample quality in image and audio generation, it has attracted attention. Compared with other types of depth generation models, the diffusion model Destroy the training data by increasing the Gaussian noise, slowly eliminate the details of the data until the complete noise is left, and then train the neural network to reverse the destruction process and gradually remove the noise during the inversion process until clean samples are left. The researchers mentioned that such a synthesis process could be regarded as an optimization algorithm that can follow the gradient of the data density to generate possible samples.

In Google’s latest research, by linking SR3 and CDM, the resolution bottleneck of the image generation of the diffusion model is broken. By expanding the diffusion model and adding unique data enhancement technology, it can produce results that are better than the existing methods. SR3 is a super-resolution diffusion model which takes low-resolution as input and constructs a corresponding high-resolution image from the complete noise. This model uses the image destruction process for training. In this process, noise is gradually added to the high-resolution image until the noise is entirely left. Then the process is reversed, starting from pure noise and inputting low noise—resolution image to guide the model to remove noise gradually.

The effect of the SR3 model is excellent. In improving the resolution of human faces and natural pictures, the image generated by the SR3 model can be confused with a 50% rate so that the subject cannot identify the generated image. , Or the snapshot taken by the camera, that is to say, the image generated by the SR3 model, is intricate for humans to distinguish between true and false.

After SR3 can generate ultra-high-resolution images, the researchers used the SR3 model to generate type-condition images. CDM is a type-condition diffusion model that is trained using ImageNet data to create high-resolution natural images. Because ImageNet is a complex and highly disordered data set, researchers concatenated multiple diffusion models to build CDM.

The researchers mentioned that this cascading method could link multiple generative models that span several spatial resolutions together and then generate a diffusion model of low-resolution data, followed by a series of SR3 high-resolution diffusion models.

The realistic samples generated by CDM are used to evaluate the FID (Fréchet Inception Distance) score and classification accuracy score of the image quality created by the developed model. The overall result is that the ultra-high-resolution images generated by SR3 surpass GAN in human evaluation. Both greatly exceed the current top methods BigGAN-deep and VQ-VAE-2.

With SR3 and CDM, the performance of diffusion models has been pushed to the state-of-the-art on super-resolution and class-conditional ImageNet generation benchmarks.

Source: https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html

Image Super-Resolution via Iterative Refinement: https://iterative-refinement.github.io/

Paper: https://arxiv.org/abs/2104.07636

Cascaded Diffusion Models for High Fidelity Image Generation: https://cascaded-diffusion.github.io/

Paper: https://cascaded-diffusion.github.io/assets/cascaded_diffusion.pdf

Sanskriti is currently pursuing her bachelor’s in Journalism, Psychology, and English and is enthusiastic about getting to know new people, uncovering their stories, and engaging with the atmosphere. She has an inclination towards news affairs, writing and teaching.

↗ Step by Step Tutorial on 'How to Build LLM Apps that can See Hear Speak'