Diffusion models became the de-facto solution for image generation tasks. They have outperformed generative adversarial networks (GANs) in multiple tasks. It is now possible to generate realistic-looking images with absurd prompts.
This realistic generation capability does not come for free, though. Diffusion models are extremely costly to train as they require tremendous data. Moreover, their run-time complexity is also another issue when it comes to using them.
It is nice to have a model to generate an almost unlimited number of different images in different concepts and settings. But, do we really need this capability all the time? There is a high chance that we want to generate an image or videos just for our specific idea. Or maybe we want to use the diffusion model to ask the “what if..” question about our favorite image or video. Could we achieve it with the existing diffusion models?
Well, yes, we can, in theory, but it would be really expensive. First, we would need to fine-tune the diffusion model using our desired input image or the video. This fine-tuning process will take so much time, plus we would need a lot of data about the image we want to use it for.
So what can we do? Should we avoid using custom diffusion models at all? Or should we just waste all these resources to come up with a solution? No, we don’t need to do any of them. There is a way to utilize the generation capabilities of diffusion models for our custom inputs without being extremely expensive. And that solution is called SinFusion.
SinFusion is a framework proposed for training diffusion models on a single input image or video. It utilizes the high-quality image generation capabilities of diffusion models while having several tricks to reduce the cost of fine-tuning. When you fine-tune the SinFusion, you can use it to generate new images/videos while maintaining the dynamics and concept of the input image/video.
SinFusion displays strong diversity for generating additional images from a single image, image editing, image generation from a sketch, and visual summaries for images. Moreover, it demonstrates video upsampling, video extrapolation (both forward and backward in time), and various generation of new videos from a single video.
But how does it achieve this? It is built on top of the Denoising diffusion probabilistic model (DDPM) architecture, which is commonly used in the literature.
For image generation, SinFusion introduces modifications to an existing DDPM structure. Instead of training on a large set of images, SinFusion is trained on a large set of random image crops using the input image. Moreover, the backbone UNet structure is modified to speed up the network.
For video generation, SinFusion uses a set of modified DDPM modules together. Namely, one for frame prediction to generate new frames, one for frame projection to ensure the frames generated by the predictor are correct, and finally, one for frame interpolation to increase the temporal resolution of the generated videos.
This was a brief summary of SinFusion. If you are interested in learning more, you can find relevant information at the links below.
Check out the Paper and Project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.