Large-scale text-conditioned image generation models have shown impressive results in recent years. They can generate realistic-looking images given a text prompt. These models are trained with extremely large image-text caption pairs.
Because of the strong semantic prior learned from a huge collection of image-caption pairings, such models may even produce new concepts by combining different compositions. For example, you can ask them to generate “Pikachu as a gladiator, riding a horse in space,” and they will produce a relatively realistic image, despite the concept being totally new. Regardless of the new combination, each object instance is known, given the learned priors.
Moreover, it is also possible to transfer styles. You can take Mona Lisa and ask to transfer the style from the “Starry Light” from Van Gogh. That will preserve the structure of the Mona Lisa but change the style using Van Gogh’s image. In the end, you will get an image of the Mona Lisa drawn in Van Gogh style.
However, what if you wanted to mix two objects? Unlike style transfer, where you preserve the shape of the object, semantic mixing aims to combine two objects in a meaningful way. What if we are interested in semantically blending two separate ideas to synthesize a fresh concept? This is what MagicMix is trying to achieve.
Semantic mixing is conceptually different from other image editing and generation tasks. For example, in style transfer, the content image is preserved while transferring the style of another one. Compositional generation mixes multiple components into a single, more complex scene. It can be confused with semantic mixing, but in the compositional generation, each individual component has already been known. On the other hand, in semantic mixing, this is unknown information. We are trying to fuse two concepts to generate a new object, and you don’t know what a corgi-alike coffee machine should look like.
Of course, such a task is difficult since even a human user may not know how the mixed objects should look at the end. What would you draw if you were asked to combine “a corgi and a coffee machine?”.
MagicMix, a novel method based on text-conditioned image diffusion-based generative models, is proposed to solve this.
You can think of MagicMix as an extension of a diffusion model. It does not require any fine-tuning or user-provided masks to work. It directly utilizes the underlying structure of diffusion models to achieve semantic mixing.
Diffusion models work progressively, meaning that different properties of the output images are generated in different layers. For example, the layout of the image usually appears around early denoising steps, and semantic content appears towards the end. This observation is the foundation of how MagicMix works.
The semantic mixing task is split into two stages: Mixing layout (shape and color) semantics and mixing content semantics. Imagine we are trying to mix corgi and coffee machine. MagicMix first obtains an approximate layout of the final image either by corrupting a real photo of a corgi or denoising it from noise generated by the text prompt “a real photo of a corgi”. Once the layout is ready, it will inject the new concept (coffee machine) and continue the denoising process until the final synthesized image is ready.
MagicMix has a solid capability to generate novel concepts. It can support a large range of applications, such as semantic style transfer, novel object synthesis (generating a mug that looks like bread), breed mixing (generating a new species by mixing cat and giraffe), and concept removal. Despite being a simple approach, MagicMix enables a new direction in computational graphics.
It has some limitations that can be addressed in future work. For example, trying to mix two concepts with no shape similarity, such as a cat and a toilet roll, will result in a direct composition. So you will get an image of a toilet roll with a cat face on it.
This was a brief summary of MagicMix, a semantic mixing method for diffusion models. You can find more information in the links below if you are interested in learning more.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'MagicMix: Semantic Mixing with Diffusion Models'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, code and project.
Please Don't Forget To Join Our ML Subreddit
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.