Google AI Introduces Muse: A Text-To-Image Generation/Editing Model via Masked Generative Transformers

In recent years, there has been significant progress in developing generative image models that produce high-quality images based on text prompts. This has been made possible through advances in deep learning architecture, novel training techniques such as masked modeling for language and vision tasks, and new generative model families such as diffusion and masking-based generation. In this work, they present a new model for text-to-image synthesis that uses a masked image modeling approach based on the Transformer architecture. Their model is composed of several sub-models, including VQGAN “tokenizer” models that can encode and decode images as sequences of discrete tokens, a base masked image model that predicts the marginal distribution of masked tokens based on unmasked tokens, and a T5-XXL text embedding, and a “superres” transformer model that translates low-resolution tokens into high-resolution tokens using a T5-XXL text embedding. They have trained a series of Muse models with varying sizes, ranging from 632 million to 3 billion parameters. They have found that conditioning on a pre-trained large language model is crucial for generating photorealistic, high-quality images.

Based on cascaded pixel-space diffusion models, Muse is far more effective than Imagen or Dall-E2; it may be likened to a discrete diffusion process with the absorbing state. Since Muse uses parallel decoding, it performs better than Parti, a cutting-edge autoregressive model. Based on experiments on comparable hardware, they estimate that Muse is more than ten times faster at inference time than either Imagen-3B or Parti-3B models and three times faster than Stable Diffusion v1.4. These comparisons take place using identically sized pictures that are either 256×256 or 512×512. Even though both models operate in a VQGAN’s latent space, Muse is also quicker than Stable Diffusion. They surmise that this is because Stable Diffusion v1.4 employs a diffusion model, which necessitates much more iterations during inference. However, Muse’s increased efficiency does not come at the expense of the created images’ quality or semantic accuracy.

They assess their work using factors such as the FID and CLIP scores. The former is a measurement of how well images and texts match, and the latter is a measurement of the variety and quality of images. Their 3B parameter model outperforms previous large-scale text-to-image models with a CLIP score of 0.32 and an FID score of 7.88 on the COCO zero-shot validation test. When trained and tested on the CC3M dataset, their 632M+268M parameter model obtains a state-of-the-art FID score of 6.06, much lower than any other reported findings in the literature.

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

Muse creates pictures that are better matched with its text prompt 2.7 times more frequently than Stable Diffusion v1.4, according to evaluations of their generations conducted by human raters using the PartiPrompts assessment suite. Muse creates graphics that include nouns, verbs, adjectives, and other components of speech from input captions. They also demonstrate awareness of compositionality, cardinality, and other multi-object qualities and an understanding of visual style. Muse’s mask-based training allows for a variety of zero-shot picture-altering features. The figure below depicts these techniques, including mask-free editing, text-guided inpainting, outpainting, and zero-shot.

Examples of Muse-based zero-shot text-guided picture editing. Without any fine-tuning, we demonstrate a variety of editing applications that leverage the Muse text-to-image generative model on actual input photos. The resolution of every altered image is 512 x 512.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.