The text-to-image generation has come a long way in recent years. Text-to-image generation models that use large-scale training data and model parameters can now vividly depict the visual scene described by a text prompt, allowing anyone to create exquisite images without sophisticated drawing skills. Diffusion models are gaining popularity among image generation approaches due to their ability to generate highly photorealistic images based on text prompts. Text-to-image diffusion models such as LDM, GLIDE, DALL-E 2, and Imagen have demonstrated impressive performance in both text relevance and image fidelity in recent years. Given a text prompt, the models use iterative denoising steps to convert a Gaussian noise into an image that conforms to the prompt. Despite these advances, existing methods for exploring diffusion models are still in their early stages. When they delve deeply into the theory and implementation of text-to-image diffusion models, numerous opportunities exist to improve the quality of generated images.
To begin, all text tokens interact with image regions during the learning process of each denoising step, and all image regions contribute equally to the final loss function. On the other hand, a visual scene of text and image contains many elements (textual words and visual objects), and different elements are usually of varying importance in expressing the scene semantics. The indiscriminate learning process may cause the model to miss some key features and interactions in the scene, exposing the model to the risk of text-image misalignments, such as the attribute confusion problem, which is especially problematic for text prompts containing multiple objects with specific attributes.
Second, when they broaden their view from individual steps to the entire denoising process, they discover that the requirements of different denoising stages are different. Existing models typically employ a single U-Net for all stages, implying that the same parameters must learn various denoising capabilities. The input images are highly noisy in the early stages, and the model must outline the semantic layout and skeleton out of almost pure noise. On the other hand, denoising primarily means improving the details based on a nearly completed image in the later steps close to the image output.
To address these issues, they propose ERNIE-ViLG 2.0, an improved text-to-image diffusion model with a knowledge-enhanced mixture of denoising experts, in this paper to incorporate additional knowledge about the visual scene and decouple denoising capabilities in different steps. In particular, they use a text parser and an object detector to extract key elements of the scene from the input text-image pair and then guide the model to pay more attention to their alignment during the learning process in the hope that the model will be able to handle the relationships between various objects and attributes.
They train ERNIE-ViLG 2.0 and scale up the model size to 24B parameters using the extra knowledge from the visual scene and the mixture-of-denoising-experts mechanism. Furthermore, they divide the denoising steps into stages and assign specific denoising “experts” to each location. Because only one expert is activated in each denoising step, the model can involve more parameters and learn the data distribution of each denoising stage better with a mixture of multiple experts. Experiments on MS-COCO show that their model outperforms previous text-to-image models by achieving a new state-of-the-art FID-30k score of 6.75 zeros, and detailed ablation studies validate the contributions of each proposed strategy.
Aside from automatic metrics, they collect 300 bilingual text prompts that can be used to assess the quality of generated images from various perspectives and allow a fair comparison of English and Chinese text-to-image models. The human evaluation results show that ERNIE-ViLG 2.0 outperforms other recent methods, such as DALL-E 2 and Stable Diffusion, by a significant margin in terms of image-text alignment and image fidelity.
To summarise, the following are the main contributions of this work:
• ERNIE-ViLG 2.0 incorporates textual and visual knowledge into the text-to-image diffusion model, improving fine-grained semantic control and alleviating the problem of object-attribute mismatching in generated images.
• ERNIE-ViLG 2.0 proposes a mechanism to refine the denoising process that can adapt to the characteristics of different denoising steps and scale up the model to 24B parameters, making it the largest text-to-image model at the moment.
• On the Chinese-English bilingual prompt set ViLG-300, ERNIE-ViLG 2.0 achieves a new state-of-the-art zero-shot FID-30k score of 6.75 on the MS-COCO dataset, outperforms DALL-E 2 and Stable Diffusion in human evaluation.
The demo for the paper is available on HuggingFace. The input prompt is first converted to Chinese, and then the image is generated.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and code.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.