Meet PIXART-α: A Transformer-Based T2I Diffusion Model Whose Image Generation Quality is Competitive with State-of-the-Art Image Generators

A new era of photorealistic image synthesis has just begun thanks to the development of text-to-image (T2I) generative models like DALLE 2, Imagen, and Stable Diffusion. This has significantly influenced many downstream applications, including picture editing, video production, the creation of 3D assets, etc. However, these sophisticated models require significant processing power to train. For example, training SDv1.5 requires 6K A100 GPU days, which costs around $320,000. The more current bigger model, RAPHAEL, even requires 60K A100 GPU days, which costs about $3,080,000. Additionally, the training causes significant CO2 emissions that put the environment under stress; for instance, RAPHAEL’s training produces 35 tonnes of CO2 emissions, the same amount of emissions that one person has during 7 years, as seen in Figure 1. 

Figure 1: Comparisons of CO2 emissions and training costs among T2I producers are shown here. A remarkable $26,000 is spent on training for PIXART-α. Our CO2 emissions and training expenses are just 1.1% and 0.85% less than RAPHAEL.

Such a high price creates major restrictions on obtaining such models for both the research community and businesses, which significantly impedes the critical progress of the AIGC community. A crucial question is raised regarding these difficulties: Can they create a high-quality picture generator with manageable resource usage? Researchers from Huawei Noah’s Ark Lab, Dalian University of Technology, HKU and HKUST present PIXART-α, which dramatically lowers training’s computing requirements while keeping the competitive picture-generating quality to the most recent state-of-the-art image generators. They suggest three main designs to do this: Decomposition of the training plan. They break down the challenging text-to-image production problem into three simple subtasks:

  1. Learning the distribution of pixels in natural pictures
  2. Learning text-image alignment
  3. Improving the aesthetic appeal of images

They suggest drastically lowering the learning cost for the first subtask by initializing the T2I model with a low-cost class-condition model. They provide a training paradigm that consists of pretraining and fine-tuning for the second and third subtasks: pretraining on text-image pair data with high information density, followed by fine-tuning on data with higher aesthetic quality, increasing training effectiveness. a productive T2I transformer. They use cross-attention modules to inject text conditions and simplify the computationally demanding class-condition branch to increase efficiency based on the Diffusion Transformer (DiT). Additionally, they present a reparameterization method that enables the modified text-to-image model to import the parameters of the original class condition model directly. 

They may thus use ImageNet’s past knowledge of natural picture distribution to provide the T2I Transformer an acceptable initialization and speed up its training. High-quality information. Their research reveals significant flaws in existing text-image pair datasets, with LAION as an example. Textual captions frequently suffer from a severe long-tail effect (i.e., many nouns appearing with extremely low frequencies) and a lack of informative content (i.e., typically describing only a portion of the objects in the images). These flaws greatly reduce the effectiveness of T2I model training and need millions of iterations to get reliable text-image alignments. They suggest an autolabeling pipeline using the most advanced vision-language model to produce captions on the SAM to overcome these issues. 

The SAM dataset has the benefit of having a large and diverse collection of objects, which makes it a perfect source for producing text-image pairings with a high information density that are more suited for text-image alignment learning. Their clever features enable their model’s training to be extremely efficient, using just 675 A100 GPU days and $26,000. Figure 1 shows how their approach uses less training data volume (0.2% vs. Imagen) and less training time (2% vs. RAPHAEL) than Imagen. Their training expenses are about 1% of those of RAPHAEL, saving them about $3,000,000 ($26,000 vs. $3,080,000). 

Regarding generation quality, their user research trials show that PIXART-α delivers better picture quality and semantic alignment than current SOTA T2I models, Stable Diffusion, etc.; moreover, its performance on T2I-CompBench demonstrates its advantage in semantic control. They anticipate that their efforts to train T2I models effectively will provide the AIGC community with useful insights and aid more independent academics or companies in producing their own high-quality T2I models at more affordable prices.

Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...