Stability AI has partnered with its AI research lab DeepFloyd to introduce the research version of its latest technology, called DeepFloyd IF. This text-to-image cascaded pixel diffusion model is designed to generate high-quality images from text inputs. The model is available on a non-commercial, research-permissible license, enabling research labs to explore and experiment with advanced text-to-image generation methods. This model’s release aligns with Stability AI’s commitment to sharing innovative technologies with the broader research community. The company plans to release the DeepFloyd IF model fully open source eventually.
The newly released DeepFloyd IF model boasts several impressive features. Firstly, it uses the T5-XXL-1.1 language model as a text encoder to aid in understanding text prompts. The model also employs cross-attention layers to better align the text prompt and the generated image. One of the standout features of the DeepFloyd IF model is its ability to accurately apply text descriptions to generate images with various objects appearing in different spatial relations. This has previously been a challenging task for other text-to-image models. Another noteworthy feature is the high degree of photorealism in the generated images, reflected in the model’s impressive zero-shot FID score of 6.66 on the COCO dataset. The DeepFloyd IF model also can generate images with non-standard aspect ratios, including vertical or horizontal orientations and the standard square aspect.
In addition to text-to-image generation, the DeepFloyd IF model offers zero-shot image-to-image translations. This is achieved by resizing the original image to 64 pixels, adding noise through forward diffusion, and using backward diffusion with a new prompt to denoise the image. The style can be modified through super-resolution modules via a prompt text description. This approach allows for the modification of style, patterns, and details in the output image while maintaining the primary form of the source image without the need for fine-tuning.
The DeepFloyd IF model works in three stages to generate high-quality images from text prompts. A frozen T5-XXL language model converts the text prompt into a qualitative representation in the first stage. Then, in the second stage, a base diffusion model is applied to transform the qualitative text into a 64×64 image, which is then upscaled to 256×256 using two text-conditional super-resolution models. During the third stage of the process, a final model is used to enhance the image to a clear and high-quality 1024×1024 resolution. The IF model includes different versions of the base and super-resolution models, which have other parameters. Although the third-stage model has yet to be available, alternative upscale models like the Stable Diffusion x4 Upscaler can be utilized.
The DeepFloyd IF model was trained on a high-quality custom dataset called LAION-A, which contains 1 billion (image, text) pairs. The dataset is an aesthetic subset of the English part of the LAION-5B dataset, and the data were filtered using custom filters to remove inappropriate content. The model is initially released under a research license, and the creators welcome feedback to improve the model’s performance and scalability. The model can be used in various domains, such as art, design, storytelling, virtual reality, and accessibility. The creators pose several research questions related to the model’s technical, academic, and ethical aspects. Access to the model’s weights is available on Deep Floyd’s Hugging Face space, and the model card and code are also available on GitHub. A Gradio demo is provided for everyone, and the creators invite people to join public discussions.
Don’t forget to join our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.