Researchers Introduce ruDALL-E For Generating Images from Text In Russia

This Article is written as a summay by Marktechpost Staff based on the Research article 'ruDALL-E: Generating Images from Text. Facing down the biggest computational challenge in Russia'. All Credit For This Research Goes To The Researchers of This Project. Check out the post, github and demo.

Please Don't Forget To Join Our ML Subreddit

Humans see objects, hear noises, feel the texture, smell scents, and taste flavors in a multimodal world. Multimodal machine learning (MMML) is a multi-disciplinary research topic that integrates and models many communicative modalities, such as linguistic, acoustic, and visual information, to satisfy some of artificial intelligence’s initial goals.

OpenAI introduced the DALL-E neural network at the start of the year, generating 256×256 pixel images in response to text. Now, teams from Sber AI, SberDevices, Samara University, AIRI, and SberCloud have established a complete pipeline for producing images from descriptive textual input written in Russian. The team aimed to build multimodal neural networks to better grasp the world as a whole, thereby drawing on concepts from various mediums (at first, text and pictures). Their work two critical criteria that search engines currently fail to meet:

  1. Users cannot express exactly what they are looking for in writing and have an entirely new image made just for them.
  2. They cannot generate as many license-free images as they desire at any one time.

They trained two versions of the model, with different sizes:

  1. ruDALL-E XXL, with 12.0 billion parameters
  2. ruDALL-E XL, having 1.3 billion parameters

The team has made the models ruDALL-E XL, ruDALL-E XXL, and ruCLIP Small available on DataHub.


The ruDALL-E architecture is built such that the transformer may learn to model textual and visual tokens as a unified flow in an autoregressive manner. Directly representing an image with pixels, on the other hand, necessitates a massive amount of memory. To prevent just short-term training relationships between pixels and text, the researchers train the model in two steps: 

  1. Images that have been compressed to a resolution of 256×256 are fed into an autoencoder, SBER VQ-GAN. The image may be reconstructed without significant loss of quality thanks to the 8 times compression factor.
  2. The transformer learns how to combine the 1024 picture and text tokens. They also prepared 128 tokens from the text input using the YTTM tokenizer. The image and textual tokens collide one after the other.

The current picture generating pipeline consists of three parts:

  1. Image generation with ruDALL-E
  2. Sorting the results with ruCLIP
  3. Image quality and resolution improvement with SuperResolution.

The researchers explain that it is possible to experiment with the parameters that determine the number of instances generated, their selection, and their level of abstraction during production and ranking.

The research team began writing their own code for training ruGPT models by referring to the documentation and other open-source approaches. Positional coding of image blocks, a general representation of text and image embeddings, convolutional and coordinated masked attention layers, weighted losses in the text and image sections, and a dropout-tokenizer are included.

Because of the large number of calculations required to adequately train a model, the fp16 precision mode must be used. However, particularly big values in the network might occasionally create a Nan loss and stop learning. Furthermore, because many zeros arise in gradients when the learning rate is set low to avoid problems, the network stops improving and falls apart.

They used ideas from the CogView project at Tsinghua University in China to solve this problem. They used DeepSpeed for distributed learning over a pair of DGX, precisely as they did for ruGPT-3.

To train a transformer, a large amount of data is required to be “clean.”

They wanted to get their hands on the data that OpenAI provided in their paper (about 250 million pairs) and the data that Cogview used (30 million pairs). Conceptual Captions, YFCC100m, Wikipedia data, and ImageNet were included. Then, as data sources regarding human activities, they added the OpenImages, LAION-400m, WIT, Web2M, and HowTo datasets. Other datasets that spanned domains of interest also included having people, animals, well-known figures, interiors, landmarks and landscapes, various sorts of technology, human activities, and emotions.

They were able to create a broad learning dataset of over 120 million pairs of imageā€”text captions after collecting the data, filtering out overly short descriptions, overly small pictures, pictures with unexaptable aspect ratios, and pictures that did not match their description, and after translating all English captions into Russian.

The ruDALL-E XXL model was trained in two stages: first on 512 TESLA V100 GPUs for 37 days and then on 128 TESLA V100 GPUs for another 11 days. Whereas the ruDALL-E XL model was trained in three stages: 8 days on 128 TESLA V100 GPUs, followed by 6.5 and 8.5 days on 192 TESLA V100 GPUs with slightly different training samples.

It’s not easy to choose the best generating parameters for diverse objects and domains. The researchers started with Nucleus Sample and Top-K sampling when it came time to use the model generatively. Despite the fact that this topic has been thoroughly investigated for textual generation, the conventional configuration did not function well for photos. 

They conducted a series of tests to determine the optimal parameter ranges. Their findings revealed that the ranges for different outputs could be dramatically different. They state that their future research will examine whether parameter ranges can be determined automatically for a generation.