Yandex Open-Sources YaLM Model With 100 Billion Parameters

It is claimed to be the world's largest neural network for generative trained transformation models (GPT)

Transformers are used for translation and text summarising tasks because they can analyze sequential input data, such as natural language. Transformers use the self-attention process and weights the importance of each component of the input data differently. Large-scale transformer-based language models have gained a lot of popularity recently in the disciplines of computer vision and natural language processing (NLP).

They expand in size and complexity frequently, yet it costs millions of dollars, hires the greatest experts, and takes years to construct these models. Because of this, many companies have been unable to use it, and only significant IT organizations have access to this cutting-edge technology.

To address these problems, Yandex has developed the largest YaLM model to date, which uses 100 billion parameters. This largest GPT-like neural network for English is currently available for free. The researchers used a pool of 800 A100 graphics cards, 1.7 TB of online materials, books, and countless other sources to train the model over the course of 65 days. They have published the model and relevant materials on GitHub under the Apache 2.0 license, allowing both academic and commercial use.  

The researchers explain that a 10% improvement in training speed for large-scale neural networks can reduce runtime on a high-value cluster by one week. The following steps are typically included in training iterations:

  1. Batch Preparation
  2. Calculation of the activation and loss functions using forward propagation
  3. Gradient calculation by running backward propagation.
  4. Running the step stage to alter the weights of the model.

In each of these steps, the team outlines actions that can speed up developers’ training:

1. Look for bottlenecks: The researchers suggest using a profiler to analyze how training time is spent. During their work, the team noticed additional major issues, thanks partly to the profile.

2. Employ Quick Data Types: The data type used to store the model and carry out calculations has the biggest impact on how quickly training and inference occur. Therefore, they suggest using quick data types.

3. Accelerating GPU Operations: GPU operations can be accelerated by enabling dropouts, minimizing memory interaction, and fully utilizing the GPU to have a tonne of data. They also state that using a library that effectively computes communication at initialization and enables direct network communication between GPUs without CPU usage also adds to the performance. Further, they used Zero Redundancy Optimizer, ensuring that communication is as quick as possible.

They used the following four strategies in their training process:

  1. Combined some of their processes to increase speed by 5%.
  2. Utilized a triangle mask and the softmax attention kernel: +10% to 80%
  3. Prevented dropout: +15 %
  4. Application of ZeRO: +80%

According to researchers, not every challenge to training an extremely large model involves a lengthy iteration. Having high computational power can make people think that they can start training the model. However, these models are highly delicate and prone to divergence. 

The team notes that loss is more later than the first few hours of training. Moreover, the models may completely lose what they have learned and become beyond repair.

The team put the following strategies into practice to deal with the issue of divergence:

  • Bf16 was chosen as the primary type for weights
  • tf32 was used to run computations that required precision
  • Pre-LayerNorm was introduced
  • After embeddings, they immediately added LayerNorm.

The team further employed curriculum learning. They planned to train its neural network on a sizable batch and length of strings. However, they begin by training on a little batch and short string length, and as the training goes on, they gradually expand them. According to their author, this method is a stabilized process and reduces the number of computations at the start of training, offering additional benefits.

They’ve trained their models without divergence for more than six months now. The models come in various sizes. They were able to train a model with 100 billion parameters thanks to these stabilizations. They are now sharing their work with the developer and research community to open paths to future developments. 



Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.