Google researchers have developed techniques that can now train a language model with more than a trillion parameters. The 1.6 trillion parameter model is the largest of its size and is four times faster than the previously largest Google-developed language model. Parameters are essential for machine learning algorithms. They are the parts of the model that are learned from historical data. In the language domain, there has been a correlation between the number of parameters and model sophistication.
According to the researchers, large-scale training can prove useful for powerful models. However, large-scale training is highly computationally intensive. To address this challenge, the researchers pursued what they called Switch Transformer. Switch Transformer is a technique that uses only a subset of parameters that transform input data within the model. The Switch Transformer keeps multiple experts (models specialized in different tasks) inside a larger model, resulting in a “gating network” that selects which experts to consult for any given data. The Switch Transformers efficiently leverage the hardware – such as GPU’s and TPU’s – designed for dense matrix multiplications. The researchers developed the model which splits unique weights on different devices. Although the weights increase with the number of devices, the model maintains a manageable memory and footprint on each device.
In an experiment, the researchers pre-trained several Switch Transformer models with the help of 32 TPU cores on a 750 GB dataset with text scraped from different web sources. The model was tasked to predict missing words in a passage that had 15% masked words. The researchers found that their model exhibited no training instability as compared to models with lesser parameters. However, on one Benchmark, i.e., the Sanford Question Answering Dataset, the model scored lower than a model with lesser parameters.
The Switch Transformation model also benefits several downstream tasks like enabling an over seven times pre-training speed using the same amount of computational resources. In a test where the Switch Transformer model was trained to translate between 100 languages, it was observed that there is a universal improvement across the languages with an over four times speedup compared with a baseline model.
In the future, researchers aim to apply this model to new modalities, including images and text. According to them, the model sparsity can be advantageous in a range of media. Unfortunately, the researchers haven’t taken the impact of these large language models in the real world. It is observed that often, models amplify biases encoded in the public data. According to OpenAI, this can associate certain specific keywords based on gender, race, and religious prejudices. These stereotypical preferences are already found in some of the most popular existing models. Owing to this, last year, Google regulated that the researchers at the firm are required to consult with legal, policy, and public relations teams before pursuing topics such as face and sentiment analysis and categorizations of race, gender, or political affiliation.
Paper: https://arxiv.org/pdf/2101.03961.pdf
Github: https://github.com/tensorflow/mesh
Consultant Intern: Kriti Maloo is currently pursuing her B.Tech from Indian Institute of Technology (IIT) Bhubaneswar. She is interested in Data Analytics and its applications in various domains. She is a Bibliophile and loves to explore new advancements in the field of technology.