Google has open-sourced a model called mT5, a multilingual variant of Google’s T5 model. This model is trained on a dataset comprising over 101 languages (mC4 corpus) and contains between 300 million and 13 billion parameters (internal variables used to make predictions). It is said to have adequate capacity to learn over 100 languages without facing any remarkable conflict.
Natural language processing (NLP) pipelines presently make use of transfer learning. This method involves pre-training the models on data-rich tasks before being fine-tuned on a downstream interest task. These T5 models enable NLP professionals to achieve robust performance in various tasks rapidly without performing pre-training themselves. However, most of these language models are entirely pre-trained in the English language, limiting their use for the world population who does not speak English. To improve the model’s functionality and service, the NLP community has developed multilingual models pre-trained in various languages, including mBERT and mBART.
The Release of mT5
The T5’s general-purpose text-to-text format is based on insights from large-scale empirical studies. Google’s multilingual MT5 is trained on MC4 that covers 101 languages. MC4 is a specially built multilingual subset of C4 that contains about 750GB of explicit English-language text sourced from the public Common Crawl repository.
The researchers have compared the mT5-XXL model with related models such as mBERT, XLM, and XLM-R for evaluation. It is stated that the mT5-XXL model has achieved SOTA performance on all the tasks from the Xtreme multilingual benchmark. This exhibits that T5’s strengths are applicable in a multilingual model environment and achieving strong performance on several standard sets. This new model also suggests that pre-training can provide a feasible alternative to any complex technical applications.