A technology that has been around for years but most often taken for granted is Natural Language Processing(NLP). It is the employment of computational methods to analyze and synthesize natural language and speech. Pre-trained multilingual language models have proven efficient in performing various downstream NLP tasks like sequence labeling and document classification.
The notion behind designing pre-trained models is to build a black box that comprehends the language that can then be instructed to perform any task in that language. The goal is to construct a machine that can replace a ‘well-read’ human. However, these models require large chunks of training data to build them. As a result, the world’s under-resourced languages are left out from being explored.
Researchers from the David R. Cheriton School of Computer Science at the University of Waterloo dispute this assumption and introduce AfriBERTa. This new neural network model leverages deep-learning approaches to generate state-of-the-art outcomes for under-resourced languages. The researchers show that they can build competitive multilingual language models with less than 1 GB of text. Their AfriBERTa model covers 11 African languages, four of which have never had a language model before.
Named Entity Recognition (NER) is a sub-task of information extraction that aims to discover and classify named entities referenced in unstructured text into preset categories such as person names, organization, and locations. NER and other downstream tasks such as text classification were performed on ten low-resource languages to assess this model.
Factors that are considered to train the variants from the point of view of the model’s architecture are:
- Model Depth
- Number of attention heads
- Vocabulary Size
Models with 4,6,8,10 layers are compared. Preliminary experiments conclude that models with more than ten layers do not result in a marked performance improvement. This is expected behavior as the data sets in use are small.
Number of Attention Heads:
The Attention module in a transformer-based system performs its calculations numerous times in parallel. Each of these is referred to as an Attention Head. For each layer size, models are trained with three different attention heads: 2,4,6. Again, it is noted that more than six attention heads did not produce significantly better results in the initial studies. Therefore settings with more than six heads are not examined.
Small vocabulary size is traditionally thought to be preferable for small datasets. However, it was discovered during the testing of this model that increasing the vocabulary size improves the performance of the multilingual model. The results are based on a comparison of the optimal model size across different vocabulary sizes.
Factors to consider while pre-training models for small datasets have been established as an outcome of their research. In addition, code, pre-trained models, and datasets have been released to encourage further work on multilingual models.
Overall, the AfriBERTa model outperforms larger models like mBERT and XLM-R on text categorization by up to 10 F1 points and also beats these models on multiple languages in the NER task. According to the researchers, their model is a fierce competitor to larger models across all languages.