This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages

The rapid advancement of large language models has ushered in a new era of natural language processing capabilities. However, a significant challenge persists: most of these models are primarily trained on a limited set of widely spoken languages, leaving a vast linguistic diversity unexplored. This limitation not only restricts the accessibility of cutting-edge language technologies but also perpetuates a technological divide across linguistic communities.

Researchers have tackled this challenge in this study by proposing a novel AI method named SambaLingo. This approach aims to adapt existing, high-performing language models to new languages, leveraging the strengths of pre-trained models while tailoring them to the unique characteristics of the target language.

✅ [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Previous efforts to address this issue have primarily focused on training monolithic multilingual or language-specific models from scratch. However, these approaches face significant hurdles, including the curse of multilinguality, data scarcity, and the substantial computational resources required. Adapting English-centric models to new languages has emerged as a promising alternative, demonstrating the potential to outperform language-specific models pre-trained from scratch.

The SambaLingo methodology begins with the selection of a suitable base model that has already exhibited exceptional performance in its initial language. In this study, the researchers chose the open-source Llama2 7B model, renowned for its English language capabilities, as their starting point.

To effectively capture the linguistic nuances of the target language, the researchers expanded the model’s vocabulary by adding non-overlapping tokens from the target language and initializing them using sub-word embeddings from the original tokenizer. This crucial step ensures that the model can accurately tokenize and represent the new language, paving the way for seamless adaptation.

Next, the researchers employed a continual pre-training approach, feeding the model a carefully curated mixture of English and target language web data sourced from CulturaX. The data mixture followed a 1:3 ratio, biased towards the target language, to strike a delicate balance between preserving the model’s existing knowledge and adapting it to the new linguistic landscape.

To further enhance the model’s alignment with human preferences, the researchers implemented a two-stage process: supervised fine-tuning (SFT) and direct preference optimization (DPO). During SFT, they utilized the ultrachat-200k dataset and its machine-translated version. For DPO, they employed ultra feedback and cai-conversation-harmless datasets, blending them with a 10:1 ratio of English to machine-translated data.

The researchers rigorously evaluated the SambaLingo models across various tasks and languages, including language modeling, translation, text classification, open-book and closed-book question answering, and various natural language understanding benchmarks as shown in Table 1. The models were tested on nine typologically diverse languages: Arabic, Thai, Turkish, Japanese, Hungarian, Russian, Bulgarian, Serbian, and Slovenian.

Across multiple benchmarks, the SambaLingo models consistently outperformed existing state-of-the-art models in these languages. For instance, on the perplexity benchmark, which measures language modeling performance, the SambaLingo models achieved lower perplexity scores than all existing baselines on a held-out set from their training data (Figure 1). Furthermore, when scaled to the larger Llama2 70B parameter scale, the SambaLingo models exhibited even better performance, surpassing their 7B counterparts across multiple benchmarks, despite being trained on fewer tokens.

To validate the quality of the model’s outputs and their alignment with human preferences, the researchers employed GPT-4 as an impartial judge, evaluating the model’s responses to real user prompts. The results were promising, with SambaLingo consistently outperforming other models in the same languages, as judged by GPT-4’s preferences and logical explanations.

In summary, the SambaLingo methodology represents a significant stride towards democratizing artificial intelligence across linguistic diversity. By leveraging the strengths of existing high-performing models and tailoring them to new linguistic landscapes, this approach offers a scalable and efficient solution to the challenge of language barriers. With its state-of-the-art performance and alignment with human preferences, SambaLingo paves the way for a future where the benefits of AI transcend linguistic boundaries, fostering inclusivity and accessibility for all.

Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

Want to get in front of 1.5 Million AI Audience? Work with us here

Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.

[Free AI Webinar] 'How to Build Personalized Marketing Chatbots (Gemini vs LoRA)'.