RakutenAI-7B: A Suite of Japanese-Oriented Large Language Models that Achieve the Great Performance on the Japanese Language Model

Natural Language Processing (NLP) models are pivotal for various applications, from translation services to virtual assistants. They enhance the ability to comprehend and generate human-like responses. These models have become increasingly sophisticated and offer nuanced understanding and interaction capabilities as technology advances.

A persisting challenge in NLP is the development of models that can understand and generate text in languages other than English, such as Japanese. Despite the advancements in LLMs, many languages still need to be represented regarding the resources available for training these models. This resource gap leads to models that could handle the nuances of languages with complex scripts or grammatical structures, affecting the quality of machine-generated text and the model’s understanding of the language.

Current efforts to bridge this gap have led to the development of models to provide better support for underrepresented languages. However, these models often need more support, such as inefficiencies in tokenization processes, especially for languages with complex scripts like Japanese. Tokenization, breaking down text into manageable pieces for the model, is a crucial step in training and using LLMs effectively.

Rakuten Group, Inc. researchers have introduced RakutenAI-7B, a suite of Japanese-oriented LLMs. The suite includes foundation models alongside instruction- and chat-tuned models, released under the Apache 2.0 license. These models are designed to accommodate the Japanese language better, incorporating extended vocabularies and improved tokenization techniques for enhanced performance.

RakutenAI-7B‘s methodology encompasses extending the vocabulary of its tokenizer to 48,000 tokens, significantly improving the processing of Japanese text by enhancing the character-per-token rate. This strategic expansion was essential for efficiently managing the complexities of the Japanese script. In parallel, the model benefitted from rigorous data filtering techniques aimed at refining the quality of training datasets. These datasets, purged of personally identifiable information and low-quality inputs, were approximately 175 billion tokens in size, ensuring the model’s outputs are coherent and relevant. This comprehensive approach, utilizing advanced tokenization and meticulous data curation, underscored the model’s preparation for high-caliber performance across various NLP tasks.

Details of a few different datasets used:

  • XLSUM-ja is a Japanese subset of the XLSUM dataset, which is used for abstractive summarization evaluation.
  • MARC-ja is a Japanese subset of the MARC dataset, which is used for text classification tasks related to sentiment analysis. 
  • JSQuAD is a Japanese reading comprehension dataset that measures a model’s ability to answer questions given a passage. 
  • JAQKET is a Japanese open-domain question-answering dataset that measures a model’s knowledge of various topics.

RakutenAI-7B outperformed other Japanese-oriented large language models in benchmark evaluations, achieving an impressive average score 62.83 on the Japanese LM Harness, over three points higher than the nearest competitor. This excellence extended to English language tasks, evidencing the model’s robust versatility. The instruction-tuned variant, RakutenAI-7B-instruct, advanced further, securing an average Japanese LM Harness score of 68.74, leading by almost two points. These quantitative achievements highlight RakutenAI-7B’s superior performance and effectiveness across various NLP tasks.

In conclusion, RakutenAI-7B represents a significant stride towards creating more inclusive and efficient language models. The model, developed with a systematic approach and high-quality datasets, consistently performs well in various NLP tasks, outperforming other open Japanese models, and its tokenizer is more suitable for processing Japanese text, potentially leading to faster and cheaper training and inference. The impressive quantitative results make it a valuable resource for researchers, developers, and industry practitioners.

Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...