BigScience AI Researchers Open-Source ‘BLOOM’: An Autoregressive Multilingual Large Language Model Larger Than GPT-3 and OPT-175B

Businesses are increasingly adopting ML and AI technologies to enhance their services and goods. Such systems include language models for various tasks, such as predicting the next word you’ll enter on your mobile phone so you can finish the message more quickly.

Over the past recent years, large machine learning (ML) models have revolutionized the field of AI research. Still, only a few teams have been able to train and study them due to the high computational costs and massive training data involved. Furthermore, the information about training these AI models, their metadata, and code remains unshared and far from the reach of AI communities. 

To address these shortcomings, BigScience Project introduces BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), the first multilingual Large Language Model (LLM) trained in complete transparency by the largest group of AI academics. Unlike the traditional secrecy of industrial AI research laboratories, the project demonstrates the possibility of training promising AI models published by the larger research community responsibly and openly.

The BigScience research project was launched in 2021. It involves around 1000 researchers from over 60 countries and more than 250 institutions. The research is led by Hugging Face with the support of  GENCI, the IDRIS team at the CNRS, the Megatron team at NVIDIA, and the Deepspeed team at Microsoft. Hugging Face released a free web app that lets anyone try Bloom without having to download it.

The foundation of every model utilized in this study is a Transformer decoder-only pretrained with an autoregressive language modeling target. As mentioned in their paper, “What Language Model to Train if You Have One Million GPU Hours?” researchers frequently choose the aforementioned architectures for large language models because it allows for zero-shot application to numerous downstream tasks.

BLOOM model includes 176 billion parameters and was trained for 11 weeks on the Jean Zay supercomputer in France. As a result, BLOOM can generate text in 46 natural languages and dialects and 13 programming languages. It can also follow prompts to complete unique tasks like writing recipes, extracting data from news articles, or creating sentences using newly-defined invented words, despite never having been trained on those particular tasks. It will be the first language model with more than 100B parameters ever generated for many of them, including Spanish, French, and Arabic.

According to their research, zero-shot generalization can be improved by supplementing Common Crawl data with high-quality cross-domain curated data. In their investigation of multilingualism, they discovered that on English zero-shot benchmarks, multilingual models considerably underperform their monolingual counterparts.

To ensure that the training corpus was consistent with their beliefs, the team took a data-first strategy. BigScience’s multidisciplinary and multinational structure allowed them to critically evaluate each stage of the process from various perspectives. This included ethical, legal, environmental, linguistic, and technical considerations without compromising the model performance.  

To develop a framework for developing and releasing those models, the team also released its Responsible AI Licence and Ethical Charter. This revealed practical applications of scaling rules in constructing substantial language models. In contrast to earlier efforts, this work provides complete justifications for all architectural parameters.

Among the basic principles that distinguish it apart from similar studies with huge language models are the following:

Open: All BigScience meeting minutes, talks, and codes are available for public viewing. Throughout the procedure, the model training progress was made public, and all the statistics required for someone else to duplicate this work were provided. Numerous research articles written by hundreds of contributors have already been produced as a result of BigScience’s “open first” methodology.

Accessibility: The team is creating an easy-to-use API, making it freely available to all scholars. 

Multilingualism: Unlike monolingual models like LaMBDA and GPT-3, BLOOM is multilingual, trained in 46 natural and 13 programming languages.

Researchers may now download, run, and study BLOOM to investigate the performance and behaviour of these freshly established massive language models down to their most fundamental internal operations.

The chair of BigScience believes that BigScience is distinctively participatory and people-first, bringing together viewpoints from thousands of multidisciplinary scholars worldwide. They believe this is the most effective way to work with those who are utilizing this technology to spread the values of responsibility and inclusivity.

The team believes that with continued workshops and experiments, BLOOM’s performance will keep on getting better. The team has planned to increase the number of languages and reduce the model size while maintaining performance.

References:

  • https://huggingface.co/blog/bloom
  • https://huggingface.co/bigscience/bloom#evaluation
Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.