A new development in large language models has emerged with the release of OpenLLaMA, an open-source reproduction of Meta AI’s LLaMA model. The creators of OpenLLaMA have made the permissively licensed model publicly available as a 7B OpenLLaMA model that has been trained with 200 billion tokens. The release includes PyTorch and Jax weights of pre-trained OpenLLaMA models, evaluation results, and a comparison against the original LLaMA models. This development has significant implications for machine learning, particularly for researchers who require large language models but face challenges accessing proprietary models.
The creators of OpenLLaMA have shared details on how they trained their models on the RedPajama dataset, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens. They followed the same preprocessing and training hyperparameters as the original LLaMA paper, including model architecture, context length, training steps, learning rate schedule, and optimizer. The only difference between their approach and the original one is the dataset used: OpenLLaMA employs the RedPajama dataset rather than the one utilized by the original LLaMA.
The models were trained on cloud TPU-v4s using EasyLM, a JAX-based training pipeline developed for training and fine-tuning language models. They employed a combination of normal data parallelism and fully sharded data parallelism (also known as ZeRO stage 3) to balance the training throughput and memory usage. Overall, their training run achieved a throughput of over 1900 tokens/second / TPU-v4 chip.
The performance of OpenLLaMA was evaluated on several tasks using the lm-evaluation-harness. The results were compared against the original LLaMA model and GPT-J, a 6B parameter model trained on the Pile dataset by EleutherAI. The evaluation metrics for the original LLaMA model were generated by running it on the same tasks. The results for the LLaMA model slightly differed from those reported in the original LLaMA paper, which may be due to differences in evaluation protocols. However, OpenLLaMA exhibited comparable or better performance than the original LLaMA and GPT-J across most tasks, according to the presented results. Although OpenLLaMA was trained on 200 billion tokens instead of the 1 trillion tokens used for the original LLaMA and 500 billion tokens used for GPT-J, its performance is expected to improve even further upon completing its training on 1 trillion tokens.
To encourage feedback and collaboration from the community, the team behind OpenLLaMA has released a preview checkpoint of their weights. These weights are available in two formats: an EasyLM format for use with their EasyLM framework and a PyTorch format for use with the Huggingface transformers library. Unlike the original LLaMA model, OpenLLaMA’s tokenizer and weights are trained entirely from scratch, so obtaining the original LLaMA tokenizer and weights is no longer necessary. However, it is essential to note that OpenLLaMA uses the BOS (beginning of a sentence) token (id=1) during training, so this token should be prepended for optimal performance during a few-shot evaluation. The preview checkpoint weights and EasyLM framework are permissively under the Apache 2.0 license. The team is currently focused on completing the training process on the entire RedPajama dataset to allow for an apple-to-apple comparison between the original LLaMA and OpenLLaMA. Additionally, they are working on training a smaller 3B model for low-resource use cases. The team plans to release more updates soon.
Check out the Github Link. Don’t forget to join our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.