Researchers from Cerebras & Neural Magic Introduce Sparse Llama: The First Production LLM based on Llama at 70% Sparsity

Natural Language Processing (NLP) is a cutting-edge field that enables machines to understand, interpret, & generate human language. It has applications in various domains, such as language translation, text summarization, sentiment analysis, and the development of conversational agents. Large language models (LLMs) have significantly advanced these applications by leveraging vast data to perform tasks with high accuracy, almost matching human performance.

Today’s primary challenge in NLP is the enormous computational and energy demands required to train and deploy these LLMs. Their sheer size often limits these models, making them expensive and less accessible to a broader audience. The high computational cost and significant energy impact restrict the usability of these models, emphasizing the need to reduce the computational footprint without compromising accuracy. Addressing this challenge is crucial for making these powerful tools more widely available and sustainable.

Various methods have been employed to mitigate these challenges and reduce LLMs’ size and computational requirements. Quantization is one technique that reduces the number of bits required to represent each model parameter, while pruning involves removing less important weights to streamline the model. However, both methods face significant hurdles in maintaining high accuracy, especially for complex tasks. Current techniques often struggle to achieve meaningful compression ratios without damaging model performance, particularly at high sparsity levels.

Researchers from Neural Magic, Cerebras Systems, and IST Austria have introduced a novel approach to create sparse foundational versions of large language models. They specifically targeted the LLaMA-2 7B model, aiming to combine the SparseGPT pruning method with sparse pretraining techniques. This innovative method seeks to achieve high sparsity levels while preserving or enhancing the model’s accuracy. The researchers’ approach involves initially pruning the model to 50% sparsity, followed by further iterative training and pruning steps to reach 70% sparsity. 

The method begins with sparse pretraining on subsets of high-quality datasets such as SlimPajama and The Stack. The sparse pretraining process includes fine-tuning with per-layer distillation, ensuring the model retains high accuracy across various complex tasks, including chat, code generation, and instruction following. This detailed process involves training the 50% sparse model until convergence and then pruning it further to achieve the 70% target. The weights are pruned and frozen, and sparsity masks are enforced during training to maintain the desired sparsity levels. This iterative process is crucial for maintaining high recovery levels after fine-tuning.

The sparse models demonstrated the ability to achieve up to 70% sparsity while fully recovering accuracy for fine-tuning tasks. Training acceleration on Cerebras CS-3 chips closely matched theoretical scaling, showcasing the efficiency of the approach. Inference speeds increased significantly, with improvements of up to 3x on CPUs using Neural Magic’s DeepSparse engine and 1.7x on GPUs using the nm-vllm engine. Additionally, the combination of sparsity and quantization resulted in total speedups on CPUs reaching up to 8.6x, highlighting the method’s efficiency and effectiveness.

The study’s results underscore the potential of combining sparsity with quantization to achieve dramatic speedups and performance gains. The sparse pretraining methodology proved particularly effective, demonstrating high recovery at up to 70% sparsity levels. The integration of Cerebras’s CS-3 AI accelerator for sparse pretraining further highlighted the advantages of this approach, enabling near-ideal speedups and significantly reducing computational requirements.

In conclusion, this research successfully addresses the challenge of reducing the computational demands of LLMs while maintaining their performance. The innovative sparse pretraining and deployment techniques introduced by the Neural Magic, Cerebras Systems, and IST Austria researchers offer a promising solution to the problem. This approach not only enhances the efficiency and accessibility of NLP models but also sets the stage for future advancements in the field.


Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

 | Website

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft