Meet LongLLaMA: A Large Language Model Capable of Handling Long Contexts of 256k Tokens

Researchers have made significant advancements in various fields using language models. However, effectively incorporating extensive new knowledge into these models remains a challenge. Fine-tuning, the common practice, is resource-intensive and complex to manage, and it only sometimes provides a straightforward method for incorporating new knowledge. Researchers propose a promising alternative called Focused Transformer (FOT) to address this.

The FOT technique aims to overcome the challenge of limited context length in language models. As the number of documents increases, the ratio of relevant to irrelevant tokens diminishes, leading to overlaps between keys related to irrelevant and relevant values. This issue is referred to as the distraction issue. The FOT allows a subset of attention layers to access an external memory of (key, value) pairs using the k-nearest neighbors (kNN) algorithm. This mechanism effectively extends the context length and helps address the distraction issue.

The training procedure of the Focused Transformer draws from contrastive learning. During training, the memory attention layers are exposed to both relevant and irrelevant keys, resembling negative samples from unrelated documents. This approach encourages the model to differentiate between keys connected to semantically diverse values, enhancing their structure.

The researchers introduce LONGLLAMAs, which are fine-tuned OpenLLaMA models with FOT. This method demonstrates that it does not require long context during training and can be applied to existing models. LONGLLAMAs significantly improve tasks requiring long-context modeling, such as passkey retrieval.

The research contributions include identifying the distraction issue as a significant challenge to scaling up context length in Transformer models, developing the Focused Transformer (FOT) to address this issue, and providing a simple implementation method that allows existing models to be augmented with memory without modifying their architecture. The resulting models, LONGLLAMAs, exhibit enhancements in tasks that benefit from increasing the number of few-shot demonstrations in the extended context. The FOT’s capabilities are further analyzed across various datasets and model sizes, demonstrating improvements in perplexity over baselines in long-context language modeling tasks.

In summary, the Focused Transformer (FOT) technique addresses the distraction issue and allows context length extension in language models. Training the model to differentiate between relevant and irrelevant keys enhances the structure and significantly improves tasks requiring long-context modeling. The FOT method can be applied to existing models without architectural modifications, making it a cost-effective solution for augmenting models with memory.

Check out the Paper and GitHub link. Don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...