Together AI Introduces StripedHyena-7B: An Alternative Artificial Intelligence Model Competitive with the Best Open-Source Transformers in Short and Long-Context Evaluations

Together AI has made a big contribution to sequence modeling architectures and introduced StripedHyena models. It has revolutionized the field by offering alternatives to the conventional Transformers, focusing on computational efficiency and enhanced performance. 

This release includes the base model StripedHyena-Hessian-7B (SH 7B) and the chat model StripedHyena-Nous-7B (SH-N 7B). StripedHyena is based on important learnings from creating effective sequence modeling architectures, such as H3, Hyena, HyenaDNA, and Monarch Mixer, which were made last year.

Researchers highlight that this model handles lengthy sequences during training, fine-tuning, and generation with greater speed and memory efficiency. Using a hybrid technique, StripedHyena combines gated convolutions and attention into what they call Hyena operators. Also, this is the first alternative architecture competitive with strong Transformer base models. On short-context tasks, including OpenLLM leaderboard tasks, StripedHyena outperforms Llama-2 7B, Yi 7B, and the strongest Transformer alternatives, such as RWKV 14B

The model was evaluated on various benchmarks in handling short-context tasks and processing lengthy prompts. Perplexity scaling experiments on Project Gutenberg books reveal that perplexity either saturates at 32k or decreases beyond this point, suggesting the model’s ability to assimilate information from longer prompts.

StripedHyena has achieved efficiency through a unique hybrid structure that combines attention and gated convolutions organized into Hyena operators. They used innovative grafting techniques to optimize this hybrid design, enabling architecture modification during training.

The researchers emphasized that one of the key advantages of StripedHyena is its enhanced speed and memory efficiency for various tasks such as training, fine-tuning, and generation of long sequences. It outperforms an optimized Transformer baseline using FlashAttention v2 and custom kernels by over 30%, 50%, and 100% in end-to-end training on lines 32k, 64k, and 128k, respectively.

In the future, the researchers want to make significant progress in several areas with the StripedHyena models. They want to create bigger models that can handle longer contexts, thus expanding the limits of information understanding. Furthermore, they want to incorporate multi-modal support, increasing the model’s adaptability by allowing it to process and understand data from various sources, such as text and images. 

Above all, they want to train bigger models that can handle longer contexts, thus expanding the limits of information understanding. They also want to improve the performance of the StripedHyena models so that they operate more effectively and efficiently. 

In conclusion, the model has the potential for improvement over Transformer models by introducing additional computation, such as multiple heads in gated convolutions. This approach, inspired by linear attention, has been proven effective in architectures such as H3 and MultiHyena, improves the quality of the model during training, and provides advantages for inference efficiency.


Check out the Blog and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...