Hugging Face Introduces StackLLaMA: A 7B Parameter Language Model Based on LLaMA and Trained on Data from Stack Exchange Using RLHF

Over the past few years, large language models have garnered significant attention from researchers and common individuals alike because of their impressive capabilities. These models, such as GPT-3, can generate human-like text, engage in conversation with users, perform tasks such as text summarization and question answering, and even write code. There are several scenarios where the quality of generated text plays a key role in evaluating the language model. For instance, for a good user experience, the user expects the model to generate error-free executable code or write a poem that exhibits a certain level of creativity. Loss functions are thus used in order to capture these attributes. Most previous research focuses on using loss functions based on next-token prediction or other similar criteria. However, another upcoming research domain focuses on incorporating human feedback as a measure of performance and using that feedback as a loss to optimize the model. This idea is known as Reinforcement Learning from Human Feedback (RLHF), and several existing powerful models, such as ChatGPT, GPT-4, and Claude, are currently employing this technique. 

Adding another model to the list of successful applications of RLHF, researchers from Hugging Face are releasing StackLLaMA, a 7B parameter language model based on Meta’s LLaMA model that has been trained to answer questions from Stack Exchange using RLHF with Hugging Face’s Transformer Reinforcement Learning (TRL) library. The researchers fine-tuned Meta’s original LLaMA model using a combination of mainly three strategies: Supervised Fine-tuning (SFT), Reward/ Preference modeling (RM), and Reinforcement Learning Human Feedback (RLHF). The model can be accessed here, and the entire training pipeline is available as a part of the TRL library.

The Hugging Face researchers pointed out that RLHF is only a fine-tuning step; hence, deciding the initial model is a crucial preliminary step. Thus, the researchers chose the recently introduced largest language models developed by Meta AI, LLaMA models, for their purpose. This collection of foundation language models can outperform even GPT-3 and is available in a range of parameters, ranging from 7B to 65B. The researchers decided to move forward with the 7B parameter model for their experiments. The researchers also pointed out that a good dataset plays an important role in giving the right human feedback. On this front, the researchers chose the StackExchange dataset, which includes over 10 million question-answer pairs on a wide range of topics and even code snippets from StackOverflow. Another attractive feature of this dataset is that it consists of the number of upvotes and a label for the accepted answer, which was quite helpful for the reward model.

The Hugging Face team sought to fine-tune the model for a specific domain (in their case, question-answering tasks) with the causal language modeling objective before training the reward model and tuning it with reinforcement learning. To achieve this, the team trained the language model on a subset of the StackExchange dataset using a technique known as packing. This efficient technique involves adding extra tokens to the end of sequences shorter than the desired length or truncating sequences longer than the desired length. The model is then trained for some thousand epochs, which marks the conclusion of the fine-tuning step. The next step was to train the reward model. As fine-tuning the model using RLHF directly with manual annotations is very time-consuming and labor-intensive, the researchers considered training the reward model by employing certain tactics that would imitate how a human would evaluate text. One such strategy is to predict the annotation based on a certain score or a binary value stating whether the annotation was good or bad. Since the StackExchange dataset consists of at least two answers for every question, the researchers selected a preferred answer based on a certain score metric. The researchers applied this methodology to a subset of the dataset to test the reward model. Its final accuracy of 67% is extremely appreciable, considering how difficult the task is to complete even with human annotators.

With the fine-tuned language model and the reward model at hand, the final step followed by the researchers was to run the RL loop. This procedure can be summarised in three main stages: generating responses from prompts, rating the responses with a reward model, and running a reinforcement learning policy-optimization step with the ratings. Based on previous work regarding training language models with RL, it has been observed that the model can learn to exploit the reward model by generating complete gibberish, which causes the reward model to assign high rewards. To counter this, the researchers even added a penalty to the reward. Based on certain experiments conducted by the team, it is safe to conclude that the resulting model gives satisfactory results on a wide range of topics.

In a nutshell, the work of the Hugging Face researchers can be summarised as creating a human-annotated dataset, adapting the language model to the domain, training a reward model, and ultimately training the model with RL. Although StackLLaMA is a major stepping stone in the world of RLHF, the model is far from perfect. There are several ongoing issues that the Hugging Face team is working hard to solve, such as occasional spikes in losses, which lead to the instability of the model. Currently, the model has been released publicly for educational and research purposes regarding RLHF and the TRL library. The team has also explicitly stated that the prompts entered into the app are being collected for further fine-tuning the model. Thus, users should refrain from sharing any sensitive personal information on the app.


Check out the Demo, Code, and Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 18k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...