UC Berkeley Researchers Introduce Starling-7B: An Open Large Language Model (LLM) Trained by Reinforcement Learning from AI Feedback (RLAIF)

Large Language Models (LLMs) are artificial intelligence models for natural language processing tasks. These models are trained on massive datasets and can understand and generate human-like text. They have transformed natural language processing with their ability to understand and develop human-like text. The utility is in every field of life.

The UC Berkeley researchers have introduced Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model leverages the capabilities of our recently developed reward training and policy tuning pipeline, our new GPT-4 labeled ranking dataset, Nectar, and a cutting-edge reward training and policy tuning pipeline.


The foundation of Starling-7B lies in the GPT-4 labeled ranking dataset, Nectar. It features 183,000 chat prompts, and each prompt presents seven responses from various models like GPT-4, GPT-3.5-instruct, GPT-3.5-turbo, Mistral-7B-Instruct, and Llama2-7B, resulting in an extensive 3.8 million pairwise comparisons. To ensure fairness, the researchers dedicated considerable effort to mitigate positional bias when prompting GPT-4 for rankings, a process thoroughly detailed in the dataset section.


They used a learning reward model to refine the Openchat 3.5 language model and found the results impressive. The AlpacaEval score increased from 88.51% to 91.99%, while the MT-Bench score increased from 7.81 to 8.09. These metrics function as standards, assessing how useful the chatbot is.

The researchers tested the model with earlier open-source models like Zephyra-7B, Neural-Chat-7B, and Tulu-2-DPO-70B, using Direct Preference Optimization (DPO). While these models performed well in Chatbot Arena, they could have lived up to the full potential of RLHF when compared to top SFT models such as OpenHermes 2.5 and Openchat 3.5 in MT Bench.

The researchers emphasized that the model has certain challenges. It is susceptible to deceitful or manipulative methods. Also, The model struggles with mathematical or reasoning tasks, and its outputs’ factual accuracy may only sometimes be guaranteed. They also noted that the model suffers occasional verbosity and susceptibility to jailbreaking prompts. They said that these flaws are still dedicated to improving Starling-7B. 

To address this problem, they proposed to refine the model further by utilizing rule-based reward models, in which GPT-4 serves as a guide, using the techniques outlined in the GPT-4 Technical Report.

In conclusion, Starling-7B represents a significant advancement in LLMs and illustrates the possibilities of Reinforcement Learning through AI Feedback. The field of natural language processing is getting enhanced because of the collaboration between these models and the community’s shared knowledge. The researchers are working to improve the model’s performance and solve the limitations.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft