OpenBezoar: A Family of Small, Cost-Effective, and Open-Source AI Models Trained on Mixed Instruction Data

The recent success of instruction fine-tuning of pre-trained Large Language Models (LLMs) for downstream tasks has attracted significant interest in the Artificial Intelligence (AI) community. This is because it allows models to be aligned with human tastes. In order to guarantee that these refined models appropriately represent human preferences, methods such as Direct Preference Optimisation (DPO) and Reinforcement Learning from Human Feedback (RLHF) have been developed. 

In Supervised Fine-Tuning (SFT), instructions are provided to pre-trained LLMs, allowing for their customization to execute specific tasks. This not only guarantees that they generate logical answers but also illustrates how supervised learning allows these models to adjust to different tasks via observational learning efficiently. 

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Due to the enormous size of the most sophisticated LLMs, which have over 100 billion parameters, many businesses and individuals cannot afford the computational expense of SFT. Studies have shown that models with fewer parameters can function well in some cases, even outperforming larger models.  Traditionally, datasets containing a high number of examples created by humans are used for fine-tuning, which increases the adaptability of the final models. However, making these databases is costly and time-consuming. Also, the commercial usage of models trained on these datasets is frequently restricted by licensing requirements. 

In a recent study, a team of researchers from Surge Global created instruction-response pairs using open-source instruction models that were licensed for commercial usage in order to overcome these limitations. Three methods for generating datasets have been developed, producing instruction datasets that can be used for profit. 

A human proxy model has been used to enhance these datasets further for quality and diversity. Using QLoRA, SFT has been applied to the selected base model, yielding three adapter models. The OpenBezoar family of models consists of these models and one alignment-specific model.

The goal of this work is to develop the OpenBezoar family of models by optimizing the OpenLLaMA 3Bv2 base model. There are multiple steps in the process, which are as follows,

  1. Data Generation: An open, commercially available, instruction-fine-tuned Falcon-40B model version has been used to generate synthetic instruction-tuning data. LaMini-LM, WizardLM/Evol-Instruct (using data bricks-dolly-15k as a seed dataset), and Orca (using the Flan Collection as a seed dataset) are the three distinct techniques that have been utilized for data production. 
  1. Data Filtering: To guarantee quality and relevance, the generated data is filtered using GPT-4, a human proxy.
  1. Supervised Fine-Tuning: Each scheme undergoes a sequential process of cost-effective QLoRA-based supervised fine-tuning in which model parameters are changed to enhance performance on particular tasks.
  1. Minimization of Distribution Shift: To ensure that the model works well on a variety of datasets, the supervised fine-tuned checkpoint is further refined using a subset of the HH-RLHF dataset.
  1. Optimization of Direct Preference (DPO): Applying the DPO loss function yields the last checkpoint, “OpenBezoar-HH-RLHF-DPO.” In this step, the model is directly aligned with human preferences, negating the need for an additional reward model.

The team has shared that the ‘LLM-as-a-judge’ framework with Claude 2.1 and LM Eval Harness tasks have been used to evaluate the final checkpoint on MT-Bench. The results have shown that the ‘OpenBezoar-HH-RLHF-DPO’ checkpoint performs better than a lot of models at the 3B parameter scale. It even beats the top model on the Huggingface Open LLM Leaderboard in one of the categories. 

The OpenBezoar-SFT, OpenBezoar-HH-RLHF-SFT, and OpenBezoarHH-RLHF-DPO checkpoints have been released and can be accessed on  HuggingFace


Check out the Paper, Datasets on HF, and CodebaseAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

[Free AI Webinar] 'How to Build Personalized Marketing Chatbots (Gemini vs LoRA)'.