OpenAI Team Introduces ‘InstructGPT’ Model Developed With Reinforcement Learning From Human Feedback (RLHF) To Make Models Safer, Helpful, And Aligned

A system can theoretically learn anything from a set of data. In practice, however, it is little more than a model dependent on a few cases. Although pretrained language models such as Open AI’s GPT-3 have excelled at a wide range of natural language processing (NLP) tasks, there are times when unintended outputs, or those not following the user’s instructions, are generated. Not only that, but their outcomes have been observed to be prejudiced, untruthful, or poisonous, potentially having harmful societal consequences.

OpenAI researchers have made substantial progress in better aligning big language models with users’ goals using reinforcement learning from human feedback (RLHF) methodologies. The team proposed InstructGPT models that have been demonstrated to produce more accurate and less harmful results in tests.

InstructGPT is designed to work on:

  • Explicit Intentions – following user instructions 
  • Implicit Intentions – Staying genuine and not being biased, poisonous, or otherwise hurtful.

The researchers also want to define desirable language models using InstructGPT as:

  • Helpful: they should assist the user in completing their work
  • Honest: they shouldn’t create information or mislead the user
  • Harmless: they shouldn’t hurt the user. In addition, they should not cause physical, psychological, or social harm to people or the environment.

The three steps involved in the high-level InstructGPT process includes:

  • To gather data from the demonstration and develop a supervised policy.
  • To collect data for comparison and use it to train a reward model.
  • PPO can be used to optimize a policy against a reward model.

Core Technique:

The most common approach used is RLHF. The reward signal exploits human preferences. The researchers employ a collection of human-written examples uploaded to their API to train supervised learning baselines. Also compiled is a dataset of human-labeled dataset comparisons between two model outputs on a broader set of prompts. They then use this dataset to train a reward model (RM) to predict which result in their labelers preference and then use the PPO method to fine-tune the GPT-3 policy to maximize this reward.


Findings and observations:

  • Labels favor InstructGPT outputs over GPT-3 outputs by a wide margin.
  • InstructGPT models outperform GPT-3 in terms of veracity.
  • InstructGPT diminishes toxicity slightly over GPT-3, but not bias.
  • Performance regressions can be reduced on public NLP datasets by tweaking the RLHF fine-tuning technique.
  • The designed models are generalizable to the preferences of “held-out” labelers who did not provide any training data.
  • The way language models are utilized is not reflected in public NLP datasets.
  • Outside of the RLHF fine-tuning distribution, InstructGPT models demonstrated promising scalability.
  • InstructGPT continues to make trivial errors.

Overall, InstructGPT has been shown to improve GPT behavior across a wide range of activities dramatically. It also highlights how fine-tuning human feedback may help huge language models better accord with human intent. The researchers intend to refine their methods to make language models safer and more functional.

If you’re interested in these research directions, OpenAI is hiring!



🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...