Unmasking AI Misbehavior: How Large Language Models Generalize from Simple Tricks to Serious Reward Tampering

Using reinforcement learning (RL) to train large language models (LLMs) to serve as AI assistants is common practice. To incentivize high-reward episodes, RL assigns numerical rewards to LLM outcomes. Reinforcing bad behaviors is possible when reward signals are not properly stated and do not correspond to the developer’s aims. This phenomenon is called specification gaming, when artificial intelligence systems learn undesirable but highly-rewarded behaviors due to reward misspecification.

The range of behaviors that can emerge from specification gaming is vast, from sycophancy, where a model aligns its results with user biases, to reward-tampering, where a model directly manipulates the reward administration mechanism. The latter, such as altering the code that executes its training reward, represents more complex and severe forms of specification gaming. These complex gaming behaviors may seem implausible at first due to the intricate steps required, such as making targeted alterations to multiple parts of the code, but they are a significant area of concern in this research.

The team from Anthropic, Redwood Research, University of Oxford generalize specification games to reward tampering and create a case study. The team’s goal is to create a curriculum of realistic-looking game worlds. They purposefully create settings where specification gaming is possible. The researchers start with environments that are easy to game (for example, responding in a sycophantic or flattering manner) and work way up to more complex ones (by, for example, lying or manipulating data to conceal it). They supplement the reward from the preference model that trained Claude-2 with a binary incentive for effective specification gaming.

Interestingly, models have been found to alter their reward function implementation and even rewrite testing code when taught on the whole curriculum and tested in a held-out environment with access to a mock-up of their training code. This behavior is likely to go unnoticed. Even though these extreme behaviors were rare, they nonetheless outperformed a model that was trained to be helpful—a model that never tampered with even after 100,000 trials—by a substantial margin.

In all training contexts, the team assigned half of the prompts to typical inquiries derived from Claude-2’s training, and they supplemented rewarding specification games with supervision from a preference model (PM). The PM rewards helpful, honest, and harmless (HHH) actions, while dishonest ones are punished. They discovered that the proposed models can deceive the PM with their made-up reasons for bad behavior. Even with regular queries that reward HHH behavior, the models can still generalize to reward curriculum tampering. Despite the failure of HHH training, the team wanted to explore if it was possible to undo the learning of reward-tampering by identifying and punishing specification gaming in situations when it is easier to observe. For this test, they retrain a model trained on their entire curriculum to never game the requirements of early-curriculum environments. The model does not include the held-out reward-tampering setting, which is only used for evaluation. The rate of tampering in the held-out setting is reduced but not eliminated. When a propensity to tamper is established, it may be difficult to eradicate. Lastly, they look into the possibility that a model’s ability to generalize gaming behaviors is influenced by the reinforcement learning (RL) algorithm. 

The experiments tested two approaches: expert iteration and proximal policy optimization. The study’s curriculum can be applied to both algorithms to demonstrate reward-tampering. Their curriculum shows consistent generalization; however, the incidence of reward tampering is still rather low. Even though half of the environments in the training set had exploitable reward processes, not a single one of the models tampered with rewards more than once every 1,000 trials, and none of them managed to hide their true identities more than once. Furthermore, it is highly doubtful that present-day models possess the necessary knowledge of their training procedures to independently develop a policy that seeks rewards, let alone effectively implement it in actual settings. Evidence that Claude-2-scale models do not represent a significant threat owing to reward-seeking behavior is the fact that it is exceedingly difficult to cause these models to generalize to reward-tampering, even when they create conditions that encourage such behavior. 

The results of this study are intended to demonstrate the theoretical potential for LLM helpers to generalize from basic to advanced specification gaming, including reward-tampering. However, it is crucial to stress that this curriculum, while designed to simulate a realistic training procedure, significantly exaggerates the incentives for gaming the specifications. The findings, therefore, do not support the notion that current frontier models engage in complex reward-tampering. This underscores the need for further research and vigilance to understand the likelihood of such behaviors in future models. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...