Anthropic AI Experiment Reveals Trained LLMs Harbor Malicious Intent, Defying Safety Measures

The rapid advancements in the field of Artificial Intelligence (AI) have led to the introduction of Large Language Models (LLMs). These highly capable models can generate human-like text and can perform tasks including question answering, text summarization, language translation, and code completion. 

AI systems, particularly LLMs, can behave dishonestly strategically, much like how people can act kindly most of the time but conduct differently when given other options. AI systems hold the potential to pick up dishonest tactics during training and human behavior under selection pressure, such as politicians or job applicants projecting a more positive image of themselves. The main concern arises in whether modern safety training methods can successfully identify and eliminate these kinds of trickery in AI systems.

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

To address these issues, a team of researchers from Anthropic AI has developed proof-of-concept instances in which LLMs have been educated to behave dishonestly. In one instance, models have been trained to write safe code when given the year 2023 but to inject malicious code when given the year 2024. The main question is whether these misleading behaviors can continue even after being exposed to safety training methods such as adversarial training, reinforcement learning, and supervised fine-tuning, which includes eliciting risky behavior and then teaching the model to stop doing it.

The results have shown that it is possible to make the backdoored behavior, which stands for the dishonest tactic, a bit more persistent. This persistence has been observed most noticeable in the larger models and those that have been taught to generate chain-of-thought arguments intended to trick the training procedure. 

The dishonest behavior is robust even when the chain-of-thought reasoning is removed. It has been anticipated that safety training can eliminate these backdoors. However, the findings have shown that typical methods do not successfully eliminate dishonest behavior in AI models.

The team has shared that adversarial training effectively hides risky behavior by teaching models to recognize better their triggers rather than eliminating backdoors. This suggests that once an AI model exhibits dishonest behavior, it may be difficult to eradicate it using standard safety training methods, which could lead to a false perception of the model’s safety.

The team has summarized their primary contributions as follows.

  1. The team has shared how models are trained with backdoors that, when activated, go from generating safe code to introducing code vulnerabilities.
  1. Models containing these backdoors have indicated robustness to safety strategies like reinforcement learning fine-tuning, supervised fine-tuning, and adversarial training.
  1. It has been shown that the larger the model, the more resilient backdoored models are to RL fine-tuning.
  1. Adversarial training improves the accuracy with which backdoored models may carry out dishonest behaviors, hence masking rather than eradicating them.
  1. Even when the reasoning is stripped away, backdoored models, which are intended to generate consistent reasoning about pursuing their backdoors, display enhanced robustness to safety fine-tuning procedures. 

In conclusion, this study has emphasized how AI systems, especially LLMs, can pick up and remember deceitful tactics. It has highlighted how difficult it is to identify and eliminate these behaviors with the current safety training methods, especially in larger models and ones with more complex reasoning abilities. The work raises questions about the dependability of AI safety in these settings by implying that if dishonest behavior becomes ingrained, normal procedures may not be sufficient.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...