CMU Researchers Propose a Simple and Effective Attack Method that Causes Aligned Language Models to Generate Objectionable Behaviors at a High Success Rate

Large language models (LLMs) are recent advances in deep learning models to work on human languages. These deep-learning trained models understand and generate text in a human-like fashion. These models are trained on a huge dataset scraped from the internet, taken from books, articles, websites and other sources of information. They can translate languages, summarize text, answer questions and, perform a wide range of natural language processing tasks.

Recently, there has been a growing concern about their ability to generate objectionable content and the resulting consequences. Thus, significant studies have been conducted in this area.

Subsequently, Researchers from Carnegie Mellon University’s School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco have studied generating objectionable behaviors in language models. In their research, they proposed a new attack method that involves adding a suffix to a wide range of queries, resulting in a substantial increase in the likelihood that both open-source and closed-source language models (LLMs) will generate affirmative responses to questions they would typically refuse.

During their investigation, the researchers successfully applied the attack suffix to various language models, including public interfaces like ChatGPT, Bard, and Claude, and open-source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. Consequently, the attack suffix effectively induced objectionable content in the outputs of these language models.

This method successfully generated harmful behaviors in 99 out of 100 instances on Vicuna. Additionally, they produced 88 out of 100 exact matches with a target harmful string in Vicuna’s output. The researchers also tested their attack method against other language models, such as GPT-3.5 and GPT-4, achieving up to 84% success rates. For PaLM-2, the success rate was 66%.

The researchers said that, at the moment, the direct harm to people that could be brought about by prompting a chatbot to produce objectionable or toxic content might not be especially severe. The concern is that these models will play a larger role in autonomous systems without human supervision. They further emphasized that as autonomous systems become more of a reality, it will be very important to ensure we have a reliable way to stop them from being hijacked by attacks like these.

The researchers said they didn’t set out to attack proprietary large language models and chatbots. But their research shows that even if we have big trillion parameter closed-source model, people can still attack it by looking at freely available, smaller, and simpler open-sourced models and learning how to attack those.

In their research, the researchers extended their attack method by training the attack suffix on multiple prompts and models. As a result, they induced objectionable content in various public interfaces, including Google Bard and Claud. The attack also affected open-source language models like Llama 2 Chat, Pythia, Falcon, and others, exhibiting objectionable behaviors.

The study demonstrated that their attack approach had broad applicability and could impact various language models, including those with public interfaces and open-source implementations. They further emphasized that we don’t have a method to stop such adversarial attacks right now, so the next step is to figure out how to fix these models.

Check out the Paper and Blog Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft