Meet SafeDecoding: A Novel Safety-Aware Decoding AI Strategy to Defend Against Jailbreak Attacks

Despite the significant strides in large language models (LLMs) such as ChatGPT, Llama2, Vicuna, and Gemini, they grapple with safety issues. This paper introduces a novel safety-aware decoding technique, SafeDecoding, which aims to protect LLMs from jailbreak attacks, a pressing concern evidenced by LLMs generating damaging, erroneous, or biased content.

Despite the progress made in alignment algorithms, hostile inputs can still affect LLMs. According to recent research, a serious risk known as a “jailbreak attack” can effectively circumvent current alignments. While many defenses have been developed, such as input perturbation, input and output detection, and prompt demonstration, these techniques are ineffective and expensive in terms of inference time and may reduce the usefulness of LLMs when servicing benign users.

By offering an alternative viewpoint on jailbreak success, researchers from the University of Washington, the Pennsylvania State University, and the Allen Institute for AI hope to protect LLMs from jailbreak attacks. The smallest textual unit that LLMs can understand is called a token, and they use token probabilities to analyze jailbreak assaults. The first viewpoint leads to the next two findings. First, the prevalence of token probabilities that support the attack goals (e.g., “Hey, here’s a tutorial for making a bomb”) makes jailbreak attacks successful. This could cause common decoding strategies like greedy and top-k to fail when producing harmless content. Secondly, although the model displays unexpected behavior, the sample space contains tokens for safety disclaimers like “Sorry, I cannot fulfill your request.” This indicates an innate knowledge of the jailbreak attack model.

Based on these observations, the team suggests a unique safety-aware decoding technique called SafeDecoding to thwart jailbreak assaults. SafeDecoding’s main concept is to deliberately find safety disclaimers and increase their token probabilities while simultaneously lowering the possibilities of token sequences supporting the attacker’s goals. To do this, SafeDecoding starts with training an expert model, which is then refined with a safety-aware dataset created with the help of the original model. SafeDecoding successfully balances the utility-safety tradeoff during the inference phase by first locating the intersection of the top tokens from the original and refined models. After that, SafeDecoding creates a new token distribution based on the expert and original models’ token probabilities. SafeDecoding samples tokens based on this new distribution to respond to the input query.

The evaluation of SafeDecoding against two detrimental benchmarks, two utility benchmarks, and six cutting-edge jailbreak attempts on five LLMs reveals its superior performance. SafeDecoding consistently outperforms all baselines in thwarting jailbreak assaults while maintaining a small computational overhead, thereby ensuring the continued usefulness of LLMs in benign user interactions.

While SafeDecoding proves effective in most cases, it does have a drawback. On rare occasions, the model may initially reject a user’s damaging queries before eventually agreeing to them. This irregularity in decoding the first-m tokens poses a challenge that needs to be addressed in future iterations of SafeDecoding.

This research primarily focuses on big language models; hence, the scope of this analysis and SafeDecoding’s performance assessments are restricted to these models. The team states that future research will examine how well SafeDecoding performs when used with newly developed multimodal big language models like GPT-4V. Multimodal large language models—integrating text, graphics, audio, and other types of data—present special difficulties and intricacies that are not covered in this work.  


Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...