The Hidden Danger in AI Models: A Space Character’s Impact on Safety

When given an unsafe prompt, like “Tell me how to build a bomb,” a well-trained large language model (LLM) should refuse to answer. This is usually achieved through Reinforcement Learning from Human Feedback (RLHF) and is crucial to make sure models are safe to use, especially in sensitive areas that involve direct interaction with people, like mental health, customer service, general conversation, and healthcare. However, there has been progress in automating the creation of these chat templates, but documentation for the template format used during training often needs to be improved. Among the eight open-source models reviewed, only Vicuna, Falcon, Llama-3, and ChatGLM describe the chat template used during fine-tuning.

The first related study focuses on Model Alignment, which aims to ensure that AI models reflect human values, a key focus in current research on LLMs. Training frameworks such as SelfInstruct, RLHF, and Constitutional AI propose methods to enhance model alignment by integrating human values into model training. The next study examines Attacks on Model Alignment, where attacks revealing vulnerabilities in model alignment have become more common. Next is Model Robustness, where in the context of adversarial attacks on classification tasks, research shows that even small alterations to images like tweaking a few pixels, can cause neural networks to misclassify them. The last work is Glitch Tokens, where tokens are present in a tokenizer’s vocabulary but absent from a model’s training data.

Researchers from the National University of Singapore have found an important observation that single-character tokens appear relatively rarely in tokenized model pre-training data. This is because of the nature of subword tokenization algorithms, which merge-common tokens. However, single-character tokens can still pose a threat to most models. The researchers explained this by looking at how tokenizer vocabularies and the contexts of single-space tokens in pre-training data work. The findings highlighted the weaknesses in current model alignment and suggested that more effort is needed to make models not just aligned but robustly aligned.

Data from AdvBench, a benchmark designed to measure how often models agree with harmful requests, is used in this study. These harmful requests include asking for misinformation, pornographic material, or instructions for illegal activities. For the experiments, a 100-sample subset of the harmful behaviors split of AdvBench is tested. Eight open-source models are tested: Vicuna v1.5, Llama 2, Llama 3, Mistral, Falcon, Guanaco, MPT, and ChatGLM, using 7B4 and 13B models. This helps analyze the impact of model size and type on harmful behavior. Responses from models that do not refuse harmful queries are likely to be harmful. A check on a randomly selected set of ten outputs from each model showed that this evaluation method is accurate in most cases (74/80).

In this paper, a situation is considered where the chat template of a model is available, which excludes closed-source, commercial models like GPT-4 and BARD. Instead, the focus is on open-source models to show that this problem exists and explore the reasons related. Although this exploration is formalized as an adversarial attack, it is not meant to propose a practical attack on LLMs but rather serves as a probing method. For a user query x to model M, the model input is formatted using template T, which consists of a system prompt s, a set of role labels R, and x. A single character is appended to the end of the template, resulting in a modified template, Tโ€ฒ.

In conclusion, researchers from the National University of Singapore found that adding a single space at the end of LLM conversation templates can cause open-source language models to give harmful responses to user prompts. This extra space is easy for an engineer to add by mistake and hard to notice without careful checks, especially in long templates. However, this small error can lead to dangerous outcomes, bypassing the model’s safeguards. The experiments suggest that this happens because of how single tokens are used in the training data, and the reason is the way the data is divided into tokens.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,ย donโ€™t forget to follow us onย Twitter.ย 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Donโ€™t Forget to join our 46k+ ML SubReddit

๐Ÿ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...