GPT-4 defaults to saying, “Sorry, but I can’t help with that,” in answer to requests that go against policies or ethical restrictions. Safety training and red-teaming are essential to prevent AI safety failures when large language models (LLMs) are used in user-facing applications like chatbots and writing tools. Serious social repercussions from LLMs producing negative material may include spreading false information, encouraging violence, and platform destruction. They find cross-lingual weaknesses in the safety systems already in place, even though developers like Meta and OpenAI have made progress in minimizing safety risks. They discover that all it takes to circumvent protections and cause negative reactions in GPT-4 is the simple translation of dangerous inputs into low-resource natural languages using Google Translate.
Researchers from Brown University demonstrate that translating English inputs into low-resource languages enhances the likelihood of getting through the GPT-4 safety filter from 1% to 79% by systematically benchmarking 12 languages with various resource settings on the AdvBenchmark. Additionally, they show that their translation-based strategy matches or even outperforms cutting-edge jailbreaking techniques, which suggests a serious weakness in GPT-4’s security measures. Their work contributes in several ways. First, they highlight the negative effects of the AI safety training community’s discriminatory treatment and unequal valuing of languages, as seen by the gap between LLMs’ capacity to fight off attacks from high- and low-resource languages.
Additionally, their research shows that the safety alignment training currently available in GPT-4 needs to generalize better across languages, leading to a mismatched generalization safety failure mode with low-resource languages. Second, the reality of their multilingual environment is rooted in their job, which grounds LLM safety systems. Around 1.2 billion people speak low-resource languages worldwide. Thus, safety measures should be taken into account. Even bad actors who speak high-resource languages may easily get around the current precautions with little effort as translation systems increase their coverage of low-resource languages.
Last but not least, their study highlights the urgent necessity to adopt a more comprehensive and inclusive red-teaming. Focusing just on English-centric benchmarks may create the impression that the model is secure. It is still vulnerable to assaults in languages where the safety training data is not widely available. More crucially, their findings also imply that scholars have yet to appreciate the ability of LLMs to comprehend and produce text in low-resource languages. They implore the safety community to construct strong AI safety guardrails with expanded language coverage and multilingual red-teaming datasets encompassing low-resource languages.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
We are also on WhatsApp. Join our AI Channel on Whatsapp..
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.