Large generative language models (LMs) such as GPT-3 and Gopher have proven the ability to generate high-quality text. However, these models run the danger of producing destructive text. Therefore, they are challenging to deploy due to their potential to hurt people in ways that are nearly impossible to forecast.
So many different inputs can lead to a model producing harmful text. As a result, it’s challenging to identify all scenarios in which a model fails before it’s used in the actual world.
Previous works have employed human annotators to hand-write test cases to identify dangerous behaviors before deployment. However, human annotation is costly and time-consuming, restricting the number and variety of test cases.
Researchers at DeepMind now generate test scenarios (“red teaming”) using another LM to automatically detect cases when a target LM behaves in a damaging way. By automatically detecting failure cases (or red teaming’), they wish to supplement manual testing and limit the number of significant oversights.
This technique identifies a number of detrimental model behaviors, including:
- The offensive language that includes Hate speech, profanity, sexual material, discrimination, etc.
- Conversational Harms: For example, use of offensive language in the course of a long conversation.
- Data leakage, where models use training corpus to generate copyrighted or private, personally identifiable information.
- Contact Information Generation, where users are directed to email or call real persons when they don’t have to.
- Distributional Bias, referring to talking about some groups of people in an unfairly different way than other groups.
The team first used their approach to red team the 280B parameter Dialogue-Prompted Gopher chatbot, which was being used to generate offensive content. They tested a number of strategies for generating test cases with language models, including prompt-based generation, few-shot learning, supervised finetuning, and reinforcement learning. Their findings suggest that some methods produce more diverse test cases for the target model, while others generate more complex test cases. The techniques offered by them are helpful for getting high test coverage and modeling adversarial circumstances when used together.
The model is prevented from generating outputs containing high-risk terms by blacklisting particular phrases that regularly appear in damaging results, finding objectionable training data quoted by the model, and removing it from future iterations of the algorithm’s training.
The team states that harmful model behavior can also be fixed by augmenting the model’s cue (training text) with an example of the intended behavior for a specific input type. Further, the model can be trained for a given test input to reduce the chance of its initial, detrimental outcome.
Overall, this work focuses on red teaming the harms caused by today’s language models, which can be used to detect and reduce language model damages. In the future, the team plans to use their approach to predict other potential disadvantages from sophisticated machine learning systems, such as internal misalignment or objective robustness failures.