Microsoft AI Researchers Introduces (De)ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

While there are several benefits to using Artificial Intelligence, there are also drawbacks to this cutting-edge technology. One example is the creation of inappropriate language by language models. Because these models are trained on massive amounts of data, inappropriate language may be learned due to its presence in the training data.

In only specific cases, content moderation techniques can be used to flag or filter such language however, the datasets used to train these programs frequently fail to capture the complexity of potentially unsuitable and poisonous language, particularly hate speech. Furthermore, the neutral samples in these datasets seldom contain group references. As a result, tools may flag even neutral language that refers to a minority identification group as hate speech. A dataset needs to be created for training content moderation algorithms that may be used to detect better implicitly harmful material, inspired by big language models’ capacity to emulate the tone, style, and vocabulary of cues they receive, whether toxic or benign.


Microsoft researchers collected initial examples of neutral statements with group mentions and examples of implicit hate speech across 13 minority identity groups. They used a large-scale language model to scale up and guide the generation process in their paper “ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection.” The result is the largest publicly available implicit hate speech dataset: 274,000 samples of both neutral and toxic utterances. Although large Transformer-based language models may not directly contain semantic information, they may identify the statistical interactions of words in various scenarios. They learned how to use cautious prompt engineering tactics to construct the ToxiGen implicit hate speech dataset by experimenting with language production using one of these enormous language models.

While demonstration-based prompting can help with large-scale data production, it does not produce explicitly designed data to test a specific content moderation technology or content classifier. Data from both demonstration-based prompting and their suggested adversarial decoding technique are included in our ToxiGen dataset. 

This repository provides all of the components required to create the ToxiGen dataset, which comprises implicitly hazardous and neutral phrases about 13 minority groups. It contains ALICE ( or (De)ToxiGen), a tool for stress testing and iteratively improving a given off-the-shelf content moderation system across different minority groups. 

As with many technologies, the solutions we create to make them stronger, more secure, and less susceptible may be used unexpectedly. While the methods described here may generate inappropriate or harmful language, they are far more valuable in combating such language. These methods can help in content moderation tools that can be used alongside human guidance to support fairer, safer, more reliable, and inclusive AI systems.

This Article Is Based On The Research Paper 'TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, Microsoft blog and Github codes. 

Please Don't Forget To Join Our ML Subreddit