TOXCL: A Unified Artificial Intelligence Framework for the Detection and Explanation of Implicit Toxic Speech

On social media, toxic speech can spread like wildfire, targeting individuals and marginalized groups. While explicit hate is relatively easy to flag, implicit toxicity – which relies on stereotypes and coded language rather than overt slurs – poses a trickier challenge. How do we train AI systems to not only detect this veiled toxicity but also explain why it’s harmful?

Researchers at Nanyang Technological University, Singapore, National University of Singapore, and Institute for Infocomm Research have tackled this head-on with a novel framework called ToXCL, an overview of which is shown in Figure 2. Unlike previous systems that lumped detection and explanation into one text generation task, ToXCL uses a multi-module approach, breaking the problem into steps.

First, there’s the Target Group Generator—a text generation model that identifies the minority group(s) potentially being targeted in a given post. Next is the Encoder-Decoder Model, which first classifies the post as toxic or non-toxic using its encoder. If flagged as toxic, the decoder then generates an explanation of why it’s problematic with the help of the target group info.

But here’s the clever bit: To beef up the encoder’s detection skills, the researchers incorporated a strong Teacher Classifier. Using the knowledge distillation technique, this teacher model passes its expertise to the encoder during training, improving its classification abilities.

The researchers also added a Conditional Decoding Constraint—a neat trick that ensures the decoder only generates explanations for posts classified as toxic, eliminating contradictory outputs.

So how did it fare? On two major implicit toxicity benchmarks, ToXCL outperformed state-of-the-art baselines and even surpassed models focused solely on detection or explanation. Human evaluators rated its outputs higher for correctness, fluency, and reduced harmfulness compared to other leading systems.

Of course, there’s still room for improvement. The model can sometimes stumble over coded symbols or abbreviations requiring external knowledge. And the subjective nature of implicit toxicity means the “right” explanation is often multi-faceted. But overall, ToXCL marks an impressive step towards AI systems that can identify veiled hatred and articulate its pernicious impacts. As this technology develops further, we must also grapple with potential risks around reinforcing biases or generating toxic language itself. But with care, it offers a path to empowering marginalized voices and curbing oppressive speech online. The quest continues.


Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning enthusiast. He is passionate about research and the latest advancements in Deep Learning, Computer Vision, and related fields.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...