In recent years, Natural Language Processing (NLP) has evolved into a powerful field in AI. It finds applications in various tasks, including language translation, filtering spam emails, among others.
Grammatical error correction (GEC) is one such task that aims to provide grammar and spelling suggestions while assisting users in improving the quality of written output in documents, emails, blog posts, and other places.
However, unlike other NLP tasks, GEC models have a limited number of datasets available for training. To compensate for the lack of data, one can use synthetic data generated by techniques such as heuristic-based random words, character-level corruptions, and so on. However, such methods are often oversimplified and do not accurately reflect users’ actual distribution of error types.
A recent Google study (Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models) proposes tagged corruption models to address these issues, which allow for more precise control of synthetic data production while maintaining diverse outputs that are more consistent with the error distribution seen in practice.
Tagged Corruption Models
The intuition behind using a conventional corruption model with GEC is to first grammatically correct sentence and then to “corrupt” it by adding errors. Choosing various types of errors for different sentences increases the diversity of corruption compared to conventional corruption models.


Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html
Data generation using Tagged Corruption Models
To provide researchers with realistic pre-training data for GEC, Google used tagged corruption models and generated a corpus of 200 million sentences called C4_200M corpus dataset. This new dataset was integrated into the training pipeline, which significantly improved on baselines.
The researchers chose 200 million clean sentences at random from the C4 corpus. Then they applied an error type tag to each sentence so that their relative frequency matched the distribution of error type tags in the BEA-dev small development set. The tag distribution of BEA-dev reflects the writing errors seen in reality because it is a carefully curated sample that includes a wide range of different English skill levels. Later, they synthesized the source sentence using a tagged corruption model. By incorporating this new dataset into the training pipeline, they could dramatically improve on GEC baselines.

Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html
Results
The researchers evaluated the proposed model on two standard development sets, CoNLL-13 and BEA-dev, respectively. The tagged corruption model shows state-of-the-art performance and outperforms untagged corruption models by more than three F0.5-points (a standard metric in GEC study).
Furthermore, the results also demonstrate the proposed mode’s ability to adapt GEC systems to user skill levels. This serves a great benefit as the error-tag distributions for native English writers, and non-native English speakers typically differ significantly.
Paper: https://aclanthology.org/2021.bea-1.4.pdf
Dataset: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.