Google AI Introduces Tagged Corruption Models To Generate Synthetic Dataset, C4_200M Corpus, For Grammatical Error Correction (GEC)

455
Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html

In recent years, Natural Language Processing (NLP) has evolved into a powerful field in AI. It finds applications in various tasks, including language translation, filtering spam emails, among others. 

Grammatical error correction (GEC) is one such task that aims to provide grammar and spelling suggestions while assisting users in improving the quality of written output in documents, emails, blog posts, and other places. 

However, unlike other NLP tasks, GEC models have a limited number of datasets available for training. To compensate for the lack of data, one can use synthetic data generated by techniques such as heuristic-based random words, character-level corruptions, and so on. However, such methods are often oversimplified and do not accurately reflect users’ actual distribution of error types.

A recent Google study (Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models) proposes tagged corruption models to address these issues, which allow for more precise control of synthetic data production while maintaining diverse outputs that are more consistent with the error distribution seen in practice.

Tagged Corruption Models 

The intuition behind using a conventional corruption model with GEC is to first grammatically correct sentence and then to “corrupt” it by adding errors. Choosing various types of errors for different sentences increases the diversity of corruption compared to conventional corruption models. 

A conventional corruption model generates an ungrammatical sentence (red) given a clean input sentence (green).
Tagged corruption models generate corruptions (red) for the clean input sentence (green) depending on the error type tag. A determiner error may lead to dropping the “a”, whereas a noun-inflection error may produce the incorrect plural “sheeps”.

Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html

Data generation using Tagged Corruption Models

To provide researchers with realistic pre-training data for GEC, Google used tagged corruption models and generated a corpus of 200 million sentences called C4_200M corpus dataset. This new dataset was integrated into the training pipeline, which significantly improved on baselines.

The researchers chose 200 million clean sentences at random from the C4 corpus. Then they applied an error type tag to each sentence so that their relative frequency matched the distribution of error type tags in the BEA-dev small development set. The tag distribution of BEA-dev reflects the writing errors seen in reality because it is a carefully curated sample that includes a wide range of different English skill levels. Later, they synthesized the source sentence using a tagged corruption model. By incorporating this new dataset into the training pipeline, they could dramatically improve on GEC baselines.

Synthetic data generation with tagged corruption models. The clean C4 sentences (green) are paired with the corrupted sentences (red) in the synthetic GEC training corpus. The corrupted sentences are generated using a tagged corruption model by following the error type frequencies in the development set (bar chart).
Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html

Results

The researchers evaluated the proposed model on two standard development sets, CoNLL-13 and BEA-dev, respectively. The tagged corruption model shows state-of-the-art performance and outperforms untagged corruption models by more than three F0.5-points (a standard metric in GEC study). 

Furthermore, the results also demonstrate the proposed mode’s ability to adapt GEC systems to user skill levels. This serves a great benefit as the error-tag distributions for native English writers, and non-native English speakers typically differ significantly.

Paper: https://aclanthology.org/2021.bea-1.4.pdf

Dataset: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction 

Source: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html