Skip to content

Latest commit

 

History

History
181 lines (123 loc) · 8.16 KB

File metadata and controls

181 lines (123 loc) · 8.16 KB

C4_200M Synthetic Dataset for Grammatical Error Correction

This dataset contains synthetic training data for grammatical error correction and is described in our BEA 2021 paper. To generate the parallel training data you will need to obtain the C4 corpus first and apply the edits that are published here by following the instructions below.

Generating the dataset

The following instructions have been tested in an Anaconda (version Anaconda3 2021.05) Python environment, but are expected to work in other Python 3 setups, too.

1.) Install the dependencies

Install the Abseil Python package with PIP:

pip install absl-py

2.) Download the C4_200M corruptions

Change to a new working directory and download the C4_200M corruptions from Kaggle Datasets:

The edits are split into 10 shards and stored as tab-separated values:

$ head edits.tsv-00000-of-00010

00000002020d286371dd59a2f8a900e6	8	13	is
00000002020d286371dd59a2f8a900e6	38	60	which CoinDesk says.
00000069b517cf07c79124fae6ebd0d8	0	3
00000069b517cf07c79124fae6ebd0d8	17	34	widespread dud
0000006dce3b7c10a6ad25736c173506	0	14
0000006dce3b7c10a6ad25736c173506	21	30	sales
0000006dce3b7c10a6ad25736c173506	33	44	stores
0000006dce3b7c10a6ad25736c173506	48	65	non residents are
0000006dce3b7c10a6ad25736c173506	112	120	sales tentatively
0000006dce3b7c10a6ad25736c173506	127	130	from

The first column is an MD5 hash that identifies a sentence in the C4 corpus. The second and third columns are byte start and end positions, and the fourth column contains the replacement text.

3.) Extract C4_200M target sentences from C4

C4_200M uses a relatively small subset of C4 (200M sentences). There are two ways to obtain the C4_200M target sentences: using TensorFlow Datasets or using the C4 version provided by allenai.

Using TensorFlow Datasets

Install the TensorFlow Datasets Python package with PIP:

pip install tensorflow-datasets

Obtain the C4 corpus version 2.2.1 by following these instructions. The c4200m_get_target_sentences.py script fetches the clean target sentences from C4 for a single shard:

python c4200m_get_target_sentences.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010

Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing. You can also run the concurrent script with the concurrent-runs parameter to check multiple shards at the same time.

python c4200m_get_target_sentences_concurrent.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 5 &> get_target_sentences.log-00000-of-00010

The above reads 5 shards (00000 to 00004) at once and saves the target sentences to their corresponding files.

Using the C4 Dataset in .json.gz Format

Given a folder containing the C4 dataset compressed in .json.gz files as provided by allenai, it is possible to fetch most of the clean target sentences as follows:

python c4200m_get_target_sentences_json.py edits.tsv-00000-of-00010 /C4/en/target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010

where we assume the training examples of the C4 dataset are located in /C4/en/*train*.json.gz.

Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing.

4.) Apply corruption edits

The mapping from the MD5 hash to the target sentence is written to target_sentences.tsv*:

$ head -n 3 target_sentences.tsv-00000-of-00010

00000002020d286371dd59a2f8a900e6	Bitcoin goes for $7,094 this morning, according to CoinDesk.
00000069b517cf07c79124fae6ebd0d8	1. The effect of "widespread dud" targets two face up attack position monsters on the field.
0000006dce3b7c10a6ad25736c173506	Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.

To generate the final parallel dataset the edits in edit.tsv* have to be applied to the sentences in target_sentences.tsv*:

python c4200m_make_sentence_pairs.py target_sentences.tsv-00000-of-00010 edits.tsv-00000-of-00010 sentence_pairs.tsv-00000-of-00010

The parallel data is written to sentence_pairs.tsv*:

$ head -n 3 sentence_pairs.tsv-00000-of-00010

Bitcoin is for $7,094 this morning, which CoinDesk says.	Bitcoin goes for $7,094 this morning, according to CoinDesk.
The effect of widespread dud targets two face up attack position monsters on the field.	1. The effect of "widespread dud" targets two face up attack position monsters on the field.
tax on sales of stores for non residents are set at 21% for 2014 and 20% in 2015 payable on sales tentatively earned from the difference of the property value some time of purchase (price differences according to working time) and theyear to which sale couples (sales costs), based on the approved annual on the base approved by law).	Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.

Again, repeat for the remaining nine shards.

Multilingual C4_200M

In our BEA 2024 paper we introduced variants of our original English dataset in German, Spanish, Romanian, and Russian. The multilingual datasets are generated with the same recipe, but you need to provide the language ID to c4200m_get_target_sentences.py:

python c4200m_get_target_sentences.py multilingual/ro.tsv ro.target_sentences.tsv ro &> ro.get_target_sentences.log

The entry point to the multilingual annotation toolkit is annotate.py:

$ echo -e "I goed to the storr.\tI went to the store." | python3 -m merrant.annotate

S I goed to the storr.
A 2 6|||R:VERB:INFL|||went|||REQUIRED|||-NONE-|||0
A 14 19|||R:SPELL|||store|||REQUIRED|||-NONE-|||0

License

The corruption edits in this dataset are licensed under CC BY 4.0.

BibTeX

If you found this dataset useful, please cite our papers:

Original English dataset (BEA 2021 paper):

@inproceedings{stahlberg-kumar-2021-synthetic,
    title = "Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models",
    author = "Stahlberg, Felix and Kumar, Shankar",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bea-1.4",
    pages = "37--47",
}

Multilingual dataset in German, Spanish, Romanian, and Russian (BEA 2024 paper):

@inproceedings{stahlberg-kumar-2024-synthetic,
    title = "Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models",
    author = "Stahlberg, Felix  and
      Kumar, Shankar",
    booktitle = "Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.bea-1.2",
    pages = "11--16",
}

This is not an officially supported Google product.