Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

Houquan Zhou, Yumeng Liu, Zhenghua Li^✉️, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang

^{Note: This cover image is created by DALL·E 3}

TL;DR

This repo contains the code for our EMNLP 2023 Findings paper: Improving Seq2Seq Grammatical Error Correction via Decoding Interventions.

We introduce a decoding intervention framework that uses critics to assess and guide token generation. We evaluate two types of critics: a pre-trained language model and an incremental target-side grammatical error detector. Experiments on English and Chinese data show our approach surpasses many existing methods and competes with SOTA models.

Citation

@inproceedings{zhou-etal-2023-improving-seq2seq,
    title = "Improving {S}eq2{S}eq Grammatical Error Correction via Decoding Interventions",
    author = "Zhou, Houquan  and
      Liu, Yumeng  and
      Li, Zhenghua  and
      Zhang, Min  and
      Zhang, Bo  and
      Li, Chen  and
      Zhang, Ji  and
      Huang, Fei",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.495",
    pages = "7393--7405",
}

Setup

Clone this repo recursively:

git clone https://github.com/Jacob-Zhou/gecdi.git --recursive

# The newest version of parser is not compatible with the current code, 
# so we need to checkout to a previous version
cd gecdi/3rdparty/parser/ && git checkout 6dc927b && cd -

Then you can use following commands to create an environment and install the dependencies:

. scripts/set_environment.sh

# for Errant (v2.0.0) evaluation a python 3.6 environment is required
# make sure your system has python 3.6 installed, then run:
. scripts/set_py36_environment.sh

You can follow this repo to obtain the 3-stage train/dev/test data for training a English GEC model. The multilingual datasets are available here.

Before running, you are required to preprocess each sentence pair into the format of

S   [src]
T   [tgt]

S   [src]
T   [tgt]

Where [src] and [tgt] are the source and target sentences, respectively. A \t is used to separate the prefix S or T and the sentence. Each sentence pair is separated by a blank line. See data/toy.train for examples.

Download Trained Models

The trained models are avaliable in HuggingFace model hub. You can download them by running:

# If you have not installed git-lfs, please install it first
# The installation guide can be found here: https://git-lfs.github.com/
# Most of the installation guide requires root permission.
# However, you can install it locally using conda:
# https://anaconda.org/anaconda/git-lfs

# Create directory for storing the trained models
mkdir -p models
cd models

# Download the trained models
# First, clone the small files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-gec
# Then use git-lfs to download the large files
cd bart-large-gec
git lfs pull

# Return to the models directory
cd -

# The download process is the same for the GED model
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HQZhou/bart-large-ged
cd bart-large-ged
git lfs pull

# The download process is the same for the Chinese models
# Just change the GEC url to https://huggingface.co/HQZhou/bart-large-chinese-gec
# and the GED url to https://huggingface.co/HQZhou/bart-large-chinese-ged

The models can also download by using the huggingface-cli:

# First make sure that you have installed `huggingface_hub` package
# You can install it following the guide here: https://huggingface.co/docs/huggingface_hub/installation
huggingface-cli download HQZhou/bart-large-gec --local-dir-use-symlinks False --local-dir models/bart-large-gec
huggingface-cli download HQZhou/bart-large-ged --local-dir-use-symlinks False --local-dir models/bart-large-ged

Run

English experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    dataset=bea19.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    lm_alpha=0.8 lm_beta=10  \
    dataset=bea19.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    ged_alpha=0.8 ged_beta=1  \
    batch=500  \
    dataset=bea19.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    gec_path=models/bart-large-gec/model  \
    ged_path=models/bart-large-ged/model  \
    lm_alpha=0.8 lm_beta=10  \
    ged_alpha=0.8 ged_beta=1  \
    batch=250  \
    dataset=bea19.dev

Chinese experiments:

# Baseline (vanilla decoding)
bash pred.sh  \
    devices=0  \
    dataset=mucgec.dev

# w/ LM-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3  \
    lm_beta=0.1  \
    dataset=mucgec.dev

# w/ GED-critic
bash pred.sh  \
    devices=0  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

# w/ both LM-critic and GED-critic
bash pred.sh  \
    devices=0  \
    lm_alpha=0.3 lm_beta=0.1  \
    ged_alpha=0.6 ged_beta=10  \
    dataset=mucgec.dev

Run target-side GED only:

bash pred_ged.sh  \
    devices=0  \
    path=models/bart-large-ged/model  \
    data=<path to the parallel data to be detected>  \
    pred=<path to the output file>

# the input file should be in the format of:
# S   [src 0]
# T   [tgt 0]

# S   [src 1]
# T   [tgt 1]

# the output file will be in the format of jsonl as follows:
# {
#     "src_text": "I implicated my class from winning the champion .",
#     "tgt_text": "I implicated my class in winning the champion .",
#     "tgt_subword": ["ĠI", "Ġimplicated", "Ġmy", "Ġclass", "Ġin", "Ġwinning", "Ġthe", "Ġchampion", "Ġ."],
#     "error": [[1, 2, "SUB"], [4, 5, "SUB"]]
# }

# the error field is a list of error spans, each span is represented as a list of three elements:
# [start of subword span, end of subword span, error type]
# error type can be one of the following:
# `RED`: redundant
# `SUB`: substitution
# `MISS-L`: there are missing tokens on the left side of the span

Recommended Hyperparameters

We search the coefficient $\alpha$ and $\beta$ on the development set.

The optimal coefficients are varied across different datasets.

Hyperparameters for LM-critic:

Dataset	$\alpha$	$\beta$
CoNLL-14	0.8	10.0
BEA-19	0.8	10.0
GMEG-Wiki	1.0	10.0
MuCGEC	0.3	0.1

Hyperparameters for GED-critic:

Dataset	$\alpha$	$\beta$
CoNLL-14	0.8	1.0
BEA-19	0.8	1.0
GMEG-Wiki	0.9	1.0
MuCGEC	0.6	10.0

Typo

Appendix B.2 (STAGE 3): We further fine-tune the model on the W&I + LOCNESS test set only. $\rightarrow$ We further fine-tune the model on the W&I + LOCNESS training set only. (We sincerely apologize for this typo and thank @GMago123 for pointing it out in the issue#4)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
3rdparty		3rdparty
assets		assets
configs		configs
data		data
gec		gec
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval_pred.py		eval_pred.py
eval_pred_zh.py		eval_pred_zh.py
exp_pipe.sh		exp_pipe.sh
exp_pipe_zh.sh		exp_pipe_zh.sh
intervened_decode.py		intervened_decode.py
pred.sh		pred.sh
pred_ged.sh		pred_ged.sh
pred_zh.sh		pred_zh.sh
requirements.txt		requirements.txt
seq2seq.py		seq2seq.py
seq2seq_ged.py		seq2seq_ged.py
supar		supar
train.sh		train.sh
train_ged.sh		train_ged.sh
train_ged_zh.sh		train_ged_zh.sh
train_zh.sh		train_zh.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

TL;DR

Citation

Setup

Download Trained Models

Run

Recommended Hyperparameters

Typo

About

Releases

Packages

Languages

License

Jacob-Zhou/gecdi

Folders and files

Latest commit

History

Repository files navigation

Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

TL;DR

Citation

Setup

Download Trained Models

Run

Recommended Hyperparameters

Typo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages