To set up the environment, ensure you have the following versions installed:
- CUDA: 12.1
- Python: 3.10
To install the required Python dependencies, run the following command:
pip install -r requirements.txt
- BEA-2019 Shared Task: Download Here
- CoNLL-2014 Shared Task: Download Here
- FCGEC Dataset: Download Here
- NaCGEC Dataset: Download Here
You need to preprocess the datasets into a unified format, compatible with the training pipeline.
Ensure that the data is structured in the same format as the provided examples:
data/epo_data_sample.json
data/sft_data_example.json
Additionally, you will need to modify the data/dataset_info.json
file to match the specifics of your dataset configuration.
bash bash/train_gec_sft_stage1.sh
bash bash/export_model.sh # merge lora weight
bash bash/train_gec_sft_stage2.sh
bash bash/gec_pairwise_sampling.sh # generate pairwise samples
bash bash/train_gec_epo.sh
Note: For Chinese GEC, you can find the corresponding scripts in the bash
directory.
bash bash/gec_eval.sh # for English GEC model
bash bash/cgec_eval.sh # for Chinese GEC model
This project is built upon LLaMA-Factory and utilizes the following tools for evaluation:
We are grateful for their contributions.