GEC-t2t

Grammar Error Correction Based on Tensor2Tensor
A temp project of Deecamp.

Train

The overall training procedure includes pretrain and finetune.

Subword-nmt
The input of this model should be a BPE format.
Pretrain
In order to improve performance of this seq2seq task, the model needs to pretrain based on a large native corpus. The source sentences are generated by denoising on native corpus. The denoising method refers to https://github.com/zhawe01/fairseq-gec. The training step of pretrain depends on the size of native corpus and batchsize parameter, which should include one epoch of native corpus.
Tips: The batchsize refers to the number of tokens.
Finetune
After pretrain, the model should be finetuned over gec corpus, such as CONLL-14.
The training step depends on the loss and performance on your task.

We use the tensorflow-serving on docker.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
subword_nmt		subword_nmt
tensor2tensor		tensor2tensor
.gitattributes		.gitattributes
README.md		README.md
avg_model.py		avg_model.py
bpe_to_origin.py		bpe_to_origin.py
conf.py		conf.py
docker_tfserving_cmd.sh		docker_tfserving_cmd.sh
export_model.py		export_model.py
new_query.py		new_query.py
query_server.py		query_server.py
query_test.py		query_test.py
test.py		test.py
train.py		train.py