Grammar Error Correction Based on Tensor2Tensor
A temp project of Deecamp.
The overall training procedure includes pretrain and finetune.
-
Subword-nmt
The input of this model should be a BPE format. -
Pretrain
In order to improve performance of this seq2seq task, the model needs to pretrain based on a large native corpus. The source sentences are generated by denoising on native corpus. The denoising method refers to https://github.com/zhawe01/fairseq-gec. The training step of pretrain depends on the size of native corpus and batchsize parameter, which should include one epoch of native corpus.
Tips: The batchsize refers to the number of tokens. -
Finetune
After pretrain, the model should be finetuned over gec corpus, such as CONLL-14.
The training step depends on the loss and performance on your task.
We use the tensorflow-serving on docker.