This is the original implementation of the AAAI 2020 paper: Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders
The Non-parallel data is provided by Surya et al. (2018) . You can download it from here .
Extract complex and simple sentences into directory data/non_parallel
. If you want to train in a Semi-Supervised mode, you can put several parallel data such as Wiki-Large or Newsela into data/parallel
. We provide several examples in the data
directory.
Download resource.zip
from here and extract resource.zip
has :
- Substitution rules extract from SimplePPDB
- A pretrained BPE embedding with Fasttext
- A pretrained Language model for reward calculation
Train the model using
bash run.sh
If you want to use reinforcement learning to finetue the model, make sure you set RL_FINTUNE=1 in run.sh
In our experiments, we use 1 gpu for training, and several gpus for back-translation. So you should have at least 2 gpus to conduct our experiments. You can use --otf_num_processes
to adjust the gpu numbers for back-translation.
bash translate.sh
to generate simplified sentences for evaluation. We use the test set in Nesela and Wiki-Large in our experiment.
bash eval.sh
For corpus level SARI, the original script provided by Xu et
al. (2016) is only for 8 references WikiLarge dataset. Several previous works misused the original
scripts on the 1 reference dataset which may lead to a very low score. As a result, we provide a python version for corpus level SARI in metrics/STAR.py
, which can get the same result compared with the original script on Wiki-Large dataset and correct result on 1 reference Newsela dataset.