This repo contains the software developed for the paper,
Cooperative Self-training for Machine Reading Comprehension, Luo H., Li S.-W., Gao M., Yu S., Glass J., NAACL 2022.
Try our live demo with medium-length passages (long document version coming soon).
We run this software using the following packages,
- Python 3.8.13
- NLTK 3.7
- Stanza 1.4.0
- PyTorch 1.11.0 + cu113
- Transformers 4.19.2
- Datasets 2.3.2
The pretrained models are available via this Google Drive Link. Please download the models and move them under the model_file/
directory if your machine cannot download the models from the Huggingface hub.
model_file/ext_sqv2.pt
: Pretrained ELECTRA-large question answering model on SQuAD v2.0.model_file/ques_gen_squad.pt
: Pretrained BART-large question generation model on SQuAD v2.0.model_file/electra-tokenize.pt
: Electra-large tokenizer provided by Huggingface.model_file/bart-tokenizer.pt
: BART-large tokenizer provided by Huggingface.
Generate question-answer pairs on the example SQuAD passages we provide at data/squad/doc_data_0.json
by running the following command,
python rgx_doc.py \
--dataset_name squad \
--data_split 0 \
--output_dir tmp/rgx \
--version_2_with_negative
The generated data will be stored under data_gen/squad
, including rgx_0.json
and qa_train_corpus_0.json
. We provide the $DATA_SPLIT
option for distributed data generation, for example, with Slurm. If only generating QA pairs with one process, simply use --data_split 0
.
All data are stored at the data/
and data_gen/
directories.
data/{$DATASET_NAME}/doc_data_{$DATA_SPLIT}.json
: unlabeled documents of the target dataset.data_gen/{$DATASET_NAME}/rgx_{$DATA_SPLIT}.json
: generated QA data aligned with each document from the corresponding dataset.data_gen/{$DATASET_NAME}/qa_train_corpus_{$DATA_SPLIT}.json
: generated QA training set of the given dataset. The training examples follows the SQuAD data format and are randomly shuffled.
- The format of the input file,
doc_data_{$DATA_SPLIT}.json
is a list of dictionaries as
[
{"context": INPUT_DOC_TXT__0},
{"context": INPUT_DOC_TXT__1},
...,
{"context": INPUT_DOC_TXT__N},
]
- The format of the output file,
qa_train_corpus_{$DATA_SPLIT}.json
, is a list of dictionaries as
[
{
"context": INPUT_DOC_TXT_0,
"question": GEN_QUESTION_TXT_0,
"answers": {
"text": [ ANSWER_TXT ], # only one answer per question
"answer_start": [ ANSWER_ST_CHAR ]
# index of the starting character of the answer txt
}
},
{
...
},
]
- The format of the output file,
rgx_{$DATA_SPLIT}.json
is a list of document-QA mappings,
[
[
$DOCUMENT_i,
$ANS2ITEM_LIST_i,
$GEN_QA_LIST_i
],
...
]
$DOCUMENT_i
has the same format as the input file. The $ANS2ITEM_LIST_i
is the meta-data of all recognized answers and generated questions. Note that one answer can have multiple questions, and the questions can be either correct or not. The final output of the model is $GEN_QA_LIST_i
, which is a list of dictionaries of generated QA pairs based on the input document,
[
{
"question": GEN_QUESTION_TXT_0,
"answers": {
"text": [ ANSWER_TXT ],
"answer_start": [ ANSWER_ST_CHAR ]
}
}
]
- Run the following command, or manually create directories under the
data/
anddata_gen/
directories,
bash new_dataset.sh $NEW_DATASET_NAME
-
Move the input file containing the target documents as
data/$NEW_DATASET_NAME/doc_data_0.json
. The format is described in the previous section. -
Run the following command
python rgx_doc.py \
--dataset_name $NEW_DATASET_NAME \
--data_split 0 \
--output_dir tmp/rgx \
--version_2_with_negative
The generated files will be stored at data_gen/{$NEW_DATASET_NAME}/
.
we suggest two approaches for QA fine-tuning with the generated QA pairs.
- Secondary pretraining: Fine-tuning the QA model on the synthetic corpus, and fine-tune on SQuAD. The model can be evaluated on different domains.
- Model mixing: Fine-tune two models on the generated corpus and SQuAD, and average all weights of the two models using the
mix_mode.py
script with
python mix_model.py $MIX_RATE $SQUAD_MODEL_PATH $RGX_MODEL_PATH
for example,
python mix_model.py 0.5 model_ft_file/ext_sq.pt model_ft_file/ext_rgx.pt
The output model will be stored as model_ft_file/ext_mixed.pt
.
Please contact the first author, Hongyin Luo (hyluo at mit dot edu) if there are any questions. If our system is applied in your work, please cite our paper
@article{luo2021cooperative,
title={Cooperative self-training machine reading comprehension},
author={Luo, Hongyin and Li, Shang-Wen and Mingye Gao, and Yu, Seunghak and Glass, James},
journal={arXiv preprint arXiv:2103.07449},
year={2021}
}