JointGT is a graph-text joint pre-training framework with structure-aware encoding and explicit graph-text alignments. You can read our paper for more details. This project is a PyTorch implementation of our work.
- Python 3.7
- NumPy
- PyTorch 1.4.0
- Transformers (Huggingface) 3.0.0
- PyTorch Scatter 2.0.4
NOTE: At the very beginning, in order to compute the METEOR scores, please download the required data and put it under the following two folders: eval_webnlg/pycocoevalcap/meteor/data/
and eval_wqpq/meteor/data/
.
Our experiments contain four downstream datasets, i.e., WebNLG(U), WebNLG(C), WebQuestions, and PathQuestions. The raw data of these datasets are from the GitHub repositories of KGPT, WebNLG, and BiGGNN. You can download the pre-processed datasets used in our paper on Google Drive / Tsinghua Cloud.
You can download the checkpoint of our pre-trained model (Google Drive / Tsinghua Cloud), and fine-tune the pre-trained model on four datasets.
bash finetune_jointgt_bart.sh
bash finetune_jointgt_t5.sh
In the scripts, --output_dir
denotes the directory to save the fine-tuning model. --model_path
indicates the pre-trained checkpoint used for fine-tuning. You can refer to the fine-tuning codes for the description of other hyper-parameters.
We also provide the inference scripts to directly acquire the generation results on the test sets.
bash infer_jointgt_bart.sh
bash infer_jointgt_t5.sh
In the scripts, --output_dir
denotes the directory of model checkpoint used for inference. The generated results are also saved in this directory.
If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, you should first download the KGTEXT dataset and the corresponding knowledge graphs from the GitHub repository of KGPT. Then, the model checkpoint of BART / T5 provided by Huggingface Transformers should also be prepared as the initialization of our model.
We provide the scripts for pre-training as follows.
bash pretrain_jointgt_bart.sh
bash pretrain_jointgt_t5.sh
In the scripts, --model_path
and --tokenizer_path
are set to the downloaded BART / T5 checkpoint. The settings of--train_file
, --predict_file
and --knowledge_file
depend on the directories of datasets and knowledge graphs from KGPT.
For a fair comparison with existing works, we use the evaluation scripts of KGPT for WebNLG.
cd eval_webnlg
python measure_score.py ${reference_path} ${model_output_path}
As for WebQuestions and PathQuestions, we use the scripts of BiGGNN for evaluation.
cd eval_for_wqpq
python eval.py --src ${source_path} --tgt ${reference_path} --out ${model_output_path}
During evaluation, model_output_path
can be set to the generated file when running our inference codes. source_path
can be set to test.source
/ src-test.txt
in our pre-processed datasets. reference_path
can be set to test.target
/ tgt-test.txt
in our pre-processed datasets. Refer to the original repositories for more details.
@inproceedings{ke-etal-2021-jointgt,
title = "{J}oint{GT}: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs",
author = "Ke, Pei and Ji, Haozhe and Ran, Yu and Cui, Xin and Wang, Liwei and Song, Linfeng and Zhu, Xiaoyan and Huang, Minlie",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
pages = "2526--2538",
}
Please kindly cite our paper if this paper and the codes are helpful.
Many thanks to the GitHub repositories of Transformers, bart-closed-book-qa and KGPT. Part of our codes are modified based on their codes.