This repository contains the code for the EMNLP 2020 long paper "An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing".
If this code is useful for your work, please consider citing:
@inproceedings{schmitt-etal-2020-unsupervised,
title = "An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing",
author = {Schmitt, Martin and
Sharifzadeh, Sahand and
Tresp, Volker and
Sch{\"u}tze, Hinrich},
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.577",
doi = "10.18653/v1/2020.emnlp-main.577",
pages = "7117--7130"
}
Main dependency is AllenNLP. At the moment our compatibility is with the older version of 0.9.0.
You can install all the requirements from the requirements.txt
.
If you are using conda, create a new environment and activate it:
conda create --name kgtxt python=3.6
conda activate kgtxt
If you are not using conda, a virtual environment can also be created and activated like this:
virtualenv --python=3.6 env
source env/bin/activate
Then go to the project folder and install the requirements:
pip install -r requirements.txt
We have noticed that the software environment that is necessary to recreate our experiments causes an error during inference, namely when the argmax is to be computed over a bool tensor on the GPU.
Fortunately, this error is easy to fix.
If you installed AllenNLP in a virtual environment stored in the directory env
as described above, then you have to change line 272 in the file env/lib/python3.6/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py
from
first_match = ((matches.cumsum(-1) == 1) * matches).argmax(-1)
to
first_match = ((matches.cumsum(-1) == 1) * matches).float().argmax(-1)
If you installed the software requirements according to our requirements.txt
,
we recommend making this change before attempting to run any experiment.
The VG benchmark can be downloaded from here. Please read the dataset README for more details on the data format.
VG is licensed under a Creative Commons Attribution 4.0 International License.
We have included the dataset in the dataset folder. The files were originally downloaded from here.
The WebNLG data are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
In case you want to retrace our preprocessing, we ran LC_ALL=C.UTF-8 LANG=C.UTF-8 python src/data/format_webnlg.py input_file.json output_file.tsv
to preprocess each json file and convert it to a tsv file.
(input_file.json
and output_file.tsv
are to be replaced with the corresponding files in the dataset folder, e.g., test.tsv
, train.tsv
, val.tsv
, etc.)
Depending on your system configuation, you might not need to explicitly enforce LC_ALL=C.UTF-8 LANG=C.UTF-8
.
Before any model can be trained, the vocabulary has to be created with the script src/data/make_vocab.py
.
For WebNLG, e.g., create the folders mkdir -p vocabularies/webNLG/
if they are not present.
Then you can run: python -m src.data.make_vocab data/webNLG/train.tsv data/webNLG/val-pairs.tsv vocabularies/webNLG
.
The process is analogous for VG with vocabularies/vg
.
An unsupervised model is trained with the following template:
CUDA_VISIBLE_DEVICES=<gpu_num> src/unsupervised.sh <model_dir> experiments/<path_to_experiment_dir>/unsupervised-lm.json experiments/<path_to_experiment_dir>/unsupervised-iter1.json <path_to_tsv_train_file> experiments/<backtranslation_config> <num_epochs> <vocabulary_dir>
where
<gpu_num>
is the id of the GPU you want to use for training (nonnegative integer).<model_dir>
is the directory where intermediate model weights and backtranslated data will be stored. As the backtranslations from iteration i will be used as input in iteration i+1, this<model_dir>
has to be the same as specified in theunsupervised-iter1.json
to be used. You have to adapt that file or create the default directory hierarchy inmodels
before running the command.<path_to_experiment_dir>
is the location of the json configuration files to be used in the particular experiment. Each subdirectory ofexperiments
contains one json file for the language modeling epoch (ending in-lm.json
) and one json file that will be used for all other epochs (ending initer1.json
). Both have to be specified here.<path_to_tsv_train_file>
is the training file as obtained from the data preprocessing scripts.<backtranslation_config>
is the configuration file used for backtranslation, i.e.,unsupervised-bt-webNLG.jsonnet
for webNLG orunsupervised-bt-VG.jsonnet
for VG.<num_epochs>
specified the number of epochs/iterations for training. VG models were trained for 5 iterations and webNLG models for 30 iterations.<vocabulary_dir>
is the directory where the vocabulary was stored withmake_vocab.py
; in the example above it isvocabularies/webNLG
.
Let's say that you want to train webNLG with composed noise.
First, make sure that the
parameters in experiments/webNLG-composed/unsupervised-iter1.json
and experiments/webNLG-composed/unsupervised-lm.json
are set correctly. Specifially the directory_path
of vocabularies, and train_data_path
should point to the corresponding directories. Note that train_data_path
contains a json dictionary (with escaped special characters) to include different training files for denoising loss and backtranslation loss.
Now Run:
CUDA_VISIBLE_DEVICES=0 LC_ALL=C.UTF-8 LANG=C.UTF-8 src/unsupervised.sh models/webNLG-composed experiments/webNLG-composed/unsupervised-lm.json experiments/webNLG-composed/unsupervised-iter1.json data/webNLG/train.tsv experiments/unsupervised-bt-webNLG.jsonnet 30 vocabularies/webNLG
(Note: Make sure that (1) you can run the script, e.g., run
chmod +x src/unsupervised.sh
and that (2) the model directory exist, e.g., run mkdir -p models/webNLG-composed
.
)
Supervised training uses the standard command from AllenNLP:
CUDA_VISIBLE_DEVICES=<gpu_num> allennlp train experiments/<path_to_config_file> -s <model_dir> --include-package src.features.copynet_shared_decoder --include-package src.models.copynet_shared_decoder
with
<gpu_num>
the id of the GPU to be used as above.<path_to_config_file>
the location of the config file for supervised training. You will have to adapttrain_data_path
,validation_data_path
anddirectory_path
to the correct paths in your system.- In the
<model_dir>
directory, all results such as metrics and model weights will be stored.
Let's first create directories for predictions:
mkdir -p predictions/text2graph predictions/graph2text
The evaluation has two steps:
- Generate the predictions
- compare the predictions to the ground truth
For step 1, we can either batch-produce predictions from every iteration (with the script src/produce_predictions.sh
) or produce predictions from a specific model (with src/analysis/predict.py
).
Batch-predictions work like this:
CUDA_VISIBLE_DEVICES=<gpu_num> src/produce_predictions.sh <model_dir> experiments/unsupervised-bt-webNLG.jsonnet <dataset> <predictions_dir>
where <dataset>
is one of the -text2graph.tsv
or ``-graph2text.tsv` files for evaluation.
Predictions from a single model are obtained like this:
CUDA_VISIBLE_DEVICES=<gpu_num> python3 -m src.analysis.predict <config_file> path/to/model/best.th <dataset> hypo.txt ref.txt <batch_size>
where <config_file>
is unsupervised-bt-webNLG.jsonnet
for webNLG or unsupervised-bt-VG.jsonnet
for VG
and <batch_size>
is any positive integer (higher values need more memory but usually speed up decoding).
For step 2, we can batch-evaluate a directory of predictions with evaluate_dir.sh
like this:
src/evaluate_dir.sh <predictions_dir> <eval_command>
where eval_command
can either be
'python3 -m src.analysis.multi-f1'
for text-to-graph or
'python3 -m src.bleu.multi-bleu'
for graph-to-text.
Or we directly apply one of the two evaluation commands like this:
python3 -m src.analysis.multi-f1 hypo.txt ref.txt
Note: For evaluation on val100, you can add the flag --val100 to either of the two evaluation commands, e.g., python3 -m src.bleu.multi-bleu --val100 hypo.txt ref.txt
.