This is the official codebase for the following paper, implemented in PyTorch:
Hareesh Bahuleyan and Layla El Asri. Diverse Keyphrase Generation with Neural Unlikelihood Training. COLING 2020. https://arxiv.org/pdf/2010.07665.pdf
-
Create and activate Python 3.7.5 virtual environment using
conda
:conda create --name keygen python=3.7.5 source activate keygen
-
Install necessary packages using pip:
pip install -r requirements.txt # Download spacy model python -m spacy download en_core_web_sm
-
Sent2Vec Installation Sent2Vec is used in the evaluation script. Please install sent2vec from https://github.com/epfml/sent2vec, using the steps below:
- Clone/Download the directory:
git clone https://github.com/epfml/sent2vec
- Go to sent2vec directory:
cd sent2vec/
git checkout f827d014a473aa22b2fef28d9e29211d50808d48
- Run
make
- Run
pip install cython
- Inside the src folder:
cd src/
python setup.py build_ext
pip install .
- Download a pre-trained sent2vec model. For example, we used
sent2vec_wiki_unigrams
. Finally, copy it todata/sent2vec/wiki_unigrams.bin
- Clone/Download the directory:
-
Data Download Download the pre-processed data files in JSON format by visiting this link: Unzip the file and copy it to
data/
The data folder should now have the following structure:
data/ ├── kp20k_sorted/ ├── KPTimes/ │ └── kptimes_sorted/ ├── sample_testset/ ├── sent2vec/ │ └── wiki_unigrams.bin └── stackexchange/ └── se_sorted/
To train a DivKGen model using one of the configurations provided under configurations/
:
# Specify the dataset
export DATASET=kp20k
# Specify the configuration name
export EXP=copy_seq2seq_attn_mle_greedy.tgt_15.0.copy_18.0
# Run training script
allennlp train configurations/$DATASET/$EXP.jsonnet -s output/$DATASET/$EXP/ -f --include-package keyphrase_generation -o '{ "trainer": {"cuda_device": 0} }'
The outputs (training logs, model checkpoints, tensorboard logs) will be stored under: output/$DATASET/$EXP
Notes:
- If your loss collapses NaN during training, this could be due to numerical underflow. The way to fix this is to edit
path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/nn/utils.py
functionmasked_log_softmax()
and change the linevector = vector + (mask + 1e-45).log()
tovector = vector + (mask + 1e-35).log()
. - Similary, find and replace all instances of
1e-45
inpath/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py
to1e-35
- During validation after every epoch, if it throws a Type Mismatch Error (
RuntimeError: "argmax_cuda" not implemented for 'Bool'
), this can be fixed by explicit type casting by changing the linematches = (expanded_source_token_ids == expanded_target_token_ids)
tomatches = (expanded_source_token_ids == expanded_target_token_ids).int()
inpath/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py
Finally, the evalution script can be run as follows:
- Go to
run_eval.sh
, set theHOME_PATH
variable. This corresponds to theabsolute/path/to/keyphrase-generation/folder
- Set the datasets. For instance, if we set both
EVALSET
andDATASET
tokp20k
, then we use the best model trained onkp20k
to evaluate onkp20k
. This is useful when you would like to evaluate a model trained on Dataset A on Dataset B. - Next,
bash run_eval.sh
will print the quality and diversity results and also save them tooutput/$DATASET/$EXP
Note: In the paper, we present EditDist as a diversity evaluation metric, for which we initially used a different fuzzy string matcher. However, this codebase uses an alternative library rapidfuzz, which offers a similar funcitonality.
If you found this code useful in your research, please cite:
@inproceedings{divKeyGen2020,
title={Diverse Keyphrase Generation with Neural Unlikelihood Training},
author={Bahuleyan, Hareesh and El Asri, Layla},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},
year={2020}
}