Diverse Keyphrase Generation with Neural Unlikelihood Training

This is the official codebase for the following paper, implemented in PyTorch:

Hareesh Bahuleyan and Layla El Asri. Diverse Keyphrase Generation with Neural Unlikelihood Training. COLING 2020. https://arxiv.org/pdf/2010.07665.pdf

Setup Instructions

Create and activate Python 3.7.5 virtual environment using conda:

conda create --name keygen python=3.7.5
source activate keygen

Install necessary packages using pip:

pip install -r requirements.txt

# Download spacy model
python -m spacy download en_core_web_sm

Sent2Vec Installation Sent2Vec is used in the evaluation script. Please install sent2vec from https://github.com/epfml/sent2vec, using the steps below:
- Clone/Download the directory: git clone https://github.com/epfml/sent2vec
- Go to sent2vec directory: cd sent2vec/
- git checkout f827d014a473aa22b2fef28d9e29211d50808d48
- Run make
- Run pip install cython
- Inside the src folder: cd src/
  - python setup.py build_ext
  - pip install .
- Download a pre-trained sent2vec model. For example, we used sent2vec_wiki_unigrams. Finally, copy it to data/sent2vec/wiki_unigrams.bin

Data Download Download the pre-processed data files in JSON format by visiting this link: Unzip the file and copy it to data/

The data folder should now have the following structure:

data/
├── kp20k_sorted/
├── KPTimes/
│   └── kptimes_sorted/
├── sample_testset/
├── sent2vec/
│   └── wiki_unigrams.bin
└── stackexchange/
    └── se_sorted/

Training Instructions

To train a DivKGen model using one of the configurations provided under configurations/:

# Specify the dataset
export DATASET=kp20k

# Specify the configuration name
export EXP=copy_seq2seq_attn_mle_greedy.tgt_15.0.copy_18.0

# Run training script
allennlp train configurations/$DATASET/$EXP.jsonnet -s output/$DATASET/$EXP/ -f --include-package keyphrase_generation -o '{ "trainer": {"cuda_device": 0} }'

The outputs (training logs, model checkpoints, tensorboard logs) will be stored under: output/$DATASET/$EXP

Notes:

If your loss collapses NaN during training, this could be due to numerical underflow. The way to fix this is to edit path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/nn/utils.py function masked_log_softmax() and change the line vector = vector + (mask + 1e-45).log() to vector = vector + (mask + 1e-35).log().
Similary, find and replace all instances of 1e-45 in path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py to 1e-35
During validation after every epoch, if it throws a Type Mismatch Error (RuntimeError: "argmax_cuda" not implemented for 'Bool'), this can be fixed by explicit type casting by changing the line matches = (expanded_source_token_ids == expanded_target_token_ids) to matches = (expanded_source_token_ids == expanded_target_token_ids).int() in path/to/conda/envs/keygen/lib/python3.7/site-packages/allennlp/models/encoder_decoders/copynet_seq2seq.py

Evaluation Instructions

Finally, the evalution script can be run as follows:

Go to run_eval.sh, set the HOME_PATH variable. This corresponds to the absolute/path/to/keyphrase-generation/folder
Set the datasets. For instance, if we set both EVALSET and DATASET to kp20k, then we use the best model trained on kp20k to evaluate on kp20k. This is useful when you would like to evaluate a model trained on Dataset A on Dataset B.
Next, bash run_eval.sh will print the quality and diversity results and also save them to output/$DATASET/$EXP

Note: In the paper, we present EditDist as a diversity evaluation metric, for which we initially used a different fuzzy string matcher. However, this codebase uses an alternative library rapidfuzz, which offers a similar funcitonality.

Citation

If you found this code useful in your research, please cite:

@inproceedings{divKeyGen2020,
  title={Diverse Keyphrase Generation with Neural Unlikelihood Training},
  author={Bahuleyan, Hareesh and El Asri, Layla},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configurations		configurations
keyphrase_generation		keyphrase_generation
output		output
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diverse Keyphrase Generation with Neural Unlikelihood Training

Setup Instructions

Training Instructions

Evaluation Instructions

Citation

About

Releases

Packages

Languages

License

HareeshBahuleyan/keyphrase-generation

Folders and files

Latest commit

History

Repository files navigation

Diverse Keyphrase Generation with Neural Unlikelihood Training

Setup Instructions

Training Instructions

Evaluation Instructions

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages