Skip to content

CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction (arXiv 22)

Notifications You must be signed in to change notification settings

tsafavi/cascader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README.md

This repository contains the data and PyTorch implementation of the arXiv submission CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction by Tara Safavi, Doug Downey, and Tom Hope.

If you use our work, please cite us as follows:

@article{safavi2022cascader,
  title={CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction},
  author={Safavi, Tara and Downey, Doug and Hope, Tom},
  journal={arXiv preprint arXiv:2205.08012},
  year={2022}
}

Quick start

Run the following to set up your virtual environment and install the Python requirements:

python3.7 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

To setup a dataset, e.g., RepoDB:

cd data
unzip repodb.zip

This will set up the data/repodb/ directory, consisting of entity and relation ID files, entity and relation text files, and train/dev/testing triple files.

Download models

To download three pretrained models (KGE, dual-encoder, & cross-encoder) for a given dataset, use the following:

chmod u+x download_models.sh
./download_models.sh <dataset_name>

For example, the command ./download_data.sh repodb will download a zip and extract the following files:

  • out/repodb/kge.ckpt
  • out/repodb/biencoder.ckpt
  • out/repodb/crossencoder.ckpt

Be aware that the model files are very large for the larger datasets, up to 7 GB for FB15K-237, because all of the query/answer score pairs for the validation/test sets are saved in these model files.

Run cascades with pretrained models

No pruning

To run a full 3-stage cascade without any pruning, use the following:

chmod u+x cascade_full.sh
./cascade_full.sh <dataset_name>
  • This will first run Tier 1 reranking (KGE + bi-encoder), searching over the optimal weighting of the two models' scores in 10 trials. The results of the best trial from Tier 1 will be saved to out/<dataset_name>/t1/checkpoints/checkpoint_best.pt.
  • Next, this will run Tier 2 reranking (Tier 1 output + cross-encoder), again searching over the optimal weighting of the two sets of scores in 10 trials. The results of the best trial from Tier 2 will be saved to out/<dataset_name>/t2/checkpoints/checkpoint_best.pt.

With pruning

To run a 3-stage cascade with pruning between Tier 1 and Tier 2, use the following:

chmod u+x cascade_pruned.sh
./cascade_pruned.sh <dataset_name>
  • This will first run Tier 1 reranking (KGE + bi-encoder), searching over the optimal weighting of the two models' scores in 10 trials (same as above). The results of the best trial from Tier 1 will be saved to out/<dataset_name>/t1/checkpoints/checkpoint_best.pt.
  • Next, this will run an Answer Selector job in which we predict the number of answers to rerank for reach query. The results of answer selection will be saved to out/<dataset_name>/t1_prune/checkpoints/checkpoint_best.pt.
  • Finally, this will run pruned Tier 2 reranking (Tier 1 output + cross-encoder over Answer Selector outputs only), again searching over the optimal weighting of the two sets of scores in 10 trials. The results of the best trial from pruned Tier 2 will be saved to out/<dataset_name>/t2_prune/checkpoints/checkpoint_best.pt.

Run new jobs

All jobs are implemented using the PyTorch Lightning API.

To run a job, use the following command:

python src/main.py <path_to_config_file>

Each job requires a path to a YAML configuration file. The file src/config.py provides default configuration options for job outputs, model training hyperparameters, etc. You can set or overwrite these options in individual config files.

Training config example

Here is an example of a config file that trains a cross-encoder BERT-Base LM on the CoDEx-S dataset and evaluates the model on the validation and test sets:

do-checkpoint: True  # by default False, set to True if you want to save model weights and ranking outputs
job-modes:
  - train  # remove if you want to evaluate the model only
  - test
dataset:
  name: codex-s  # if custom, you must provide the corresponding dataset in the data/ directory
  num_entities: 2034
  num_relations: 42
  text:
    subj_repr:  # concatenate ‘name’ and ‘extract’ columns from codex-s entity file for subject entity description
      - name
      - extract
    obj_repr:
      - name
      - extract
  splits:
    test:  # get model prediction scores on validation and test splits
      - valid
      - test
train:
  model_type: crossencoder
  batch_size: 16
  max_epochs: 5
  use_bce_loss: True
  use_margin_loss: True
  use_relation_cls_loss: True
  lr: 1.0e-5
  margin: 1
  negative_samples:
    num_neg_per_pos: 2
lm:
  model_name: bert-base-uncased
  max_length: 128
eval:
  batch_size: 16
  check_val_every_n: 5

Model selection config example

To run a job and select a model over a specified set of hyperparameters, add the --search flag to your job invocation as follows:

python src/main.py <path_to_config_file> --search

Here is an example of a config file that trains a cross-encoder BERT-Base LM on the CoDEx-S dataset and evaluates the model on the validation and test sets, searching over the optimal learning rate, margin, and number of negative samples in 5 trials:

do-checkpoint: True  # by default False, set to True if you want to save model weights and ranking outputs
job-modes:
  - train  # remove if you want to evaluate the model only
  - test
dataset:
  name: codex-s  # if custom, you must provide the corresponding dataset in the data/ directory
  num_entities: 2034
  num_relations: 42
  text:
    subj_repr:  # concatenate ‘name’ and ‘extract’ columns from codex-s entity file for subject entity description
      - name
      - extract
    obj_repr:
      - name
      - extract
  splits:
    test:  # get model prediction scores on validation and test splits
      - valid
      - test
train:
  model_type: crossencoder
  batch_size: 16
  max_epochs: 5
  use_bce_loss: True
  use_margin_loss: True
  use_relation_cls_loss: True
  lr: 1.0e-5
  margin: 1
  negative_samples:
    num_neg_per_pos: 2
lm:
  model_name: bert-base-uncased
  max_length: 128
eval:
  batch_size: 16
  check_val_every_n: 5
search:
  num_trials: 5
  parameters:
  - name: train.lr
    type: choice
    value_type: float
    values:
    - 1e-5
    - 2e-5
    - 3e-5
  - name: train.margin
    type: range
    value_type: int
    bounds:
    - 1
    - 10
  - name: train.negative_samples.num_neg_per_pos
    type: range
    value_type: int
    bounds:
    - 1
    - 5

Reranking config example

To run a reranking job over a pair of models and select optimal weights for the two models' scores, use the following:

python src/main.py <path_to_config_file> --search

Here is an example of a reranking job that searches over the optimal additive ensemble between a KGE and a cross-encoder on CoDEx-S:

do-checkpoint: True
job-modes:  # no training since base models are already trained
  - validate  # must include validation to select the optimal weights
  - test
dataset:
  name: codex-s
  num_entities: 2034
  num_relations: 42
train:
  model_type: ensemble
ensemble:
  base_ranker_checkpoint_path: out/codex-s/kge.ckpt
  reranker_checkpoint_path: out/codex-s/crossencoder.ckpt
search:
  parameters:
  - bounds:
    - 0.05
    - 0.95
    name: ensemble.reranker_weight_head_batch
    type: range
    value_type: float
  - bounds:
    - 0.05
    - 0.95
    name: ensemble.reranker_weight_tail_batch
    type: range
    value_type: float

About

CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction (arXiv 22)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published