To save training time, all models used in the three different datasets are provided in the following path /data/<dataset_name>/models
e.g., /data/deception/models
. BERT parameters should be stored in the following path data/<dataset_name>/bert_fine_tune
. Please download the folders from this link: https://tinyurl.com/bert-fine-tune-folder. Note that folders can be huge and may take time to download.
- To save
svm
,svm_l1
,xgb
, andlstm
features and their feature importance, runsave_combinations.py
.- Note that only
save_combinations.py
uses the downloaded shap package instead of the local one. As such, before runningsave_combinations.py
, remember to set package path to run the downloaded shap package. Otherwise, simply rename localshap
folder to something else so thatsave_combinations.py
does not read from the local package. If you renamed local shap folder, remember to revert to the original folder name after runningsave_combinations.py
so other files will not be affected.
- Note that only
- To save
lstm
attention weights, runget_lstm_att_weights.py
. - To save
lstm
SHAP, runpython get_lstm_shap.py <dataset_name>
.
- Generate
tsv
files forbert
:- deception: run
python data_retrieval.py deception
- yelp: run
python data_retrieval.py yelp
- sst: run
python data_retrieval.py sst
- deception: run
- To save
bert
attention weights:- deception: run
python bert_att_weight_retrieval.py --data_dir data/deception --bert_model data/deception/bert_fine_tune/ --task_name sst-2 --output_dir /data/temp_output_dir/deception/ --do_eval --max_seq_length 300 --eval_batch_size 1
- yelp: run the above command, but replace
deception
withyelp
, and changemax_seq_length
to512
- sst: run the above command, but replace
deception
withsst
, and changemax_seq_length
to128
- deception: run
- To save
bert
LIME:- deception: run
python bert_lime.py --data_dir data/deception --bert_model data/deception/bert_fine_tune/ --task_name sst-2 --output_dir /data/temp_output_dir/deception/ --do_eval --max_seq_length 300 --eval_batch_size 1
- yelp: run the above command, but replace
deception
withyelp
, and changemax_seq_length
to512
- sst: run the above command, but replace
deception
withsst
, and changemax_seq_length
to128
- deception: run
- To save
bert
SHAP:- deception: run
python bert_shap.py --data_dir data/deception --bert_model data/deception/bert_fine_tune/ --task_name sst-2 --output_dir /data/temp_output_dir/deception/ --do_eval --max_seq_length 300 --eval_batch_size 1
- yelp: run the above command, but replace
deception
withyelp
, and changemax_seq_length
to512
- sst: run the above command, but replace
deception
withsst
, and changemax_seq_length
to128
- deception: run
- Generate bert spans and white spans:
- deception: run
python tokenizer_alignment.py --data_dir data/deception --bert_model data/deception/bert_fine_tune --task_name sst-2 --output_dir /data/temp_output_dir/deception/ --do_eval --max_seq_length 300
- yelp: run the above command, but replace
deception
withyelp
, and changemax_seq_length
to512
- sst: run the above command, but replace
deception
withsst
, and changemax_seq_length
to128
- deception: run
- Align all
bert
features/tokens with correct weights, runpython get_bert.py
. Note: to generatebert
related feature and its feature importance, it is important to follow the above steps in order.
- To generate plots in the paper, refer to interactive notebook
main.ipynb
.
If met with any problems, please send an email to vivian.lai@colorado.edu
and jon.z.cai@colorado.edu
.
Paper: https://arxiv.org/abs/1910.08534
@article{lai2019many,
title={Many Faces of Feature Importance: Comparing Built-in and Post-hoc Feature Importance in Text Classification},
author={Lai, Vivian and Cai, Jon Z and Tan, Chenhao},
journal={arXiv preprint arXiv:1910.08534},
year={2019}
}