This repository contains the source code and data for our ACL 2021 paper: "How is BERT surprised? Layerwise detection of linguistic anomalies" by Bai Li, Zining Zhu, Guillaume Thomas, Yang Xu, and Frank Rudzicz.
If you use our work in your research, please cite:
Li, B., Zhu, Z., Thomas, G., Xu, Y., and Rudzicz, F. (2021) How is BERT surprised? Layerwise detection of linguistic anomalies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL).
@inproceedings{li2021layerwise,
author = "Li, Bai and Zhu, Zining and Thomas, Guillaume and Xu, Yang and Rudzicz, Frank",
title = "How is BERT surprised? Layerwise detection of linguistic anomalies",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)",
publisher = "Association for Computational Linguistics",
year = "2021",
}
The project was developed with the following library versions. Running with other versions may crash or produce incorrect results.
- Python 3.7.5
- CUDA Version: 11.0
- torch==1.7.1
- transformers==4.5.1
- numpy==1.19.0
- pandas==0.25.3
- scikit-learn==0.22
- Clone this repo:
git clone https://github.com/SPOClab-ca/layerwise-anomaly
- Download BNC Baby (4m word sample) from this link and extract into
data/bnc/
- Run BNC preprocessing script:
python scripts/process_bnc.py --bnc_dir=data/bnc/download/Texts --to=data/bnc.pkl
- Clone BLiMP repo:
cd data && git clone https://github.com/alexwarstadt/blimp
PYTHONPATH=. time python scripts/blimp_anomaly.py \
--bnc_path=data/bnc.pkl \
--blimp_path=data/blimp/data/ \
--out=blimp_result
Run the notebooks/FreqSurprisal.ipynb
notebook.
PYTHONPATH=. time python scripts/run_surprisal_gaps.py \
--bnc_path=data/bnc.pkl \
--out=surprisal_gaps
PYTHONPATH=. time python scripts/run_accuracy.py \
--model_name=roberta-base \
--anomaly_model=gmm
PYTHONPATH=. pytest tests