This is the repository for the EMNLP 2020 paper "Intrinsic Probing through Dimension Selection".
These instructions assume that conda is already installed on your system.
- Clone this repository. NOTE: We recommend keeping the default folder name when cloning.
- First run
conda env create -f environment.yml
. - Activate the environment with
conda activate intrinsic-probing
. - Install pyTorch.
- Install torch-scatter.
- (Optional) Setup wandb, if you want live logging of your runs.
- You will also need to install fastText to your environment as described here.
- Setup the config file
cp config.default.py config.py
.
You will also need to generate the data. Here we provide instructions on how to obtain the data to replicate our entire study.
- First run
mkdir unimorph && cd unimorph && wget https://raw.githubusercontent.com/unimorph/um-canonicalize/master/um_canonicalize/tags.yaml
- Download UD 2.1 treebanks and put them in
data/ud/ud-treebanks-v2.1
- Download all fastText embedding files by running
cd scripts; ./download_fasttext_vectors.sh; cd ..
. WARNING: This may take a while & require a lot of bandwidth. - Clone the modified UD converter to this repository's parent folder (or consider using the original, official UD to UniMorph converter) and then convert the treebank annotations to the UniMorph schema with
cd scripts; ./ud_to_um.sh; cd ..
. NOTE: This step will fail if the repositories were cloned into folders different than the default. If you changed the folder name, you can update the top lines in the shell file to reflect that. - Run
./scripts/preprocess_bert.sh
to preprocess all the relevant treebanks using BERT. This may take a while. - Run
./scripts/preprocess_fasttext.sh
to preprocess all the relevant treebanks using FastText. This may take a while. - (Only on a headless server) Orca needs X11 to run, or else it cannot generate graphs. An easier alternative is to run
sudo apt-get install xvfb
and then open apython
interpreter and run:>>> import plotly.io as pio >>> pio.orca.config.use_xvfb = True >>> pio.orca.config.save()
All the experiments are run using run_ud_treebanks.py
.
For a list of options you can use, run python run_ud_treebanks.py -h
.
For example, to replicate our MAP experiments for Portuguese fastText, you would run python run_ud_treebanks.py por fasttext --max-iter 50 --trainer map
.
You can also run ./scripts/run_ud_all_experiments.sh
to reproduce experimental results.
@inproceedings{torrobahennigen+al.emnlp20,
title = {Intrinsic Probing through Dimension Selection},
author = {Torroba Hennigen, Lucas and
Williams, Adina and
Cotterell, Ryan},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
month = {November},
year = {2020},
address = {Online},
publisher = {Association for Computational Linguistics}
}