We use Lenta.Ru dataset in our experiments.
- Create
data_path
folder and putlenta-ru-news.csv
file in it - Choose labels to experiment with and split the data into train and test by running
lenta_dataset.ipynb
- Finally, run
python data_prep.py -dd data_path
. This command will savefull.csv
,train.csv
,eval.csv
andvocabs.txt
files todata_path
- Modify
build_model
function if you want to change an architecture of the model - Create
experiment_path
folder and putexperiments/config.yaml
in it - Modify hyperparameters inside
experiment_path/config.yaml
- Run
python train.py -dd data_path -md experiment_path
. This command will train the model and save checkpoints toexperiment_path
- Choose words you would like to experiments with. For example,
Москва
,ООН
,Жириновский
will be a good choice. - Run
python collect_concepts.py -dd data_path -md experiment_path --ngrams 3
. This command will generate multiple files:
concepts.pkl
-- for each concept (e.g.Москва
) we search for sentences where this word occurs. Then we retrieve ngrams of sizen
from this sentence (e.g.лето в Москва
,Москва слезам не верит
) and call it concepts. Also we collect some random samples from the data for each concept.cav_bottlenecks.pkl
-- we convert concept texts into hidden representations of the model fromexperiment_path
foldercavs.pkl
-- Hyperplanes for each concept received by fitting Logistic Regression onconcept/non-concept
data. LR is trained on hidden representation of the data.grads.pkl
-- directional derivatives (see the paper for more details)
- Run
python calculate_tcav.py -dd data_path
. This command will savescores.pkl
file. In this file you can find TCAV scores for each concept against all labels.
Run TCAV.ipynb
to compare results.