GIM (Gradient Interaction Modifications) is a state-of-the-art feature attribution method and circuit discovery method. It currently leads the leaderboard for the Mechanistic Interpretability Benchmark, while being as fast as gradients.
We have created this PyPI package to make it effortless to use GIM on any Large Language Model. The code for the PyPI package is found in this repository.
The code in this repository is for reproducing the experiments in the paper. The code is less useful for other use cases.
make setupmake download_dataYou must download the twitter sentiment classification manually from https://www.kaggle.com/competitions/tweet-sentiment-extraction/data
CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_models.pyChange CUDA_VISIBLE_DEVICES if you want to use a different GPU.
You can reproduce our three experiments using the following lines of code:
CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_self_repair.pyCUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_feature_attributions.pyThis command will also compute the results needed for the ablation study. This will be take a lot of time. You can change the parameters in the code to only run a few models in the same run.
CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_layers.pyThe code for creating the figures and tables are in the /resultsfolder.