Official repository for the paper GIM: Improved Interpretability for Large Language Models.

GIM (Gradient Interaction Modifications) is a state-of-the-art feature attribution method and circuit discovery method. It currently leads the leaderboard for the Mechanistic Interpretability Benchmark, while being as fast as gradients.

We have created this PyPI package to make it effortless to use GIM on any Large Language Model. The code for the PyPI package is found in this repository.

The code in this repository is for reproducing the experiments in the paper. The code is less useful for other use cases.

Setup

Setup uv and pre-install

make setup

Download the datasets

make download_data

You must download the twitter sentiment classification manually from https://www.kaggle.com/competitions/tweet-sentiment-extraction/data

Compute classification accuracy of the large language models

CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_models.py

Change CUDA_VISIBLE_DEVICES if you want to use a different GPU.

Running experiments

You can reproduce our three experiments using the following lines of code:

Self-repair experiments

CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_self_repair.py

Feature attribution experiments

CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_feature_attributions.py

This command will also compute the results needed for the ablation study. This will be take a lot of time. You can change the parameters in the code to only run a few models in the same run.

Circuit identification experiments

CUDA_VISIBLE_DEVICES="0" uv run python src/evaluation/evaluate_layers.py

Figures and tables

The code for creating the figures and tables are in the /resultsfolder.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/raw		data/raw
results		results
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Official repository for the paper GIM: Improved Interpretability for Large Language Models.

Setup

Setup uv and pre-install

Download the datasets

Compute classification accuracy of the large language models

Running experiments

Self-repair experiments

Feature attribution experiments

Circuit identification experiments

Figures and tables

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

JoakimEdin/gim

Folders and files

Latest commit

History

Repository files navigation

Official repository for the paper GIM: Improved Interpretability for Large Language Models.

Setup

Setup uv and pre-install

Download the datasets

Compute classification accuracy of the large language models

Running experiments

Self-repair experiments

Feature attribution experiments

Circuit identification experiments

Figures and tables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages