The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.
This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: ruizhe.li@abdn.ac.uk
.
Table of Contents
- Awesome Interpretability Libraries
- Awesome Interpretability Blogs & Videos
- Awesome Interpretability Tutorials
- Awesome Interpretability Forums
- Awesome Interpretability Tools
- Awesome Interpretability Programs
- Awesome Interpretability Papers
- Other Awesome Interpretability Resources
- TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
- nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial)
- SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
- Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
- Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
- pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
- repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
- Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Doc, Tutorial)
- LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
- Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
- Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
- A Barebones Guide to Mechanistic Interpretability Prerequisites
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
- 200 Concrete Open Problems in Mechanistic Interpretability
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
- ARENA 3.0: understand mechanistic interpretability using TransformerLens.
- EACL24: Transformer-specific Interpretability (Github)
- Transformer Debugger: investigate specific behaviors of small LLMs
- LLM Transparency Tool (Demo)
- sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
- Neuronpedia: an open platform for interpretability research. (Doc)
- ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.
Title | Venue | Date | Code |
---|---|---|---|
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP | arXiv | 2024-06-18 | - |
A Primer on the Inner Workings of Transformer-based Language Models | arXiv | 2024-05-02 | - |
Mechanistic Interpretability for AI Safety -- A Review | arXiv | 2024-04-22 | - |
From Understanding to Utilization: A Survey on Explainability for Large Language Models | arXiv | 2024-02-22 | - |
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks | arXiv | 2023-08-18 | - |
Title | Venue | Date | Code |
---|---|---|---|
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience | ICML | 2024-06-03 | - |
Interpretability Needs a New Paradigm | arXiv | 2024-05-08 | - |
Position Paper: Toward New Frameworks for Studying Model Representations | arXiv | 2024-02-06 | - |
Rethinking Interpretability in the Era of Large Language Models | arXiv | 2024-01-30 | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation |
arXiv | 2024-06-24 | Github | - |
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits |
XAI4CV@CVPR | 2024-04-09 | Github | - |
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) |
arXiv | 2024-02-16 | Github | - |
Analyzing Vision Transformers for Image Classification in Class Embedding Space |
NeurIPS | 2023-09-21 | Github | - |
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP |
CLVL@ICCV | 2023-08-27 | Github | - |
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models |
NeurIPS | 2023-07-11 | Github | Blog |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Benchmarking Mental State Representations in Language Models |
MI@ICML | 2024-06-25 | - | - |
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains |
ACL | 2024-05-21 | Dataset | Blog |
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations |
arXiv | 2024-02-27 | Github | - |
CausalGym: Benchmarking causal interpretability methods on linguistic tasks |
arXiv | 2024-02-19 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability |
arXiv | 2024-01-08 | - | - |
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability |
arXiv | 2023-06-06 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
An introduction to graphical tensor notation for mechanistic interpretability |
arXiv | 2024-02-02 | - | - |
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks |
arXiv | 2023-10-03 | Github | - |