Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: ruizhe.li@abdn.ac.uk.

Table of Contents

Awesome Interpretability Libraries
Awesome Interpretability Blogs & Videos
Awesome Interpretability Tutorials
Awesome Interpretability Forums
Awesome Interpretability Tools
Awesome Interpretability Programs
Awesome Interpretability Papers
Other Awesome Interpretability Resources

Awesome Interpretability Libraries

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial, Paper)
SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
EleutherAI: sae: train SAE on very large model based on the method and released code of the openAI SAE paper
Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Paper, Doc, Tutorial)
LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
shap: Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. (Paper, Doc)
captum: Model interpretability and understanding library for PyTorch (Paper, Doc)

Awesome Interpretability Blogs & Videos

A Barebones Guide to Mechanistic Interpretability Prerequisites
Concrete Steps to Get Started in Transformer Mechanistic Interpretability
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
200 Concrete Open Problems in Mechanistic Interpretability
3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning

Awesome Interpretability Tutorials

ARENA 3.0: understand mechanistic interpretability using TransformerLens.
EACL24: Transformer-specific Interpretability (Github)
ICML24: Physics of Language Models (Youtube)
NAACL24: Explanations in the Era of Large Language Models

Awesome Interpretability Forums & Worhshops

AI Alignment Forum
LessWrong
Mechanistic Interpretability Workshop 2024 ICML (Accepted papers)
Attributing Model Behavior at Scale Workshop 2023 NeurIPS (Accepted papers)
BlackboxNLP 2023 EMNLP (Accepted papers)

Awesome Interpretability Tools

Transformer Debugger: investigate specific behaviors of small LLMs
LLM Transparency Tool (Demo)
sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
Neuronpedia: an open platform for interpretability research. (Doc)
Comgra: A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. (Paper)

Awesome Interpretability Programs

ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

Awesome Interpretability Papers

Survey Papers

Title	Venue	Date	Code
Knowledge Mechanisms in Large Language Models: A Survey and Perspective	EMNLP	2024-10-06	-
Attention Heads of Large Language Models: A Survey	arXiv	2024-09-06	Github
Internal Consistency and Self-Feedback in Large Language Models: A Survey	arXiv	2024-07-22	Github Paper List
Relational Composition in Neural Networks: A Survey and Call to Action	MechInterp@ICML	2024-07-15	-
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP	arXiv	2024-06-18	-
A Primer on the Inner Workings of Transformer-based Language Models	arXiv	2024-05-02	-
Mechanistic Interpretability for AI Safety -- A Review	arXiv	2024-04-22	-
From Understanding to Utilization: A Survey on Explainability for Large Language Models	arXiv	2024-02-22	-
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks	arXiv	2023-08-18	-

Position Papers

Title	Venue	Date	Code
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience	ICML	2024-06-25	-
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience	ICML	2024-06-03	-
Interpretability Needs a New Paradigm	arXiv	2024-05-08	-
Position Paper: Toward New Frameworks for Studying Model Representations	arXiv	2024-02-06	-
Rethinking Interpretability in the Era of Large Language Models	arXiv	2024-01-30	-

Interpretable Analysis of LLMs

Title	Venue	Date	Code	Blog
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions	EMNLP	2024-10-23	-	-
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models	EMNLP	2024-10-04	Github	-
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States	EMNLP	2024-10-01	Github	-
Information Flow Routes: Automatically Interpreting Language Models at Scale	EMNLP	2024-10-01	Github	-
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model	EMNLP	2024-10-01	Github	-
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis	EMNLP	2024-09-12	Github	-
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models	-	2024-08-05	Github	-
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically	MechInterp@ICML	2024-07-15	-	-
Compact Proofs of Model Performance via Mechanistic Interpretability	MechInterp@ICML	2024-07-15	Github	-
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks	MechInterp@ICML	2024-07-15	-	-
How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching	MechInterp@ICML	2024-07-15	-	-
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models	MechInterp@ICML	2024-07-15	-	-
What Makes and Breaks Safety Fine-tuning? Mechanistic Study	MechInterp@ICML	2024-07-15	-	-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability	MechInterp@ICML	2024-07-15	-	-
Loss in the Crowd: Hidden Breakthroughs in Language Model Training	MechInterp@ICML	2024-07-15	-	-
Robust Knowledge Unlearning via Mechanistic Localizations	MechInterp@ICML	2024-07-15	-	-
Language Models Linearly Represent Sentiment	MechInterp@ICML	2024-07-15	-	-
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms	MechInterp@ICML	2024-07-15	Github	-
Learning and Unlearning of Fabricated Knowledge in Language Models	MechInterp@ICML	2024-07-15	-	-
Faithful and Fast Influence Function via Advanced Sampling	MechInterp@ICML	2024-07-15	-	-
Hypothesis Testing the Circuit Hypothesis in LLMs	MechInterp@ICML	2024-07-15	-	-
The Geometry of Categorical and Hierarchical Concepts in Large Language Models	MechInterp@ICML	2024-07-15	Github	-
InversionView: A General-Purpose Method for Reading Information from Neural Activations	MechInterp@ICML	2024-07-15	Github	-
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks	MechInterp@ICML	2024-07-15	-	-
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning	arXiv	2024-07-04	-	-
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation	arXiv	2024-07-01	Github	-
Recovering the Pre-Fine-Tuning Weights of Generative Models	ICML	2024-07-01	Github	Blog
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs	arXiv	2024-06-28	Github	Blog
Observable Propagation: Uncovering Feature Vectors in Transformers	ICML	2024-06-25	Github	-
Multi-property Steering of Large Language Models with Dynamic Activation Composition	arXiv	2024-06-25	Github	-
What Do the Circuits Mean? A Knowledge Edit View	arXiv	2024-06-25	-	-
Confidence Regulation Neurons in Language Models	arXiv	2024-06-24	-	-
Compact Proofs of Model Performance via Mechanistic Interpretability	arXiv	2024-06-24	Github	-
Preference Tuning For Toxicity Mitigation Generalizes Across Languages	arXiv	2024-06-23	Github	-
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models	arXiv	2024-06-23	-	-
Estimating Knowledge in Large Language Models Without Generating a Single Token	arXiv	2024-06-18	Github	-
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations	arXiv	2024-06-17	-	-
Transcoders Find Interpretable LLM Feature Circuits	MechInterp@ICML	2024-06-17	Github	-
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue	arXiv	2024-06-16	Github	-
Context versus Prior Knowledge in Language Models	ACL	2024-06-16	Github	-
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models	arXiv	2024-06-13	-	-
MambaLRP: Explaining Selective State Space Sequence Models	arXiv	2024-06-11	Github	-
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models	ICML	2024-06-06	Github	Blog
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals	ACL	2024-06-06	Github	-
Learned feature representations are biased by complexity, learning order, position, and more	arXiv	2024-06-06	Demo	-
Iteration Head: A Mechanistic Study of Chain-of-Thought	arXiv	2024-06-05	-	-
Activation Addition: Steering Language Models Without Optimization	arXiv	2024-06-04	Code	-
Interpretability Illusions in the Generalization of Simplified Models	arXiv	2024-06-04	-	-
SyntaxShap: Syntax-aware Explainability Method for Text Generation	arXiv	2024-06-03	Github	Blog
Calibrating Reasoning in Language Models with Internal Consistency	arXiv	2024-05-29	-	-
Black-Box Access is Insufficient for Rigorous AI Audits	FAccT	2024-05-29	-	-
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting	arXiv	2024-05-28	-	-
From Neurons to Neutrons: A Case Study in Interpretability	ICML	2024-05-27	Github	-
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization	MechInterp@ICML	2024-05-27	Github	-
Explorations of Self-Repair in Language Models	ICML	2024-05-26	Github	-
Emergence of a High-Dimensional Abstraction Phase in Language Transformers	arXiv	2024-05-24	-	-
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions	arXiv	2024-05-23	Github	-
Not All Language Model Features Are Linear	arXiv	2024-05-23	Github	-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability	arXiv	2024-05-20	-	-
Your Transformer is Secretly Linear	arXiv	2024-05-19	Github	-
Are self-explanations from Large Language Models faithful?	ACL	2024-05-16	Github	-
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models	arXiv	2024-05-14	-	-
Steering Llama 2 via Contrastive Activation Addition	arXiv	2024-05-07	Github	-
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability	AISTATS	2024-05-07	Github	-
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning	arXiv	2024-05-06	Github	-
Circuit Component Reuse Across Tasks in Transformer Language Models	ICLR	2024-05-06	Github	-
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations	HCI+NLP@NAACL	2024-04-24	Github	-
How to use and interpret activation patching	arXiv	2024-04-23	-	-
Understanding Addition in Transformers	arXiv	2024-04-23	-	-
Towards Uncovering How Large Language Model Works: An Explainability Perspective	arXiv	2024-04-15	-	-
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation	ICML	2024-04-10	Github	-
Does Transformer Interpretability Transfer to RNNs?	arXiv	2024-04-09	-	-
Locating and Editing Factual Associations in Mamba	arXiv	2024-04-04	Github	Demo
Eliciting Latent Knowledge from Quirky Language Models	ME-FoMo@ICLR	2024-04-03	-	-
Do language models plan ahead for future tokens?	arXiv	2024-04-01	-	-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models	arXiv	2024-03-31	Github	Demo
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms	arXiv	2024-03-26	-	-
What does the Knowledge Neuron Thesis Have to do with Knowledge?	ICLR	2024-03-16	Github	-
Language Models Represent Space and Time	ICLR	2024-03-04	Github	-
*AtP: An efficient and scalable method for localizing LLM behaviour to components**	arXiv	2024-03-01	-	-
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task	arXiv	2024-02-28	-	-
Function Vectors in Large Language Models	ICLR	2024-02-25	Github	Blog
A Language Model's Guide Through Latent Space	arXiv	2024-02-22	-	-
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model	arXiv	2024-02-22	-	-
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking	ICLR	2024-02-22	Github	Blog
Fine-grained Hallucination Detection and Editing for Language Models	arXiv	2024-02-21	Github	Blog
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation	arXiv	2024-02-20	Github	-
Identifying Semantic Induction Heads to Understand In-Context Learning	arXiv	2024-02-20	-	-
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space	arXiv	2024-02-20	-	-
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models	ACML	2024-02-12	-	-
Model Editing with Canonical Examples	arXiv	2024-02-09	Github	-
Opening the AI black box: program synthesis via mechanistic interpretability	arXiv	2024-02-07	Github	-
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection	ICLR	2024-02-06	-	-
In-Context Language Learning: Architectures and Algorithms	arXiv	2024-01-30	Github	-
Gradient-Based Language Model Red Teaming	EACL	2024-01-30	Github	-
The Calibration Gap between Model and Human Confidence in Large Language Models	arXiv	2024-01-24	-	-
Universal Neurons in GPT2 Language Models	arXiv	2024-01-22	Github	-
The mechanistic basis of data dependence and abrupt learning in an in-context classification task	ICLR	2024-01-16	-	-
Overthinking the Truth: Understanding how Language Models Process False Demonstrations	ICLR	2024-01-16	Github	-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks	ICLR	2024-01-16	-	-
Feature emergence via margin maximization: case studies in algebraic tasks	ICLR	2024-01-16	-	-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild	ICLR	2024-01-16	-	-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods	ICLR	2024-01-16	-	-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity	ICML	2024-01-03	Github	-
Forbidden Facts: An Investigation of Competing Objectives in Llama-2	ATTRIB@NeurIPS	2023-12-31	Github	Blog
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets	arXiv	2023-12-08	Github	Blog
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching	ATTRIB@NeurIPS	2023-12-06	Github	-
Structured World Representations in Maze-Solving Transformers	UniReps@NeurIPS	2023-12-05	Github	-
Generating Interpretable Networks using Hypernetworks	arXiv	2023-12-05	-	-
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks	NeurIPS	2023-11-21	Github	-
Attribution Patching Outperforms Automated Circuit Discovery	ATTRIB@NeurIPS	2023-11-20	Github	-
Tracr: Compiled Transformers as a Laboratory for Interpretability	NeurIPS	2023-11-03	Github	-
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model	NeurIPS	2023-11-02	Github	-
Learning Transformer Programs	NeurIPS	2023-10-31	Github	-
Towards Automated Circuit Discovery for Mechanistic Interpretability	NeurIPS	2023-10-28	Github	-
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models	EMNLP	2023-10-23	Github	-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model	NeurIPS	2023-10-20	Github	-
Progress measures for grokking via mechanistic interpretability	ICLR	2023-10-19	Github	Blog
Copy Suppression: Comprehensively Understanding an Attention Head	arXiv	2023-10-06	Github	Blog & Demo
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models	NeurIPS	2023-09-21	Github	-
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca	NeurIPS	2023-09-21	Github	-
Emergent Linear Representations in World Models of Self-Supervised Sequence Models	BlackboxNLP@EMNLP	2023-09-07	Github	Blog
Finding Neurons in a Haystack: Case Studies with Sparse Probing	arXiv	2023-06-02	Github	-
Efficient Shapley Values Estimation by Amortization for Text Classification	ACL	2023-05-31	Github	Video
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations	ICML	2023-05-24	Github	-
Localizing Model Behavior with Path Patching	arXiv	2023-05-16	-	-
Language models can explain neurons in language models	OpenAI	2023-05-09	-	-
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models	ICLR Workshop	2023-04-22	-	-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	ICLR	2023-01-20	Github	-
Interpreting Neural Networks through the Polytope Lens	arXiv	2022-11-22	-	-
Scaling Laws and Interpretability of Learning from Repeated Data	arXiv	2022-05-21	-	-
In-context Learning and Induction Heads	Anthropic	2022-03-08	-	-
A Mathematical Framework for Transformer Circuits	Anthropic	2021-12-22	-	-
Thinking Like Transformers	ICML	2021-07-19	Github	Mini Tutorial

SAE, Dictionary Learning and Superposition

Title	Venue	Date	Code	Blog
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task	MechInterp@ICML	2024-07-15	-	-
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models	MechInterp@ICML	2024-07-15	-	-
Interpreting Attention Layer Outputs with Sparse Autoencoders	MechInterp@ICML	2024-06-25	-	Demo
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning	MechInterp@ICML	2024-05-24	Github	-
Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis	arXiv	2024-05-23	-	-
Automatically Identifying Local and Global Circuits with Linear Computation Graphs	arXiv	2024-05-22	-	-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet	Anthropic	2024-05-21	-	Demo
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models	arXiv	2024-05-21	-	-
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control	arXiv	2024-05-20	-	-
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks	arXiv	2024-05-20	Github	-
Improving Dictionary Learning with Gated Sparse Autoencoders	arXiv	2024-04-30	-	-
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers	LessWrong	2024-04-29	-	Demo
Activation Steering with SAEs	LessWrong	2024-04-19	-	-
SAE reconstruction errors are (empirically) pathological	LessWrong	2024-03-29	-	-
Sparse autoencoders find composed features in small toy models	LessWrong	2024-03-14	Github	-
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT	LessWrong	2024-03-05	Github	-
Do sparse autoencoders find "true features"?	LessWrong	2024-02-12	-	-
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT	arXiv	2024-02-19	-	-
Toward A Mathematical Framework for Computation in Superposition	LessWrong	2024-01-18	-	-
Sparse Autoencoders Work on Attention Layer Outputs	LessWrong	2024-01-16	-	Demo
Sparse Autoencoders Find Highly Interpretable Features in Language Models	ICLR	2024-01-16	Github	-
Codebook Features: Sparse and Discrete Interpretability for Neural Networks	arXiv	2023-10-26	Github	Demo
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	Anthropic	2023-10-04	Github	Demo-1, Demo-2, Tutorial
Polysemanticity and Capacity in Neural Networks	arXiv	2023-07-12	-	-
Distributed Representations: Composition & Superposition	Anthropic	2023-05-04	-	-
Superposition, Memorization, and Double Descent	Anthropic	2023-01-05	-	-
Engineering Monosemanticity in Toy Models	arXiv	2022-11-16	Github	-
Toy Models of Superposition	Anthropic	2022-09-14	Github	Demo
Softmax Linear Units	Anthropic	2022-06-27	-	-
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors	DeeLIO@NAACL	2021-03-29	Github	-
Zoom In: An Introduction to Circuits	Distill	2020-03-10	-	-

Interpretability in Vision LLMs

Title	Venue	Date	Code	Blog
Dissecting Query-Key Interaction in Vision Transformers	MechInterp@ICML	2024-06-25	-	-
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models	MechInterp@ICML	2024-06-25	-	-
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP	MechInterp@ICML	2024-06-25	-	-
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision	MechInterp@ICML	2024-06-25	-	-
Don’t trust your eyes: on the (un)reliability of feature visualizations	ICML	2024-06-25	Github	-
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation	arXiv	2024-06-24	Github	-
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits	XAI4CV@CVPR	2024-04-09	Github	-
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)	arXiv	2024-02-16	Github	-
Analyzing Vision Transformers for Image Classification in Class Embedding Space	NeurIPS	2023-09-21	Github	-
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP	CLVL@ICCV	2023-08-27	Github	-
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models	NeurIPS	2023-07-11	Github	Blog

Benchmarking Interpretability

Title	Venue	Date	Code	Blog
Benchmarking Mental State Representations in Language Models	MechInterp@ICML	2024-06-25	-	-
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains	ACL	2024-05-21	Dataset	Blog
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations	arXiv	2024-02-27	Github	-
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	arXiv	2024-02-19	Github	-

Enhancing Interpretability

Title	Venue	Date	Code	Blog
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability	arXiv	2024-01-08	-	-
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability	arXiv	2023-06-06	Github	-

Others

Title	Venue	Date	Code	Blog
An introduction to graphical tensor notation for mechanistic interpretability	arXiv	2024-02-02	-	-
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks	arXiv	2023-10-03	Github	-

Other Awesome Interpretability Resources

Daily Picks in Interpretability & Analysis of LMs
Awesome LLM Interpretability
awesome papers for understanding LLM mechanism
Awesome-Attention-Heads
Awesome-LLM-Interpretability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Interpretability in Large Language Models

Awesome Interpretability Libraries

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

Awesome Interpretability Forums & Worhshops

Awesome Interpretability Tools

Awesome Interpretability Programs

Awesome Interpretability Papers

Survey Papers

Position Papers

Interpretable Analysis of LLMs

SAE, Dictionary Learning and Superposition

Interpretability in Vision LLMs

Benchmarking Interpretability

Enhancing Interpretability

Others

Other Awesome Interpretability Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Interpretability in Large Language Models

Awesome Interpretability Libraries

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

Awesome Interpretability Forums & Worhshops

Awesome Interpretability Tools

Awesome Interpretability Programs

Awesome Interpretability Papers

Survey Papers

Position Papers

Interpretable Analysis of LLMs

SAE, Dictionary Learning and Superposition

Interpretability in Vision LLMs

Benchmarking Interpretability

Enhancing Interpretability

Others

Other Awesome Interpretability Resources