Skip to content

Latest commit

 

History

History
333 lines (294 loc) · 60.3 KB

File metadata and controls

333 lines (294 loc) · 60.3 KB

Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: ruizhe.li@abdn.ac.uk.


Table of Contents


Awesome Interpretability Libraries

  • GitHub Repo stars TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
  • GitHub Repo stars nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial, Paper)
  • GitHub Repo stars SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
  • Github Repo stars EleutherAI: sae: train SAE on very large model based on the method and released code of the openAI SAE paper
  • GitHub Repo stars Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
  • GitHub Repo stars Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
  • GitHub Repo stars pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
  • GitHub Repo stars repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
  • GitHub Repo stars Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Paper, Doc, Tutorial)
  • GitHub Repo stars LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
  • GitHub Repo stars Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
  • GitHub Repo stars Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
  • GitHub Repo stars shap: Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. (Paper, Doc)
  • GitHub Repo stars captum: Model interpretability and understanding library for PyTorch (Paper, Doc)

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

Awesome Interpretability Forums & Worhshops

Awesome Interpretability Tools

  • GitHub Repo stars Transformer Debugger: investigate specific behaviors of small LLMs
  • GitHub Repo stars LLM Transparency Tool (Demo)
  • GitHub Repo stars sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
  • Neuronpedia: an open platform for interpretability research. (Doc)
  • GitHub Repo stars Comgra: A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. (Paper)

Awesome Interpretability Programs

  • ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

Awesome Interpretability Papers

Survey Papers

Title Venue Date Code
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
EMNLP 2024-10-06 -
Attention Heads of Large Language Models: A Survey arXiv 2024-09-06 Github
Internal Consistency and Self-Feedback in Large Language Models: A Survey arXiv 2024-07-22 Github Paper List
Relational Composition in Neural Networks: A Survey and Call to Action
MechInterp@ICML 2024-07-15 -
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP arXiv 2024-06-18 -
A Primer on the Inner Workings of Transformer-based Language Models arXiv 2024-05-02 -
Mechanistic Interpretability for AI Safety -- A Review arXiv 2024-04-22 -
From Understanding to Utilization: A Survey on Explainability for Large Language Models arXiv 2024-02-22 -
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks arXiv 2023-08-18 -

Position Papers

Title Venue Date Code
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience ICML 2024-06-25 -
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience ICML 2024-06-03 -
Interpretability Needs a New Paradigm arXiv 2024-05-08 -
Position Paper: Toward New Frameworks for Studying Model Representations arXiv 2024-02-06 -
Rethinking Interpretability in the Era of Large Language Models arXiv 2024-01-30 -

Interpretable Analysis of LLMs

Title Venue Date Code Blog
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
EMNLP 2024-10-23 - -
GitHub Repo stars
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
EMNLP 2024-10-04 Github -
GitHub Repo stars
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
EMNLP 2024-10-01 Github -
GitHub Repo stars
Information Flow Routes: Automatically Interpreting Language Models at Scale
EMNLP 2024-10-01 Github -
GitHub Repo stars
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
EMNLP 2024-10-01 Github -
GitHub Repo stars
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
EMNLP 2024-09-12 Github -
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
- 2024-08-05 Github -
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
MechInterp@ICML 2024-07-15 - -
GitHub Repo stars
Compact Proofs of Model Performance via Mechanistic Interpretability
MechInterp@ICML 2024-07-15 Github -
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
MechInterp@ICML 2024-07-15 - -
How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching
MechInterp@ICML 2024-07-15 - -
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
MechInterp@ICML 2024-07-15 - -
What Makes and Breaks Safety Fine-tuning? Mechanistic Study
MechInterp@ICML 2024-07-15 - -
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
MechInterp@ICML 2024-07-15 - -
Loss in the Crowd: Hidden Breakthroughs in Language Model Training
MechInterp@ICML 2024-07-15 - -
Robust Knowledge Unlearning via Mechanistic Localizations
MechInterp@ICML 2024-07-15 - -
Language Models Linearly Represent Sentiment
MechInterp@ICML 2024-07-15 - -
GitHub Repo stars
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
MechInterp@ICML 2024-07-15 Github -
Learning and Unlearning of Fabricated Knowledge in Language Models
MechInterp@ICML 2024-07-15 - -
Faithful and Fast Influence Function via Advanced Sampling
MechInterp@ICML 2024-07-15 - -
Hypothesis Testing the Circuit Hypothesis in LLMs
MechInterp@ICML 2024-07-15 - -
GitHub Repo stars
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
MechInterp@ICML 2024-07-15 Github -
GitHub Repo stars
InversionView: A General-Purpose Method for Reading Information from Neural Activations
MechInterp@ICML 2024-07-15 Github -
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
MechInterp@ICML 2024-07-15 - -
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning
arXiv 2024-07-04 - -
GitHub Repo stars
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
arXiv 2024-07-01 Github -
GitHub Repo stars
Recovering the Pre-Fine-Tuning Weights of Generative Models
ICML 2024-07-01 Github Blog
GitHub Repo stars
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
arXiv 2024-06-28 Github Blog
GitHub Repo stars
Observable Propagation: Uncovering Feature Vectors in Transformers
ICML 2024-06-25 Github -
GitHub Repo stars
Multi-property Steering of Large Language Models with Dynamic Activation Composition
arXiv 2024-06-25 Github -
What Do the Circuits Mean? A Knowledge Edit View
arXiv 2024-06-25 - -
Confidence Regulation Neurons in Language Models
arXiv 2024-06-24 - -
GitHub Repo stars
Compact Proofs of Model Performance via Mechanistic Interpretability
arXiv 2024-06-24 Github -
GitHub Repo stars
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
arXiv 2024-06-23 Github -
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models
arXiv 2024-06-23 - -
GitHub Repo stars
Estimating Knowledge in Large Language Models Without Generating a Single Token
arXiv 2024-06-18 Github -
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
arXiv 2024-06-17 - -
GitHub Repo stars
Transcoders Find Interpretable LLM Feature Circuits
MechInterp@ICML 2024-06-17 Github -
GitHub Repo stars
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
arXiv 2024-06-16 Github -
GitHub Repo stars
Context versus Prior Knowledge in Language Models
ACL 2024-06-16 Github -
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
arXiv 2024-06-13 - -
GitHub Repo stars
MambaLRP: Explaining Selective State Space Sequence Models
arXiv 2024-06-11 Github -
GitHub Repo stars
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
ICML 2024-06-06 Github Blog
GitHub Repo stars
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
ACL 2024-06-06 Github -
Learned feature representations are biased by complexity, learning order, position, and more
arXiv 2024-06-06 Demo -
Iteration Head: A Mechanistic Study of Chain-of-Thought
arXiv 2024-06-05 - -
Activation Addition: Steering Language Models Without Optimization
arXiv 2024-06-04 Code -
Interpretability Illusions in the Generalization of Simplified Models
arXiv 2024-06-04 - -
GitHub Repo stars
SyntaxShap: Syntax-aware Explainability Method for Text Generation
arXiv 2024-06-03 Github Blog
Calibrating Reasoning in Language Models with Internal Consistency
arXiv 2024-05-29 - -
Black-Box Access is Insufficient for Rigorous AI Audits
FAccT 2024-05-29 - -
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
arXiv 2024-05-28 - -
GitHub Repo stars
From Neurons to Neutrons: A Case Study in Interpretability
ICML 2024-05-27 Github -
GitHub Repo stars
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
MechInterp@ICML 2024-05-27 Github -
GitHub Repo stars
Explorations of Self-Repair in Language Models
ICML 2024-05-26 Github -
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
arXiv 2024-05-24 - -
GitHub Repo stars
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
arXiv 2024-05-23 Github -
GitHub Repo stars
Not All Language Model Features Are Linear
arXiv 2024-05-23 Github -
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
arXiv 2024-05-20 - -
GitHub Repo stars
Your Transformer is Secretly Linear
arXiv 2024-05-19 Github -
GitHub Repo stars
Are self-explanations from Large Language Models faithful?
ACL 2024-05-16 Github -
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
arXiv 2024-05-14 - -
GitHub Repo stars
Steering Llama 2 via Contrastive Activation Addition
arXiv 2024-05-07 Github -
GitHub Repo stars
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
AISTATS 2024-05-07 Github -
GitHub Repo stars
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
arXiv 2024-05-06 Github -
GitHub Repo stars
Circuit Component Reuse Across Tasks in Transformer Language Models
ICLR 2024-05-06 Github -
GitHub Repo stars
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations
HCI+NLP@NAACL 2024-04-24 Github -
How to use and interpret activation patching
arXiv 2024-04-23 - -
Understanding Addition in Transformers
arXiv 2024-04-23 - -
Towards Uncovering How Large Language Model Works: An Explainability Perspective
arXiv 2024-04-15 - -
GitHub Repo stars
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
ICML 2024-04-10 Github -
Does Transformer Interpretability Transfer to RNNs?
arXiv 2024-04-09 - -
GitHub Repo stars
Locating and Editing Factual Associations in Mamba
arXiv 2024-04-04 Github Demo
Eliciting Latent Knowledge from Quirky Language Models
ME-FoMo@ICLR 2024-04-03 - -
Do language models plan ahead for future tokens?
arXiv 2024-04-01 - -
GitHub Repo stars
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv 2024-03-31 Github Demo
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
arXiv 2024-03-26 - -
GitHub Repo stars
What does the Knowledge Neuron Thesis Have to do with Knowledge?
ICLR 2024-03-16 Github -
GitHub Repo stars
Language Models Represent Space and Time
ICLR 2024-03-04 Github -
AtP*: An efficient and scalable method for localizing LLM behaviour to components
arXiv 2024-03-01 - -
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
arXiv 2024-02-28 - -
GitHub Repo stars
Function Vectors in Large Language Models
ICLR 2024-02-25 Github Blog
A Language Model's Guide Through Latent Space
arXiv 2024-02-22 - -
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
arXiv 2024-02-22 - -
GitHub Repo stars
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
ICLR 2024-02-22 Github Blog
GitHub Repo stars
Fine-grained Hallucination Detection and Editing for Language Models
arXiv 2024-02-21 Github Blog
GitHub Repo stars
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
arXiv 2024-02-20 Github -
Identifying Semantic Induction Heads to Understand In-Context Learning
arXiv 2024-02-20 - -
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
arXiv 2024-02-20 - -
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
ACML 2024-02-12 - -
GitHub Repo stars
Model Editing with Canonical Examples
arXiv 2024-02-09 Github -
GitHub Repo stars
Opening the AI black box: program synthesis via mechanistic interpretability
arXiv 2024-02-07 Github -
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
ICLR 2024-02-06 - -
GitHub Repo stars
In-Context Language Learning: Architectures and Algorithms
arXiv 2024-01-30 Github -
Gradient-Based Language Model Red Teaming
EACL 2024-01-30 Github -
The Calibration Gap between Model and Human Confidence in Large Language Models
arXiv 2024-01-24 - -
GitHub Repo stars
Universal Neurons in GPT2 Language Models
arXiv 2024-01-22 Github -
The mechanistic basis of data dependence and abrupt learning in an in-context classification task
ICLR 2024-01-16 - -
GitHub Repo stars
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
ICLR 2024-01-16 Github -
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
ICLR 2024-01-16 - -
Feature emergence via margin maximization: case studies in algebraic tasks
ICLR 2024-01-16 - -
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
ICLR 2024-01-16 - -
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
ICLR 2024-01-16 - -
GitHub Repo stars
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
ICML 2024-01-03 Github -
GitHub Repo stars
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
ATTRIB@NeurIPS 2023-12-31 Github Blog
GitHub Repo stars
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
arXiv 2023-12-08 Github Blog
GitHub Repo stars
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
ATTRIB@NeurIPS 2023-12-06 Github -
GitHub Repo stars
Structured World Representations in Maze-Solving Transformers
UniReps@NeurIPS 2023-12-05 Github -
Generating Interpretable Networks using Hypernetworks
arXiv 2023-12-05 - -
GitHub Repo stars
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
NeurIPS 2023-11-21 Github -
GitHub Repo stars
Attribution Patching Outperforms Automated Circuit Discovery
ATTRIB@NeurIPS 2023-11-20 Github -
GitHub Repo stars
Tracr: Compiled Transformers as a Laboratory for Interpretability
NeurIPS 2023-11-03 Github -
GitHub Repo stars
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
NeurIPS 2023-11-02 Github -
GitHub Repo stars
Learning Transformer Programs
NeurIPS 2023-10-31 Github -
GitHub Repo stars
Towards Automated Circuit Discovery for Mechanistic Interpretability
NeurIPS 2023-10-28 Github -
GitHub Repo stars
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
EMNLP 2023-10-23 Github -
GitHub Repo stars
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
NeurIPS 2023-10-20 Github -
GitHub Repo stars
Progress measures for grokking via mechanistic interpretability
ICLR 2023-10-19 Github Blog
GitHub Repo stars
Copy Suppression: Comprehensively Understanding an Attention Head
arXiv 2023-10-06 Github Blog & Demo
GitHub Repo stars
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
BlackboxNLP@EMNLP 2023-09-07 Github Blog
GitHub Repo stars
Finding Neurons in a Haystack: Case Studies with Sparse Probing
arXiv 2023-06-02 Github -
GitHub Repo stars
Efficient Shapley Values Estimation by Amortization for Text Classification
ACL 2023-05-31 Github Video
GitHub Repo stars
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
ICML 2023-05-24 Github -
Localizing Model Behavior with Path Patching
arXiv 2023-05-16 - -
Language models can explain neurons in language models
OpenAI 2023-05-09 - -
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
ICLR Workshop 2023-04-22 - -
GitHub Repo stars
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
ICLR 2023-01-20 Github -
Interpreting Neural Networks through the Polytope Lens
arXiv 2022-11-22 - -
Scaling Laws and Interpretability of Learning from Repeated Data
arXiv 2022-05-21 - -
In-context Learning and Induction Heads
Anthropic 2022-03-08 - -
A Mathematical Framework for Transformer Circuits
Anthropic 2021-12-22 - -
GitHub Repo stars
Thinking Like Transformers
ICML 2021-07-19 Github Mini Tutorial

SAE, Dictionary Learning and Superposition

Title Venue Date Code Blog
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task
MechInterp@ICML 2024-07-15 - -
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
MechInterp@ICML 2024-07-15 - -
Interpreting Attention Layer Outputs with Sparse Autoencoders
MechInterp@ICML 2024-06-25 - Demo
GitHub Repo stars
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
MechInterp@ICML 2024-05-24 Github -
Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis
arXiv 2024-05-23 - -
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
arXiv 2024-05-22 - -
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Anthropic 2024-05-21 - Demo
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
arXiv 2024-05-21 - -
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
arXiv 2024-05-20 - -
GitHub Repo stars
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
arXiv 2024-05-20 Github -
Improving Dictionary Learning with Gated Sparse Autoencoders
arXiv 2024-04-30 - -
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
LessWrong 2024-04-29 - Demo
Activation Steering with SAEs
LessWrong 2024-04-19 - -
SAE reconstruction errors are (empirically) pathological
LessWrong 2024-03-29 - -
GitHub Repo stars
Sparse autoencoders find composed features in small toy models
LessWrong 2024-03-14 Github -
GitHub Repo stars
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
LessWrong 2024-03-05 Github -
Do sparse autoencoders find "true features"?
LessWrong 2024-02-12 - -
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
arXiv 2024-02-19 - -
Toward A Mathematical Framework for Computation in Superposition
LessWrong 2024-01-18 - -
Sparse Autoencoders Work on Attention Layer Outputs
LessWrong 2024-01-16 - Demo
GitHub Repo stars
Sparse Autoencoders Find Highly Interpretable Features in Language Models
ICLR 2024-01-16 Github -
GitHub Repo stars
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
arXiv 2023-10-26 Github Demo
GitHub Repo stars
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Anthropic 2023-10-04 Github Demo-1, Demo-2, Tutorial
Polysemanticity and Capacity in Neural Networks
arXiv 2023-07-12 - -
Distributed Representations: Composition & Superposition
Anthropic 2023-05-04 - -
Superposition, Memorization, and Double Descent
Anthropic 2023-01-05 - -
GitHub Repo stars
Engineering Monosemanticity in Toy Models
arXiv 2022-11-16 Github -
GitHub Repo stars
Toy Models of Superposition
Anthropic 2022-09-14 Github Demo
Softmax Linear Units
Anthropic 2022-06-27 - -
GitHub Repo stars
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
DeeLIO@NAACL 2021-03-29 Github -
Zoom In: An Introduction to Circuits
Distill 2020-03-10 - -

Interpretability in Vision LLMs

Title Venue Date Code Blog
Dissecting Query-Key Interaction in Vision Transformers
MechInterp@ICML 2024-06-25 - -
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
MechInterp@ICML 2024-06-25 - -
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
MechInterp@ICML 2024-06-25 - -
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
MechInterp@ICML 2024-06-25 - -
GitHub Repo stars
Don’t trust your eyes: on the (un)reliability of feature visualizations
ICML 2024-06-25 Github -
GitHub Repo stars
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation
arXiv 2024-06-24 Github -
GitHub Repo stars
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
XAI4CV@CVPR 2024-04-09 Github -
GitHub Repo stars
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
arXiv 2024-02-16 Github -
GitHub Repo stars
Analyzing Vision Transformers for Image Classification in Class Embedding Space
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
CLVL@ICCV 2023-08-27 Github -
GitHub Repo stars
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
NeurIPS 2023-07-11 Github Blog

Benchmarking Interpretability

Title Venue Date Code Blog
Benchmarking Mental State Representations in Language Models
MechInterp@ICML 2024-06-25 - -
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
ACL 2024-05-21 Dataset Blog
GitHub Repo stars
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
arXiv 2024-02-27 Github -
GitHub Repo stars
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
arXiv 2024-02-19 Github -

Enhancing Interpretability

Title Venue Date Code Blog
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability
arXiv 2024-01-08 - -
GitHub Repo stars
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
arXiv 2023-06-06 Github -

Others

Title Venue Date Code Blog
An introduction to graphical tensor notation for mechanistic interpretability
arXiv 2024-02-02 - -
GitHub Repo stars
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks
arXiv 2023-10-03 Github -

Other Awesome Interpretability Resources