Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
|
EMNLP |
2024-10-23 |
- |
- |
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
|
EMNLP |
2024-10-04 |
Github |
- |
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
|
EMNLP |
2024-10-01 |
Github |
- |
Information Flow Routes: Automatically Interpreting Language Models at Scale
|
EMNLP |
2024-10-01 |
Github |
- |
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
|
EMNLP |
2024-10-01 |
Github |
- |
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
|
EMNLP |
2024-09-12 |
Github |
- |
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
|
- |
2024-08-05 |
Github |
- |
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Compact Proofs of Model Performance via Mechanistic Interpretability
|
MechInterp@ICML |
2024-07-15 |
Github |
- |
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
|
MechInterp@ICML |
2024-07-15 |
- |
- |
How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
|
MechInterp@ICML |
2024-07-15 |
- |
- |
What Makes and Breaks Safety Fine-tuning? Mechanistic Study
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Loss in the Crowd: Hidden Breakthroughs in Language Model Training
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Robust Knowledge Unlearning via Mechanistic Localizations
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Language Models Linearly Represent Sentiment
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
|
MechInterp@ICML |
2024-07-15 |
Github |
- |
Learning and Unlearning of Fabricated Knowledge in Language Models
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Faithful and Fast Influence Function via Advanced Sampling
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Hypothesis Testing the Circuit Hypothesis in LLMs
|
MechInterp@ICML |
2024-07-15 |
- |
- |
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
|
MechInterp@ICML |
2024-07-15 |
Github |
- |
InversionView: A General-Purpose Method for Reading Information from Neural Activations
|
MechInterp@ICML |
2024-07-15 |
Github |
- |
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
|
MechInterp@ICML |
2024-07-15 |
- |
- |
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning
|
arXiv |
2024-07-04 |
- |
- |
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
|
arXiv |
2024-07-01 |
Github |
- |
Recovering the Pre-Fine-Tuning Weights of Generative Models
|
ICML |
2024-07-01 |
Github |
Blog |
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
|
arXiv |
2024-06-28 |
Github |
Blog |
Observable Propagation: Uncovering Feature Vectors in Transformers
|
ICML |
2024-06-25 |
Github |
- |
Multi-property Steering of Large Language Models with Dynamic Activation Composition
|
arXiv |
2024-06-25 |
Github |
- |
What Do the Circuits Mean? A Knowledge Edit View
|
arXiv |
2024-06-25 |
- |
- |
Confidence Regulation Neurons in Language Models
|
arXiv |
2024-06-24 |
- |
- |
Compact Proofs of Model Performance via Mechanistic Interpretability
|
arXiv |
2024-06-24 |
Github |
- |
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
|
arXiv |
2024-06-23 |
Github |
- |
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models
|
arXiv |
2024-06-23 |
- |
- |
Estimating Knowledge in Large Language Models Without Generating a Single Token
|
arXiv |
2024-06-18 |
Github |
- |
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
|
arXiv |
2024-06-17 |
- |
- |
Transcoders Find Interpretable LLM Feature Circuits
|
MechInterp@ICML |
2024-06-17 |
Github |
- |
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
|
arXiv |
2024-06-16 |
Github |
- |
Context versus Prior Knowledge in Language Models
|
ACL |
2024-06-16 |
Github |
- |
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
|
arXiv |
2024-06-13 |
- |
- |
MambaLRP: Explaining Selective State Space Sequence Models
|
arXiv |
2024-06-11 |
Github |
- |
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
|
ICML |
2024-06-06 |
Github |
Blog |
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
|
ACL |
2024-06-06 |
Github |
- |
Learned feature representations are biased by complexity, learning order, position, and more
|
arXiv |
2024-06-06 |
Demo |
- |
Iteration Head: A Mechanistic Study of Chain-of-Thought
|
arXiv |
2024-06-05 |
- |
- |
Activation Addition: Steering Language Models Without Optimization
|
arXiv |
2024-06-04 |
Code |
- |
Interpretability Illusions in the Generalization of Simplified Models
|
arXiv |
2024-06-04 |
- |
- |
SyntaxShap: Syntax-aware Explainability Method for Text Generation
|
arXiv |
2024-06-03 |
Github |
Blog |
Calibrating Reasoning in Language Models with Internal Consistency
|
arXiv |
2024-05-29 |
- |
- |
Black-Box Access is Insufficient for Rigorous AI Audits
|
FAccT |
2024-05-29 |
- |
- |
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
|
arXiv |
2024-05-28 |
- |
- |
From Neurons to Neutrons: A Case Study in Interpretability
|
ICML |
2024-05-27 |
Github |
- |
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
|
MechInterp@ICML |
2024-05-27 |
Github |
- |
Explorations of Self-Repair in Language Models
|
ICML |
2024-05-26 |
Github |
- |
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
|
arXiv |
2024-05-24 |
- |
- |
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
|
arXiv |
2024-05-23 |
Github |
- |
Not All Language Model Features Are Linear
|
arXiv |
2024-05-23 |
Github |
- |
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
|
arXiv |
2024-05-20 |
- |
- |
Your Transformer is Secretly Linear
|
arXiv |
2024-05-19 |
Github |
- |
Are self-explanations from Large Language Models faithful?
|
ACL |
2024-05-16 |
Github |
- |
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
|
arXiv |
2024-05-14 |
- |
- |
Steering Llama 2 via Contrastive Activation Addition
|
arXiv |
2024-05-07 |
Github |
- |
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
|
AISTATS |
2024-05-07 |
Github |
- |
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
|
arXiv |
2024-05-06 |
Github |
- |
Circuit Component Reuse Across Tasks in Transformer Language Models
|
ICLR |
2024-05-06 |
Github |
- |
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations
|
HCI+NLP@NAACL |
2024-04-24 |
Github |
- |
How to use and interpret activation patching
|
arXiv |
2024-04-23 |
- |
- |
Understanding Addition in Transformers
|
arXiv |
2024-04-23 |
- |
- |
Towards Uncovering How Large Language Model Works: An Explainability Perspective
|
arXiv |
2024-04-15 |
- |
- |
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
|
ICML |
2024-04-10 |
Github |
- |
Does Transformer Interpretability Transfer to RNNs?
|
arXiv |
2024-04-09 |
- |
- |
Locating and Editing Factual Associations in Mamba
|
arXiv |
2024-04-04 |
Github |
Demo |
Eliciting Latent Knowledge from Quirky Language Models
|
ME-FoMo@ICLR |
2024-04-03 |
- |
- |
Do language models plan ahead for future tokens?
|
arXiv |
2024-04-01 |
- |
- |
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
|
arXiv |
2024-03-31 |
Github |
Demo |
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
|
arXiv |
2024-03-26 |
- |
- |
What does the Knowledge Neuron Thesis Have to do with Knowledge?
|
ICLR |
2024-03-16 |
Github |
- |
Language Models Represent Space and Time
|
ICLR |
2024-03-04 |
Github |
- |
AtP*: An efficient and scalable method for localizing LLM behaviour to components
|
arXiv |
2024-03-01 |
- |
- |
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
|
arXiv |
2024-02-28 |
- |
- |
Function Vectors in Large Language Models
|
ICLR |
2024-02-25 |
Github |
Blog |
A Language Model's Guide Through Latent Space
|
arXiv |
2024-02-22 |
- |
- |
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
|
arXiv |
2024-02-22 |
- |
- |
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
|
ICLR |
2024-02-22 |
Github |
Blog |
Fine-grained Hallucination Detection and Editing for Language Models
|
arXiv |
2024-02-21 |
Github |
Blog |
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
|
arXiv |
2024-02-20 |
Github |
- |
Identifying Semantic Induction Heads to Understand In-Context Learning
|
arXiv |
2024-02-20 |
- |
- |
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
|
arXiv |
2024-02-20 |
- |
- |
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
|
ACML |
2024-02-12 |
- |
- |
Model Editing with Canonical Examples
|
arXiv |
2024-02-09 |
Github |
- |
Opening the AI black box: program synthesis via mechanistic interpretability
|
arXiv |
2024-02-07 |
Github |
- |
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
|
ICLR |
2024-02-06 |
- |
- |
In-Context Language Learning: Architectures and Algorithms
|
arXiv |
2024-01-30 |
Github |
- |
Gradient-Based Language Model Red Teaming
|
EACL |
2024-01-30 |
Github |
- |
The Calibration Gap between Model and Human Confidence in Large Language Models
|
arXiv |
2024-01-24 |
- |
- |
Universal Neurons in GPT2 Language Models
|
arXiv |
2024-01-22 |
Github |
- |
The mechanistic basis of data dependence and abrupt learning in an in-context classification task
|
ICLR |
2024-01-16 |
- |
- |
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
|
ICLR |
2024-01-16 |
Github |
- |
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
|
ICLR |
2024-01-16 |
- |
- |
Feature emergence via margin maximization: case studies in algebraic tasks
|
ICLR |
2024-01-16 |
- |
- |
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
|
ICLR |
2024-01-16 |
- |
- |
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
|
ICLR |
2024-01-16 |
- |
- |
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
|
ICML |
2024-01-03 |
Github |
- |
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
|
ATTRIB@NeurIPS |
2023-12-31 |
Github |
Blog |
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
|
arXiv |
2023-12-08 |
Github |
Blog |
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
|
ATTRIB@NeurIPS |
2023-12-06 |
Github |
- |
Structured World Representations in Maze-Solving Transformers
|
UniReps@NeurIPS |
2023-12-05 |
Github |
- |
Generating Interpretable Networks using Hypernetworks
|
arXiv |
2023-12-05 |
- |
- |
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
|
NeurIPS |
2023-11-21 |
Github |
- |
Attribution Patching Outperforms Automated Circuit Discovery
|
ATTRIB@NeurIPS |
2023-11-20 |
Github |
- |
Tracr: Compiled Transformers as a Laboratory for Interpretability
|
NeurIPS |
2023-11-03 |
Github |
- |
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
|
NeurIPS |
2023-11-02 |
Github |
- |
Learning Transformer Programs
|
NeurIPS |
2023-10-31 |
Github |
- |
Towards Automated Circuit Discovery for Mechanistic Interpretability
|
NeurIPS |
2023-10-28 |
Github |
- |
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
|
EMNLP |
2023-10-23 |
Github |
- |
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
|
NeurIPS |
2023-10-20 |
Github |
- |
Progress measures for grokking via mechanistic interpretability
|
ICLR |
2023-10-19 |
Github |
Blog |
Copy Suppression: Comprehensively Understanding an Attention Head
|
arXiv |
2023-10-06 |
Github |
Blog & Demo |
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
|
NeurIPS |
2023-09-21 |
Github |
- |
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
|
NeurIPS |
2023-09-21 |
Github |
- |
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
|
BlackboxNLP@EMNLP |
2023-09-07 |
Github |
Blog |
Finding Neurons in a Haystack: Case Studies with Sparse Probing
|
arXiv |
2023-06-02 |
Github |
- |
Efficient Shapley Values Estimation by Amortization for Text Classification
|
ACL |
2023-05-31 |
Github |
Video |
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
|
ICML |
2023-05-24 |
Github |
- |
Localizing Model Behavior with Path Patching
|
arXiv |
2023-05-16 |
- |
- |
Language models can explain neurons in language models
|
OpenAI |
2023-05-09 |
- |
- |
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
|
ICLR Workshop |
2023-04-22 |
- |
- |
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
|
ICLR |
2023-01-20 |
Github |
- |
Interpreting Neural Networks through the Polytope Lens
|
arXiv |
2022-11-22 |
- |
- |
Scaling Laws and Interpretability of Learning from Repeated Data
|
arXiv |
2022-05-21 |
- |
- |
In-context Learning and Induction Heads
|
Anthropic |
2022-03-08 |
- |
- |
A Mathematical Framework for Transformer Circuits
|
Anthropic |
2021-12-22 |
- |
- |
Thinking Like Transformers
|
ICML |
2021-07-19 |
Github |
Mini Tutorial |