Stanford NLP Python library for understanding and improving PyTorch models via interventions
-
Updated
May 27, 2025 - Python
Stanford NLP Python library for understanding and improving PyTorch models via interventions
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Steering vectors for transformer language models in Pytorch / Huggingface
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Sparse and discrete interpretability tool for neural networks
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Multi-Layer Sparse Autoencoders (ICLR 2025)
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
graphpatch is a library for activation patching on PyTorch neural network models.
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.
A small package implementing some useful wrapping around nnsight
An effective weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
[NeurIPS'23 ATTRIB] An efficient framework to generate neuron explanations for LLMs
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Repository for "From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance"
A framework for conducting interpretability research and for developing an LLM from a synthetic dataset.
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."