Stanford NLP Python library for understanding and improving PyTorch models via interventions
-
Updated
Oct 13, 2025 - Python
Stanford NLP Python library for understanding and improving PyTorch models via interventions
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Steering vectors for transformer language models in Pytorch / Huggingface
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Sparse and discrete interpretability tool for neural networks
Unified access to Large Language Model modules using NNsight
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Mapping out the "memory" of neural nets with data attribution
Multi-Layer Sparse Autoencoders (ICLR 2025)
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.
graphpatch is a library for activation patching on PyTorch neural network models.
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery via Relevance Patching"
[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Repository for "From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance"
[NeurIPS'23 ATTRIB] An efficient framework to generate neuron explanations for LLMs
Exploring Representations and Interventions in Time Series Foundation Models @ ICML 2025
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."