mechanistic-interpretability

Star

Here are 47 public repositories matching this topic...

stanfordnlp / pyvene

Star

Stanford NLP Python library for understanding and improving PyTorch models via interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Oct 13, 2025
Python

OpenMOSS / Language-Model-SAEs

Star

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Oct 16, 2025
Python

stanfordnlp / axbench

Star

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

intervention interpretability large-language-models mechanistic-interpretability llm-steering

Updated Jun 25, 2025
Python

steering-vectors / steering-vectors

Star

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Feb 21, 2025
Python

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

taufeeque9 / codebook-features

Star

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

Butanium / nnterp

Star

Unified access to Large Language Model modules using NNsight

mechanistic-interpretability nnsight patchscopes

Updated Oct 15, 2025
Python

aryamanarora / causalgym

Star

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Nov 30, 2024
Python

EleutherAI / bergson

Star

Mapping out the "memory" of neural nets with data attribution

interpretability mechanistic-interpretability data-attribution

Updated Oct 19, 2025
Python

tim-lawson / mlsae

Star

Multi-Layer Sparse Autoencoders (ICLR 2025)

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Feb 11, 2025
Python

Trustworthy-ML-Lab / CB-LLMs

Star

[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

natural-language-processing deep-learning interpretable-deep-learning explainable-ai large-language-models mechanistic-interpretability

Updated Aug 15, 2025
Python

evan-lloyd / graphpatch

Star

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated Feb 11, 2025
Python

koayon / atp_star

Star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Jan 19, 2025
Python

FarnoushRJ / RelP

Star

Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery via Relevance Patching"

circuit-analysis interpretability explainable-ai interpretable-machine-learning explainability llms mechanistic-interpretability

Updated Sep 1, 2025
Python

Trustworthy-ML-Lab / ThinkEdit

Star

[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s representation space.

deep-learning interpretable-machine-learning large-language-models generative-ai mechanistic-interpretability reasoning-language-models