A Mechanistic Interpretability Toolkit for Cross-Layer Transcoder Training and Attribution-Graph Visualization
-
Updated
Apr 16, 2026 - Python
A Mechanistic Interpretability Toolkit for Cross-Layer Transcoder Training and Attribution-Graph Visualization
The Dataset and Official Implementation for <Discursive Circuits: How Do Language Models Understand Discourse Relations?> @ EMNLP 2025
Tracking exactly what happens to the internal "circuitry" (induction heads) of a 2-layer attention-only Transformer when forced to undergo domain adaptation from prose to structured Python code.
Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.
Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.
Reverse-engineering neural network internals from scratch in NumPy + PyTorch. A 6-week masterclass: linear representation hypothesis, superposition, sparse autoencoders, transformer circuits & induction heads, activation/path patching & causal scrubbing, and steering a real LM. Fully executed notebooks.
Reproduction of the induction-heads circuit (Olsson et al., 2022) in a 2-layer attention-only transformer trained on synthetic random-boundary repeat sequences, with mechanistic identification by ablation and training-time tracking of circuit formation.
Natural Language Autoencoder (NLA) research prototype inspired by Anthropic’s interpretability work. Implements a scoped approximation of activation verbalization and reconstruction on small open-source LLMs, with quantitative evaluation, baselines, and reproducible local-first experimentation.
Add a description, image, and links to the transformer-circuits topic page so that developers can more easily learn about it.
To associate your repository with the transformer-circuits topic, visit your repo's landing page and select "manage topics."