A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
-
Updated
Mar 23, 2025 - Python
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders
AI research portfolio bridging technical rigor and humanistic inquiry through the Eigen-Koan Matrix, Codex Illuminata, and specialized metaprompts for diverse interaction styles
Add a description, image, and links to the llm-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the llm-interpretability topic, visit your repo's landing page and select "manage topics."