llm-interpretability

Here are 4 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

pytorch feature-extraction open-research sparse-autoencoder llama3 llm-interpretability feature-steering

Updated Mar 23, 2025
Python

Luisibear98 / intervention-jailbreak

Star

This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify

intervention llm llm-interpretability

Updated Feb 2, 2025
Python

peppinob-ol / attribution-graph-probing

Star

Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.

graph-analysis sparse-autoencoders mechanistic-interpretability llm-interpretability research-tooling circuit-tracing attribution-graphs probe-prompting prompt-probing neuronpedia feature-activation supernodes cross-layer-transcoder

Updated Dec 4, 2025
Python

BeekeepingAI / hexray

Star

🔬 HexRay: An Open-Source Neuroscope for AI — Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security

debugging ai llm mechanistic-interpretability llm-training llm-inference llm-interpretability llm-introspection

Updated Jul 28, 2025
Python

Improve this page

Add a description, image, and links to the llm-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-interpretability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-interpretability

Here are 4 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

Luisibear98 / intervention-jailbreak

peppinob-ol / attribution-graph-probing

BeekeepingAI / hexray

Improve this page

Add this topic to your repo