A research tool for studying how deception emerges in multi-agent LLM systems and detecting it through activation analysis.
alignment gemma sparse-autoencoders multi-agent-systems ai-safety emergent-behavior interpretability deception-detection activation-analysis mechanistic-interpretability llm-agents gemma-2b gemma-scope transformer-lens linear-probes
-
Updated
Jan 11, 2026 - Python