[ICLR 2025] General-purpose activation steering library
-
Updated
Sep 18, 2025 - Python
[ICLR 2025] General-purpose activation steering library
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
A collection of symbolic recursion resources for integrity and sovereignty.
Add a description, image, and links to the refusal topic page so that developers can more easily learn about it.
To associate your repository with the refusal topic, visit your repo's landing page and select "manage topics."