DIRECTER is a novel inference-time activation steering method designed to significantly improve how Large Language Models (LLMs) follow complex instructions while mitigating the common risk of "oversteering."
While activation steering techniques can effectively force models to adhere to constraints, they often suffer from a trade-off: excessive emphasis on the instruction can degrade the overall coherence and quality of the generated text. DIRECTER solves this by dynamically modulating steering strength at every decoding step.
DIRECTER couples KV cache steering with a plausibility-guided decoding loop. At each step, the method:
- Steers: Tentatively amplifies the "Key" vectors in the KV cache associated with the instruction.
- Checks Plausibility: Compares the steered output distribution against the raw model's distribution.
- Modulates: If the steered output is deemed implausible (deviates too far from the model's natural distribution), DIRECTER progressively reduces the steering strength by removing layers from the intervention set.
This process is guided by a lightweight, one-time Sensitivity Analysis that ranks layers based on their influence, ensuring that the most effective layers are prioritized.
The official implementation code will be released soon.
We are currently preparing the codebase for public release. Please watch this repository for updates.
