AlphaSteer is a theoretically grounded activation steering method designed to enhance LLM safety without compromising utility. While traditional activation steering approaches face a trade-off between safety and performance, AlphaSteer addresses this challenge through a principled learning approach with dual objectives:
- Utility Preservation: Learns to create near-zero steering vectors for benign inputs using null-space constraints
- Safety Enhancement: Generates effective refusal direction vectors for malicious prompts through linear regression
AlphaSteer steers activations of malicious prompts towards refusal, while largely leaving those of benign prompts unchanged. Traditional activation steering methods struggle to maintain benign prompts unchanged. Therefore, AlphaSteer maintains the utility unchanged while enhancing the safety of the model by a large margin.
conda create -n alphasteer python=3.11
conda activate alphasteer
pip install -r requirements.txt
The alphasteer.sh script automates the process of extracting embeddings, calculating the steering matrix, and generating steered responses for the meta-llama/Llama-3.1-8B-Instruct model.
./scripts/alphasteer.sh
Or you can directly download our steering matrix from this Google Drive link(recommended).
Please download it directly to the data/steering_matrix
directory, and then execute the final part of the generation process.
./scripts/generate.sh
Please contact any of the first authors for queries.
- Leheng Sheng, leheng.sheng@u.nus.edu
- Changshuo Shen, stephen_shen@mail.ustc.edu.cn