Skip to content

AlphaLab-USTC/AlphaSteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

license
Leheng Sheng*1, Changshuo Shen*2, Weixiang Zhao3, Junfeng Fang1, Xiaohao Liu1, Zhengkai Liang1, Xiang Wang2, An Zhang2†, Tat-Seng Chua1,
1National University of Singapore, 2University of Science and Technology of China, 3Harbin Institute of Technology
* Equal contribution. + Corresponding author.

License Python 3.11+

Overview

AlphaSteer Overview

AlphaSteer is a theoretically grounded activation steering method designed to enhance LLM safety without compromising utility. While traditional activation steering approaches face a trade-off between safety and performance, AlphaSteer addresses this challenge through a principled learning approach with dual objectives:

  1. Utility Preservation: Learns to create near-zero steering vectors for benign inputs using null-space constraints
  2. Safety Enhancement: Generates effective refusal direction vectors for malicious prompts through linear regression

Effect on Different Prompt Activations & Performance

PCA Visualization Performance

AlphaSteer steers activations of malicious prompts towards refusal, while largely leaving those of benign prompts unchanged. Traditional activation steering methods struggle to maintain benign prompts unchanged. Therefore, AlphaSteer maintains the utility unchanged while enhancing the safety of the model by a large margin.

👉 Quick Start of AlphaSteer

Installation of Dependencies

conda create -n alphasteer python=3.11
conda activate alphasteer
pip install -r requirements.txt

Usage

The alphasteer.sh script automates the process of extracting embeddings, calculating the steering matrix, and generating steered responses for the meta-llama/Llama-3.1-8B-Instruct model.

./scripts/alphasteer.sh

Or you can directly download our steering matrix from this Google Drive link(recommended).

Please download it directly to the data/steering_matrix directory, and then execute the final part of the generation process.

./scripts/generate.sh

☎️ Contact

Please contact any of the first authors for queries.

About

The implementation of "AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •