Skip to content
/ VAA Public

[ICML 2025] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

ChanLiang/VAA

Repository files navigation

Vulnerability-Aware Alignment:

Mitigating Uneven Forgetting in Harmful Fine-Tuning

Paper Github ICML 2025 Publication

Implementation for our ICML 2025 paper Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning.

VAA is a safety alignment method that upweights and reinforces vulnerable data to enhance safety retention during customized fine-tuning.

Getting Started

1. Environment Setup

Create the Conda environment using the provided environment.yml:

conda env create -f environment.yml

If you encounter missing dependencies, please install them manually.

2. Data and Resources

Prepare experimental datasets using:

bash script/build_dataset.sh

Models should be downloaded and placed in the paths as specified in scripts.

3. Safety Alignment

Run the following scripts to perform safety alignment with VAA and baselines:

# model = qwen | llama
bash script/vaa_pipeline_{model}.sh # Supports ERM and vaccine
bash script/repnoise_{model}.sh
bash script/booster_{model}.sh

4. Harmful Fine-Tuning & Evaluation

Run the following script to perform HFT and evaluate the model on various downstream datasets.

These steps are also integrated into the main training pipeline.

bash script/run_all_downstream.sh

Citation

If you find our work helpful, please cite:

@misc{chen2025VAA,
      title={Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning}, 
      author={Liang Chen and Xueting Han and Li Shen and Jing Bai and Kam-Fai Wong},
      year={2025},
      eprint={2506.03850},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.03850}, 
}

This implementation is partially builds upon vaccine framework. We thank the authors for releasing their codebase.

@article{huang2024vaccine,
  title={Vaccine: Perturbation-aware alignment for large language model},
  author={Huang, Tiansheng and Hu, Sihao and Liu, Ling},
  journal={arXiv preprint arXiv:2402.01109},
  year={2024}
}

About

[ICML 2025] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published