Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Ning Lu, Shengcai Liu📧, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang

This repository contains the official implementation of the Safe Delta algorithm, introduced in our ICML 2025 paper Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets.

📚 Overview

📖 Introduction
✨ Getting Started
🔧 Usage
📃 Evaluation
🔄 Other Models
🔖 Citation

📖Introduction

Safe Delta is a safety-aware post-training method that preserves safety alignment regardless of finetuning datasets.

Key Highlights:

New Problem: Identifying the challenge of preserving safety across different datasets.
New Approach: First safety-aware parameter modification defense method.
Rigorous Theoretical and Empirical Evaluation. Use the code in this repo to reproduce our results.

✨Getting Started

Installation

You can install Safe Delta dependencies by running the following commands:

conda create -n safedelta python==3.11
conda activate safedelta

pip install -r requirements.txt

pip install flash-attn==2.7.2.post1 --no-build-isolation
pip install vllm==0.7.3 # for fast evaluation

Repo Structure

This repository includes the following key files and directories:

llama2: Codes for fine-tuning, applying Safe Delta defense and evaluation on llama2. For other models, follow the instruction in next section.
- scripts: Scripts for end-to-end pipeline for fine-tuning, applying the Safe Delta, and evaluation.
- finetuning.py: Code for fine-tuning.
- run_safedelta.py: Code to apply the Safe Delta algorithm after fine-tuning.
- finetuned_models: Directory where fine-tuned and Safe Delta-defended models are saved.

🔧Usage

Data Preparation

We provide all necessary datasets, including those for: Fine-tuning, Applying Safe Delta, Evaluation (Safety and Utility). No additional data collection or preprocessing is required — everything is ready to use out of the box.

Fine-Tuning

Here is an example commands to fine-tune llama2 on Purebad dataset.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2  finetuning.py \
  --batch_size_training 10 --lr 5e-5 \
  --num_epochs 5 \
  --dataset pure_bad_dataset --dataset_size 100 \
  --train_split "ft_datasets/pure_bad_dataset/pure_bad_100.jsonl" \
  --enable_fsdp \
  --model_name ckpts/llama2-7b-chat-hf --pure_bf16 \
  --dist_checkpoint_root_folder finetuned_models/ \
  --dist_checkpoint_folder purebad100-7b-full \
  --gradient_accumulation_steps 1 --run_validation False --save_every_epoch False --save_model False

Run Safe Delta

Here’s an example of applying Safe Delta on Purebad-fine-tuned llama2 model: The resulting model is saved in llama2/finetuned_models/{model_name_ft}-SafeDelta

CUDA_VISIBLE_DEVICES=0 python run_redline_recovery.py \
  --model_name_align 'ckpts/llama2-7b-chat-hf' \
  --model_name_ft 'finetuned_models/purebad100-7b-full' \
  --s 0.11

The current implementation adopts an online processing approach to enhance code readability and simplify implementation. Thus, this may result in a slightly slower runtime.

📃Evaluation

Safety

Please use the provided scripts in llama2/scripts to replicate the experiments in various settings with Attack Success Rate and Harmful Score computed. You can find the model generations in llama2/safety_evaluation/question_output.

Below is an example safety evaluation result on Hexbench:

########## Safety Evaluation ASR ##########
                         Model_ID    ASR  Test_Num
           vllm-llama2-7b-chat-hf  5.15%       330
          vllm-purebad100-7b-full 96.67%       330
vllm-purebad100-7b-full-SafeDelta  5.45%       330

Utility

Please use the provided scripts in llama2/scripts to replicate the experiments in various settings with the downstream task utility. You can find the model generations in llama2/utility_evaluation/sum/data/gen_answers.

We evaluate model basic utility on MMLU using the Language Model Evaluation Harness. MT-Bench scores are evaluated using LLM Judge, which is part of the FastChat.

🔄Applying to Other Models

Safe Delta can be easily transferable to different models. Step-by-step instructions will be added soon.

🔖Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@inproceedings{
  lu2025safe,
  title={Safe Delta: Consistently Preserving Safety when Fine-Tuning {LLM}s on Diverse Datasets},
  author={Ning Lu and Shengcai Liu and Jiahao Wu and Weiyu Chen and Zhirui Zhang and Yew-Soon Ong and Qi Wang and Ke Tang},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=QsCDgFKErb}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
llama2		llama2
.gitignore		.gitignore
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

📚 Overview

📖Introduction

Key Highlights:

✨Getting Started

Installation

Repo Structure

🔧Usage

Data Preparation

Fine-Tuning

Run Safe Delta

📃Evaluation

Safety

Utility

🔄Applying to Other Models

🔖Citation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

ColinLu50/SafeDelta

Folders and files

Latest commit

History

Repository files navigation

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

📚 Overview

📖Introduction

Key Highlights:

✨Getting Started

Installation

Repo Structure

🔧Usage

Data Preparation

Fine-Tuning

Run Safe Delta

📃Evaluation

Safety

Utility

🔄Applying to Other Models

🔖Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages