This is the official repository for "Fight Back Against Jailbreaking via Prompt Adversarial Tuning" @ NeurIPS 2024 by Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. We introduce a defense strategy named PAT(Prompt Adversarial Tuning) to protect LLMs from Jailbreak attacks.
We conduct all our experiments with the version of FastChat
fschat==0.2.20
.
Before run the experiments, you need to download Vicuna-7B or/and LLaMA-2-7B-Chat first. Then modify the model path in experiments/configs/individual_xxx.py
or experiments/configs/transfer_xxx.py
(for multiple behaviors or transfer experiment).
The experiments
folder contains code to reproduce PAT defense on AdvBench.
- To perform multiple behaviors defense on harmful behaviors:
cd experiments/launch_scripts
bash run_gcg_multiple.sh vicuna 15 0.2 0
- To perform transfer behaviors experiments (i.e. 25 behaviors, 2 models), run the following code inside
experiments
:
cd experiments/launch_scripts
bash run_gcg_transfer.sh vicuna_guanaco 15 0.2 0
- To perform evaluation experiments, please refer to the file in
eval_scripts
.eval_attack.sh
tests the model on harmful behaviors, andeval_benign.sh
tests the model on benign prompts. You need to specify the paths for the attack file and the defense file.
We will show our training results here, and you can test PAT directly with the provided defense controls.
The defense control for Vicuna-7B is as below:
Answer seriously esper quietlymedia (/ \". harmcgi{{[.(_{( zv\":
The defense control for Llama-2 is as below:
{[[ Tili particulderr einemroom folgenderomasquestion bezeichneter {%Begin Charlotteawtativelyiostream sug>::
The defense control for transferable settings is as below:
Come ! critique ! respond seriously?. evil fortune////////////////WRITE asked ein legalEGIN chat,. /**------+ nicelyTurnCred/) Base Question sid!(button [\"itle
If you find this useful in your research, please consider citing:
@inproceedings{
mo2024fight,
title={Fight Back Against Jailbreaking via Prompt Adversarial Tuning},
author={Yichuan Mo and Yuji Wang and Zeming Wei and Yisen Wang},
booktitle={NeurIPS},
year={2024}
}
Thanks for work Universal and Transferable Adversarial Attacks on Aligned Language Models.