Skip to content

PKU-ML/PAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAT(Prompt Adversarial Tuning)

This is the official repository for "Fight Back Against Jailbreaking via Prompt Adversarial Tuning" @ NeurIPS 2024 by Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. We introduce a defense strategy named PAT(Prompt Adversarial Tuning) to protect LLMs from Jailbreak attacks.

PAT

Table of Contents

Installation

We conduct all our experiments with the version of FastChat fschat==0.2.20.

Models

Before run the experiments, you need to download Vicuna-7B or/and LLaMA-2-7B-Chat first. Then modify the model path in experiments/configs/individual_xxx.py or experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiment).

Experiments

The experiments folder contains code to reproduce PAT defense on AdvBench.

  • To perform multiple behaviors defense on harmful behaviors:
cd experiments/launch_scripts
bash run_gcg_multiple.sh vicuna 15 0.2 0
  • To perform transfer behaviors experiments (i.e. 25 behaviors, 2 models), run the following code inside experiments:
cd experiments/launch_scripts
bash run_gcg_transfer.sh vicuna_guanaco 15 0.2 0
  • To perform evaluation experiments, please refer to the file in eval_scripts. eval_attack.sh tests the model on harmful behaviors, and eval_benign.sh tests the model on benign prompts. You need to specify the paths for the attack file and the defense file.

We will show our training results here, and you can test PAT directly with the provided defense controls.

The defense control for Vicuna-7B is as below:

Answer seriously esper quietlymedia (/ \". harmcgi{{[.(_{( zv\":

The defense control for Llama-2 is as below:

{[[ Tili particulderr einemroom folgenderomasquestion bezeichneter {%Begin Charlotteawtativelyiostream sug>::

The defense control for transferable settings is as below:

Come ! critique ! respond seriously?. evil fortune////////////////WRITE asked ein legalEGIN chat,. /**------+ nicelyTurnCred/) Base Question sid!(button [\"itle 

Citation

If you find this useful in your research, please consider citing:

@inproceedings{
mo2024fight,
title={Fight Back Against Jailbreaking via Prompt Adversarial Tuning},
author={Yichuan Mo and Yuji Wang and Zeming Wei and Yisen Wang},
booktitle={NeurIPS},
year={2024}
}

Acknowledgments

Thanks for work Universal and Transferable Adversarial Attacks on Aligned Language Models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published