Skip to content

uiuc-kang-lab/AdaptiveAttackAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

This repository contains the official code for the paper "Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents," accepted to NAACL 2025 Findings. In this project, we adapt the adversarial string training methods from LLM jailbreaking to Indirect Prompt Injection (IPI) attacks, demonstrating that our approach successfully bypasses eight different IPI defenses across experiments on two distinct LLM agents. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability.

Overview

Set up

git clone https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.git
cd AdaptiveAttackAgent
pip install -r requirements.txt
export OPENAI_API_KEY=Your_OpenAI_key

Usage

To train and evaluate different defenses for various LLM agents, use the following command:

python3 run.py --model path_to_model  --defense defense_name

Command parameters:

  • --model: Path to the base model of the LLM agent. We currently support two base models: meta-llama/Llama-3.1-8B-Instruct and lmsys/vicuna-7b-v1.5. The code can be easily adapted to other base models.
  • --defense: Name of the defense to evaluate and attack adaptively. For lmsys/vicuna-7b-v1.5, we include eight defenses: LLMDetector, FinetunedDetector, InstructionalPrevention, DataPromptIsolation, SandwichPrevention, Paraphrasing, Adverserial Finetuning, Perplexity Filtering. For meta-llama/Llama-3.1-8B-Instruct, we exclude DataPromptIsolation and SandwichPrevention.

The table below provides descriptions of each defense and the corresponding adaptive attack used:

Defense

Results

The following figures illustrate the effectiveness of our adaptive attacks, consistently achieving an attack success rate of over 50% across different defenses and LLM agents:

Results

Acknowledgement

This repository builds upon outstanding jailbreaking benchmarks and methods, including HarmBench, GGC, AutoDan. It also leverages the IPI benchmark InjecAgent.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published