Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

This repository contains the official code for the paper "Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents," accepted to NAACL 2025 Findings. In this project, we adapt the adversarial string training methods from LLM jailbreaking to Indirect Prompt Injection (IPI) attacks, demonstrating that our approach successfully bypasses eight different IPI defenses across experiments on two distinct LLM agents. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability.

Set up

git clone https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.git
cd AdaptiveAttackAgent
pip install -r requirements.txt
export OPENAI_API_KEY=Your_OpenAI_key

Usage

To train and evaluate different defenses for various LLM agents, use the following command:

python3 run.py --model path_to_model  --defense defense_name

Command parameters:

--model: Path to the base model of the LLM agent. We currently support two base models: meta-llama/Llama-3.1-8B-Instruct and lmsys/vicuna-7b-v1.5. The code can be easily adapted to other base models.
--defense: Name of the defense to evaluate and attack adaptively. For lmsys/vicuna-7b-v1.5, we include eight defenses: LLMDetector, FinetunedDetector, InstructionalPrevention, DataPromptIsolation, SandwichPrevention, Paraphrasing, Adverserial Finetuning, Perplexity Filtering. For meta-llama/Llama-3.1-8B-Instruct, we exclude DataPromptIsolation and SandwichPrevention.

The table below provides descriptions of each defense and the corresponding adaptive attack used:

Results

The following figures illustrate the effectiveness of our adaptive attacks, consistently achieving an attack success rate of over 50% across different defenses and LLM agents:

Acknowledgement

This repository builds upon outstanding jailbreaking benchmarks and methods, including HarmBench, GGC, AutoDan. It also leverages the IPI benchmark InjecAgent.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
InjecAgent		InjecAgent
asset		asset
attacks		attacks
data_processing		data_processing
.gitignore		.gitignore
README.md		README.md
configs.json		configs.json
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

Set up

Usage

Command parameters:

Results

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

uiuc-kang-lab/AdaptiveAttackAgent

Folders and files

Latest commit

History

Repository files navigation

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents

Set up

Usage

Command parameters:

Results

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages