This repository contains the official code for the paper "Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents," accepted to NAACL 2025 Findings. In this project, we adapt the adversarial string training methods from LLM jailbreaking to Indirect Prompt Injection (IPI) attacks, demonstrating that our approach successfully bypasses eight different IPI defenses across experiments on two distinct LLM agents. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability.
git clone https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.git
cd AdaptiveAttackAgent
pip install -r requirements.txt
export OPENAI_API_KEY=Your_OpenAI_key
To train and evaluate different defenses for various LLM agents, use the following command:
python3 run.py --model path_to_model --defense defense_name
--model
: Path to the base model of the LLM agent. We currently support two base models: meta-llama/Llama-3.1-8B-Instruct and lmsys/vicuna-7b-v1.5. The code can be easily adapted to other base models.--defense
: Name of the defense to evaluate and attack adaptively. For lmsys/vicuna-7b-v1.5, we include eight defenses:LLMDetector
,FinetunedDetector
,InstructionalPrevention
,DataPromptIsolation
,SandwichPrevention
,Paraphrasing
,Adverserial Finetuning
,Perplexity Filtering
. For meta-llama/Llama-3.1-8B-Instruct, we excludeDataPromptIsolation
andSandwichPrevention
.
The table below provides descriptions of each defense and the corresponding adaptive attack used:
The following figures illustrate the effectiveness of our adaptive attacks, consistently achieving an attack success rate of over 50% across different defenses and LLM agents:
This repository builds upon outstanding jailbreaking benchmarks and methods, including HarmBench, GGC, AutoDan. It also leverages the IPI benchmark InjecAgent.