CALM: Curiosity-Driven Auditing for Large Language Models

CALM: Curiosity-Driven Auditing for Large Language Models (AAAI 2025 AI Alignment Track) [Paper]
Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang

Abstract

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.

Environment

conda env create -n calm
conda activate calm
pip install trl
pip install pykeops autoroot fast_bleu typo nltk tensorboard

Run

bash scripts/run_ppo_auditing_gpt2_senators.sh
bash scripts/run_ppo_auditing_gpt2_senators_toxicity.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
src		src
.gitignore		.gitignore
.project-root		.project-root
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CALM: Curiosity-Driven Auditing for Large Language Models

Abstract

Environment

Run

About

Releases

Packages

Languages

x-zheng16/CALM

Folders and files

Latest commit

History

Repository files navigation

CALM: Curiosity-Driven Auditing for Large Language Models

Abstract

Environment

Run

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages