CTFJudge: LLM as a Judge for Offensive Security Agents

This is the official repository for CTFJudge from "Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark" (AAAI'26) [paper].

For CTFTiny benchmark, please refer to CTFTiny Official Repository.

Overview

This system uses three LLM-powered agents to:

1.Agent 1: Decompose writeups into structured solution steps

2.Agent 2: Extract and summarize trajectory actions from JSON logs

3.Agent 3: Perform qualitative comparison and scoring between writeup and trajectory

Prerequisites

Python 3.8+
Anthropic API key

export ANTHROPIC_API_KEY='your-api-key'

Usage

# Run evaluation on all challenge pairs
python run_evaluation.py --writeups-dir <path> --trajs-dir <path>

# Run evaluation on a specific challenge
python run_evaluation.py --challenge <challenge_name> --writeups-dir <path> --trajs-dir <path>

Options

Flag	Default	Description
`--challenge`, `-c`	None	Evaluate a specific challenge by name (without extension)
`--writeups-dir`	`writeups`	Directory containing `.txt` writeup files
`--trajs-dir`	`trajs`	Directory containing `.json` trajectory files
`--outputs-dir`	`outputs`	Directory for intermediate agent outputs
`--evaluations-dir`	`evaluations`	Directory for final evaluation reports
`--errors-dir`	`errors`	Directory for error logs

Notes

Challenge names should match the full docker container names (e.g., 2023q-web-smug_dino)
Writeup and trajectory files must share the same base name (<name>.txt ↔ <name>.json)
Modify the config file to adjust model selection and token limits (applies to all 3 agents)
Modifications may be required if using a different trajectory format other than nyuctf_agents format.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
errors		errors
evaluations		evaluations
outputs		outputs
sanity_check		sanity_check
trajs		trajs
writeups		writeups
README.md		README.md
config.json		config.json
error_handler.py		error_handler.py
llm_utils.py		llm_utils.py
qualitative_evaluation_agent.py		qualitative_evaluation_agent.py
run_evaluation.py		run_evaluation.py
trajectory_summary_agent.py		trajectory_summary_agent.py
version_manager.py		version_manager.py
writeup_summary_agent.py		writeup_summary_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTFJudge: LLM as a Judge for Offensive Security Agents

Overview

Prerequisites

Usage

Options

Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

NYU-LLM-CTF/CTFJudge

Folders and files

Latest commit

History

Repository files navigation

CTFJudge: LLM as a Judge for Offensive Security Agents

Overview

Prerequisites

Usage

Options

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages