Runs the SysRepair-Bench scenarios (ccdc/, meta2/, meta3/ubuntu/,
vulnhub/, meta4/, hivestorm/) under
Inspect AI with one of five solver strategies
and two evaluation modes.
cd inspect_eval
uv syncDocker must be installed and running (each scenario builds its own image from
its Dockerfile).
meta2 (Ubuntu 8.04 Hardy) — requires a shared base image with a GNU
timeout shim (Hardy's coreutils 6.10 predates the binary). The harness
builds it automatically on first run, or manually:
docker build -t sysrepair/meta2-hardy:latest meta2/_basehivestorm — requires randomized roles.json + task.md generated before
each run (anti-memorization):
SEED=42 bash hivestorm/prepare.shPresets are defined in runs.yaml. Run from inspect_eval/:
# Day-1 mode: agent receives full threat briefing
uv run python -m sysrepair_bench.run all_solvers_gemma4_31b_day1
# Zero-day mode: agent must discover + remediate blind
uv run python -m sysrepair_bench.run all_solvers_gemma4_31b_zero_day
# Hivestorm free-roam (partial credit, 9 Linux scenarios)
uv run python -m sysrepair_bench.run hivestorm_linux_gemma4_31b
# Meta4 smoke test (5 representative scenarios)
uv run python -m sysrepair_bench.run meta4_smoke# All scenarios across default benchmarks, ReAct
uv run inspect eval sysrepair_bench --model openai/gpt-4o-mini
# Only meta2, with the LATS solver
uv run inspect eval sysrepair_bench --model openai/gpt-4o \
-T solver=lats -T benchmarks='["meta2"]'
# A specific scenario list
uv run inspect eval sysrepair_bench --model openai/gpt-4o \
-T solver=reflexion \
-T scenarios='["meta2/scenario-01","vulnhub/scenario-03"]'
# Local vLLM via OpenAI-compatible endpoint
OPENAI_BASE_URL=http://localhost:8001/v1 OPENAI_API_KEY=vllm \
uv run inspect eval sysrepair_bench --model openai/gemma-4-31b| Mode | Prompt | Measures |
|---|---|---|
day1 (default) |
Full threat.md briefing (CVE, affected config, expected remediation) |
Execution capability: can the agent apply a known fix? |
zero_day |
No briefing — agent must discover + remediate blind | Discovery + execution: enumerate, identify, remediate. |
Set via -T mode=day1 or -T mode=zero_day, or in a preset's mode: field.
| Param | Default | Meaning |
|---|---|---|
solver |
react |
One of react, basic, reflexion, plan_and_solve, lats. |
mode |
day1 |
day1 (with briefing) or zero_day (blind discovery). |
benchmarks |
["meta2","vulnhub","ccdc"] |
Filter by benchmark directory. |
scenarios |
unset | Explicit list (e.g. ["meta2/scenario-01"]); overrides benchmarks. |
message_limit |
40 |
Per-sample message budget (forced-halt cap). |
max_attempts |
1 |
ReAct resubmission attempts on scorer feedback. |
time_limit |
1800 |
Per-sample wall-clock ceiling (seconds). |
token_limit |
500000 |
Per-sample token ceiling (input + output). |
bash_timeout |
180 |
Per-command timeout for the shell tool (seconds). |
verify_timeout |
300 |
Timeout for verify.sh execution in the scorer (seconds). |
The agent receives these tools (varies by benchmark):
| Tool | Provided to | Description |
|---|---|---|
shell |
All scenarios | OS-aware command execution (bash on Linux, PowerShell on Windows) |
text_editor |
ccdc, vulnhub, meta3, meta4 | View, create, str_replace, insert, undo_edit on files |
text_editor (Hardy) |
meta2 | Lightweight pure-shell implementation (Inspect's native editor fails on Ubuntu 8.04) |
think |
All scenarios | Reasoning scratchpad (no side effects) |
score_progress |
hivestorm only | Runs verify.sh mid-loop; returns passing checks + earned points |
submit |
All scenarios | Declare remediation finished (built into ReAct solver) |
| Benchmark | Scorer | Method |
|---|---|---|
| ccdc, meta2, meta3, meta4, vulnhub | dispatch_scorer (binary) |
Copies verify.sh into sandbox, runs it. Exit 0 = CORRECT. |
| hivestorm | dispatch_scorer (weighted) |
Parses JSONL output from verify.sh. Sums passed check weights, subtracts 10 per broken service, divides by total. Score = 0.0-1.0. |
- react —
inspect_ai.agent.react()with shell + text_editor + think tools; uses Inspect's built-insubmit()semantics. - basic — Minimal generate+tool loop; ends when the model stops calling tools or hits the message limit.
- reflexion — Up to 3 ReAct cycles. Between cycles a reflector LLM call produces a correction strategy appended to the system prompt.
- plan_and_solve — One JSON plan call up front, then sequential execution
with per-step retry. Stops as soon as
verify.shpasses. - lats — UCB1-driven MCTS; an LLM both expands candidate commands and scores rollouts. Container state accumulates across rollouts.
| Base image | Timeout | text_editor | Notes |
|---|---|---|---|
| Ubuntu 25.10 (ccdc) | Native GNU | Inspect native | Full support |
| Ubuntu 14.04 (meta3) | Native GNU | Inspect native | mirrors.kernel.org for apt |
| Ubuntu 8.04 Hardy (meta2) | Exec-based shim | Hardy-compatible | sysrepair/meta2-hardy:latest base image |
| Debian 11 (vulnhub) | Native GNU | Inspect native | Full support |
| Metasploitable2 (vulnhub/05) | Exec-based shim | Inspect native | Shim baked into Dockerfile |
| CentOS 7 (hivestorm/12) | Native GNU | Inspect native | EPEL for jq; vault.centos.org repos |
| Alpine/docker:dind (hivestorm/15) | Native | Inspect native | Needs apk upgrade before install |