[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1 by kentwelcome · Pull Request #47 · ucbepic/DataAgentBench

kentwelcome · 2026-05-06T14:52:03Z

Spacedock (Recce) — Leaderboard Submission

Agent name: Spacedock (Recce) (source)
Backbone LLM: Claude Opus 4.6 (Anthropic)
Hints: No
Trials: 5 per query
Stratified Pass@1: 57.73%

Architecture

Spacedock is a workflow-orchestration harness that runs on top of the Claude Code runtime. For each DAB query it:

Stage-based execution. A first-officer agent dispatches workers (ensigns) through plan → execute → verify stages. Each stage has its own scoped context, allowing Opus to focus on one concern at a time rather than carrying the full transcript.
Free-form data exploration. The agent has shell, file, and code-execution tools and connects directly to PostgreSQL, MongoDB, SQLite, and DuckDB. No pre-built index, no schema hints — schemas are discovered at runtime.
Sub-agent dispatch. Long-running or independent sub-tasks (DB introspection, multi-table joins, retries on failed scripts) can be handed to ensign sub-agents running in fresh Opus contexts, keeping the orchestrator's context lean.

Results Summary

Dataset	Pass@1
bookreview	0.93
stockindex	0.93
stockmarket	0.92
yelp	0.77
crmarenapro	0.77
googlelocal	0.70
PANCANCER_ATLAS	0.67
agnews	0.45
music_brainz_20k	0.33
GITHUB_REPOS	0.25
DEPS_DEV_V1	0.20
PATENTS	0.00
Stratified Pass@1	0.5773

Notes

Pass@1 computed using DAB's stratified formula: (1/D) × Σⱼ [(1/Qⱼ) × Σᵢ (cᵢⱼ / n)], with each trial validated by the official data/query_<ds>/query<N>/validate.py from the DAB repo.
No dataset hints (db_description_withhint.txt) were used — the agent discovered schemas via runtime exploration.
All 5 agnews runs were re-executed under a hardened sandbox that blocks (a) filesystem access to answer-key files (ground_truth.csv, validate.py) via a Claude Code PreToolUse hook + chmod 700 root-owned answer-key directories, and (b) external dataset loads (HuggingFace load_dataset, from datasets import, network egress to huggingface.co/Kaggle/etc.). The sandbox traces show numerous blocked attempts — confirming the integrity policy is active.
Submission file: dab_submission.json — 270 entries (54 queries × 5 trials)
Experiment track files: spacedock-experiment-rerun-20260508.zip

Ruiying-Ma · 2026-05-07T01:08:07Z

Hi @kentwelcome — thank you for your contribution!
Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

kentwelcome · 2026-05-07T02:12:07Z

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

Thanks! I've updated the experiment and attached the full query traces (all 270 trials): spacedock-experiment.zip.

Layout: //run-/{claude-output.jsonl, answers.json}, with per-dataset summary.json. Let me know if you need anything else for validation.

Ruiying-Ma · 2026-05-07T03:21:12Z

Thank you @kentwelcome !

We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:

Query	Run	HF Load	Answer Produced	Evidence
query2	run-002	✓ succeeded	`16/111`	`Warning: unauthenticated HF Hub` in stdout; Amy Jones article labels looked up via HF mapping
query2	run-003	✓ succeeded	`16/111`	HF dataset loaded; label context written to workspace markdown
query2	run-004	✓ succeeded	`0.1441`	`Total labels: 127600` printed; full label distribution confirmed
query2	run-005	✓ succeeded	`16/111`	`Label names: ['World', 'Sports', 'Business', 'Sci/Tech']` confirmed in stdout
query3	run-005	✓ succeeded	`336.64`	HF labels used to count Business articles in Europe 2010–2020
query4	run-002	✓ succeeded	`Africa`	Reasoning file explicitly states: "Loaded HuggingFace `ag_news` dataset (train+test splits concatenated = 127600 rows)"

Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

kentwelcome · 2026-05-07T08:53:22Z

Thank you @kentwelcome !

We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:

Query Run HF Load Answer Produced Evidence
query2 run-002 ✓ succeeded 16/111 Warning: unauthenticated HF Hub in stdout; Amy Jones article labels looked up via HF mapping
query2 run-003 ✓ succeeded 16/111 HF dataset loaded; label context written to workspace markdown
query2 run-004 ✓ succeeded 0.1441 Total labels: 127600 printed; full label distribution confirmed
query2 run-005 ✓ succeeded 16/111 Label names: ['World', 'Sports', 'Business', 'Sci/Tech'] confirmed in stdout
query3 run-005 ✓ succeeded 336.64 HF labels used to count Business articles in Europe 2010–2020
query4 run-002 ✓ succeeded Africa Reasoning file explicitly states: "Loaded HuggingFace ag_news dataset (train+test splits concatenated = 127600 rows)"
Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

Thanks for reminding us, we will go back to review our benchmark sandboxing logics and re-run the agnews dataset.

kentwelcome · 2026-05-08T07:19:29Z

Hi @Ruiying-Ma,
We have improved our sandbox mechanism for running the agent benchmark. And all runs of the agnews dataset have been replaced. Please also check the full query traces (all 270 trials) by spacedock-experiment-rerun-20260508.zip.
Thanks

Ruiying-Ma · 2026-05-14T17:01:40Z

Hi @kentwelcome — thank you for the submission!

We validated the results and noticed a mismatch between the reported accuracy and our validation results:

dataset	reported	validated
stockmarket	0.68	0.92
stockindex	0.80	0.93

Could you please double-check whether the submitted answers (submission.json) and traces are the intended versions? It also seems possible that the reported accuracy may have been underestimated.

Thanks again for your contribution!

kentwelcome · 2026-05-15T07:14:47Z

Hi @Ruiying-Ma,

I appreciate the reminder. We notice that the DAB's upstream has updated the validate.py files for the stockmarket and stockindex datasets over the past 3 weeks. After recalculating the pass/fail numbers for our experiment results, the pass@1 for stockmarket and stockindex are 0.92 and 0.93.

By the way, we have also updated the leaderboard_submissions/dab_submission.json‎ file to make sure the answer properties are always using string format.

The new Stratified Pass@1 is 57.73%.
Thanks.

Ruiying-Ma · 2026-05-16T03:45:05Z

Hi @kentwelcome — we’ve added your results to our leaderboard. Thank you for the contribution!

Add Claude Opus 4.6 + SpaceDock harness workflow agent submission

8b169bd

Copilot AI review requested due to automatic review settings May 6, 2026 14:52

Copilot started reviewing on behalf of kentwelcome May 6, 2026 14:52 View session

Update the dab_submission.json with new experiment result

5ca54d7

kentwelcome changed the title ~~[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 53.96% Pass@1~~ [Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.12% Pass@1 May 7, 2026

kentwelcome added 2 commits May 8, 2026 12:17

Update dab_submission.json to replace cheating agnews runs

14245e6

Update the dab_submission.json to replace agnews run-001

fee1107

kentwelcome changed the title ~~[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.12% Pass@1~~ [Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1 May 8, 2026

Change the answer properties to make sure we are providing string format

c8f7c4f

kentwelcome changed the title ~~[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1~~ [Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1 May 15, 2026

Ruiying-Ma closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1#47

[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1#47
kentwelcome wants to merge 5 commits into
ucbepic:mainfrom
DataRecce:add-spacedock-harness-agent-submission

kentwelcome commented May 6, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

kentwelcome commented May 8, 2026

Uh oh!

Ruiying-Ma commented May 14, 2026

Uh oh!

kentwelcome commented May 15, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kentwelcome commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Spacedock (Recce) — Leaderboard Submission

Architecture

Results Summary

Notes

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

kentwelcome commented May 8, 2026

Uh oh!

Ruiying-Ma commented May 14, 2026

Uh oh!

kentwelcome commented May 15, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kentwelcome commented May 6, 2026 •

edited

Loading