[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1#47
[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.73% Pass@1#47kentwelcome wants to merge 5 commits into
Conversation
|
Hi @kentwelcome — thank you for your contribution! |
Thanks! I've updated the experiment and attached the full query traces (all 270 trials): spacedock-experiment.zip. Layout: //run-/{claude-output.jsonl, answers.json}, with per-dataset summary.json. Let me know if you need anything else for validation. |
|
Thank you @kentwelcome ! We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:
Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you! |
Thanks for reminding us, we will go back to review our benchmark sandboxing logics and re-run the agnews dataset. |
|
Hi @Ruiying-Ma, |
|
Hi @kentwelcome — thank you for the submission! We validated the results and noticed a mismatch between the reported accuracy and our validation results:
Could you please double-check whether the submitted answers ( Thanks again for your contribution! |
|
Hi @Ruiying-Ma, I appreciate the reminder. We notice that the DAB's upstream has updated the By the way, we have also updated the The new Stratified Pass@1 is 57.73%. |
|
Hi @kentwelcome — we’ve added your results to our leaderboard. Thank you for the contribution! |
Spacedock (Recce) — Leaderboard Submission
Agent name: Spacedock (Recce) (source)
Backbone LLM: Claude Opus 4.6 (Anthropic)
Hints: No
Trials: 5 per query
Stratified Pass@1: 57.73%
Architecture
Spacedock is a workflow-orchestration harness that runs on top of the Claude Code runtime. For each DAB query it:
Results Summary
Notes
(1/D) × Σⱼ [(1/Qⱼ) × Σᵢ (cᵢⱼ / n)], with each trial validated by the officialdata/query_<ds>/query<N>/validate.pyfrom the DAB repo.db_description_withhint.txt) were used — the agent discovered schemas via runtime exploration.ground_truth.csv,validate.py) via a Claude Code PreToolUse hook + chmod 700 root-owned answer-key directories, and (b) external dataset loads (HuggingFaceload_dataset,from datasets import, network egress tohuggingface.co/Kaggle/etc.). The sandbox traces show numerous blocked attempts — confirming the integrity policy is active.dab_submission.json— 270 entries (54 queries × 5 trials)