[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44
Conversation
|
Hi @sahrizvi — thank you for your contribution! |
|
Hi @Ruiying-Ma! Thanks for the quick turnaround. The full set of per-trial traces (270 trials, 54 queries × 5) is attached as a 13.4 MB zip. Layout Happy to provide additional metadata or adjust the layout if anything's harder to verify than expected. |
|
Thank you @sahrizvi ! Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you! |
|
Hi @Ruiying-Ma, Thank you for the leakage flag — it surfaced more than the agnews issue you found. We did a thorough audit We've reworked the harness and re-executed the affected trials. Attached is the updated submission What changed
New result
The drop from our prior 0.6710 number is mostly the format_hint correction. Per-dataset table is in ProvenanceThe 270 trial answers come from three runs of the same hardened harness on the same machine:
PackageSame shape as the previous traces zip (
Happy to answer questions or re-run any specific trials you'd like spot-checked under tighter conditions. |
|
Hi @sahrizvi — we’ve added your results to our leaderboard. Thank you for the contribution! |
|
Thank you, @Ruiying-Ma! Really appreciate the careful review on the leakage findings — the audit was genuinely useful for us internally too. One follow-up worth flagging: we ran the same hardened harness against Thanks again for maintaining the leaderboard! |
|
Hi @sahrizvi — Thank you for sharing your results! Please feel free to open another PR to submit any new results as well. They’re very valuable to us, and we really appreciate your contribution to the leaderboard. Thanks again! |
Altimate Code — Leaderboard Submission
Agent name: Altimate Code
Project page: altimate.sh
Backbone LLM: Claude Sonnet 4.6 (via OpenRouter —
openrouter/anthropic/claude-sonnet-4.6)Hints: Yes (
db_description_withhint.txtinjected into the user prompt)Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Result
The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on
mainat submission time.9031c68ad)5ec934595)Note on validator versions. Our trials executed when
vendor/DataAgentBenchwas at commit9031c68ad. Upstream subsequently merged commits16ccc3cbd("Relax 16 validators to accept semantically-correct answers") and7c94cbf4c("Relax 3 more validators"), which together updated 17validate.pyfiles across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers insubmission.json.Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
db_description_withhint.txt(injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.schema_index,schema_search,schema_inspect,sql_execute,warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.sql-review,query-optimize,lineage-diff,sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.ANSWERrather than producing a meta-summary.solve.pyper query and iterates in place (Edit, not rewrite) until convergence; final answer goes toANSWER.Per-dataset Pass@1
Note on PATENTS
PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.
We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.
Configuration