Add Team Cohere submission (TRP1 FDE, April 2026)#38
Conversation
Agent: OracleForge Data Agent Backbone LLM: google/gemini-2.0-flash-001 (via OpenRouter) Dataset hints used: yes Coverage: 54 queries x 5 runs = 270 entries
- results/team-cohere_gemini-2.0-flash-001_n5.json: 270-entry submission in DAB schema - results/dab_submission.json: same content, synced to submission format (dataset, query, run, answer) - results/README.md: submission status + PR URL (ucbepic/DataAgentBench#38) - scripts/build_dab_submission.py: rebuild submission from worker outputs
Post-submission iteration: created SQLite/DuckDB/Postgres views exposing DAB's original table names (Lead, review, Mutation_Data, etc.) over our prefix-loaded tables (crm_Lead, bookreview_review, pancancer_Mutation_Data). Reran the previously 0%-scoring datasets (partial, 136 of 205 planned trials — halted by OpenRouter weekly-credit exhaustion). Score change: before: 38/270 = 14.07% pass@1 (trial-level) after: 41/270 = 15.19% pass@1 Net +3 passes. Biggest gainers: bookreview (0 -> 5), crmarenapro (0 -> 2). Regressions on music_brainz_20k and yelp (their prior passes were lucky substring matches; with views in place, the agent correctly reports "data not available" for Mongo-hosted data that views cannot reach). Agent, model, and hints disclosure unchanged from original PR body.
Changes in conductor.py: - Retry transient 429/503/504 with exponential backoff (2s/5s/12s) in _call_llm; upstream throttles previously turned whole trials into 'LLM call failed' fallback answers. - Cap each retrieved KB doc at 4KB before joining into the prompt so a growing corrections_log.md (observed ~1.4MB / ~350K tokens) cannot inflate input-token cost ~70x per call. - Harden _extract_answer: new _scrub_leaked_llm_output strips stray markdown fences, leading plan-comment lines, and raw dict/list dumps when the LLM omits the ANSWER: marker. - Strengthen evidence-mode and synthesize-mode user prompts: require in-SQL computation (AVG/SUM/COUNT/CTEs/window funcs) when the question asks for a statistic; forbid code fences / plan comments / dict dumps in the final ANSWER. - (Branch-level, prior to this session) _resolve_mongodb_tool helper + _recover_db_type made instance method, so dialect reroute works with collection-scoped Mongo tool names (query_mongodb_yelp_review etc). KB docs added (retrieval-scored, 4KB-capped in prompt): - kb/domain/pancancer-patterns.md clinical+molecular join, log10 expression, %-mutation-by-histology, chi-square recipe. - kb/domain/patents-patterns.md natural-language date parsing, EMA in SQL, citation-network traversal, cpc<->titleFull join. - kb/domain/agnews-patterns.md key insight that category is not a column (must LLM-classify from title+description), Mongo<->SQLite join on article_id. Benchmark impact: pass@1 15.24% -> 28.15% (41 -> 76 / 270). Submission updated: PR ucbepic/DataAgentBench#38, commit cdfeca1 on team-cohere-submission-april-2026 branch of the fork.
- results/dab_benchmark_5trials.json: 270 rows (54 queries x 5 trials), 76 pass = 28.15% (prior 41/270 = 15.24%). - results/dab_submission.json, team-cohere_gemini-2.0-flash-001_n5.json: rubric-shaped (dataset, query, run, answer), identical content. Mirror copied to external/DataAgentBench/leaderboard_submissions/ (PR ucbepic/DataAgentBench#38, commit cdfeca1). Gains came from the agent/kb changes in 87f07af (429 backoff, kb doc cap, answer scrubber, SQL-compute prompt, domain KB patterns).
|
Hi @NuryeNigusMekonen — I re-ran the submission through DAB's per-query Looking at the failed answers, most of the gap comes from the MCP DB tool's preview output being submitted verbatim as the final answer instead of an analyzed result. Three examples from the JSON:
DAB's validators look for the literal expected entity in the answer string, so these all score 0. For now we'll list the verified 0.128 on the leaderboard, but feel free to update your agent and resubmit. |
|
Hi Shreyashankar,
Thanks for the detailed review.
I will be adding a post-processing and validation layer aligned with
validate.py and fixing routing issues.
I will update and resubmit.
Regards,
…On Wed, 22 Apr 2026, 4:21 am Shreya Shankar, ***@***.***> wrote:
*shreyashankar* left a comment (ucbepic/DataAgentBench#38)
<#38?email_source=notifications&email_token=BSKFGP32C33ZPJSXZIRWFBD4XANBDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGI4TANJZHEYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292905990>
Hi @NuryeNigusMekonen <https://github.com/NuryeNigusMekonen> — I re-ran
the submission through DAB's per-query validate.py and got *0.128
stratified ***@***.*** (34/270)*, vs. the *0.282 (76/270)* in the PR
description. The gap is concentrated on stockmarket (claim 25/25, verified
10/25), googlelocal (9/20 → 0/20), GITHUB_REPOS (6/20 → 0/20), stockindex
(10/15 → 5/15), and yelp (4/35 → 0/35). On the other 6 datasets the two
graders match exactly.
Looking at the failed answers, most of the gap comes from the MCP DB
tool's preview output being submitted verbatim as the final answer instead
of an analyzed result. Three examples from the JSON:
- *GITHUB_REPOS/q1* ("proportion of non-Python READMEs with copyright
info") — answer is Found 20 result(s). First 5 name(s): stockinfo,
bookreview_review, crm_User, crm_Account, crm_Contact (+15 more). —
that's a list_tables output.
- *GITHUB_REPOS/q2* ("Swift repo with most copied .swift file") —
answer is Found 2,792 result(s). First 5 table_name(s): AAAU, AADR,
AAME, AAWW, AAXJ (+2,787 more). — stockmarket ticker tables, totally
unrelated to the question.
- *GITHUB_REPOS/q4* ("top 5 non-Python repos by commits") — answer is
a markdown table sorted by watch_count not commits.
DAB's validators look for the literal expected entity in the answer
string, so these all score 0. For now we'll list the verified 0.128 on the
leaderboard, but feel free to update your agent and resubmit.
—
Reply to this email directly, view it on GitHub
<#38?email_source=notifications&email_token=BSKFGP32C33ZPJSXZIRWFBD4XANBDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGI4TANJZHEYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292905990>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BSKFGPYNNOSWS434VBRFLPD4XANBDAVCNFSM6AAAAACX65B65SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEOJSHEYDKOJZGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…, Cohere (ucbepic#38) to leaderboard Verified Pass@1 numbers were re-computed from the raw submission JSONs using common_scaffold/validate/validate.py: PR ucbepic#31 Pi Coding Agent + Claude Opus 4.6 → 0.5603 (ucbepic#1) PR ucbepic#32 Oracle Forge (Tenacious) + Sonnet 4.6 → 0.4554 (ucbepic#4) PR ucbepic#38 Oracle Forge (Cohere) + Gemini 2.0 F. → 0.128 (ucbepic#10) Adds a Submission column on both the README table and the website leaderboard linking each submission to its PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mission PRs Pi was linking to mariozechner/pi-coding-agent (the SDK author), not the team that made the submission. Cohere was linking to their source repo. Both now link to the PR they opened on this repo, matching the pattern already used for Tenacious (ucbepic#32). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DAB Submission — Team Cohere (TRP1 FDE Programme, April 2026)
Agent name: OracleForge Data Agent
Backbone LLM model and version:
google/gemini-2.0-flash-001(served via OpenRouter)Dataset hints used: Yes —
db_description_withhint.txtis loaded and injected as part of the agent's dataset context for every trial. Seeeval/run_trials.py::_load_dataset_descriptionin our repo.Additional notes:
dataset,query,run,answer(strings,runzero-indexed 0–4).music_brainz_20k 5/15 (33.3%), bookreview 5/15 (33.3%), GITHUB_REPOS 6/20 (30.0%),
DEPS_DEV_V1 2/10 (20.0%), crmarenapro 9/65 (13.8%), yelp 4/35 (11.4%),
PATENTS 1/15 (6.7%), PANCANCER_ATLAS 0/15 (0.0%), agnews 0/20 (0.0%).
corrections_log.mdfrom inflating input-token cost ~70× per trial.ANSWER:marker, the extractor now strips leaked markdown code fences, leading comment lines, and raw Python dict/list dumps instead of emitting them as the answer.kb/domain/pancancer-patterns.md,patents-patterns.md,agnews-patterns.md(join patterns, date-format notes, classification workflow). These had limited effect on the three zero/low-pass datasets — the remaining failures on PANCANCER_ATLAS, PATENTS, and agnews are primarily analytical-reasoning limits rather than missing schema knowledge.crm_<Table>,bookreview_<table>) and DAB's unprefixeddb_description.txtnames. Views are in place for non-colliding tables; the colliding ones (Lead,tip) are documented as prefixed aliases in the dataset context. Seeaidlc-docs/audit.mdfor the follow-up plan.