Add Team Cohere submission (TRP1 FDE, April 2026) by NuryeNigusMekonen · Pull Request #38 · ucbepic/DataAgentBench

NuryeNigusMekonen · 2026-04-20T00:20:51Z

DAB Submission — Team Cohere (TRP1 FDE Programme, April 2026)

Agent name: OracleForge Data Agent

Backbone LLM model and version: google/gemini-2.0-flash-001 (served via OpenRouter)

Dataset hints used: Yes — db_description_withhint.txt is loaded and injected as part of the agent's dataset context for every trial. See eval/run_trials.py::_load_dataset_description in our repo.

Additional notes:

Pass@1 (trial-level, harness scoring): 28.15% (76/270). Prior submission on this PR: 15.24% (41/270).
Trial count: 5 runs per query across all 54 queries = 270 entries (DAB website minimum submission tier).
Submission file follows the DAB schema: dataset, query, run, answer (strings, run zero-indexed 0–4).
Agent architecture: multi-database (SQLite, DuckDB, Postgres, MongoDB) via MCP Toolbox + a local DuckDB bridge + a code-execution sandbox. Source: https://github.com/trp1-cohere-team/data-analytics-agent.
Per-dataset pass@1 on this run:
- stockmarket 25/25 (100.0%), stockindex 10/15 (66.7%), googlelocal 9/20 (45.0%),
  music_brainz_20k 5/15 (33.3%), bookreview 5/15 (33.3%), GITHUB_REPOS 6/20 (30.0%),
  DEPS_DEV_V1 2/10 (20.0%), crmarenapro 9/65 (13.8%), yelp 4/35 (11.4%),
  PATENTS 1/15 (6.7%), PANCANCER_ATLAS 0/15 (0.0%), agnews 0/20 (0.0%).
Changes since the prior submission on this PR:
1. Transient-failure retry — the LLM client now retries upstream 429/503/504 with exponential backoff (2s → 5s → 12s). Previously a brief throttle turned a whole trial into a "LLM call failed" fallback answer.
2. Knowledge-base budget cap — each retrieved KB document is truncated to 4 KB before being joined into the system prompt, preventing a growing corrections_log.md from inflating input-token cost ~70× per trial.
3. Answer-extraction scrub — if the LLM response lacks an explicit ANSWER: marker, the extractor now strips leaked markdown code fences, leading comment lines, and raw Python dict/list dumps instead of emitting them as the answer.
4. Prompt update — the evidence-mode user prompt now explicitly requires in-SQL computation (AVG/SUM/COUNT/CTEs/window functions) when the question asks for a statistic, so the agent returns computed values rather than raw rows.
5. Dataset-specific KB patterns — added kb/domain/pancancer-patterns.md, patents-patterns.md, agnews-patterns.md (join patterns, date-format notes, classification workflow). These had limited effect on the three zero/low-pass datasets — the remaining failures on PANCANCER_ATLAS, PATENTS, and agnews are primarily analytical-reasoning limits rather than missing schema knowledge.
Known limitation carried forward: schema-naming mismatch between the team's local data load (prefix convention like crm_<Table>, bookreview_<table>) and DAB's unprefixed db_description.txt names. Views are in place for non-colliding tables; the colliding ones (Lead, tip) are documented as prefixed aliases in the dataset context. See aidlc-docs/audit.md for the follow-up plan.

Agent: OracleForge Data Agent Backbone LLM: google/gemini-2.0-flash-001 (via OpenRouter) Dataset hints used: yes Coverage: 54 queries x 5 runs = 270 entries

- results/team-cohere_gemini-2.0-flash-001_n5.json: 270-entry submission in DAB schema - results/dab_submission.json: same content, synced to submission format (dataset, query, run, answer) - results/README.md: submission status + PR URL (ucbepic/DataAgentBench#38) - scripts/build_dab_submission.py: rebuild submission from worker outputs

Post-submission iteration: created SQLite/DuckDB/Postgres views exposing DAB's original table names (Lead, review, Mutation_Data, etc.) over our prefix-loaded tables (crm_Lead, bookreview_review, pancancer_Mutation_Data). Reran the previously 0%-scoring datasets (partial, 136 of 205 planned trials — halted by OpenRouter weekly-credit exhaustion). Score change: before: 38/270 = 14.07% pass@1 (trial-level) after: 41/270 = 15.19% pass@1 Net +3 passes. Biggest gainers: bookreview (0 -> 5), crmarenapro (0 -> 2). Regressions on music_brainz_20k and yelp (their prior passes were lucky substring matches; with views in place, the agent correctly reports "data not available" for Mongo-hosted data that views cannot reach). Agent, model, and hints disclosure unchanged from original PR body.

Changes in conductor.py: - Retry transient 429/503/504 with exponential backoff (2s/5s/12s) in _call_llm; upstream throttles previously turned whole trials into 'LLM call failed' fallback answers. - Cap each retrieved KB doc at 4KB before joining into the prompt so a growing corrections_log.md (observed ~1.4MB / ~350K tokens) cannot inflate input-token cost ~70x per call. - Harden _extract_answer: new _scrub_leaked_llm_output strips stray markdown fences, leading plan-comment lines, and raw dict/list dumps when the LLM omits the ANSWER: marker. - Strengthen evidence-mode and synthesize-mode user prompts: require in-SQL computation (AVG/SUM/COUNT/CTEs/window funcs) when the question asks for a statistic; forbid code fences / plan comments / dict dumps in the final ANSWER. - (Branch-level, prior to this session) _resolve_mongodb_tool helper + _recover_db_type made instance method, so dialect reroute works with collection-scoped Mongo tool names (query_mongodb_yelp_review etc). KB docs added (retrieval-scored, 4KB-capped in prompt): - kb/domain/pancancer-patterns.md clinical+molecular join, log10 expression, %-mutation-by-histology, chi-square recipe. - kb/domain/patents-patterns.md natural-language date parsing, EMA in SQL, citation-network traversal, cpc<->titleFull join. - kb/domain/agnews-patterns.md key insight that category is not a column (must LLM-classify from title+description), Mongo<->SQLite join on article_id. Benchmark impact: pass@1 15.24% -> 28.15% (41 -> 76 / 270). Submission updated: PR ucbepic/DataAgentBench#38, commit cdfeca1 on team-cohere-submission-april-2026 branch of the fork.

- results/dab_benchmark_5trials.json: 270 rows (54 queries x 5 trials), 76 pass = 28.15% (prior 41/270 = 15.24%). - results/dab_submission.json, team-cohere_gemini-2.0-flash-001_n5.json: rubric-shaped (dataset, query, run, answer), identical content. Mirror copied to external/DataAgentBench/leaderboard_submissions/ (PR ucbepic/DataAgentBench#38, commit cdfeca1). Gains came from the agent/kb changes in 87f07af (429 backoff, kb doc cap, answer scrubber, SQL-compute prompt, domain KB patterns).

shreyashankar · 2026-04-22T01:20:59Z

Hi @NuryeNigusMekonen — I re-ran the submission through DAB's per-query validate.py and got 0.128 stratified Pass@1 (34/270), vs. the 0.282 (76/270) in the PR description. The gap is concentrated on stockmarket (claim 25/25, verified 10/25), googlelocal (9/20 → 0/20), GITHUB_REPOS (6/20 → 0/20), stockindex (10/15 → 5/15), and yelp (4/35 → 0/35). On the other 6 datasets the two graders match exactly.

Looking at the failed answers, most of the gap comes from the MCP DB tool's preview output being submitted verbatim as the final answer instead of an analyzed result. Three examples from the JSON:

GITHUB_REPOS/q1 ("proportion of non-Python READMEs with copyright info") — answer is Found 20 result(s). First 5 name(s): stockinfo, bookreview_review, crm_User, crm_Account, crm_Contact (+15 more). — that's a list_tables output.
GITHUB_REPOS/q2 ("Swift repo with most copied .swift file") — answer is Found 2,792 result(s). First 5 table_name(s): AAAU, AADR, AAME, AAWW, AAXJ (+2,787 more). — stockmarket ticker tables, totally unrelated to the question.
GITHUB_REPOS/q4 ("top 5 non-Python repos by commits") — answer is a markdown table sorted by watch_count not commits.

DAB's validators look for the literal expected entity in the answer string, so these all score 0. For now we'll list the verified 0.128 on the leaderboard, but feel free to update your agent and resubmit.

Add PR #31, #32, #38 to leaderboard

NuryeNigusMekonen · 2026-04-22T04:48:44Z

Hi Shreyashankar, Thanks for the detailed review. I will be adding a post-processing and validation layer aligned with validate.py and fixing routing issues. I will update and resubmit. Regards,

…

On Wed, 22 Apr 2026, 4:21 am Shreya Shankar, ***@***.***> wrote: *shreyashankar* left a comment (ucbepic/DataAgentBench#38) <#38?email_source=notifications&email_token=BSKFGP32C33ZPJSXZIRWFBD4XANBDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGI4TANJZHEYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292905990> Hi @NuryeNigusMekonen <https://github.com/NuryeNigusMekonen> — I re-ran the submission through DAB's per-query validate.py and got *0.128 stratified ***@***.*** (34/270)*, vs. the *0.282 (76/270)* in the PR description. The gap is concentrated on stockmarket (claim 25/25, verified 10/25), googlelocal (9/20 → 0/20), GITHUB_REPOS (6/20 → 0/20), stockindex (10/15 → 5/15), and yelp (4/35 → 0/35). On the other 6 datasets the two graders match exactly. Looking at the failed answers, most of the gap comes from the MCP DB tool's preview output being submitted verbatim as the final answer instead of an analyzed result. Three examples from the JSON: - *GITHUB_REPOS/q1* ("proportion of non-Python READMEs with copyright info") — answer is Found 20 result(s). First 5 name(s): stockinfo, bookreview_review, crm_User, crm_Account, crm_Contact (+15 more). — that's a list_tables output. - *GITHUB_REPOS/q2* ("Swift repo with most copied .swift file") — answer is Found 2,792 result(s). First 5 table_name(s): AAAU, AADR, AAME, AAWW, AAXJ (+2,787 more). — stockmarket ticker tables, totally unrelated to the question. - *GITHUB_REPOS/q4* ("top 5 non-Python repos by commits") — answer is a markdown table sorted by watch_count not commits. DAB's validators look for the literal expected entity in the answer string, so these all score 0. For now we'll list the verified 0.128 on the leaderboard, but feel free to update your agent and resubmit. — Reply to this email directly, view it on GitHub <#38?email_source=notifications&email_token=BSKFGP32C33ZPJSXZIRWFBD4XANBDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGI4TANJZHEYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292905990>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BSKFGPYNNOSWS434VBRFLPD4XANBDAVCNFSM6AAAAACX65B65SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEOJSHEYDKOJZGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…, Cohere (ucbepic#38) to leaderboard Verified Pass@1 numbers were re-computed from the raw submission JSONs using common_scaffold/validate/validate.py: PR ucbepic#31 Pi Coding Agent + Claude Opus 4.6 → 0.5603 (ucbepic#1) PR ucbepic#32 Oracle Forge (Tenacious) + Sonnet 4.6 → 0.4554 (ucbepic#4) PR ucbepic#38 Oracle Forge (Cohere) + Gemini 2.0 F. → 0.128 (ucbepic#10) Adds a Submission column on both the README table and the website leaderboard linking each submission to its PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mission PRs Pi was linking to mariozechner/pi-coding-agent (the SDK author), not the team that made the submission. Cohere was linking to their source repo. Both now link to the PR they opened on this repo, matching the pattern already used for Tenacious (ucbepic#32). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Team Cohere submission (TRP1 FDE, April 2026)

ec00492

Agent: OracleForge Data Agent Backbone LLM: google/gemini-2.0-flash-001 (via OpenRouter) Dataset hints used: yes Coverage: 54 queries x 5 runs = 270 entries

NuryeNigusMekonen added 2 commits April 20, 2026 01:29

Update Team Cohere submission — pass@1 28.15% (from 15.24%)

cdfeca1

shreyashankar mentioned this pull request Apr 22, 2026

Add PR #31, #32, #38 to leaderboard #39

Merged

2 tasks

shreyashankar added a commit that referenced this pull request Apr 22, 2026

Merge pull request #39 from ucbepic/leaderboard/add-pr31-32-38

4ceeb67

Add PR #31, #32, #38 to leaderboard

shreyashankar mentioned this pull request Apr 22, 2026

Fix team links on leaderboard #40

Merged

1 task

NuryeNigusMekonen closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Team Cohere submission (TRP1 FDE, April 2026)#38

Add Team Cohere submission (TRP1 FDE, April 2026)#38
NuryeNigusMekonen wants to merge 3 commits into
ucbepic:mainfrom
NuryeNigusMekonen:team-cohere-submission-april-2026

NuryeNigusMekonen commented Apr 20, 2026 •

edited

Loading

Uh oh!

shreyashankar commented Apr 22, 2026

Uh oh!

NuryeNigusMekonen commented Apr 22, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NuryeNigusMekonen commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DAB Submission — Team Cohere (TRP1 FDE Programme, April 2026)

Uh oh!

shreyashankar commented Apr 22, 2026

Uh oh!

NuryeNigusMekonen commented Apr 22, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NuryeNigusMekonen commented Apr 20, 2026 •

edited

Loading