Skip to content

Add Team Cohere submission (TRP1 FDE, April 2026)#38

Closed
NuryeNigusMekonen wants to merge 3 commits into
ucbepic:mainfrom
NuryeNigusMekonen:team-cohere-submission-april-2026
Closed

Add Team Cohere submission (TRP1 FDE, April 2026)#38
NuryeNigusMekonen wants to merge 3 commits into
ucbepic:mainfrom
NuryeNigusMekonen:team-cohere-submission-april-2026

Conversation

@NuryeNigusMekonen
Copy link
Copy Markdown

@NuryeNigusMekonen NuryeNigusMekonen commented Apr 20, 2026

DAB Submission — Team Cohere (TRP1 FDE Programme, April 2026)

Agent name: OracleForge Data Agent

Backbone LLM model and version: google/gemini-2.0-flash-001 (served via OpenRouter)

Dataset hints used: Yes — db_description_withhint.txt is loaded and injected as part of the agent's dataset context for every trial. See eval/run_trials.py::_load_dataset_description in our repo.

Additional notes:

  • Pass@1 (trial-level, harness scoring): 28.15% (76/270). Prior submission on this PR: 15.24% (41/270).
  • Trial count: 5 runs per query across all 54 queries = 270 entries (DAB website minimum submission tier).
  • Submission file follows the DAB schema: dataset, query, run, answer (strings, run zero-indexed 0–4).
  • Agent architecture: multi-database (SQLite, DuckDB, Postgres, MongoDB) via MCP Toolbox + a local DuckDB bridge + a code-execution sandbox. Source: https://github.com/trp1-cohere-team/data-analytics-agent.
  • Per-dataset pass@1 on this run:
    • stockmarket 25/25 (100.0%), stockindex 10/15 (66.7%), googlelocal 9/20 (45.0%),
      music_brainz_20k 5/15 (33.3%), bookreview 5/15 (33.3%), GITHUB_REPOS 6/20 (30.0%),
      DEPS_DEV_V1 2/10 (20.0%), crmarenapro 9/65 (13.8%), yelp 4/35 (11.4%),
      PATENTS 1/15 (6.7%), PANCANCER_ATLAS 0/15 (0.0%), agnews 0/20 (0.0%).
  • Changes since the prior submission on this PR:
    1. Transient-failure retry — the LLM client now retries upstream 429/503/504 with exponential backoff (2s → 5s → 12s). Previously a brief throttle turned a whole trial into a "LLM call failed" fallback answer.
    2. Knowledge-base budget cap — each retrieved KB document is truncated to 4 KB before being joined into the system prompt, preventing a growing corrections_log.md from inflating input-token cost ~70× per trial.
    3. Answer-extraction scrub — if the LLM response lacks an explicit ANSWER: marker, the extractor now strips leaked markdown code fences, leading comment lines, and raw Python dict/list dumps instead of emitting them as the answer.
    4. Prompt update — the evidence-mode user prompt now explicitly requires in-SQL computation (AVG/SUM/COUNT/CTEs/window functions) when the question asks for a statistic, so the agent returns computed values rather than raw rows.
    5. Dataset-specific KB patterns — added kb/domain/pancancer-patterns.md, patents-patterns.md, agnews-patterns.md (join patterns, date-format notes, classification workflow). These had limited effect on the three zero/low-pass datasets — the remaining failures on PANCANCER_ATLAS, PATENTS, and agnews are primarily analytical-reasoning limits rather than missing schema knowledge.
  • Known limitation carried forward: schema-naming mismatch between the team's local data load (prefix convention like crm_<Table>, bookreview_<table>) and DAB's unprefixed db_description.txt names. Views are in place for non-colliding tables; the colliding ones (Lead, tip) are documented as prefixed aliases in the dataset context. See aidlc-docs/audit.md for the follow-up plan.

Agent: OracleForge Data Agent
Backbone LLM: google/gemini-2.0-flash-001 (via OpenRouter)
Dataset hints used: yes
Coverage: 54 queries x 5 runs = 270 entries
NuryeNigusMekonen added a commit to trp1-cohere-team/data-analytics-agent that referenced this pull request Apr 20, 2026
- results/team-cohere_gemini-2.0-flash-001_n5.json: 270-entry submission in DAB schema
- results/dab_submission.json: same content, synced to submission format (dataset, query, run, answer)
- results/README.md: submission status + PR URL (ucbepic/DataAgentBench#38)
- scripts/build_dab_submission.py: rebuild submission from worker outputs
Post-submission iteration: created SQLite/DuckDB/Postgres views exposing
DAB's original table names (Lead, review, Mutation_Data, etc.) over our
prefix-loaded tables (crm_Lead, bookreview_review, pancancer_Mutation_Data).
Reran the previously 0%-scoring datasets (partial, 136 of 205 planned
trials — halted by OpenRouter weekly-credit exhaustion).

Score change:
  before:  38/270 = 14.07% pass@1 (trial-level)
  after:   41/270 = 15.19% pass@1

Net +3 passes. Biggest gainers: bookreview (0 -> 5), crmarenapro (0 -> 2).
Regressions on music_brainz_20k and yelp (their prior passes were lucky
substring matches; with views in place, the agent correctly reports "data
not available" for Mongo-hosted data that views cannot reach).

Agent, model, and hints disclosure unchanged from original PR body.
NuryeNigusMekonen added a commit to trp1-cohere-team/data-analytics-agent that referenced this pull request Apr 20, 2026
Changes in conductor.py:
- Retry transient 429/503/504 with exponential backoff (2s/5s/12s) in
  _call_llm; upstream throttles previously turned whole trials into
  'LLM call failed' fallback answers.
- Cap each retrieved KB doc at 4KB before joining into the prompt so a
  growing corrections_log.md (observed ~1.4MB / ~350K tokens) cannot
  inflate input-token cost ~70x per call.
- Harden _extract_answer: new _scrub_leaked_llm_output strips stray
  markdown fences, leading plan-comment lines, and raw dict/list
  dumps when the LLM omits the ANSWER: marker.
- Strengthen evidence-mode and synthesize-mode user prompts: require
  in-SQL computation (AVG/SUM/COUNT/CTEs/window funcs) when the
  question asks for a statistic; forbid code fences / plan comments /
  dict dumps in the final ANSWER.
- (Branch-level, prior to this session) _resolve_mongodb_tool helper
  + _recover_db_type made instance method, so dialect reroute works
  with collection-scoped Mongo tool names (query_mongodb_yelp_review
  etc).

KB docs added (retrieval-scored, 4KB-capped in prompt):
- kb/domain/pancancer-patterns.md  clinical+molecular join, log10
  expression, %-mutation-by-histology, chi-square recipe.
- kb/domain/patents-patterns.md  natural-language date parsing, EMA
  in SQL, citation-network traversal, cpc<->titleFull join.
- kb/domain/agnews-patterns.md  key insight that category is not a
  column (must LLM-classify from title+description), Mongo<->SQLite
  join on article_id.

Benchmark impact: pass@1 15.24% -> 28.15% (41 -> 76 / 270).
Submission updated: PR ucbepic/DataAgentBench#38, commit cdfeca1 on
team-cohere-submission-april-2026 branch of the fork.
NuryeNigusMekonen added a commit to trp1-cohere-team/data-analytics-agent that referenced this pull request Apr 20, 2026
- results/dab_benchmark_5trials.json: 270 rows (54 queries x 5 trials),
  76 pass = 28.15% (prior 41/270 = 15.24%).
- results/dab_submission.json, team-cohere_gemini-2.0-flash-001_n5.json:
  rubric-shaped (dataset, query, run, answer), identical content.
  Mirror copied to external/DataAgentBench/leaderboard_submissions/
  (PR ucbepic/DataAgentBench#38, commit cdfeca1).

Gains came from the agent/kb changes in 87f07af (429 backoff, kb doc
cap, answer scrubber, SQL-compute prompt, domain KB patterns).
@shreyashankar
Copy link
Copy Markdown
Collaborator

Hi @NuryeNigusMekonen — I re-ran the submission through DAB's per-query validate.py and got 0.128 stratified Pass@1 (34/270), vs. the 0.282 (76/270) in the PR description. The gap is concentrated on stockmarket (claim 25/25, verified 10/25), googlelocal (9/20 → 0/20), GITHUB_REPOS (6/20 → 0/20), stockindex (10/15 → 5/15), and yelp (4/35 → 0/35). On the other 6 datasets the two graders match exactly.

Looking at the failed answers, most of the gap comes from the MCP DB tool's preview output being submitted verbatim as the final answer instead of an analyzed result. Three examples from the JSON:

  • GITHUB_REPOS/q1 ("proportion of non-Python READMEs with copyright info") — answer is Found 20 result(s). First 5 name(s): stockinfo, bookreview_review, crm_User, crm_Account, crm_Contact (+15 more). — that's a list_tables output.
  • GITHUB_REPOS/q2 ("Swift repo with most copied .swift file") — answer is Found 2,792 result(s). First 5 table_name(s): AAAU, AADR, AAME, AAWW, AAXJ (+2,787 more). — stockmarket ticker tables, totally unrelated to the question.
  • GITHUB_REPOS/q4 ("top 5 non-Python repos by commits") — answer is a markdown table sorted by watch_count not commits.

DAB's validators look for the literal expected entity in the answer string, so these all score 0. For now we'll list the verified 0.128 on the leaderboard, but feel free to update your agent and resubmit.

shreyashankar added a commit that referenced this pull request Apr 22, 2026
@shreyashankar shreyashankar mentioned this pull request Apr 22, 2026
1 task
@NuryeNigusMekonen
Copy link
Copy Markdown
Author

NuryeNigusMekonen commented Apr 22, 2026 via email

NuryeNigusMekonen pushed a commit to NuryeNigusMekonen/DataAgentBench that referenced this pull request Apr 22, 2026
…, Cohere (ucbepic#38) to leaderboard

Verified Pass@1 numbers were re-computed from the raw submission JSONs
using common_scaffold/validate/validate.py:

  PR ucbepic#31  Pi Coding Agent + Claude Opus 4.6      → 0.5603 (ucbepic#1)
  PR ucbepic#32  Oracle Forge (Tenacious) + Sonnet 4.6  → 0.4554 (ucbepic#4)
  PR ucbepic#38  Oracle Forge (Cohere) + Gemini 2.0 F.  → 0.128  (ucbepic#10)

Adds a Submission column on both the README table and the website
leaderboard linking each submission to its PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NuryeNigusMekonen pushed a commit to NuryeNigusMekonen/DataAgentBench that referenced this pull request Apr 22, 2026
…mission PRs

Pi was linking to mariozechner/pi-coding-agent (the SDK author), not the
team that made the submission. Cohere was linking to their source repo.
Both now link to the PR they opened on this repo, matching the pattern
already used for Tenacious (ucbepic#32).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants