[Team PaLM] — TRP1 FDE Programme, April 2026 by yosef-zewdu · Pull Request #37 · ucbepic/DataAgentBench

yosef-zewdu · 2026-04-19T18:38:17Z

Summary

Pass@1: 0.474
Trials: 5 runs per query
Agent name: Oracle Forge
Backbone LLM: Gemini 3.1 pro preview (via OpenRouter)
Hints: Context Layer

Architecture

ContextManager curated domain KB, AGENT.md and corrections log
AgenticLoop — LLM-driven tool execution max 30 iterations per query.
ExecutionEngine/ MCPToolbox — hybrid routing: Google MCP Toolbox for PostgreSQL/SQLite/MongoDB; DuckDB MCP server for DuckDB.
SelfCorrectionLoop — failure categorisation, LLM-guided repair, known join-key fixes, max 3 retries, outcomes logged back into Layer 3.

Per-Dataset Results

Dataset	Pass@1
agnews	0.25
bookreview	1.00
crmarenapro	0.60
deps_dev_v1	0.00
github_repos	0.00
googlelocal	0.65
music_brainz_20k	0.67
pancancer_atlas	1.00
patents	0.00
stockindex	0.60
stockmarket	0.44
yelp	0.31
OVERALL	0.47

shreyashankar · 2026-04-22T01:18:57Z

Hi @yosef-zewdu — we're missing coverage. The file has 30 entries across 9 of 12 datasets (missing GITHUB_REPOS, PANCANCER_ATLAS, music_brainz_20k) and 28 of 54 queries, with 1–2 runs each. Per the instructions in the README, we need every query across all 12 datasets with at least 5 runs per query. If you didn't attempt some queries, include those entries with "answer": "". Once it's in I'll re-run verification and post the Pass@1 here.

yosef-zewdu · 2026-04-24T05:48:34Z

@shreyashankar Thanks for the response. We will get back as the instruction suggest.

yosef-zewdu · 2026-05-11T05:52:22Z

Hello @shreyashankar , I have updated our submission with 5 runs per query as you mentioned. We have also updated the results. I request a re run for verification.

yosef-zewdu · 2026-05-14T19:34:40Z

Hello @Ruiying-Ma, could you please take a look at this PR. I am following up because all requested 5-run evaluation data has been pushed and updated.

Thank you for your time!

Ruiying-Ma · 2026-05-16T03:16:59Z

Hello @yosef-zewdu! Sorry for the delayed reply. Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for additional validation checks. Once we have them, we’ll re-run the verification and post the Pass@1 results here. Thank you!

yosef-zewdu · 2026-05-16T04:23:57Z

Thanks for the response @Ruiying-Ma. Here are the traces for all runs
runs.zip

Ruiying-Ma · 2026-05-16T05:17:55Z

Hi @yosef-zewdu — verified with common_scaffold/validate/validate.py:

Pass@1 = 0.4607 (stratified) — average across the 12 datasets of the per-dataset average across queries of c/n. This is the leaderboard metric.
Pass@1 = 0.4741 (micro) — total passes / total runs across all 270 trials. Equal weight per (query, run).

We put 0.4607 on the leaderboard, which now links back to this PR. Thanks for the submission! Closing.

Team PaLM evaluation submission for DAB benchmark

89cc181

yosef-zewdu changed the title ~~[Oracle Forge] — TRP1 FDE Programme, April 2026~~ [Team PaLM] — TRP1 FDE Programme, April 2026 Apr 19, 2026

shreyashankar mentioned this pull request Apr 22, 2026

Add PR #31, #32, #38 to leaderboard #39

Merged

2 tasks

yosef-zewdu added 2 commits May 11, 2026 06:38

Merge branch 'ucbepic:main' into main

a06a1e7

Update benchmark results with new runs

fbf9791

Ruiying-Ma closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Team PaLM] — TRP1 FDE Programme, April 2026#37

[Team PaLM] — TRP1 FDE Programme, April 2026#37
yosef-zewdu wants to merge 3 commits into
ucbepic:mainfrom
PALM-Oracle-Forge:main

yosef-zewdu commented Apr 19, 2026 •

edited

Loading

Uh oh!

shreyashankar commented Apr 22, 2026

Uh oh!

yosef-zewdu commented Apr 24, 2026

Uh oh!

yosef-zewdu commented May 11, 2026

Uh oh!

yosef-zewdu commented May 14, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

yosef-zewdu commented May 16, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yosef-zewdu commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Per-Dataset Results

Uh oh!

shreyashankar commented Apr 22, 2026

Uh oh!

yosef-zewdu commented Apr 24, 2026

Uh oh!

yosef-zewdu commented May 11, 2026

Uh oh!

yosef-zewdu commented May 14, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

yosef-zewdu commented May 16, 2026

Uh oh!

Ruiying-Ma commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yosef-zewdu commented Apr 19, 2026 •

edited

Loading