Skip to content

[Team PaLM] — TRP1 FDE Programme, April 2026#37

Closed
yosef-zewdu wants to merge 3 commits into
ucbepic:mainfrom
PALM-Oracle-Forge:main
Closed

[Team PaLM] — TRP1 FDE Programme, April 2026#37
yosef-zewdu wants to merge 3 commits into
ucbepic:mainfrom
PALM-Oracle-Forge:main

Conversation

@yosef-zewdu
Copy link
Copy Markdown

@yosef-zewdu yosef-zewdu commented Apr 19, 2026

Summary

  • Pass@1: 0.474
  • Trials: 5 runs per query
  • Agent name: Oracle Forge
  • Backbone LLM: Gemini 3.1 pro preview (via OpenRouter)
  • Hints: Context Layer

Architecture

  • ContextManager curated domain KB, AGENT.md and corrections log
  • AgenticLoop — LLM-driven tool execution max 30 iterations per query.
  • ExecutionEngine/ MCPToolbox — hybrid routing: Google MCP Toolbox for PostgreSQL/SQLite/MongoDB; DuckDB MCP server for DuckDB.
  • SelfCorrectionLoop — failure categorisation, LLM-guided repair, known join-key fixes, max 3 retries, outcomes logged back into Layer 3.

Per-Dataset Results

Dataset Pass@1
agnews 0.25
bookreview 1.00
crmarenapro 0.60
deps_dev_v1 0.00
github_repos 0.00
googlelocal 0.65
music_brainz_20k 0.67
pancancer_atlas 1.00
patents 0.00
stockindex 0.60
stockmarket 0.44
yelp 0.31
OVERALL 0.47

@yosef-zewdu yosef-zewdu changed the title [Oracle Forge] — TRP1 FDE Programme, April 2026 [Team PaLM] — TRP1 FDE Programme, April 2026 Apr 19, 2026
@shreyashankar
Copy link
Copy Markdown
Collaborator

Hi @yosef-zewdu — we're missing coverage. The file has 30 entries across 9 of 12 datasets (missing GITHUB_REPOS, PANCANCER_ATLAS, music_brainz_20k) and 28 of 54 queries, with 1–2 runs each. Per the instructions in the README, we need every query across all 12 datasets with at least 5 runs per query. If you didn't attempt some queries, include those entries with "answer": "". Once it's in I'll re-run verification and post the Pass@1 here.

@yosef-zewdu
Copy link
Copy Markdown
Author

@shreyashankar Thanks for the response. We will get back as the instruction suggest.

@yosef-zewdu
Copy link
Copy Markdown
Author

Hello @shreyashankar , I have updated our submission with 5 runs per query as you mentioned. We have also updated the results. I request a re run for verification.

@yosef-zewdu
Copy link
Copy Markdown
Author

Hello @Ruiying-Ma, could you please take a look at this PR. I am following up because all requested 5-run evaluation data has been pushed and updated.

Thank you for your time!

@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

Hello @yosef-zewdu! Sorry for the delayed reply. Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for additional validation checks. Once we have them, we’ll re-run the verification and post the Pass@1 results here. Thank you!

@yosef-zewdu
Copy link
Copy Markdown
Author

Thanks for the response @Ruiying-Ma. Here are the traces for all runs
runs.zip

@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

Hi @yosef-zewdu — verified with common_scaffold/validate/validate.py:

Pass@1 = 0.4607 (stratified) — average across the 12 datasets of the per-dataset average across queries of c/n. This is the leaderboard metric.
Pass@1 = 0.4741 (micro) — total passes / total runs across all 270 trials. Equal weight per (query, run).

We put 0.4607 on the leaderboard, which now links back to this PR. Thanks for the submission! Closing.

@Ruiying-Ma Ruiying-Ma closed this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants