[Team PaLM] — TRP1 FDE Programme, April 2026#37
Conversation
|
Hi @yosef-zewdu — we're missing coverage. The file has 30 entries across 9 of 12 datasets (missing GITHUB_REPOS, PANCANCER_ATLAS, music_brainz_20k) and 28 of 54 queries, with 1–2 runs each. Per the instructions in the README, we need every query across all 12 datasets with at least 5 runs per query. If you didn't attempt some queries, include those entries with |
|
@shreyashankar Thanks for the response. We will get back as the instruction suggest. |
|
Hello @shreyashankar , I have updated our submission with 5 runs per query as you mentioned. We have also updated the results. I request a re run for verification. |
|
Hello @Ruiying-Ma, could you please take a look at this PR. I am following up because all requested 5-run evaluation data has been pushed and updated. Thank you for your time! |
|
Hello @yosef-zewdu! Sorry for the delayed reply. Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? We’ll use them for additional validation checks. Once we have them, we’ll re-run the verification and post the Pass@1 results here. Thank you! |
|
Thanks for the response @Ruiying-Ma. Here are the traces for all runs |
|
Hi @yosef-zewdu — verified with common_scaffold/validate/validate.py: Pass@1 = 0.4607 (stratified) — average across the 12 datasets of the per-dataset average across queries of c/n. This is the leaderboard metric. We put 0.4607 on the leaderboard, which now links back to this PR. Thanks for the submission! Closing. |
Summary
Architecture
AGENT.mdand corrections logPer-Dataset Results