GSoC 2026 – Interest in Behavioral Evaluation Test Framework (Idea #2) #20482
KenWuqianghao
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm Ken Wu, a final-year CS student at the University of Waterloo. I've been building LLM eval systems for the past year: at Nokia I built an eval framework for fine-tuned Qwen models that caught overfitting we couldn't see from loss curves alone, and at August I built a round-robin scoring framework for 15+ LLM agents with LLM-as-Judge. I've been reading through the gemini-cli eval code for the past week and wanted to share what I found before drafting a proposal.
cc @gundermanc @srithreepo
What already exists
The
evals/directory is more capable than the issue description (#18257) makes it sound. A few things I noticed:evalTestinevals/test-helper.tshas a two-tier policy system:ALWAYS_PASSESruns in CI on every PR,USUALLY_PASSESgets tracked nightly. So the framework already handles LLM non-determinism instead of treating every flaky result as a test bug. That's good.TestRig(packages/test-utils/src/test-rig.ts) does a lot more than I expected. It sets up isolated CLI environments with custom file systems, runs prompts, waits for specific tool calls (you can even validate arguments via callback), and gives you the full telemetry logs afterward throughreadToolLogs(). It also handles git init, node_modules symlinking, agent acknowledgment files, activity log capture.The nightly pipeline (
evals-nightly.yml) runs everything 3 times across 6 models, aggregates pass rates, and keeps 10-run history. There's even/fix-behavioral-evalthat auto-investigates regressions by pulling nightly results withghCLI and suggesting prompt fixes. So the full loop already works: write eval, run nightly, detect regression, auto-fix.Where the gaps are
There are ~20 eval files right now. Comparing that against the GSoC target of 50+ across debugging, refactoring, new features, and code review, a few things stand out:
No debugging evals. Nothing hands the agent a broken codebase and checks if it can find and fix the bug.
validation_fidelity.eval.tstests refactoring but always starts from working code.No code review evals. Can the agent spot a security issue in a diff? Can it flag a breaking API change? No idea, it's never tested.
No multi-file reasoning. Most evals use 1-3 files. Real tasks usually need the agent to figure out which files are even relevant before it can start working.
No efficiency tracking. The nightly pipeline knows if the agent passed or failed, but not how it got there. This was a big lesson from my time at August: one agent solves a task in 3 tool calls, another thrashes around for 30 and still technically passes.
readToolLogs()already captures this data, nobody's aggregating it.No taxonomy. Evals are grouped by feature (grep, memory, plan mode). If you want to know "how good is the agent at debugging?" there's no way to slice the data that way.
What I'd build
1. Task taxonomy
I'd extend
EvalCasewith category metadata rather than replacing anything:evalTestand the policy system stay the same. This just adds fields so the nightly report can group results by category.2. New scenarios
Debugging:
< lengthto<= lengthwithout touching unrelated code.Code review:
Multi-file:
validation_fidelity.eval.tsbut bigger.3. Efficiency scoring
This is the part I care about most, because I ran into exactly this problem at August. Pass/fail hides huge quality differences. I'd pull metrics from
readToolLogs()after each run:None of this would block CI. It'd show up in the nightly report as trend lines alongside pass rates.
4. Better reports
Extend
scripts/aggregate_evals.jsto break results down by category instead of only by test name. Add regression alerts when a category's pass rate drops below its rolling average.5.
gemini --evalCLI command (stretch goal)This was suggested on #18257 -- letting users run custom eval suites from a JSON file, like how ADK's eval command works. Would be useful for extension authors. I'd treat this as a stretch goal depending on how the first four pieces go.
Questions for mentors
Should I prioritize getting to 50+ scenarios first, or is the efficiency scoring system more useful to you right now?
Is there historical nightly eval data somewhere I can use for baselines? The aggregate script references past runs but artifacts only seem to last 7 days.
The nightly pipeline tests 6 models. Should the framework track per-model baselines, or is one "primary model" baseline enough?
How much priority should
gemini --eval custom.jsonget within the 175 hours?I'm working on a PR that adds a debugging-category eval as a first contribution. Let me know if there's a different area that'd be more useful.
Beta Was this translation helpful? Give feedback.
All reactions