GSoC 2026 – Interest in Behavioral Evaluation Test Framework (Idea #2) #20482

KenWuqianghao · 2026-02-26T20:21:07Z

KenWuqianghao
Feb 26, 2026

Hi, I'm Ken Wu, a final-year CS student at the University of Waterloo. I've been building LLM eval systems for the past year: at Nokia I built an eval framework for fine-tuned Qwen models that caught overfitting we couldn't see from loss curves alone, and at August I built a round-robin scoring framework for 15+ LLM agents with LLM-as-Judge. I've been reading through the gemini-cli eval code for the past week and wanted to share what I found before drafting a proposal.

cc @gundermanc @srithreepo

What already exists

The evals/ directory is more capable than the issue description (#18257) makes it sound. A few things I noticed:

evalTest in evals/test-helper.ts has a two-tier policy system: ALWAYS_PASSES runs in CI on every PR, USUALLY_PASSES gets tracked nightly. So the framework already handles LLM non-determinism instead of treating every flaky result as a test bug. That's good.

TestRig (packages/test-utils/src/test-rig.ts) does a lot more than I expected. It sets up isolated CLI environments with custom file systems, runs prompts, waits for specific tool calls (you can even validate arguments via callback), and gives you the full telemetry logs afterward through readToolLogs(). It also handles git init, node_modules symlinking, agent acknowledgment files, activity log capture.

The nightly pipeline (evals-nightly.yml) runs everything 3 times across 6 models, aggregates pass rates, and keeps 10-run history. There's even /fix-behavioral-eval that auto-investigates regressions by pulling nightly results with gh CLI and suggesting prompt fixes. So the full loop already works: write eval, run nightly, detect regression, auto-fix.

Where the gaps are

There are ~20 eval files right now. Comparing that against the GSoC target of 50+ across debugging, refactoring, new features, and code review, a few things stand out:

No debugging evals. Nothing hands the agent a broken codebase and checks if it can find and fix the bug. validation_fidelity.eval.ts tests refactoring but always starts from working code.

No code review evals. Can the agent spot a security issue in a diff? Can it flag a breaking API change? No idea, it's never tested.

No multi-file reasoning. Most evals use 1-3 files. Real tasks usually need the agent to figure out which files are even relevant before it can start working.

No efficiency tracking. The nightly pipeline knows if the agent passed or failed, but not how it got there. This was a big lesson from my time at August: one agent solves a task in 3 tool calls, another thrashes around for 30 and still technically passes. readToolLogs() already captures this data, nobody's aggregating it.

No taxonomy. Evals are grouped by feature (grep, memory, plan mode). If you want to know "how good is the agent at debugging?" there's no way to slice the data that way.

What I'd build

1. Task taxonomy

I'd extend EvalCase with category metadata rather than replacing anything:

interface EvalScenario extends EvalCase {
  category: 'debugging' | 'refactoring' | 'new-feature' | 'code-review' | 'exploration';
  complexity: 'single-file' | 'multi-file' | 'cross-package';
  expected_tools: string[];
}

interface EvalScore {
  passed: boolean;
  tool_calls_total: number;
  redundant_reads: number;
  wasted_explorations: number;
  self_corrections: number;
}

evalTest and the policy system stay the same. This just adds fields so the nightly report can group results by category.

2. New scenarios

Debugging:

Off-by-one bug with a failing test. Agent should fix < length to <= length without touching unrelated code.
Broken import path in a TypeScript project. Should fix the import, not rewrite the module.
Null reference buried three function calls deep. Can the agent actually trace the stack?

Code review:

Diff that introduces unsanitized user input in a shell command. Agent should flag it.
Refactor that silently changes public API behavior. Agent should catch the break.

Multi-file:

Rename a type used across 5+ files, verify all references updated and type checker passes. Similar to the existing validation_fidelity.eval.ts but bigger.
New feature that touches a type definition, a service, and a test file. Order matters.

3. Efficiency scoring

This is the part I care about most, because I ran into exactly this problem at August. Pass/fail hides huge quality differences. I'd pull metrics from readToolLogs() after each run:

Useful tool calls vs total calls
Files read but never referenced in the solution
Whether the agent recovered after going down a wrong path
Same tool called with the same arguments multiple times

None of this would block CI. It'd show up in the nightly report as trend lines alongside pass rates.

4. Better reports

Extend scripts/aggregate_evals.js to break results down by category instead of only by test name. Add regression alerts when a category's pass rate drops below its rolling average.

5. gemini --eval CLI command (stretch goal)

This was suggested on #18257 -- letting users run custom eval suites from a JSON file, like how ADK's eval command works. Would be useful for extension authors. I'd treat this as a stretch goal depending on how the first four pieces go.

Questions for mentors

Should I prioritize getting to 50+ scenarios first, or is the efficiency scoring system more useful to you right now?
Is there historical nightly eval data somewhere I can use for baselines? The aggregate script references past runs but artifacts only seem to last 7 days.
The nightly pipeline tests 6 models. Should the framework track per-model baselines, or is one "primary model" baseline enough?
How much priority should gemini --eval custom.json get within the 175 hours?

I'm working on a PR that adds a debugging-category eval as a first contribution. Let me know if there's a different area that'd be more useful.


Portfolio	kenwu.is-a.dev
GitHub	@KenWuqianghao
LinkedIn	kenwuu
Resume	Google Drive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2026 – Interest in Behavioral Evaluation Test Framework (Idea #2) #20482

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GSoC 2026 – Interest in Behavioral Evaluation Test Framework (Idea #2) #20482

Uh oh!

KenWuqianghao Feb 26, 2026

What already exists

Where the gaps are

What I'd build

Questions for mentors

Replies: 0 comments

KenWuqianghao
Feb 26, 2026