Skip to content

[tooling] Sonnet recall benchmark for deep-review-pro #434

@hubertgajewski

Description

@hubertgajewski

User Story

As the maintainer of the review workflows, I want a recall benchmark that runs the new deep-review-pro orchestrator against five merged PRs reviewed by Sonnet before #218 within a single stable reviewer-config window, and verifies 100% recall of all blocking findings, so that I have evidence the pro workflow matches or exceeds historical Sonnet review coverage before the rename lands in #435.

Context

Sonnet's reviews on this repo distinguish blocking from advisory:

Signal Treatment
Review state == "CHANGES_REQUESTED" Every finding in body is blocking
Review state == "APPROVED" with ### Bug: / ### Issue: heading Blocking
Inline review comment on a CHANGES_REQUESTED review Blocking
Phrase (not a blocker), Minor gap, out of scope, Nit:, Consider: Non-blocking

Confirmed by inspecting PR #205 (CHANGES_REQUESTED, ### Bug: head_sha is no longer guaranteed to be a local git object) and PR #213 (APPROVED with explicit (not a blocker) advisory).

Reviewer config has changed over time (see memory [Reviewer config & model selection]); to keep classification consistent, the corpus PRs must all come from a single <= 4-week window with stable reviewer config. PR #205 is included to validate S5b's CI semantic-review capability.

Until #435 lands, the orchestrator may still live at the provisional deep-review-next path. The benchmark requirements should use the intended stable name deep-review-pro; implementation may map that to the provisional path during the rollout window.

Acceptance Criteria

  • Given a single <= 4-week window before #217 Make PR reviewer provider-switchable (OpenRouter/Qwen default) #218 with stable reviewer config (window documented in eval/README.md), when I select corpus PRs, then five merged PRs from that window span small/medium/large diffs and multiple domains (Playwright, CI, scripts, MCP, docs+tooling), with PR #203 Checkout default branch so self-healing script is always up to date #205 included.
  • Given the corpus, when the corpus build step completes, then each PR file lists Sonnet's findings classified into blocking / advisory / explicitly-excluded buckets, each finding tagged with the expected agent, and each finding records the commit_id it was reviewed against.
  • Given a Sonnet finding was fixed before merge (the flagged code no longer exists in unfixed form in the merged diff), when the corpus is built, then that finding is marked ~~B<n>~~ (resolved before merge - exclude from recall) and not counted in recall scoring.
  • Given the corpus, when I run deep-review-pro <PR#> against each, then every blocking finding has a matching deep-review-pro finding (file region +/-5 lines, same category).
  • Given any blocking miss, when the benchmark fails, then the report identifies the responsible agent and the prose to tighten; the corresponding agent story is reopened or a follow-up patch issue is created before this story closes.
  • Given advisory recall, when the benchmark runs, then advisory recall >= 80% per skill version is tracked in eval/results/ as a regression metric (soft target - does not gate this story).

Implementation Hint

Target layout after #435:

.claude/skills/deep-review-pro/eval/
├── README.md            # window, scoring rules, how to add a PR
├── corpus/              # one .md per PR - Sonnet baseline + classification + commit_id per finding
└── results/             # one .md per PR per skill version

During the rollout window before #435, use the provisional .claude/skills/deep-review-next/eval/ path if needed, but keep documentation and result labels oriented around the stable deep-review-pro name.

Fetch reviews/comments from three GitHub API endpoints:

  • /repos/.../pulls/<N>/reviews - top-level reviews (state field)
  • /repos/.../pulls/<N>/comments - inline review comments
  • /repos/.../issues/<N>/comments - general PR comments

Filter by user.login == "claude[bot]".

Definition of Done

Metadata

Metadata

Labels

enhancementNew feature or requesttest-qualityTest code quality improvementstoolingDeveloper tooling, scripts, and IDE configuration

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions