You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the maintainer of the review workflows, I want a recall benchmark that runs the new deep-review-pro orchestrator against five merged PRs reviewed by Sonnet before #218 within a single stable reviewer-config window, and verifies 100% recall of all blocking findings, so that I have evidence the pro workflow matches or exceeds historical Sonnet review coverage before the rename lands in #435.
Context
Sonnet's reviews on this repo distinguish blocking from advisory:
Signal
Treatment
Review state == "CHANGES_REQUESTED"
Every finding in body is blocking
Review state == "APPROVED" with ### Bug: / ### Issue: heading
Blocking
Inline review comment on a CHANGES_REQUESTED review
Blocking
Phrase (not a blocker), Minor gap, out of scope, Nit:, Consider:
Non-blocking
Confirmed by inspecting PR #205 (CHANGES_REQUESTED, ### Bug: head_sha is no longer guaranteed to be a local git object) and PR #213 (APPROVED with explicit (not a blocker) advisory).
Reviewer config has changed over time (see memory [Reviewer config & model selection]); to keep classification consistent, the corpus PRs must all come from a single <= 4-week window with stable reviewer config. PR #205 is included to validate S5b's CI semantic-review capability.
Until #435 lands, the orchestrator may still live at the provisional deep-review-next path. The benchmark requirements should use the intended stable name deep-review-pro; implementation may map that to the provisional path during the rollout window.
Given the corpus, when the corpus build step completes, then each PR file lists Sonnet's findings classified into blocking / advisory / explicitly-excluded buckets, each finding tagged with the expected agent, and each finding records the commit_id it was reviewed against.
Given a Sonnet finding was fixed before merge (the flagged code no longer exists in unfixed form in the merged diff), when the corpus is built, then that finding is marked ~~B<n>~~ (resolved before merge - exclude from recall) and not counted in recall scoring.
Given the corpus, when I run deep-review-pro <PR#> against each, then every blocking finding has a matching deep-review-pro finding (file region +/-5 lines, same category).
Given any blocking miss, when the benchmark fails, then the report identifies the responsible agent and the prose to tighten; the corresponding agent story is reopened or a follow-up patch issue is created before this story closes.
Given advisory recall, when the benchmark runs, then advisory recall >= 80% per skill version is tracked in eval/results/ as a regression metric (soft target - does not gate this story).
.claude/skills/deep-review-pro/eval/
├── README.md # window, scoring rules, how to add a PR
├── corpus/ # one .md per PR - Sonnet baseline + classification + commit_id per finding
└── results/ # one .md per PR per skill version
During the rollout window before #435, use the provisional .claude/skills/deep-review-next/eval/ path if needed, but keep documentation and result labels oriented around the stable deep-review-pro name.
Fetch reviews/comments from three GitHub API endpoints:
deep-review-pro <PR#> run against each; results captured in eval/results/. Not verified: Codex produced manual substitution artifacts, not literal Claude /deep-review-pro execution.
100% recall on blocking findings across all five (HARD pass - any miss triggers a follow-up patch on the responsible agent before this story closes). Verified only as an exclusion audit: the selected corpus had 0/0 active blocking findings after fixed-before-merge and withdrawn/accepted exclusions.
Advisory recall >= 80% (soft target; misses logged for tuning). Verified in the Codex manual substitution audit as 11/11 advisory findings.
eval/README.md documents the window, scoring rules, and how to add a PR for future iterations. Verified in closed PR #434 Add deep-review-pro recall benchmark #533, but not merged because the data was intentionally discarded.
User Story
As the maintainer of the review workflows, I want a recall benchmark that runs the new
deep-review-proorchestrator against five merged PRs reviewed by Sonnet before #218 within a single stable reviewer-config window, and verifies 100% recall of all blocking findings, so that I have evidence the pro workflow matches or exceeds historical Sonnet review coverage before the rename lands in #435.Context
Sonnet's reviews on this repo distinguish blocking from advisory:
state == "CHANGES_REQUESTED"state == "APPROVED"with### Bug:/### Issue:headingCHANGES_REQUESTEDreview(not a blocker),Minor gap,out of scope,Nit:,Consider:Confirmed by inspecting PR #205 (CHANGES_REQUESTED,
### Bug: head_sha is no longer guaranteed to be a local git object) and PR #213 (APPROVED with explicit(not a blocker)advisory).Reviewer config has changed over time (see memory
[Reviewer config & model selection]); to keep classification consistent, the corpus PRs must all come from a single <= 4-week window with stable reviewer config. PR #205 is included to validate S5b's CI semantic-review capability.Until #435 lands, the orchestrator may still live at the provisional
deep-review-nextpath. The benchmark requirements should use the intended stable namedeep-review-pro; implementation may map that to the provisional path during the rollout window.Acceptance Criteria
eval/README.md), when I select corpus PRs, then five merged PRs from that window span small/medium/large diffs and multiple domains (Playwright, CI, scripts, MCP, docs+tooling), with PR #203 Checkout default branch so self-healing script is always up to date #205 included.commit_idit was reviewed against.~~B<n>~~ (resolved before merge - exclude from recall)and not counted in recall scoring.deep-review-pro <PR#>against each, then every blocking finding has a matchingdeep-review-profinding (file region +/-5 lines, same category).eval/results/as a regression metric (soft target - does not gate this story).Implementation Hint
Target layout after #435:
During the rollout window before #435, use the provisional
.claude/skills/deep-review-next/eval/path if needed, but keep documentation and result labels oriented around the stabledeep-review-proname.Fetch reviews/comments from three GitHub API endpoints:
/repos/.../pulls/<N>/reviews- top-level reviews (state field)/repos/.../pulls/<N>/comments- inline review comments/repos/.../issues/<N>/comments- general PR commentsFilter by
user.login == "claude[bot]".Definition of Done
< #218from a single <= 4-week stable-config window selected; PR #203 Checkout default branch so self-healing script is always up to date #205 is one of them. Verified during closed PR #434 Add deep-review-pro recall benchmark #533: selected PRs #175 Add /generate-stubs skill for bulk coverage gap handling #195, #173 Add selector fix proposals via Claude Sonnet #197, #176 Add self-healing GitHub Actions workflow for selector repairs #201, #203 Checkout default branch so self-healing script is always up to date #205, and #200 Remove CLAUDE_DIAGNOSIS fallback and add AI_MODEL_FAST/AI_MODEL_STRONG overrides #213 from 2026-04-10 through 2026-04-14.commit_id. Verified during corpus build in closed PR #434 Add deep-review-pro recall benchmark #533.deep-review-pro <PR#>run against each; results captured ineval/results/. Not verified: Codex produced manual substitution artifacts, not literal Claude/deep-review-proexecution.0/0active blocking findings after fixed-before-merge and withdrawn/accepted exclusions.eval/README.mddocuments the window, scoring rules, and how to add a PR for future iterations. Verified in closed PR #434 Add deep-review-pro recall benchmark #533, but not merged because the data was intentionally discarded.gh project item-list: issue [tooling] Sonnet recall benchmark for deep-review-pro #434 hasestimate: 3.gh project item-list: issue [tooling] Sonnet recall benchmark for deep-review-pro #434 hasactual_hours: 1.