[tooling] Sonnet recall benchmark for deep-review-pro

## User Story

As the maintainer of the review workflows, I want a recall benchmark that runs the new `deep-review-pro` orchestrator against five merged PRs reviewed by Sonnet before #218 within a single stable reviewer-config window, and verifies 100% recall of all blocking findings, so that I have evidence the pro workflow matches or exceeds historical Sonnet review coverage before the rename lands in #435.

## Context

Sonnet's reviews on this repo distinguish blocking from advisory:

| Signal | Treatment |
|---|---|
| Review `state == "CHANGES_REQUESTED"` | Every finding in body is **blocking** |
| Review `state == "APPROVED"` with `### Bug:` / `### Issue:` heading | Blocking |
| Inline review comment on a `CHANGES_REQUESTED` review | Blocking |
| Phrase `(not a blocker)`, `Minor gap`, `out of scope`, `Nit:`, `Consider:` | Non-blocking |

Confirmed by inspecting PR #205 (CHANGES_REQUESTED, `### Bug: head_sha is no longer guaranteed to be a local git object`) and PR #213 (APPROVED with explicit `(not a blocker)` advisory).

Reviewer config has changed over time (see memory `[Reviewer config & model selection]`); to keep classification consistent, the corpus PRs must all come from a single <= 4-week window with stable reviewer config. PR #205 is included to validate S5b's CI semantic-review capability.

Until #435 lands, the orchestrator may still live at the provisional `deep-review-next` path. The benchmark requirements should use the intended stable name `deep-review-pro`; implementation may map that to the provisional path during the rollout window.

## Acceptance Criteria

- **Given** a single <= 4-week window before #218 with stable reviewer config (window documented in `eval/README.md`), **when** I select corpus PRs, **then** five merged PRs from that window span small/medium/large diffs and multiple domains (Playwright, CI, scripts, MCP, docs+tooling), with PR #205 included.
- **Given** the corpus, **when** the corpus build step completes, **then** each PR file lists Sonnet's findings classified into blocking / advisory / explicitly-excluded buckets, each finding tagged with the expected agent, and each finding records the `commit_id` it was reviewed against.
- **Given** a Sonnet finding was fixed before merge (the flagged code no longer exists in unfixed form in the merged diff), **when** the corpus is built, **then** that finding is marked `~~B<n>~~ (resolved before merge - exclude from recall)` and not counted in recall scoring.
- **Given** the corpus, **when** I run `deep-review-pro <PR#>` against each, **then** every blocking finding has a matching `deep-review-pro` finding (file region +/-5 lines, same category).
- **Given** any blocking miss, **when** the benchmark fails, **then** the report identifies the responsible agent and the prose to tighten; the corresponding agent story is reopened or a follow-up patch issue is created before this story closes.
- **Given** advisory recall, **when** the benchmark runs, **then** advisory recall >= 80% per skill version is tracked in `eval/results/` as a regression metric (soft target - does not gate this story).

## Implementation Hint

Target layout after #435:

```
.claude/skills/deep-review-pro/eval/
├── README.md            # window, scoring rules, how to add a PR
├── corpus/              # one .md per PR - Sonnet baseline + classification + commit_id per finding
└── results/             # one .md per PR per skill version
```

During the rollout window before #435, use the provisional `.claude/skills/deep-review-next/eval/` path if needed, but keep documentation and result labels oriented around the stable `deep-review-pro` name.

Fetch reviews/comments from three GitHub API endpoints:
- `/repos/.../pulls/<N>/reviews` - top-level reviews (state field)
- `/repos/.../pulls/<N>/comments` - inline review comments
- `/repos/.../issues/<N>/comments` - general PR comments

Filter by `user.login == "claude[bot]"`.

## Definition of Done

- [x] Five PRs `< #218` from a single <= 4-week stable-config window selected; PR #205 is one of them. _Verified during closed PR #533: selected PRs #195, #197, #201, #205, and #213 from 2026-04-10 through 2026-04-14._
- [x] Each blocking finding tagged with expected agent and `commit_id`. _Verified during corpus build in closed PR #533._
- [x] Findings fixed before merge marked excluded from recall. _Verified during corpus build in closed PR #533._
- [ ] `deep-review-pro <PR#>` run against each; results captured in `eval/results/`. _Not verified: Codex produced manual substitution artifacts, not literal Claude `/deep-review-pro` execution._
- [x] **100% recall on blocking findings** across all five (HARD pass - any miss triggers a follow-up patch on the responsible agent before this story closes). _Verified only as an exclusion audit: the selected corpus had `0/0` active blocking findings after fixed-before-merge and withdrawn/accepted exclusions._
- [x] Advisory recall >= 80% (soft target; misses logged for tuning). _Verified in the Codex manual substitution audit as 11/11 advisory findings._
- [x] `eval/README.md` documents the window, scoring rules, and how to add a PR for future iterations. _Verified in closed PR #533, but not merged because the data was intentionally discarded._
- [x] Estimate = 3 set in [Project #1](https://github.com/users/hubertgajewski/projects/1). _Verified via `gh project item-list`: issue #434 has `estimate: 3`._
- [x] Actual hours recorded in [Project #1](https://github.com/users/hubertgajewski/projects/1). _Verified via `gh project item-list`: issue #434 has `actual_hours: 1`._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tooling] Sonnet recall benchmark for deep-review-pro #434

User Story

Context

Acceptance Criteria

Implementation Hint

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Signal	Treatment
Review `state == "CHANGES_REQUESTED"`	Every finding in body is blocking
Review `state == "APPROVED"` with `### Bug:` / `### Issue:` heading	Blocking
Inline review comment on a `CHANGES_REQUESTED` review	Blocking
Phrase `(not a blocker)`, `Minor gap`, `out of scope`, `Nit:`, `Consider:`	Non-blocking

[tooling] Sonnet recall benchmark for deep-review-pro #434

Description

User Story

Context

Acceptance Criteria

Implementation Hint

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions