User feedback after v0.0.19's prompt rewrite: "继续很深入改". Deeper
this time: not just better prompts, but a structural change to the
pipeline itself.
## The biggest single quality multiplier we can ship
A second LLM call on the stitched output catches 20-30% more L3 errors
that the first pass missed. This is the single biggest quality lift
short of changing the model. Default ON.
## Self-review (Stage 6) — wired into iter_events
After all chunks complete:
1. Build a payload of {edited_markdown, change_log, library_context,
briefing}
2. Call the model with the self_review system prompt
3. Parse JSON: {additional_corrections, rollbacks,
promotions_to_user_review, data_conflicts, format_issues}
4. Apply additional_corrections to the markdown via string replace
(only if `old` actually appears in the doc — model hallucination
guard)
5. Append review changes to change_log with stage="self_review"
6. Surface diagnostics on the `complete` event's `self_review` field
Cost: 1 extra LLM call per Run regardless of chunk count. Auto-skipped
when stitched output > 100k chars to bound cost on monster transcripts.
SSE events: self_review_start / self_review_done / self_review_error.
UI shows "↻ Self-review — re-reading for missed corrections…" then
"+N fixes · M data conflicts to check · K ambiguous items flagged".
Pipeline param: `enable_self_review: bool = True`. Tests that exercise
chunking/main-edit alone opt out explicitly.
## Self-review prompt (06_self_review.md)
Rewritten from a thin checklist into a 5-check routine:
1. Proper noun audit (the headline reason; first pass misses 20-30%)
2. Speaker consistency across full document
3. Cross-section data consistency (ARR / headcount / funding agree)
4. Format hygiene (leftover [Speaker N], mixed punctuation, etc.)
5. Over-correction rollbacks (confidence < 0.7 sanity check)
Hard JSON output schema. "Don't second-guess high-confidence work."
## L3.5 rewritten
Same treatment as L3 got in v0.0.19. The conservative L3.5 table
now wraps a mandatory 7-check routine: sentence boundary correctness,
stutter dedup (exact X X only), word-order garbling, missing function
words, speaker-switch swallowed mid-paragraph, number/letter
confusion in spoken digits, same-sound substitution destroying
meaning (cohort vs co-host).
Hard confidence threshold table per change type. Explicit "what
L3.5 does NOT do" list keeps the model from over-correcting.
## Tests
284 → 292 (+8). All passing. Ruff clean.
test_pipeline_self_review.py covers: event ordering, additional
corrections actually applied, token counts include both passes,
opt-out flag, auto-skip for huge output, garbage response doesn't
crash, diagnostics surfaced in complete, ignore corrections where
old not in doc (hallucination guard).
Existing tests that asserted exact token counts updated to reflect
the new 2-pass total.