Comparing changes

User feedback after v0.0.19's prompt rewrite: "继续很深入改". Deeper this time: not just better prompts, but a structural change to the pipeline itself. ## The biggest single quality multiplier we can ship A second LLM call on the stitched output catches 20-30% more L3 errors that the first pass missed. This is the single biggest quality lift short of changing the model. Default ON. ## Self-review (Stage 6) — wired into iter_events After all chunks complete: 1. Build a payload of {edited_markdown, change_log, library_context, briefing} 2. Call the model with the self_review system prompt 3. Parse JSON: {additional_corrections, rollbacks, promotions_to_user_review, data_conflicts, format_issues} 4. Apply additional_corrections to the markdown via string replace (only if `old` actually appears in the doc — model hallucination guard) 5. Append review changes to change_log with stage="self_review" 6. Surface diagnostics on the `complete` event's `self_review` field Cost: 1 extra LLM call per Run regardless of chunk count. Auto-skipped when stitched output > 100k chars to bound cost on monster transcripts. SSE events: self_review_start / self_review_done / self_review_error. UI shows "↻ Self-review — re-reading for missed corrections…" then "+N fixes · M data conflicts to check · K ambiguous items flagged". Pipeline param: `enable_self_review: bool = True`. Tests that exercise chunking/main-edit alone opt out explicitly. ## Self-review prompt (06_self_review.md) Rewritten from a thin checklist into a 5-check routine: 1. Proper noun audit (the headline reason; first pass misses 20-30%) 2. Speaker consistency across full document 3. Cross-section data consistency (ARR / headcount / funding agree) 4. Format hygiene (leftover [Speaker N], mixed punctuation, etc.) 5. Over-correction rollbacks (confidence < 0.7 sanity check) Hard JSON output schema. "Don't second-guess high-confidence work." ## L3.5 rewritten Same treatment as L3 got in v0.0.19. The conservative L3.5 table now wraps a mandatory 7-check routine: sentence boundary correctness, stutter dedup (exact X X only), word-order garbling, missing function words, speaker-switch swallowed mid-paragraph, number/letter confusion in spoken digits, same-sound substitution destroying meaning (cohort vs co-host). Hard confidence threshold table per change type. Explicit "what L3.5 does NOT do" list keeps the model from over-correcting. ## Tests 284 → 292 (+8). All passing. Ruff clean. test_pipeline_self_review.py covers: event ordering, additional corrections actually applied, token counts include both passes, opt-out flag, auto-skip for huge output, garbage response doesn't crash, diagnostics surfaced in complete, ignore corrections where old not in doc (hallucination guard). Existing tests that asserted exact token counts updated to reflect the new 2-pass total.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on May 23, 2026

This comparison is taking too long to generate.

Uh oh!