Context
Phase 4 PR-6 (T057 decoder accuracy harness) landed the SC-004 measurement gate as crates/engine/tests/decoder_accuracy.rs::resolution_rate_at_0_85. The test is #[ignore]-marked because the decoder's current empirical accuracy is well below the spec's 85% target. The complementary resolution_rate_does_not_regress is always-on at a 50% floor and prevents the accuracy from getting worse.
This issue tracks closing the gap so the #[ignore] can be removed and SC-004 lands as a load-bearing gate.
Current state (2026-04-25 capture, branch 004-phase4-pr6-bench-accuracy-gates)
Per-class breakdown:
GarbledDelimiter: 51/51 (100.0%) ✅
MissingDelimiter: 0/17 ( 0.0%) ❌
Reordering: 41/41 (100.0%) ✅
SupersededToken: 2/3 ( 66.7%) ⚠️
Typo: 26/130 ( 20.0%) ❌
WrongCase: 18/18 (100.0%) ✅
Aggregate: 138/260 ( 53.1%)
To reach 85% aggregate (221/260), the decoder needs to recover roughly:
- +83 fixtures if the gain comes purely from Typo (104/130 → 80% Typo-class accuracy)
- +17 fixtures if MissingDelimiter is fully recovered (17/17 → 100%) plus +66 from Typo
- Or a mix across both classes — the per-class table above shows where the headroom is.
Specific gaps surfaced by the harness
The first five unresolved samples from the Typo class (representative, not exhaustive — full list reproducible by running the gate):
"TOP SECRET//SI/UK//NOFORN" → expected "TOP SECRET//SI/TK//NOFORN". Decoder returned Unambiguous(TopSecret) but did NOT correct the SCI sub-compartment typo (UK → TK). The fuzzy matcher's per-token pass appears not to cover SCI sub-compartment positions.
"SECRET//USAR-..." → expected "SECRET//SAR-...". Decoder produced 3 SCI controls instead of recognizing the multi-word SAR program identifier (USAR- typo prefix).
"TPP SECRET//SI//NOFORN" → expected "TOP SECRET//SI//NOFORN". Decoder lost the classification entirely (cls=None); TPP did not fuzzy-match TOP.
"SECRET//SAR-BP-J1 2J54-..." → expected "SECRET//SAR-BP-J12 J54-...". Intra-SAR-token typo (whitespace shift inside a multi-word SAR program identifier).
"SECRET//SAR-...//NOFORON" → expected "...//NOFORN". Returned zero-candidate. NOFORON is edit-distance-1 from NOFORN (insertion) but the fuzzy matcher rejected it — likely the per-token MIN_FUZZY_LEN gate or insertion handling.
Likely fix areas (decoder)
Based on the failure patterns above, the candidate work breakdown:
Acceptance
cargo test -p marque-engine --test decoder_accuracy --features decoder-harness -- --ignored exits 0.
#[ignore] is removed from resolution_rate_at_0_85.
- The regression-floor constant
AGGREGATE_FLOOR_REGRESSION in the same file is ratcheted up alongside the decoder improvements, so a future regression below the new measured rate also fails CI.
Constitution / spec references
- Spec SC-004 (
specs/004-constraints-decoder-vocab/spec.md line 149): "Of a mangled-marking fixture of at least 200 labeled cases, at least 85 percent are resolved to the expected canonical marking when the probabilistic recognizer's aggregate confidence threshold is set to 0.85 or higher."
- Constitution Principle VIII (Authoritative Source Fidelity): any new fuzzy-correction transform that touches CAPCO syntax must cite the relevant §A–H passage in
crates/capco/docs/CAPCO-2016.md.
Out of scope
- Lowering SC-004 below 85% — the spec target stands.
- Removing fixtures from
tests/fixtures/mangled/ to inflate the rate — the SC-004 floor of ≥200 cases is enforced by the harness's MIN_FIXTURE_COUNT constant.
Context
Phase 4 PR-6 (T057 decoder accuracy harness) landed the SC-004 measurement gate as
crates/engine/tests/decoder_accuracy.rs::resolution_rate_at_0_85. The test is#[ignore]-marked because the decoder's current empirical accuracy is well below the spec's 85% target. The complementaryresolution_rate_does_not_regressis always-on at a 50% floor and prevents the accuracy from getting worse.This issue tracks closing the gap so the
#[ignore]can be removed and SC-004 lands as a load-bearing gate.Current state (2026-04-25 capture, branch
004-phase4-pr6-bench-accuracy-gates)To reach 85% aggregate (221/260), the decoder needs to recover roughly:
Specific gaps surfaced by the harness
The first five unresolved samples from the Typo class (representative, not exhaustive — full list reproducible by running the gate):
"TOP SECRET//SI/UK//NOFORN"→ expected"TOP SECRET//SI/TK//NOFORN". Decoder returned Unambiguous(TopSecret) but did NOT correct the SCI sub-compartment typo (UK→TK). The fuzzy matcher's per-token pass appears not to cover SCI sub-compartment positions."SECRET//USAR-..."→ expected"SECRET//SAR-...". Decoder produced 3 SCI controls instead of recognizing the multi-word SAR program identifier (USAR-typo prefix)."TPP SECRET//SI//NOFORN"→ expected"TOP SECRET//SI//NOFORN". Decoder lost the classification entirely (cls=None);TPPdid not fuzzy-matchTOP."SECRET//SAR-BP-J1 2J54-..."→ expected"SECRET//SAR-BP-J12 J54-...". Intra-SAR-token typo (whitespace shift inside a multi-word SAR program identifier)."SECRET//SAR-...//NOFORON"→ expected"...//NOFORN". Returned zero-candidate.NOFORONis edit-distance-1 fromNOFORN(insertion) but the fuzzy matcher rejected it — likely the per-token MIN_FUZZY_LEN gate or insertion handling.Likely fix areas (decoder)
Based on the failure patterns above, the candidate work breakdown:
NOFORONzero-candidate is the cleanest case. Probably a MIN_FUZZY_LEN edge or an insertion-handling gap inmarque-core::fuzzy::FuzzyVocabMatcher.//separators in canonical positions. Probably a missing transform ingenerate_candidate_bytes.TPP SECRETis edit-distance-1 fromTOP SECRET, but the matcher returnedcls=None. Worth confirming whether TPP is being normalized differently fromT0P-style typos.Acceptance
cargo test -p marque-engine --test decoder_accuracy --features decoder-harness -- --ignoredexits 0.#[ignore]is removed fromresolution_rate_at_0_85.AGGREGATE_FLOOR_REGRESSIONin the same file is ratcheted up alongside the decoder improvements, so a future regression below the new measured rate also fails CI.Constitution / spec references
specs/004-constraints-decoder-vocab/spec.mdline 149): "Of a mangled-marking fixture of at least 200 labeled cases, at least 85 percent are resolved to the expected canonical marking when the probabilistic recognizer's aggregate confidence threshold is set to 0.85 or higher."crates/capco/docs/CAPCO-2016.md.Out of scope
tests/fixtures/mangled/to inflate the rate — the SC-004 floor of ≥200 cases is enforced by the harness'sMIN_FIXTURE_COUNTconstant.