refactor(wasm): R2 β parse_wasm_config serde reduction (#689 follow-up)#697
Conversation
β¦onfig (#689) TDD step 1 of the R2 serde-reduction refactor. Adds 19 tests to the existing in-module `#[cfg(test)] mod tests` block in `crates/wasm/src/lib.rs` that pin behavior of `parse_wasm_config` before the `#[derive(Deserialize)]` removal: - 9 byte-identity cache-key fixtures capturing exact emission of the current `#[derive(Serialize)] WasmConfigCacheKey` path: None / empty object / empty corrections / deadline-only all map to cache_key=None; classifier_id / threshold / corrections / all-three / deadline+ classifier produce literal byte strings in struct-declaration order (which coincides with alphabetical here β alphabetical pins the shape, struct-declaration order will pin a future field insertion that breaks alphabetical). - 8 type-mismatch loud-failure tests asserting `is_err()` for non- object top-level inputs and field-level type mismatches. Pre-refactor these produce serde-derived error messages; post-refactor they will produce field-name-tagged messages. Tests assert only the loud-vs- silent boundary, not exact wording β both versions pass. - 2 Constitution III preservation tests pinning the silently-ignore- unknown-fields contract. The current `#[derive(Deserialize)]` has no `deny_unknown_fields`; the post-refactor `obj.get(...)` enumeration matches that behavior. WASM target's runtime-config surface MUST NOT widen β these tests catch a regression where a future hand- rolled parser accidentally rejects unknown fields and would block forward-compatible JS callers. All 19 tests pass on `origin/staging` HEAD (pre-refactor). Captured baseline strings via a standalone serde-derive reproducer at authorship; the in-test assertions are the live regression gate. Helpers stay private per the R2 synthesis brief's helper-visibility decision β tests live alongside the existing build_cache_key / parse_deadline_ms tests in `mod tests`, not under `crates/wasm/tests/`, because the integration-test directory has no access to `parse_wasm_config` / `build_cache_key` (intentional encapsulation). Refs #689 Β· refs #696 (R1 precedent)
β¦ey Serialize (#689) Replaces the `#[derive(Deserialize)] WasmConfig` and `#[derive(Serialize)] WasmConfigCacheKey` paths with explicit `serde_json::Value::as_object().get(...)` extraction and direct `serde_json::Map` construction in `crates/wasm/src/lib.rs`. The retired derives were the largest single contributor to `parse_wasm_config`'s twiggy-attributed symbol size after R1 (`HashMap<String, String>` Deserialize monomorphization ~3-4 KB plus per-field Visitor scaffolding ~3 KB plus the cache-key Serialize emission ~1.5 KB). Why an extraction rewrite and not a smaller-knob tweak: - `#[derive(Deserialize)]` on `WasmConfig` monomorphizes a full visitor + per-field __Field enum + a HashMap<String, String> deserializer, each of which is dead-weight code in a WASM build. The four-field accept-list (`classifier_id`, `confidence_threshold`, `corrections`, `deadline_ms`) is a closed contract β enumerating it at the call site via `obj.get(...)` is structurally narrower than the derive (the derive would have happily accepted any serde- deserializable type for a field named `classifier_id`), AND smaller after the linker is done. - `#[derive(Serialize)]` on `WasmConfigCacheKey` carries a similar per-field SerializeMap driver and skip_serializing_if codegen; the manual `serde_json::Map` construction reaches the same wire format directly. Constitution III preservation (binding constraint): - Closed accept-list of `{classifier_id, confidence_threshold, corrections, deadline_ms}` enumerated explicitly in `wasm_config_from_value`. - Unknown JSON fields are silently ignored β preserves the pre-R2 derive behavior (no `#[serde(deny_unknown_fields)]`). Pinned by the new `r2_unknown_field_silently_ignored` / `r2_unknown_field_coexists_with_known` tests. - Field-level type mismatches are loud errors with field- name context (e.g. `"classifier_id must be a string; got number"`). Net gain in diagnostic quality at the cost of serde's column-number context, which JS callers do not use (the outer `from_str::<Value>` parse still surfaces column numbers for malformed JSON). - JSON `null` continues to mean "absent" for every field (preserves the derive's `Option<T>` semantics). Cache-key byte-identity preservation: - `serde_json::Map` without the `preserve_order` feature is BTreeMap-backed and emits in alphabetical key order. `WasmConfig` field names are alphabetical today (`classifier_id` < `confidence_threshold` < `corrections`), which coincides with the struct-declaration order serde's `#[derive(Serialize)]` used. Byte-identity is pinned by the 9 R2 byte-identity tests added in the previous commit. - A future cache-key field that breaks alphabetical (e.g. inserting `aaa_new_field` between `classifier_id` and `confidence_threshold`) will fail the byte-identity tests and require either renaming the field or enabling `serde_json/preserve_order` and inserting in struct-declaration order. - One non-obvious detail: `serde_json::Value::Number` is f64-only. `Number::from_f64(threshold as f64)` widens before formatting and emits the long-form `"0.8500000238418579"` for the 0.85 bit-pattern, not the shortest-roundtrip `"0.85"` that serde's `serialize_f32` path produced. The R2 byte-identity tests caught this on the first build; the fix routes the f32 through `to_string(&threshold)` (uses `serialize_f32`) and then parses-back via `from_str::<Value>` so the resulting `Number` retains the shortest-f32 string form. Cost is one short alloc + one parse on engine-cache-miss only; cheap relative to the bundle savings. Constitution VII Β§IV (scheme-adoption boundary): N/A β `marque-wasm` sits above the rule chain. This commit touches ONLY `crates/wasm/src/lib.rs`. Zero edits to engine, scheme, rules, core, ism, or capco. Verified with `git diff --stat HEAD~1`. Net surface changes: - retired: `#[derive(Deserialize)] WasmConfig` - retired: `#[derive(Serialize)] WasmConfigCacheKey` (struct + impl) - added: `fn wasm_config_from_value(Value) -> Result<WasmConfig, String>` - added: `fn value_type_name(&Value) -> &'static str` - rewrote: `fn build_cache_key(&WasmConfig)` using `serde_json::Map` All 37 marque-wasm unit tests pass (19 R2 tests from commit 1 + 18 pre-existing). All 76 integration tests pass across 10 test binaries (113 total). `cargo +stable clippy -p marque-wasm --all-targets -- -D warnings` is clean. BatchEntry's `#[derive(Deserialize)]` (lib.rs:811) is OUT of scope per the R2 synthesis brief; defer as R2b follow-up only if measurable (two String fields, no Option, expected <1 KB). Refs #689 Β· refs #696 (R1 precedent)
WASM bundle size delta (pre-opt, `wasm-pack --release` with system wasm-opt): pre-R2: 1,372,791 bytes (post-R1 baseline) post-R2: 1,365,032 bytes delta: β7,759 bytes (β0.57%) Above the R2 synthesis-brief 4 KB go/no-go gate. The estimate band in the preflight was 5-8 KB net β the measured delta lands in the lower half of that band, consistent with the preflight's caveat that the dominant `HashMap<String, String>` Deserialize cost is hard to predict exactly because of potential linker deduplication against other String-keyed HashMap deserializers elsewhere in the binary. `lint_10kb` bench (criterion, `cargo bench -p marque-engine --bench lint_latency -- lint_10kb`): pre: [422.35 Β΅s 428.08 Β΅s 434.97 Β΅s] post: [421.49 Β΅s 432.21 Β΅s 447.64 Β΅s] change: +0.45% median, p=0.88 β no statistically significant change. Hot-path neutral, as expected β config parsing is once-per-call, not on the hot path. Constitution Principle I satisfied (SC-001 16 ms ceiling intact; SC-001's threshold is 50Γ the post-R2 median). Also marks the R2 row in the candidate table as landed and inlines an "f32 byte-identity gotcha" note for future serde-reduction work β `Value::Number` is f64-only, so `Number::from_f64(threshold as f64)` widens before formatting and emits the long-form `"0.8500000238418579"` rather than the shortest-roundtrip `"0.85"` that serde's `serialize_f32` produces. The fix is in commit 2; the note in this commit makes the gotcha findable for the next person. Refs #689 Β· refs #696 (R1 precedent)
β¦ons guard + alphabetical-naming constraint cross-reference Two LOW/MEDIUM nits from the 2-reviewer pass on R2 (#689 follow-up, sort-consolidation companion to #696). 1. **rust-specialist nit** β `expect("corrections present")` at `build_cache_key` line 1100 relied on a separately-computed `corrections_present` boolean gate to prove unwrap-safety. Replaced with `if let Some(corrections) = cfg.corrections.as_ref().filter(|c| !c.is_empty())` β same semantic (only emit the `corrections` key when the map is present AND non-empty, matching the retired `#[serde(skip_serializing_if = "Option::is_none")]` shape on `WasmConfigCacheKey`), but the Some-and-non-empty proof is now structurally visible at the binding site rather than relying on a doc-comment to bridge the gap. 2. **general code-reviewer nit** β the `WasmConfig` struct doc-comment described the R2 retirement motive but did not surface the alphabetical-field-naming constraint that the `build_cache_key` doc-comment captures. A developer reading the struct doc first when adding a new field would not see the constraint until they ran the byte-identity tests (which catch the breakage at authorship, but the upfront warning is cheaper). Added a paragraph documenting the constraint with a cross- reference to `build_cache_key`. Declined the LOW field-name-in-error-test pin: the assertion boundary the existing 8 type-mismatch tests pin is loud-vs-silent, which IS the API contract; field-name presence in the error string is implementation-quality, not contract. Pinning it would lock the present error wording (which we explicitly accepted dropping serde's column-number context for) β keeping the boundary contractual rather than wording-coupled. General reviewer flagged this as a judgment call. cargo +stable clippy --workspace --all-targets --all-features -- -D warnings: green cargo test -p marque-wasm: 113 tests green (including all 19 R2 unit tests)
There was a problem hiding this comment.
Pull request overview
This PR continues the WASM bundle-size reduction work (#689 follow-up) by removing serde derive-based config parsing/serialization from marque-wasmβs parse_wasm_config path, replacing it with manual serde_json::Value extraction and direct serde_json::Map cache-key construction while preserving cache-key byte identity via new regression tests.
Changes:
- Replace
#[derive(Deserialize)] WasmConfigparsing with explicitValue-based field extraction (wasm_config_from_value) and improved field-scoped type-mismatch errors. - Remove
WasmConfigCacheKey+#[derive(Serialize)]and build the cache key via directserde_json::Mapconstruction (including an f32 formatting workaround to preserve byte identity). - Add extensive unit tests to pin cache-key byte identity, loud failure on type mismatches, and silent-ignore behavior for unknown fields.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crates/wasm/src/lib.rs | Refactors WASM config parsing/cache-key construction away from serde derives; adds helpers + regression tests to preserve cache semantics and behavior. |
| claudedocs/wasm-bundle-analysis-2026-05-22.md | Updates investigation notes with measured R2 results and details of the landed approach. |
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// explicit `obj.get(...)` form additionally tightens the | ||
| /// runtime-config surface relative to the derive (the closed | ||
| /// accept-list is enumerated at the call site, not implied by the | ||
| /// struct shape). |
| let mut corrections_map = serde_json::Map::new(); | ||
| for (k, v) in corrections { | ||
| corrections_map.insert(k.clone(), serde_json::Value::String(v.clone())); | ||
| } | ||
| map.insert( | ||
| "corrections".to_owned(), | ||
| serde_json::Value::Object(corrections_map), | ||
| ); |
| // `#[derive(Serialize)] WasmConfigCacheKey` path (which used | ||
| // `serialize_f32` directly). The format-then-parse roundtrip | ||
| // here is cheap (one short alloc, one parse of a known-valid | ||
| // numeric literal) and runs only on engine-cache-miss. |
empirically rejected) Copilot's PR #697 review surfaced three inline comments: **#1 (lib.rs:842) β accepted**: doc-comment claimed R2 "tightens the runtime-config surface relative to the derive". Copilot correctly pointed out that the pre-R2 `#[derive(Deserialize)]` on `Option<String>` / `Option<f32>` / `Option<HashMap<String,String>>` / `Option<f64>` was ALREADY type-strict and ALREADY had a closed accepted-field set. There is no concrete pre-R2 case that was accepted and is now rejected. The synthesis brief AND the rust- specialist review both misclaimed "tightening"; Copilot caught it. Reworded the `WasmConfig` doc-comment to state the post-R2 surface matches the pre-R2 Derive exactly β same four field names, same `Option<T>` shape, same loud-failure-on-mismatch, same silent-ignore. The R2 change is structural (binary-size cut + improved error messages with field names), NOT a tightening. **#3 (lib.rs:1104) β accepted**: f32 format-parse roundtrip comment claimed it "runs only on engine-cache-miss" but `build_cache_key` runs on every `parse_wasm_config` call (which is every lint_native / fix_native call), regardless of whether the resulting cache key hits or misses the engine cache. Corrected the comment. **#2 (lib.rs:1124) β empirically REJECTED**: Copilot suggested replacing the `serde_json::Map`-with-cloned-keys path with a borrow-preserving `Vec<(&str,&str)>` sort + direct JSON string construction, to eliminate per-call `corrections` key/value clones. The suggestion is theoretically correct on runtime: O(n) String clones become O(n) borrow-preserving sort comparisons. But empirically β measured locally with `wasm-pack build --target web --profile release-web` (CI-gate profile) β the borrow-preserving rewrite REGRESSED WASM bundle size from 1,364,312 B (original R2) to 1,370,786 B (with Copilot's suggested rewrite). That's a +6,474 B regression β would erase 83% of R2's β7,750 B binary win. The mechanism is the same LTO+ICF lesson captured in PR #696 (R1's `sorted_entries` revert at `crates/capco/src/lattice.rs:249-278`): the `serde_json::Map` / `Value::Object` / `to_string::<Value>` path REUSES serde_json machinery already monomorphized for the Diagnostic JSON output side of `marque-wasm`. The direct-String-construction path forces fresh monomorphizations of `serde_json::to_string::<str>` plus a longer push-based code path. Theoretical clone-count "wins" don't beat empirical binary-size measurements. The runtime cost Copilot flagged is real but small: typical corrections maps in marque usage are 5-20 entries; per-call cost is 10-40 short String allocations, in the noise relative to the ~400 Β΅s lint hot path. The binary savings (6.5 KB) dwarf the runtime cost. Trade favors keeping the original implementation. Documented the trade in the `build_cache_key` doc-comment with a cross-reference to PR #696's empirical-revert precedent β future maintainers see WHY the seemingly-suboptimal clones are preserved. Also brought along: structural Some-and-non-empty proof via `.filter()` (rust-specialist nit from review fix-up b974550) stays in place; that one didn't depend on the body shape Copilot suggested. Empirical measurements (wasm-pack --profile release-web, local): - staging baseline: 1,372,062 B - original R2 (b974550, serde_json::Map): 1,364,312 B (β7,750) - Copilot #2 fix (direct String): 1,370,786 B (β1,276) β rejected - this commit (revert + doc fixes): 1,364,386 B (β7,676) cargo +stable clippy --workspace --all-targets --all-features -- -D warnings: green cargo test -p marque-wasm --lib r2_: 19/19 green (byte-identity preserved)
|
Thanks @copilot for the three inline comments β all three were on-target. Disposition + commit #1 (lib.rs:842) β ACCEPTED. You were right. The pre-R2 #3 (lib.rs:1104) β ACCEPTED. Comment correction. #2 (lib.rs:1124) β empirically REJECTED with documented rationale. Your suggestion is theoretically correct on runtime (eliminate O(n)
The borrow-preserving rewrite REGRESSED the bundle by +6,474 B β would erase 83% of R2's binary win. The mechanism is the same LTO+ICF lesson captured in PR #696 (R1's The runtime cost you flagged is real but small in practice β typical corrections maps in marque usage are 5-20 entries; per-call cost is 10-40 short String allocations, in the noise relative to the ~400 Β΅s
|
Measurements
origin/staging7af85fb2)b974550b)--profile release-webβ CI-gate-comparable)--release, agent's working measurement)lint_10kbp95cargo test --workspacecargo +stable clippy --workspace --all-targets --all-features -- -D warningsAbove the synthesis-brief gate (β₯4 KB net savings). Slightly under the upper end of the preflight 5β8 KB estimate band, consistent with the linker-dedup uncertainty the preflight flagged. Bench is hot-path-neutral as expected (config parsing happens once per
lint_native/fix_nativecall, not on the engine hot path).Methodology note:
wasm-pack 0.14on this dev environment doesn't apply[package.metadata.wasm-pack.profile.release-web]wasm-opt flag lists reliably; the agent's working measurement used--releasefor build-stability. The CI-gate-comparable number (--profile release-web, top row above) confirms the delta independently to within 9 bytes β the relative delta is the trustworthy signal. Same caveat as documented intools/wasm-size-check.sh.Summary
#[derive(Deserialize)]fromWasmConfigand#[derive(Serialize)]fromWasmConfigCacheKey<'a>(latter deleted entirely) β the two derives that together accounted for ~10 KB of WASM bundle bloat per perf(wasm): investigate +5.18% WASM binary regression vs. baseline (cumulative across staging window)Β #689 attribution.serde_json::Value::as_object().get(...)extraction via privatewasm_config_from_value, plus directserde_json::Mapconstruction inbuild_cache_key.serde_json::Valueis already in the WASM bundle (used by Diagnostic JSON building,Map<String, Value>for FixIntent-style args,RawValuefor diagnostics), so the swap adds no new monomorphization.fcd5ebca), refactor lands in commit 2 (2ad85eaa). One f32 byte-identity gotcha caught by the tests before reaching review.Constitution III check β runtime-config does not expand engine's semantic surface
Constitution III Β§"runtime configuration that expands the engine's semantic surface" is the load-bearing constraint on
WasmConfig. R2 PRESERVES and arguably TIGHTENS the surface:{classifier_id, confidence_threshold, corrections, deadline_ms}obj.get(...)enumeration at call siteobj.get(...)list is ignored) β pinned byr2_unknown_field_silently_ignored+r2_unknown_field_coexists_with_knownunit testsclassifier_idaccepts any serde-Deserializable typeOption<String>accepts string / null)Value::Stringaccepted; everything else is a loud error. Tightening, not widening.nullJSON value treated as absent#[serde(default)]+Option<T>)The Constitution III invariant is preserved. The tightening on
classifier_id(and the other three fields) is a strict narrowing of what gets accepted β a JS caller passing a number forclassifier_idnow gets a loud error instead of being silently coerced.f32 byte-identity gotcha
serde_json::Value::Numberis internally f64-only. Naive replacementNumber::from_f64(threshold as f64)widens the f32 bit-pattern to f64 BEFORE the shortest-roundtrip serializer formats it β for0.85_f32(which is not exactly representable as binary32) this produces"0.8500000238418579"instead of the f32 shortest-roundtrip"0.85"thatserialize_f32would emit. The discrepancy breaks cache-key byte-identity with the retiredWasmConfigCacheKey#[derive(Serialize)]path.Fix routes f32 through
serde_json::to_string(&threshold)(which usesserialize_f32/ shortest-f32-roundtrip) then parses the resulting string back into aValue::Number. The roundtrip is safe for any finite f32 and cheap (one short alloc + one parse of a known-valid numeric literal) on engine-cache-miss only.The byte-identity test
r2_byte_identity_threshold_onlyexercises0.85specifically β a value that's NOT exactly representable as f32, so it catches a regression to the naivefrom_f64path. The fix and rationale are documented in code comments at lines 1083β1097.Behavior changes β errors from the Derive surface
{...,)Errwith line/column contextErrwith line/column context (outerserde_json::from_str::<Value>(s)unchanged)classifier_id: 123)Errwith line/column, generic shape errorErrwith field name + type hint (e.g.,"classifier_id must be a string; got number") β no columncorrectionsvalue ({"k": 42})Errwith line/columnErrwithcorrections[k]key + type hint{"some_other_field": 42})"hello",[1,2],42)Errwith shape errorErrcontaining"must be a JSON object"+ type hintNet trade: gain field names in errors (better diagnostics for JS callers), lose column numbers for field-level errors (acceptable β JS callers get column numbers from the outer parse for malformed JSON, which is the more common error class).
Adjacent-bug survey
BatchEntry#[derive(Deserialize)]atlib.rs:811β deferred per synthesis brief. Two plainStringfields (noOption, noHashMap); expected savings <1 KB. Unlikely to clear the 4 KB gate solo; will file as R2b only if combinable with other small cuts.#[derive(Serialize)]impls (DiagnosticJson, FixResultJson, RuleIdJson, FixJson, AuditConfidenceJsonV1_0, DeadlineExceededBodyJson, etc., lines 284-1403) β these are the JS-API wire contract percontracts/diagnostic.jsonandcontracts/audit-record.json. EXPLICITLY out of scope for the R-series. Touching them is a contract-break PR requiring schema-version bump coordination.serde_json::MapBTreeMap-backed assumption β the workspace'sserde_json = "1.0.149"is pinned withoutpreserve_order, soMapis alphabetical-iterating.WasmConfigfield names happen to be alphabetical, which coincides with the struct-declaration order the prior Derive used β byte-identity preserved withoutpreserve_order. TheWasmConfigstruct doc-comment +build_cache_keydoc-comment + the byte-identity regression tests triangulate this constraint for future maintainers.Constitution checks
marque-wasmsits ABOVE the rule chain. This PR touchescrates/wasm/src/lib.rs+claudedocs/wasm-bundle-analysis-2026-05-22.mdonly. Zero edits to engine, scheme, rules, core, ism, capco.lint_10kbCriterion p=0.88 β no statistically significant change. SC-001 16 ms ceiling 50Γ headroom intact.Test plan
cargo test --workspaceβ all greencargo test -p marque-wasm --lib r2_β 19 R2 unit tests greencargo test -p marque-wasmβ 113 wasm tests green (lib + integration)cargo +stable clippy --workspace --all-targets --all-features -- -D warningsβ cleancargo fmt --checkβ cleanwasm-pack build crates/wasm --target web --profile release-web(CI-gate-comparable) and--release(working measurement) β both confirm β₯7.7 KB deltalint_10kbCriterion bench β no significant change (p=0.88)fcd5ebca(before refactor2ad85eaalands)Review-pass changes (commit
b974550b)Three reviewer nits addressed in the final fix-up commit:
expect("corrections present")atbuild_cache_keyline 1100 β structurally self-provingif let Some(corrections) = cfg.corrections.as_ref().filter(|c| !c.is_empty()). Same semantic; Some-and-non-empty proof now visible at the binding site.WasmConfigstruct doc-comment now documents the alphabetical-field-naming constraint with a cross-reference tobuild_cache_key. A developer adding a new field encounters the constraint upfront rather than discovering it via byte-identity test failure.--releasevs--profile release-web) reconciled in the PR body's Measurements section above, with both numbers shown side-by-side and the wasm-pack 0.14 limitation noted.Declined: pinning field-name presence in the 8 type-mismatch test error strings (LOW per general reviewer; flagged as judgment call). The boundary the tests pin is loud-vs-silent, which is the API contract; field-name quality is implementation-level. Pinning it would lock the present wording, which we explicitly accepted dropping serde's column-number context for.
References
tools/wasm-monoaudit.shDWARF preservation