Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights

## Finding

The new DSL Qwen path (`QwenNetworkLoader.fromWeights` → `OptimizedLLMRuntime` DIRECT) and the legacy `LlamaRuntime` Qwen path produce **substantially different logits** when fed identical FP32 weights, even on a tiny 1-layer model. Discovered while attempting Phase 1B.4 (parity test) of the no-model-duplication refactor (refs #46, the Phase 1A/B/C/2 PRs #109/#110/#111/#112/#113).

**Concrete numbers** — first token, logit[0]:
- DSL: `-13.231035`
- Legacy: `-7.451726`
- Absolute diff: `5.7793093`

That's six orders of magnitude above any reasonable tolerance for "same math, different op ordering". This is a real semantic divergence, not numerical drift.

## Reproducer (sketch)

Synthetic Qwen3-flavoured weights, dim=32, 2 heads, kvHeads=2, headDim=16, ffnDim=64, 1 layer, vocab=64. Small integer weight values for reproducibility. Includes `attn_q_norm` / `attn_k_norm` so both paths build a real qkNorm=true Qwen network. RoPE base = 1_000_000, RMSNorm eps = 1e-6.

```kotlin
val weights = DecoderGgufWeights(metadata, tensors)  // identical for both paths

// DSL path
val dslModel = QwenNetworkLoader.fromWeights(weights)
val dsl = OptimizedLLMRuntime(dslModel, ctx, OptimizedLLMMode.DIRECT, FP32::class)

// Legacy path (with MemSegWeightConverter to pre-transpose, mirroring CLI)
Arena.ofConfined().use { arena ->
    val mapped = LlamaWeightMapper.map(weights)
    val converted = MemSegWeightConverter.convert(mapped, ctx, arena)
    val backend = CpuAttentionBackend(ctx, converted, FP32::class, ropeFreqBase = 1_000_000f)
    val legacy = LlamaRuntime(ctx, converted, backend, FP32::class, eps = 1e-6f)

    dsl.forward(1)     // → logit[0] = -13.231035
    legacy.forward(1)  // → logit[0] = -7.451726
}
```

I had this test in a draft branch; pulled it from PR #112/#113's follow-up rather than ship a failing test without a root-cause investigation.

## Hypotheses (most likely first)

1. **RoPE convention mismatch.** DSL's `rope()` defaults to `RoPEMode.INTERLEAVED` pairing (i, i+1). Legacy's `CpuAttentionBackend.applyRopeGqa` may use split-half pairing (i, i + headDim/2). The Gemma DSL had this exact gotcha — see the long comment at `GemmaNetworkDef.kt:172-187` explaining that HF / GGUF storage convention is split-half and using interleaved gives \"correct-by-accident\" outputs at small N but compounds wrong rotations across positions.
2. **QK-norm formula.** DSL `MultiHeadAttention.qNorm` uses an `RMSNormalization` module applied on a reshape; legacy `LlamaRuntime.applyPerHeadRMSNorm` is inline math (`LlamaRuntime.kt:181-`). Eps placement (`sumSq/headDim + eps` vs `(sumSq + eps)/headDim`) or weight broadcasting could differ.
3. **Attention scale placement.** Legacy may apply `1/sqrt(headDim)` at a different stage than DSL's `MultiHeadAttention`.
4. **Transpose semantics on square Q/O.** Legacy `linearProject`'s shape heuristic relies on `MemSegWeightConverter` having pre-transposed weights to `[in, out]`. With the converter included in the test the divergence persists, so the transpose handling is *probably* not the root cause — but worth ruling out via a non-square test.

## Implication for Phase 4 (CLI swap)

The plan ([snazzy-wibbling-dewdrop](.)) Phase 4 swaps the CLI's Qwen branch from `LlamaRuntime` → DSL. **This is a real behavior change**, not just an internal refactor as previously framed. We do not yet know which path produces *correct* output (= matches HuggingFace transformers reference on a real Qwen3 model). The Q8 smoke test in #113 only proves DSL FP32 ≈ DSL Q8 (round-trip consistency, not absolute correctness).

## Phase 4 readiness checklist

Holding Phase 4 until at least one of:

- [ ] **Root-cause the parity divergence.** Localize per-layer or per-stage diffs (post-attn-norm, post-Q-proj, post-RoPE-Q, post-attn, post-FFN, ...) to identify which op disagrees. Cross-check against HF transformers reference on a tiny model.
- [ ] **HF-reference smoke** in `tests/smoke/`. Run a real Qwen3 GGUF through both DSL (Phase 4 candidate) and legacy CLI paths, capture top-K next tokens, compare against HF transformers (or llama.cpp) reference for the same prompt. Whichever matches HF wins.
- [ ] **Resolve issue #52** (byte-level BPE). Without this, Qwen3 tool-calling is broken end-to-end regardless of which inference path ships, so we can't even smoke-test Phase 4 against real Qwen3-Instruct.

## What is NOT at risk

The architectural wins from #109/#110/#111/#112/#113 stand:
- `decoderTransformerNetwork` shared builder
- `qwenNetwork` no longer a stub
- Generic loader naming (`Decoder*`)
- `DecoderGgufMemSegConverter` for the DSL path
- `QwenNetworkLoader.fromGgufNative` + Q8 round-trip smoke

The DSL Qwen path *works* — it just produces different output than the legacy path. Either could be the correct one; we just need to know which.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights #114

Finding

Reproducer (sketch)

Hypotheses (most likely first)

Implication for Phase 4 (CLI swap)

Phase 4 readiness checklist

What is NOT at risk

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights #114

Description

Finding

Reproducer (sketch)

Hypotheses (most likely first)

Implication for Phase 4 (CLI swap)

Phase 4 readiness checklist

What is NOT at risk

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions