Skip to content

Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights #114

@michalharakal

Description

@michalharakal

Finding

The new DSL Qwen path (QwenNetworkLoader.fromWeightsOptimizedLLMRuntime DIRECT) and the legacy LlamaRuntime Qwen path produce substantially different logits when fed identical FP32 weights, even on a tiny 1-layer model. Discovered while attempting Phase 1B.4 (parity test) of the no-model-duplication refactor (refs #46, the Phase 1A/B/C/2 PRs #109/#110/#111/#112/#113).

Concrete numbers — first token, logit[0]:

  • DSL: -13.231035
  • Legacy: -7.451726
  • Absolute diff: 5.7793093

That's six orders of magnitude above any reasonable tolerance for "same math, different op ordering". This is a real semantic divergence, not numerical drift.

Reproducer (sketch)

Synthetic Qwen3-flavoured weights, dim=32, 2 heads, kvHeads=2, headDim=16, ffnDim=64, 1 layer, vocab=64. Small integer weight values for reproducibility. Includes attn_q_norm / attn_k_norm so both paths build a real qkNorm=true Qwen network. RoPE base = 1_000_000, RMSNorm eps = 1e-6.

val weights = DecoderGgufWeights(metadata, tensors)  // identical for both paths

// DSL path
val dslModel = QwenNetworkLoader.fromWeights(weights)
val dsl = OptimizedLLMRuntime(dslModel, ctx, OptimizedLLMMode.DIRECT, FP32::class)

// Legacy path (with MemSegWeightConverter to pre-transpose, mirroring CLI)
Arena.ofConfined().use { arena ->
    val mapped = LlamaWeightMapper.map(weights)
    val converted = MemSegWeightConverter.convert(mapped, ctx, arena)
    val backend = CpuAttentionBackend(ctx, converted, FP32::class, ropeFreqBase = 1_000_000f)
    val legacy = LlamaRuntime(ctx, converted, backend, FP32::class, eps = 1e-6f)

    dsl.forward(1)     // → logit[0] = -13.231035
    legacy.forward(1)  // → logit[0] = -7.451726
}

I had this test in a draft branch; pulled it from PR #112/#113's follow-up rather than ship a failing test without a root-cause investigation.

Hypotheses (most likely first)

  1. RoPE convention mismatch. DSL's rope() defaults to RoPEMode.INTERLEAVED pairing (i, i+1). Legacy's CpuAttentionBackend.applyRopeGqa may use split-half pairing (i, i + headDim/2). The Gemma DSL had this exact gotcha — see the long comment at GemmaNetworkDef.kt:172-187 explaining that HF / GGUF storage convention is split-half and using interleaved gives "correct-by-accident" outputs at small N but compounds wrong rotations across positions.
  2. QK-norm formula. DSL MultiHeadAttention.qNorm uses an RMSNormalization module applied on a reshape; legacy LlamaRuntime.applyPerHeadRMSNorm is inline math (LlamaRuntime.kt:181-). Eps placement (sumSq/headDim + eps vs (sumSq + eps)/headDim) or weight broadcasting could differ.
  3. Attention scale placement. Legacy may apply 1/sqrt(headDim) at a different stage than DSL's MultiHeadAttention.
  4. Transpose semantics on square Q/O. Legacy linearProject's shape heuristic relies on MemSegWeightConverter having pre-transposed weights to [in, out]. With the converter included in the test the divergence persists, so the transpose handling is probably not the root cause — but worth ruling out via a non-square test.

Implication for Phase 4 (CLI swap)

The plan (snazzy-wibbling-dewdrop) Phase 4 swaps the CLI's Qwen branch from LlamaRuntime → DSL. This is a real behavior change, not just an internal refactor as previously framed. We do not yet know which path produces correct output (= matches HuggingFace transformers reference on a real Qwen3 model). The Q8 smoke test in #113 only proves DSL FP32 ≈ DSL Q8 (round-trip consistency, not absolute correctness).

Phase 4 readiness checklist

Holding Phase 4 until at least one of:

  • Root-cause the parity divergence. Localize per-layer or per-stage diffs (post-attn-norm, post-Q-proj, post-RoPE-Q, post-attn, post-FFN, ...) to identify which op disagrees. Cross-check against HF transformers reference on a tiny model.
  • HF-reference smoke in tests/smoke/. Run a real Qwen3 GGUF through both DSL (Phase 4 candidate) and legacy CLI paths, capture top-K next tokens, compare against HF transformers (or llama.cpp) reference for the same prompt. Whichever matches HF wins.
  • Resolve issue Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #52 (byte-level BPE). Without this, Qwen3 tool-calling is broken end-to-end regardless of which inference path ships, so we can't even smoke-test Phase 4 against real Qwen3-Instruct.

What is NOT at risk

The architectural wins from #109/#110/#111/#112/#113 stand:

  • decoderTransformerNetwork shared builder
  • qwenNetwork no longer a stub
  • Generic loader naming (Decoder*)
  • DecoderGgufMemSegConverter for the DSL path
  • QwenNetworkLoader.fromGgufNative + Q8 round-trip smoke

The DSL Qwen path works — it just produces different output than the legacy path. Either could be the correct one; we just need to know which.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions