Finding
The new DSL Qwen path (QwenNetworkLoader.fromWeights → OptimizedLLMRuntime DIRECT) and the legacy LlamaRuntime Qwen path produce substantially different logits when fed identical FP32 weights, even on a tiny 1-layer model. Discovered while attempting Phase 1B.4 (parity test) of the no-model-duplication refactor (refs #46, the Phase 1A/B/C/2 PRs #109/#110/#111/#112/#113).
Concrete numbers — first token, logit[0]:
- DSL:
-13.231035
- Legacy:
-7.451726
- Absolute diff:
5.7793093
That's six orders of magnitude above any reasonable tolerance for "same math, different op ordering". This is a real semantic divergence, not numerical drift.
Reproducer (sketch)
Synthetic Qwen3-flavoured weights, dim=32, 2 heads, kvHeads=2, headDim=16, ffnDim=64, 1 layer, vocab=64. Small integer weight values for reproducibility. Includes attn_q_norm / attn_k_norm so both paths build a real qkNorm=true Qwen network. RoPE base = 1_000_000, RMSNorm eps = 1e-6.
val weights = DecoderGgufWeights(metadata, tensors) // identical for both paths
// DSL path
val dslModel = QwenNetworkLoader.fromWeights(weights)
val dsl = OptimizedLLMRuntime(dslModel, ctx, OptimizedLLMMode.DIRECT, FP32::class)
// Legacy path (with MemSegWeightConverter to pre-transpose, mirroring CLI)
Arena.ofConfined().use { arena ->
val mapped = LlamaWeightMapper.map(weights)
val converted = MemSegWeightConverter.convert(mapped, ctx, arena)
val backend = CpuAttentionBackend(ctx, converted, FP32::class, ropeFreqBase = 1_000_000f)
val legacy = LlamaRuntime(ctx, converted, backend, FP32::class, eps = 1e-6f)
dsl.forward(1) // → logit[0] = -13.231035
legacy.forward(1) // → logit[0] = -7.451726
}
I had this test in a draft branch; pulled it from PR #112/#113's follow-up rather than ship a failing test without a root-cause investigation.
Hypotheses (most likely first)
- RoPE convention mismatch. DSL's
rope() defaults to RoPEMode.INTERLEAVED pairing (i, i+1). Legacy's CpuAttentionBackend.applyRopeGqa may use split-half pairing (i, i + headDim/2). The Gemma DSL had this exact gotcha — see the long comment at GemmaNetworkDef.kt:172-187 explaining that HF / GGUF storage convention is split-half and using interleaved gives "correct-by-accident" outputs at small N but compounds wrong rotations across positions.
- QK-norm formula. DSL
MultiHeadAttention.qNorm uses an RMSNormalization module applied on a reshape; legacy LlamaRuntime.applyPerHeadRMSNorm is inline math (LlamaRuntime.kt:181-). Eps placement (sumSq/headDim + eps vs (sumSq + eps)/headDim) or weight broadcasting could differ.
- Attention scale placement. Legacy may apply
1/sqrt(headDim) at a different stage than DSL's MultiHeadAttention.
- Transpose semantics on square Q/O. Legacy
linearProject's shape heuristic relies on MemSegWeightConverter having pre-transposed weights to [in, out]. With the converter included in the test the divergence persists, so the transpose handling is probably not the root cause — but worth ruling out via a non-square test.
Implication for Phase 4 (CLI swap)
The plan (snazzy-wibbling-dewdrop) Phase 4 swaps the CLI's Qwen branch from LlamaRuntime → DSL. This is a real behavior change, not just an internal refactor as previously framed. We do not yet know which path produces correct output (= matches HuggingFace transformers reference on a real Qwen3 model). The Q8 smoke test in #113 only proves DSL FP32 ≈ DSL Q8 (round-trip consistency, not absolute correctness).
Phase 4 readiness checklist
Holding Phase 4 until at least one of:
What is NOT at risk
The architectural wins from #109/#110/#111/#112/#113 stand:
decoderTransformerNetwork shared builder
qwenNetwork no longer a stub
- Generic loader naming (
Decoder*)
DecoderGgufMemSegConverter for the DSL path
QwenNetworkLoader.fromGgufNative + Q8 round-trip smoke
The DSL Qwen path works — it just produces different output than the legacy path. Either could be the correct one; we just need to know which.
Finding
The new DSL Qwen path (
QwenNetworkLoader.fromWeights→OptimizedLLMRuntimeDIRECT) and the legacyLlamaRuntimeQwen path produce substantially different logits when fed identical FP32 weights, even on a tiny 1-layer model. Discovered while attempting Phase 1B.4 (parity test) of the no-model-duplication refactor (refs #46, the Phase 1A/B/C/2 PRs #109/#110/#111/#112/#113).Concrete numbers — first token, logit[0]:
-13.231035-7.4517265.7793093That's six orders of magnitude above any reasonable tolerance for "same math, different op ordering". This is a real semantic divergence, not numerical drift.
Reproducer (sketch)
Synthetic Qwen3-flavoured weights, dim=32, 2 heads, kvHeads=2, headDim=16, ffnDim=64, 1 layer, vocab=64. Small integer weight values for reproducibility. Includes
attn_q_norm/attn_k_normso both paths build a real qkNorm=true Qwen network. RoPE base = 1_000_000, RMSNorm eps = 1e-6.I had this test in a draft branch; pulled it from PR #112/#113's follow-up rather than ship a failing test without a root-cause investigation.
Hypotheses (most likely first)
rope()defaults toRoPEMode.INTERLEAVEDpairing (i, i+1). Legacy'sCpuAttentionBackend.applyRopeGqamay use split-half pairing (i, i + headDim/2). The Gemma DSL had this exact gotcha — see the long comment atGemmaNetworkDef.kt:172-187explaining that HF / GGUF storage convention is split-half and using interleaved gives "correct-by-accident" outputs at small N but compounds wrong rotations across positions.MultiHeadAttention.qNormuses anRMSNormalizationmodule applied on a reshape; legacyLlamaRuntime.applyPerHeadRMSNormis inline math (LlamaRuntime.kt:181-). Eps placement (sumSq/headDim + epsvs(sumSq + eps)/headDim) or weight broadcasting could differ.1/sqrt(headDim)at a different stage than DSL'sMultiHeadAttention.linearProject's shape heuristic relies onMemSegWeightConverterhaving pre-transposed weights to[in, out]. With the converter included in the test the divergence persists, so the transpose handling is probably not the root cause — but worth ruling out via a non-square test.Implication for Phase 4 (CLI swap)
The plan (snazzy-wibbling-dewdrop) Phase 4 swaps the CLI's Qwen branch from
LlamaRuntime→ DSL. This is a real behavior change, not just an internal refactor as previously framed. We do not yet know which path produces correct output (= matches HuggingFace transformers reference on a real Qwen3 model). The Q8 smoke test in #113 only proves DSL FP32 ≈ DSL Q8 (round-trip consistency, not absolute correctness).Phase 4 readiness checklist
Holding Phase 4 until at least one of:
tests/smoke/. Run a real Qwen3 GGUF through both DSL (Phase 4 candidate) and legacy CLI paths, capture top-K next tokens, compare against HF transformers (or llama.cpp) reference for the same prompt. Whichever matches HF wins.What is NOT at risk
The architectural wins from #109/#110/#111/#112/#113 stand:
decoderTransformerNetworkshared builderqwenNetworkno longer a stubDecoder*)DecoderGgufMemSegConverterfor the DSL pathQwenNetworkLoader.fromGgufNative+ Q8 round-trip smokeThe DSL Qwen path works — it just produces different output than the legacy path. Either could be the correct one; we just need to know which.