Skip to content

feat(llama): add DecoderGgufMemSegConverter for the DSL inference path#112

Merged
michalharakal merged 1 commit into
developfrom
feat/decoder-gguf-memseg-converter
May 4, 2026
Merged

feat(llama): add DecoderGgufMemSegConverter for the DSL inference path#112
michalharakal merged 1 commit into
developfrom
feat/decoder-gguf-memseg-converter

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

  • Adds a post-load converter targeting the DSL path's DecoderGgufWeights<T, V> format. This is the bind-time piece Phase 1B needs so quantized weights can flow through OptimizedLLMRuntime without dequantizing the entire model to FP32.
  • Purely additive — 2 new files in :llm-inference:llama, no existing code modified. The legacy MemSegWeightConverter (operates on LlamaRuntimeWeights) stays in place and gets deleted in Phase 4 along with LlamaRuntime.

Why a separate converter

The original Phase 1B plan said "move MemSegWeightConverter to :llm-core (it's generic)". That premise was wrong: the existing converter operates on LlamaRuntimeWeights<FP32> (the hand-coded LlamaRuntime's named-field layer struct: wq, wk, wv, …) and is genuinely tied to that legacy format. The DSL path uses DecoderGgufWeights<T, V> — a flat tensor-name → tensor map — which has a different shape. So the DSL needs its own converter; the two coexist until Phase 4 deletes the legacy one.

Behavior

Input quant type Output
Q4_0 Wrapped as Q4MemorySegmentTensorData
Q8_0 Wrapped as Q8MemorySegmentTensorData
Q4_K / Q5_K / Q6_K Dequantized to FP32 (same trade-off MemSegWeightConverter makes; packed K-quant kernels aren't on the DSL hot path)
FP32 (no entry in quantTypes) Pass-through unchanged
Anything else Warning logged, pass-through (forward will fail at matmul if hit)

quantTypes is cleared on the result — packed tensors carry their own marker, dequantized tensors have no quant identity, and a stale map would mislead later consumers.

Why no pre-transpose (unlike the legacy converter)

The DSL's linearProject(input, weight) always calls ops.matmul(input, ops.transpose(weight)). For Q4/Q8 MemSeg tensors upstream's transpose is a shape-only metadata swap (free), so pre-transposing brings no benefit. For dequantized K-quants and FP32 tensors a runtime transpose still has a real cost — addressing it requires adding a pre-transposed marker that linearProject checks, which is tracked as a follow-up perf optimization in the sub-plan.

What ships next on top of this

  • 1B.2fromGgufNative(...) entry point on QwenNetworkLoader / LlamaNetworkLoader / VoxtralNetworkLoader that loads with NATIVE_OPTIMIZED and runs the converter before binding into the DSL.
  • 1B.4 — DSL-vs-LlamaRuntime numerical parity test on the same packed weights.
  • After both: Phase 4 (CLI swap, delete LlamaRuntime).

Test plan

  • 4 new tests in DecoderGgufMemSegConverterTest: empty-quantTypes no-op, Q4_0 wrap (verified via Q4MemorySegmentMarker), Q8_0 wrap, key-set preservation.
  • No regressions: existing MemSegWeightConverterTest, all :llm-core, :llm-inference:llama, :llm-inference:qwen, :llm-runtime:kllama tests pass.
  • CI green on PR.

Refs the Phase 1B sub-plan (~/.claude/plans/snazzy-wibbling-dewdrop-1B.md) and the closed #46.

🤖 Generated with Claude Code

The legacy MemSegWeightConverter operates on LlamaRuntimeWeights — the
hand-coded LlamaRuntime's runtime format. The new DSL path produces
DecoderGgufWeights<T,V> (a flat map keyed by GGUF tensor name) and has
no equivalent post-load step today, so the QwenNetworkLoader.fromGguf
default of DEQUANTIZE_TO_FP32 is the only viable path — which inflates
Qwen3-8B-Q4_K_M from ~5GB to ~32GB FP32 and breaks -Xmx42g.

This adds a generic post-load converter for the DSL path:
- Q4_0 / Q8_0 → wrapped as Q{4,8}MemorySegmentTensorData. Upstream
  DefaultCpuOpsJvm.matmul already auto-dispatches via the marker.
- Q4_K / Q5_K / Q6_K → dequantized to FP32. Same trade-off the legacy
  converter makes — packed K-quant kernels aren't on the hot path.
- FP32 (no quantTypes entry) → pass through unchanged.
- quantTypes is cleared on the result; tensors now carry their own marker.

Unlike MemSegWeightConverter, this one does NOT pre-transpose. The DSL's
linearProject() always calls ops.transpose(weight) at forward; for
Q4/Q8 MemSeg upstream returns shape-only swap (free), so pre-transposing
is a noop. For FP32 / dequantized K-quants the runtime transpose still
costs — addressing it requires a pre-transposed marker on linearProject,
tracked as a follow-up perf optimization.

Existing MemSegWeightConverter is unchanged and stays in place; it gets
deleted in Phase 4 along with LlamaRuntime / LlamaIngestion.

Tests: 4 new in DecoderGgufMemSegConverterTest covering empty-quantTypes
no-op, Q4_0 wrap, Q8_0 wrap, key-set preservation. Existing
MemSegWeightConverterTest, all :llm-core, :llm-inference:llama,
:llm-inference:qwen, :llm-runtime:kllama tests pass.

Refs Phase 1B sub-plan (~/.claude/plans/snazzy-wibbling-dewdrop-1B.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 46bd75e into develop May 4, 2026
2 checks passed
@michalharakal michalharakal deleted the feat/decoder-gguf-memseg-converter branch May 5, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wire QwenNetworkLoader into CLI for proper Qwen3 inference

1 participant