feat(llama): add DecoderGgufMemSegConverter for the DSL inference path by michalharakal · Pull Request #112 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-04T10:28:57Z

Summary

Adds a post-load converter targeting the DSL path's DecoderGgufWeights<T, V> format. This is the bind-time piece Phase 1B needs so quantized weights can flow through OptimizedLLMRuntime without dequantizing the entire model to FP32.
Purely additive — 2 new files in :llm-inference:llama, no existing code modified. The legacy MemSegWeightConverter (operates on LlamaRuntimeWeights) stays in place and gets deleted in Phase 4 along with LlamaRuntime.

Why a separate converter

The original Phase 1B plan said "move MemSegWeightConverter to :llm-core (it's generic)". That premise was wrong: the existing converter operates on LlamaRuntimeWeights<FP32> (the hand-coded LlamaRuntime's named-field layer struct: wq, wk, wv, …) and is genuinely tied to that legacy format. The DSL path uses DecoderGgufWeights<T, V> — a flat tensor-name → tensor map — which has a different shape. So the DSL needs its own converter; the two coexist until Phase 4 deletes the legacy one.

Behavior

Input quant type	Output
Q4_0	Wrapped as `Q4MemorySegmentTensorData`
Q8_0	Wrapped as `Q8MemorySegmentTensorData`
Q4_K / Q5_K / Q6_K	Dequantized to FP32 (same trade-off `MemSegWeightConverter` makes; packed K-quant kernels aren't on the DSL hot path)
FP32 (no entry in `quantTypes`)	Pass-through unchanged
Anything else	Warning logged, pass-through (forward will fail at matmul if hit)

quantTypes is cleared on the result — packed tensors carry their own marker, dequantized tensors have no quant identity, and a stale map would mislead later consumers.

Why no pre-transpose (unlike the legacy converter)

The DSL's linearProject(input, weight) always calls ops.matmul(input, ops.transpose(weight)). For Q4/Q8 MemSeg tensors upstream's transpose is a shape-only metadata swap (free), so pre-transposing brings no benefit. For dequantized K-quants and FP32 tensors a runtime transpose still has a real cost — addressing it requires adding a pre-transposed marker that linearProject checks, which is tracked as a follow-up perf optimization in the sub-plan.

What ships next on top of this

1B.2 — fromGgufNative(...) entry point on QwenNetworkLoader / LlamaNetworkLoader / VoxtralNetworkLoader that loads with NATIVE_OPTIMIZED and runs the converter before binding into the DSL.
1B.4 — DSL-vs-LlamaRuntime numerical parity test on the same packed weights.
After both: Phase 4 (CLI swap, delete LlamaRuntime).

Test plan

4 new tests in DecoderGgufMemSegConverterTest: empty-quantTypes no-op, Q4_0 wrap (verified via Q4MemorySegmentMarker), Q8_0 wrap, key-set preservation.
No regressions: existing MemSegWeightConverterTest, all :llm-core, :llm-inference:llama, :llm-inference:qwen, :llm-runtime:kllama tests pass.
CI green on PR.

Refs the Phase 1B sub-plan (~/.claude/plans/snazzy-wibbling-dewdrop-1B.md) and the closed #46.

🤖 Generated with Claude Code

The legacy MemSegWeightConverter operates on LlamaRuntimeWeights — the hand-coded LlamaRuntime's runtime format. The new DSL path produces DecoderGgufWeights<T,V> (a flat map keyed by GGUF tensor name) and has no equivalent post-load step today, so the QwenNetworkLoader.fromGguf default of DEQUANTIZE_TO_FP32 is the only viable path — which inflates Qwen3-8B-Q4_K_M from ~5GB to ~32GB FP32 and breaks -Xmx42g. This adds a generic post-load converter for the DSL path: - Q4_0 / Q8_0 → wrapped as Q{4,8}MemorySegmentTensorData. Upstream DefaultCpuOpsJvm.matmul already auto-dispatches via the marker. - Q4_K / Q5_K / Q6_K → dequantized to FP32. Same trade-off the legacy converter makes — packed K-quant kernels aren't on the hot path. - FP32 (no quantTypes entry) → pass through unchanged. - quantTypes is cleared on the result; tensors now carry their own marker. Unlike MemSegWeightConverter, this one does NOT pre-transpose. The DSL's linearProject() always calls ops.transpose(weight) at forward; for Q4/Q8 MemSeg upstream returns shape-only swap (free), so pre-transposing is a noop. For FP32 / dequantized K-quants the runtime transpose still costs — addressing it requires a pre-transposed marker on linearProject, tracked as a follow-up perf optimization. Existing MemSegWeightConverter is unchanged and stays in place; it gets deleted in Phase 4 along with LlamaRuntime / LlamaIngestion. Tests: 4 new in DecoderGgufMemSegConverterTest covering empty-quantTypes no-op, Q4_0 wrap, Q8_0 wrap, key-set preservation. Existing MemSegWeightConverterTest, all :llm-core, :llm-inference:llama, :llm-inference:qwen, :llm-runtime:kllama tests pass. Refs Phase 1B sub-plan (~/.claude/plans/snazzy-wibbling-dewdrop-1B.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit 46bd75e into develop May 4, 2026
2 checks passed

This was referenced May 4, 2026

feat(qwen): DSL native-quantized GGUF entry point + Q8 smoke test #113

Merged

Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights #114

Closed

feat(kllama-cli): swap Qwen branch to DSL path (Phase 4) #121

Merged

michalharakal deleted the feat/decoder-gguf-memseg-converter branch May 5, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llama): add DecoderGgufMemSegConverter for the DSL inference path#112

feat(llama): add DecoderGgufMemSegConverter for the DSL inference path#112
michalharakal merged 1 commit into
developfrom
feat/decoder-gguf-memseg-converter

michalharakal commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 4, 2026

Summary

Why a separate converter

Behavior

Why no pre-transpose (unlike the legacy converter)

What ships next on top of this

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant