feat(llama): add DecoderGgufMemSegConverter for the DSL inference path#112
Merged
Merged
Conversation
The legacy MemSegWeightConverter operates on LlamaRuntimeWeights — the
hand-coded LlamaRuntime's runtime format. The new DSL path produces
DecoderGgufWeights<T,V> (a flat map keyed by GGUF tensor name) and has
no equivalent post-load step today, so the QwenNetworkLoader.fromGguf
default of DEQUANTIZE_TO_FP32 is the only viable path — which inflates
Qwen3-8B-Q4_K_M from ~5GB to ~32GB FP32 and breaks -Xmx42g.
This adds a generic post-load converter for the DSL path:
- Q4_0 / Q8_0 → wrapped as Q{4,8}MemorySegmentTensorData. Upstream
DefaultCpuOpsJvm.matmul already auto-dispatches via the marker.
- Q4_K / Q5_K / Q6_K → dequantized to FP32. Same trade-off the legacy
converter makes — packed K-quant kernels aren't on the hot path.
- FP32 (no quantTypes entry) → pass through unchanged.
- quantTypes is cleared on the result; tensors now carry their own marker.
Unlike MemSegWeightConverter, this one does NOT pre-transpose. The DSL's
linearProject() always calls ops.transpose(weight) at forward; for
Q4/Q8 MemSeg upstream returns shape-only swap (free), so pre-transposing
is a noop. For FP32 / dequantized K-quants the runtime transpose still
costs — addressing it requires a pre-transposed marker on linearProject,
tracked as a follow-up perf optimization.
Existing MemSegWeightConverter is unchanged and stays in place; it gets
deleted in Phase 4 along with LlamaRuntime / LlamaIngestion.
Tests: 4 new in DecoderGgufMemSegConverterTest covering empty-quantTypes
no-op, Q4_0 wrap, Q8_0 wrap, key-set preservation. Existing
MemSegWeightConverterTest, all :llm-core, :llm-inference:llama,
:llm-inference:qwen, :llm-runtime:kllama tests pass.
Refs Phase 1B sub-plan (~/.claude/plans/snazzy-wibbling-dewdrop-1B.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DecoderGgufWeights<T, V>format. This is the bind-time piece Phase 1B needs so quantized weights can flow throughOptimizedLLMRuntimewithout dequantizing the entire model to FP32.:llm-inference:llama, no existing code modified. The legacyMemSegWeightConverter(operates onLlamaRuntimeWeights) stays in place and gets deleted in Phase 4 along withLlamaRuntime.Why a separate converter
The original Phase 1B plan said "move
MemSegWeightConverterto:llm-core(it's generic)". That premise was wrong: the existing converter operates onLlamaRuntimeWeights<FP32>(the hand-codedLlamaRuntime's named-field layer struct:wq,wk,wv, …) and is genuinely tied to that legacy format. The DSL path usesDecoderGgufWeights<T, V>— a flat tensor-name → tensor map — which has a different shape. So the DSL needs its own converter; the two coexist until Phase 4 deletes the legacy one.Behavior
Q4MemorySegmentTensorDataQ8MemorySegmentTensorDataMemSegWeightConvertermakes; packed K-quant kernels aren't on the DSL hot path)quantTypes)quantTypesis cleared on the result — packed tensors carry their own marker, dequantized tensors have no quant identity, and a stale map would mislead later consumers.Why no pre-transpose (unlike the legacy converter)
The DSL's
linearProject(input, weight)always callsops.matmul(input, ops.transpose(weight)). For Q4/Q8 MemSeg tensors upstream'stransposeis a shape-only metadata swap (free), so pre-transposing brings no benefit. For dequantized K-quants and FP32 tensors a runtime transpose still has a real cost — addressing it requires adding a pre-transposed marker thatlinearProjectchecks, which is tracked as a follow-up perf optimization in the sub-plan.What ships next on top of this
fromGgufNative(...)entry point onQwenNetworkLoader/LlamaNetworkLoader/VoxtralNetworkLoaderthat loads withNATIVE_OPTIMIZEDand runs the converter before binding into the DSL.LlamaRuntimenumerical parity test on the same packed weights.LlamaRuntime).Test plan
DecoderGgufMemSegConverterTest: empty-quantTypes no-op, Q4_0 wrap (verified viaQ4MemorySegmentMarker), Q8_0 wrap, key-set preservation.MemSegWeightConverterTest, all:llm-core,:llm-inference:llama,:llm-inference:qwen,:llm-runtime:kllamatests pass.Refs the Phase 1B sub-plan (
~/.claude/plans/snazzy-wibbling-dewdrop-1B.md) and the closed #46.🤖 Generated with Claude Code