feat(kllama-cli): swap Llama GGUF + SafeTensors branches to DSL path#122
Merged
Conversation
Mirrors the Qwen swap from #121 for the Llama / Mistral GGUF branch and the Llama SafeTensors branch: - **GGUF**: `DecoderGgufWeightLoader(NATIVE_OPTIMIZED, LLAMA_COMPATIBLE_ARCHITECTURES)` → `DecoderGgufMemSegConverter.convert` → `LlamaNetworkLoader.fromWeights` → `OptimizedLLMRuntime` DIRECT mode. Same packed Q4_0/Q8_0 SIMD path the Qwen swap uses; no behavior change for quantized models. - **SafeTensors**: `DecoderSafeTensorsLoader<FP32>(...).loadToMap` → `LlamaNetworkLoader.fromWeights` → `OptimizedLLMRuntime`. Drops the legacy `LlamaIngestion` SafeTensors path entirely. - **BIN** (Karpathy llama2.c format): kept on legacy `LlamaRuntime` for now. The .bin loader returns `LlamaRuntimeWeights` directly, and the DSL path requires `DecoderGgufWeights`. Either migrate `Llama2DotCWeightLoader` or drop .bin support — separate followup. After this merge the kllama CLI's Llama / Qwen / Mistral GGUF + Llama SafeTensors paths all run through the DSL. Only BIN format and a handful of other consumers (`KLlamaJava`, `:llm-apps:skainet-cli`, `:llm-performance` benchmark engines) still depend on `LlamaRuntime` / `LlamaIngestion` / `MemSegWeightConverter` / `CpuAttentionBackend`. Those migrations + deletion of the legacy stack are subsequent PRs. Imports cleaned: removed `LlamaIngestion`, `LlamaLoadConfig`, `MemSegWeightConverter`, `LlamaWeightMapper` — all unused after the swap. `LlamaRuntime` and `CpuAttentionBackend` stay (BIN path). Numerical parity with the legacy LlamaRuntime path on identical weights is pinned by `QwenDslLegacyParityTest` (#120) — same `LlamaNetworkLoader.fromWeights` codepath, just exercised via Qwen. The Llama branch produces equivalent output by the same construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to #121 — swaps the kllama CLI's remaining Llama-family branches off
LlamaRuntimeand onto the DSL path. After this merge all GGUF and SafeTensors paths in the kllama CLI run through the DSL.Changes
DecoderGgufWeightLoader(NATIVE_OPTIMIZED, LLAMA_COMPATIBLE_ARCHITECTURES)→DecoderGgufMemSegConverter.convert→LlamaNetworkLoader.fromWeights→OptimizedLLMRuntimeDIRECT mode. Same packed Q4_0/Q8_0 SIMD path the Qwen swap uses.DecoderSafeTensorsLoader<FP32>(ctx, FP32::class, metadata, tiedEmbeddings).loadToMap { … }→LlamaNetworkLoader.fromWeights→OptimizedLLMRuntime. Drops the legacyLlamaIngestionSafeTensors path.LlamaRuntimefor now. The.binloader returnsLlamaRuntimeWeightsdirectly and the DSL path requiresDecoderGgufWeights. Either migrateLlama2DotCWeightLoaderor drop .bin support — separate followup.What's still on legacy after this PR
KLlamaJava(Java facade) andKLlamaSession.LlamaIngestionBlocking.:llm-apps:skainet-cli/Main.kt.:llm-runtime:kqwen/QwenIngestion.kt.:llm-performancebenchmark engines (JVM + native).kllamabrowser/cliMain.kt.Each is a focused migration PR; deletion of
LlamaRuntime/LlamaIngestion/MemSegWeightConverter/CpuAttentionBackendfamily comes after they're all migrated.Why it's safe
Numerical parity with
LlamaRuntimeis pinned byQwenDslLegacyParityTest(#120, closes #114). SameLlamaNetworkLoader.fromWeightscodepath, just exercised via Qwen — Llama produces equivalent output by the same construction. Q8 round-trip equivalence is pinned byQwenDslQuantizedTest(#113).Imports cleaned
Removed
LlamaIngestion,LlamaLoadConfig,MemSegWeightConverter,LlamaWeightMapper— all unused after the swap.LlamaRuntime+CpuAttentionBackendstay for the BIN fallback.Test plan
:llm-runtime:kllama:jvmTest,:llm-core:jvmTest,:llm-inference:qwen:jvmTest,:llm-inference:llama:jvmTest— all pass.@Deprecatedwarning on the LlamaRuntime BIN-fallback ctor).kllama-cliwith a real Llama / Mistral GGUF, plus a SafeTensors checkpoint; verify coherent output.🤖 Generated with Claude Code