Skip to content

feat(kllama-cli): swap Llama GGUF + SafeTensors branches to DSL path#122

Merged
michalharakal merged 1 commit into
developfrom
feat/llama-dsl-cli-swap
May 4, 2026
Merged

feat(kllama-cli): swap Llama GGUF + SafeTensors branches to DSL path#122
michalharakal merged 1 commit into
developfrom
feat/llama-dsl-cli-swap

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Companion to #121 — swaps the kllama CLI's remaining Llama-family branches off LlamaRuntime and onto the DSL path. After this merge all GGUF and SafeTensors paths in the kllama CLI run through the DSL.

Changes

  • Llama / Mistral GGUF: DecoderGgufWeightLoader(NATIVE_OPTIMIZED, LLAMA_COMPATIBLE_ARCHITECTURES)DecoderGgufMemSegConverter.convertLlamaNetworkLoader.fromWeightsOptimizedLLMRuntime DIRECT mode. Same packed Q4_0/Q8_0 SIMD path the Qwen swap uses.
  • Llama SafeTensors: DecoderSafeTensorsLoader<FP32>(ctx, FP32::class, metadata, tiedEmbeddings).loadToMap { … }LlamaNetworkLoader.fromWeightsOptimizedLLMRuntime. Drops the legacy LlamaIngestion SafeTensors path.
  • BIN (Karpathy llama2.c format): kept on legacy LlamaRuntime for now. The .bin loader returns LlamaRuntimeWeights directly and the DSL path requires DecoderGgufWeights. Either migrate Llama2DotCWeightLoader or drop .bin support — separate followup.

What's still on legacy after this PR

  • BIN format in this CLI (above).
  • KLlamaJava (Java facade) and KLlamaSession.
  • LlamaIngestionBlocking.
  • :llm-apps:skainet-cli/Main.kt.
  • :llm-runtime:kqwen/QwenIngestion.kt.
  • :llm-performance benchmark engines (JVM + native).
  • Wasm/native kllama browser/cli Main.kt.

Each is a focused migration PR; deletion of LlamaRuntime / LlamaIngestion / MemSegWeightConverter / CpuAttentionBackend family comes after they're all migrated.

Why it's safe

Numerical parity with LlamaRuntime is pinned by QwenDslLegacyParityTest (#120, closes #114). Same LlamaNetworkLoader.fromWeights codepath, just exercised via Qwen — Llama produces equivalent output by the same construction. Q8 round-trip equivalence is pinned by QwenDslQuantizedTest (#113).

Imports cleaned

Removed LlamaIngestion, LlamaLoadConfig, MemSegWeightConverter, LlamaWeightMapper — all unused after the swap. LlamaRuntime + CpuAttentionBackend stay for the BIN fallback.

Test plan

  • :llm-runtime:kllama:jvmTest, :llm-core:jvmTest, :llm-inference:qwen:jvmTest, :llm-inference:llama:jvmTest — all pass.
  • Compile clean (only one pre-existing @Deprecated warning on the LlamaRuntime BIN-fallback ctor).
  • CI green on PR.
  • Manual (post-merge): kllama-cli with a real Llama / Mistral GGUF, plus a SafeTensors checkpoint; verify coherent output.

🤖 Generated with Claude Code

Mirrors the Qwen swap from #121 for the Llama / Mistral GGUF branch
and the Llama SafeTensors branch:

- **GGUF**: `DecoderGgufWeightLoader(NATIVE_OPTIMIZED, LLAMA_COMPATIBLE_ARCHITECTURES)`
  → `DecoderGgufMemSegConverter.convert` → `LlamaNetworkLoader.fromWeights`
  → `OptimizedLLMRuntime` DIRECT mode. Same packed Q4_0/Q8_0 SIMD path
  the Qwen swap uses; no behavior change for quantized models.
- **SafeTensors**: `DecoderSafeTensorsLoader<FP32>(...).loadToMap` →
  `LlamaNetworkLoader.fromWeights` → `OptimizedLLMRuntime`. Drops the
  legacy `LlamaIngestion` SafeTensors path entirely.
- **BIN** (Karpathy llama2.c format): kept on legacy `LlamaRuntime` for
  now. The .bin loader returns `LlamaRuntimeWeights` directly, and the
  DSL path requires `DecoderGgufWeights`. Either migrate
  `Llama2DotCWeightLoader` or drop .bin support — separate followup.

After this merge the kllama CLI's Llama / Qwen / Mistral GGUF + Llama
SafeTensors paths all run through the DSL. Only BIN format and a
handful of other consumers (`KLlamaJava`, `:llm-apps:skainet-cli`,
`:llm-performance` benchmark engines) still depend on `LlamaRuntime` /
`LlamaIngestion` / `MemSegWeightConverter` / `CpuAttentionBackend`.
Those migrations + deletion of the legacy stack are subsequent PRs.

Imports cleaned: removed `LlamaIngestion`, `LlamaLoadConfig`,
`MemSegWeightConverter`, `LlamaWeightMapper` — all unused after the
swap. `LlamaRuntime` and `CpuAttentionBackend` stay (BIN path).

Numerical parity with the legacy LlamaRuntime path on identical
weights is pinned by `QwenDslLegacyParityTest` (#120) — same
`LlamaNetworkLoader.fromWeights` codepath, just exercised via Qwen.
The Llama branch produces equivalent output by the same construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 4 readiness: DSL Qwen and legacy LlamaRuntime diverge numerically on identical weights

1 participant