Skip to content

Tags: SKaiNET-developers/SKaiNET-transformers

Tags

0.23.5

Toggle 0.23.5's commit message
Release 0.23.5

0.23.4

Toggle 0.23.4's commit message
SKaiNET-transformers 0.23.4 — BOM coverage gap fixed; docs corrected;…

… BOM internals auto-discover.

Transformers-only release on the 0.23.x line. No SKaiNET engine bump
in this version; the `sk.ainet:skainet-bom` pin in
`gradle/libs.versions.toml` stays at 0.23.1.

Highlights

- BOM coverage gap. `:llm-inference:apertus` and
  `:llm-inference:voxtral` apply `com.vanniktech.maven.publish` and
  ship to Maven Central, but were missing from
  `skainet-transformers-bom`'s constraints. Consumers who imported
  the BOM and pulled either of these artifacts didn't get version
  alignment for them. Both now constrained.

- Wrong artifact IDs in the README and tutorials. The "Current
  release" snippet in README.md and the two tutorial pages
  (getting-started-java.adoc, llama3-tool-calling.adoc) showed
  `sk.ainet.transformers:llm-core` / `llm-runtime-kllama` /
  `llm-agent` — those are project paths, not published artifact
  IDs. Real coordinates are `skainet-transformers-core`,
  `skainet-transformers-runtime-kllama`,
  `skainet-transformers-agent`. Anyone copy-pasting hit a "module
  not found" error. Snippets switched to the BOM pattern so future
  version bumps only need to touch one line; Maven snippet now uses
  the `-jvm` classifier suffix that Maven needs for KMP artifacts.

- BOM internals: auto-discovery via a buildSrc convention plugin.
  The `bomModules` list in `llm-bom/build.gradle.kts` is no longer
  hand-maintained. A new `sk.ainet.transformers.bom-coverage`
  plugin (`buildSrc/`) iterates `rootProject.subprojects`, picks up
  every sibling that applies `com.vanniktech.maven.publish`, and
  adds it as an `api` constraint on the BOM. The only manual input
  left is the exclusion list (currently just `:llm-performance` —
  benchmarks, not part of the consumer surface). The BOM is
  coherent by construction; missing or drifting modules can no
  longer happen, which is why the previous `verifyBomCoverage`
  drift-guard task was removed.

- llm-test-java now consumes SKaiNET through the local BOM. The
  three `sk.ainet.core:*` deps in `llm-test/llm-test-java/build.gradle.kts`
  are version-less and pinned through `platform(project(":llm-bom"))`,
  so the BOM is exercised inside this build itself. A regression in
  the BOM's constraints fails the local build instead of leaking
  out to a published artifact.

- Removed dead `allprojects { group = "sk.ainet.llm" }` from the
  root build. The published group has always been
  `sk.ainet.transformers` (sourced from `gradle.properties`); the
  override was being overridden in turn by vanniktech at publish
  time. The in-memory project group now matches the published
  group, removing a footgun for anyone resolving internal modules
  by GAV.

Behavior

- POM contents for `skainet-transformers-bom` are bit-for-bit
  equivalent to a hand-maintained BOM with `:llm-inference:apertus`
  and `:llm-inference:voxtral` added — same set of constrained
  modules, alphabetical ordering in the generated POM (Maven
  dependency-management is order-independent).
- Configuration cache: clean. `--configuration-cache` stores on
  first run and reuses on subsequent runs.

Notes

- `gradle/libs.versions.toml` keeps `skainet = "0.23.1"` — the
  CHANGELOG narrative claim of "version-aligned with SKaiNET X.Y.Z"
  has been drifting from the actual engine pin since 0.23.2 and
  this release does not fix that drift. Worth addressing in a
  later release that picks up an engine bump.

0.23.3

Toggle 0.23.3's commit message
SKaiNET-transformers 0.23.3 — prefill progress callback for AgentLoop.

Highlights

- Prefill progress visibility. generateUntilStop gains an optional
  onPrefill: ((Int, Int) -> Unit)? parameter that fires once per
  prompt token during the autoregressive prefill loop, with
  (done, total) where `done` is 1-based and `total` is `prompt.size`.
  Plumbed through both AgentLoop.run and AgentLoop.runWithEncoder as
  a new default-no-op AgentListener.onPrefillProgress(done, total)
  method.

  Why this matters: prefill in 0.23.x is autoregressive — one
  forward() per prompt token (the comment on generateUntilStop
  documents the forwardBatched correctness regression we reverted).
  On a CPU-only runtime with a 300-token prompt the first onToken
  lands tens of seconds to minutes after the agent loop starts; UIs
  previously had no way to surface that work was happening, so the
  loop appeared hung. The new callback lets a UI show e.g.
  "prefill: 32/282 (11%)" instead of dead silence.

  Backwards compatible — the new parameter and interface method
  default to null/no-op, so existing AgentListener implementations
  and callers compile and behave unchanged.

Tests

- generateUntilStopReportsPrefillProgressForEachPromptToken pins the
  contract: one (done, total) pair per prompt token, in order, with
  done 1-based and total = prompt.size.
- generateUntilStopWithEmptyPromptDoesNotInvokePrefillCallback pins
  the empty-prompt edge case (callback must not fire).

Build / version

- VERSION_NAME 0.23.2 → 0.23.3; skainet pin stays at 0.23.1.

Docs

- CHANGELOG: 0.23.3 entry added; backfilled the missing 0.23.2 entry
  covering DSL-path swaps, tokenizer unification, Llama 3 fenced
  tool-call parser fix, Qwen3 NEOX RoPE pairing, and QK-norm
  RMSNorm-eps wiring.
- README: version coordinates 0.23.1 → 0.23.3; "What's new" section
  refreshed to lead with 0.23.3 and recap 0.23.2 / 0.23.1 below.

Known followups

- Same Llama Q8 perf gap from 0.23.2 stays open: give the DSL
  first-class Q4/Q8 DTypes so linearProject dispatches SIMD without
  the per-call ops.transpose tax, or push that selection deeper into
  ops.matmul.
- forwardBatched parity — the prefill speedup left on the table
  behind the autoregressive fallback. Once forwardBatched matches
  autoregressive logits, the new onPrefill callback could fire
  per-batch instead of per-token (with appropriate API tweak) for a
  5–10× prefill cost reduction.

0.23.2

Toggle 0.23.2's commit message
SKaiNET-transformers 0.23.2 — DSL swap-out for Llama/Qwen runners, GP…

…U stub cleanup, Llama 3 tool-calling robustness.

Highlights

- DSL inference path. The kllama CLI's Qwen GGUF (#1bacb56) and Llama
  GGUF + SafeTensors (#d519eb2) branches, the kllama-native (#35aac6b)
  and kllama-wasm (#8ffd459) browser CLIs, the KLlamaJava facade
  (#e4b8b66), and the skainet-cli LLaMA/Qwen branch (#4219088) all run
  through DecoderGgufWeightLoader → LlamaNetworkLoader.fromWeights →
  OptimizedLLMRuntime DIRECT. Pinned by QwenDslLegacyParityTest
  (closes #114).
- Native-quantized DSL entry point. DecoderGgufMemSegConverter
  (#5847330) wraps Q4_0/Q8_0 GGUF tensors as Q4/Q8MemorySegmentTensorData
  with logical [out, in] shapes for the SIMD quant matmul kernels;
  K-quants dequant to FP32; token_embd dequantizes regardless of quant
  type so Embedding.gather sees real floats.
- Shared decoder body. llm-core gained a shared decoder transformer
  body builder + DecoderModelMetadata (#61488de); Llama/Qwen/Voxtral
  NetworkDef collapsed onto it (#5eb18fc); generic loaders renamed
  Llama* → Decoder* (#a2758a7).
- llm-core tokenizer alignment. GGUF tokenizer load routes through
  upstream sk.ainet.io.tokenizer (closes #52); SentencePiece decorator
  for Gemma-style chat models (#e5738a9); fromGgufSource /
  fromTokenizerJsonString (#864186c); Qwen / GPT-2 BPE GGUFs route to
  upstream byte-level BPE (#bc7c70c).
- Llama 3 tool calling robustness. Markdown code fences around the
  JSON tool call (```json ... ``` / ``` ... ```) are now peeled by
  Llama31ToolCallParserStrategy (#edb366c) — fixes silently-missed
  calls on Llama 3.2 1B that wraps its JSON despite the bare-JSON
  prompt instruction. ToolCallingDemo prints the rendered prompt,
  tools list, raw assistant output, and final conversation
  (#5c3b9fa) for debuggability.
- GPU stub cleanup. GpuAttentionBackend, GpuTensorBridge, and the
  createGpuBridge / createMetalContext / createMlxContext expect-actual
  chains were placeholders that always fell back to CPU. Deleted; the
  native benchmark scenario was renamed native-cpu-throughput (#cbc5cc6).
- Module cleanups. :llm-runtime:kqwen deleted (#db1fba8) — Qwen now
  shares the kllama runtime via the DSL swap. LlamaIngestionBlocking.kt
  removed (#26a0fed) — the Java facade went DSL.
- Docs. End-to-end Llama 3 tool-calling walkthrough for app integrators
  (#cea3173): dependency, KLlamaJava.loadGGUF, custom Tool, ChatSession
  + AgentLoop, AgentListener observability, parser fence note. Pre-
  existing format-internals reference preserved.
- Smoke. Llama-3.2-1B-Instruct entry pinned with a tool-calling
  assertion (#1e7af50).

Fixes

- fix(tool-calling): tolerate markdown code fences around Llama 3 JSON.
- fix(kllama-cli): route Llama GGUF/SafeTensors back to eager
  LlamaRuntime for now. The DSL Q4/Q8 path is functionally correct but
  pays a per-linearProject ops.transpose tax on packed Q4/Q8 weights
  (the DSL doesn't yet have first-class Q4/Q8 DTypes). Measured 0.24
  t/s vs ~0.37 t/s on the eager path on Llama-3.2-1B-Instruct-Q8;
  Qwen GGUF stays on DSL. Tracked as a perf followup.
- fix(llama): inject logical 2D shape and dequant token_embd in the
  DSL converter (now Qwen-only after the Llama revert above).
- fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs.
- fix(transformer): thread metadata RMSNorm eps through QK-norm.

Build / version

- VERSION_NAME 0.23.1 → 0.23.2; skainet pin stays at 0.23.1.
- New :llm-inference:voxtral module surfaces in the API dump.
- llm-performance JVM benchmark drops the legacy LlamaRuntime adapter
  (#4999ae5).
- Public API dumps refreshed via apiDump (#40200da).

Known followups

- Recover the previous ~2 t/s baseline on Llama Q8: either give the DSL
  first-class Q4/Q8 DTypes so linearProject can dispatch the SIMD
  kernel directly, or push that selection deeper into ops.matmul so
  the per-call transpose disappears.
- Bisect the residual gap between the 0.37 t/s eager path on this
  branch and the 2 t/s seen earlier on the same eager stack — skainet
  is still pinned at 0.23.1, so the regression isn't an upstream
  backend bump.

0.23.1

Toggle 0.23.1's commit message
SKaiNET-transformers 0.23.1 — version-aligned with SKaiNET 0.23.1.

Highlights

- Apertus end-to-end. Real-GGUF loading on top of skainet 0.23.x's
  block-major Q4_K TensorData wiring, routed through OptimizedLLMRuntime
  + apertusNetwork(). Chat template, tool calling, and integration tests
  against Apertus-8B-Q4_K_S. See APERTUS_ROLLOUT.md.
- Gemma 4 chat-model JVM facade (Gemma4ChatModel) for embedded text-only
  deployments; close() propagates to the mmap arena; PLE mmap path now
  consumes upstream loadTensorStorageMapped.
- Multi-id EOS / stop-token support in the chat layer.
- Tokenizer auto-detect for SentencePiece in fromTokenizerJson.
- New end-to-end smoke test in llm-test/llm-test-java that wires LEAF
  (mdbr-leaf-mt via KBertJava) and Llama 3.2-1B (KLlamaJava) in one JVM,
  gated on env vars / cache fallbacks.
- Apertus tool calling as a first-class family alongside Llama 3, Gemma 4,
  Qwen, and ChatML/Hermes.
- kllama-cli + skainet-cli shadow-jar ServiceLoader fix-up so the
  priority-100 skainet-backend-native-cpu provider is picked up at runtime.

Fixes

- fix(apertus): force-dequant token_embd under NATIVE_OPTIMIZED.
- fix(tokenizer): auto-detect SentencePiece marker in fromTokenizerJson.
- fix(gemma4): produce coherent text on real SafeTensors checkpoint.
- fix(apertus): route through OptimizedLLMRuntime + apertusNetwork().

Build / version

- VERSION_NAME 0.21.1 → 0.23.1; skainet pin 0.23.0 → 0.23.1.
- llm-test/llm-test-java maxHeapSize 8g → 16g (Llama 3.2-1B + LEAF in one JVM).
- No 0.22.x transformers release was tagged; the version line jumps to
  re-sync with the engine.

See CHANGELOG.md for the full list of changes.

0.21.1

Toggle 0.21.1's commit message
Release 0.21.1

Hotfix re-publish of 0.21.0 with missing POM_NAME for the apertus,
voxtral, and llm-performance modules — the 0.21.0 publish run failed
Sonatype Central Portal validation with 'Project name is missing'
on every publication from those three modules.

See PR #86.

0.21.0

Toggle 0.21.0's commit message
SKaiNET-transformers 0.21.0

Mirrors SKaiNET 0.21.0. Highlights:

- SKaiNET 0.21.0 dependency: Panama Vector FP32 matmul kernel auto-discovered
  via ServiceLoader, ScratchPool SPI for runtime workspace allocation, Q4_K
  SIMD-fused kernel + SPI, Q6_K SIMD dequant, Q4_0 partial-vec dot, canonical
  ggml layout for Q4_K/Q5_K, FP32 MemSeg arena leak fix, TensorOps.permute.
- ScratchPool wired into kllama batched-prefill attention output and the BERT
  encoder forward — pooling is opt-in via PooledExecutionContext, default
  NoopScratchPool preserves existing behavior.
- First-class Java surface for Llama tool calling: KLlamaJava + KLlamaSession
  + JavaTool + JavaTools.definition + JavaAgentLoop, exercised end-to-end by
  llm-test:llm-test-java and llm-apps:kllama-java-sample.
- Removed deprecated-runtime CLIs: kqwen-only, kapertus-cli, kvoxtral-cli.
  Qwen now goes through skainet-cli or kllama-cli (same tensor layout).
- Antora docs site populated with Divio quadrants — Getting Started (Kotlin
  and Java), Tool Calling (generic + Llama 3 family), Embeddings, Smoke
  Tests, plus how-to and explanation pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

0.16.0

Toggle 0.16.0's commit message
version 0.16.0

v0.16.0

Toggle v0.16.0's commit message
Release version 0.16.0

0.3.0

Toggle 0.3.0's commit message
version 0.3.0