Tags · SKaiNET-developers/SKaiNET-transformers

0.23.5

Release 0.23.5

May 8, 2026
2c43150
zip
tar.gz

0.23.4

SKaiNET-transformers 0.23.4 — BOM coverage gap fixed; docs corrected;…

… BOM internals auto-discover.

Transformers-only release on the 0.23.x line. No SKaiNET engine bump
in this version; the `sk.ainet:skainet-bom` pin in
`gradle/libs.versions.toml` stays at 0.23.1.

Highlights

- BOM coverage gap. `:llm-inference:apertus` and
  `:llm-inference:voxtral` apply `com.vanniktech.maven.publish` and
  ship to Maven Central, but were missing from
  `skainet-transformers-bom`'s constraints. Consumers who imported
  the BOM and pulled either of these artifacts didn't get version
  alignment for them. Both now constrained.

- Wrong artifact IDs in the README and tutorials. The "Current
  release" snippet in README.md and the two tutorial pages
  (getting-started-java.adoc, llama3-tool-calling.adoc) showed
  `sk.ainet.transformers:llm-core` / `llm-runtime-kllama` /
  `llm-agent` — those are project paths, not published artifact
  IDs. Real coordinates are `skainet-transformers-core`,
  `skainet-transformers-runtime-kllama`,
  `skainet-transformers-agent`. Anyone copy-pasting hit a "module
  not found" error. Snippets switched to the BOM pattern so future
  version bumps only need to touch one line; Maven snippet now uses
  the `-jvm` classifier suffix that Maven needs for KMP artifacts.

- BOM internals: auto-discovery via a buildSrc convention plugin.
  The `bomModules` list in `llm-bom/build.gradle.kts` is no longer
  hand-maintained. A new `sk.ainet.transformers.bom-coverage`
  plugin (`buildSrc/`) iterates `rootProject.subprojects`, picks up
  every sibling that applies `com.vanniktech.maven.publish`, and
  adds it as an `api` constraint on the BOM. The only manual input
  left is the exclusion list (currently just `:llm-performance` —
  benchmarks, not part of the consumer surface). The BOM is
  coherent by construction; missing or drifting modules can no
  longer happen, which is why the previous `verifyBomCoverage`
  drift-guard task was removed.

- llm-test-java now consumes SKaiNET through the local BOM. The
  three `sk.ainet.core:*` deps in `llm-test/llm-test-java/build.gradle.kts`
  are version-less and pinned through `platform(project(":llm-bom"))`,
  so the BOM is exercised inside this build itself. A regression in
  the BOM's constraints fails the local build instead of leaking
  out to a published artifact.

- Removed dead `allprojects { group = "sk.ainet.llm" }` from the
  root build. The published group has always been
  `sk.ainet.transformers` (sourced from `gradle.properties`); the
  override was being overridden in turn by vanniktech at publish
  time. The in-memory project group now matches the published
  group, removing a footgun for anyone resolving internal modules
  by GAV.

Behavior

- POM contents for `skainet-transformers-bom` are bit-for-bit
  equivalent to a hand-maintained BOM with `:llm-inference:apertus`
  and `:llm-inference:voxtral` added — same set of constrained
  modules, alphabetical ordering in the generated POM (Maven
  dependency-management is order-independent).
- Configuration cache: clean. `--configuration-cache` stores on
  first run and reuses on subsequent runs.

Notes

- `gradle/libs.versions.toml` keeps `skainet = "0.23.1"` — the
  CHANGELOG narrative claim of "version-aligned with SKaiNET X.Y.Z"
  has been drifting from the actual engine pin since 0.23.2 and
  this release does not fix that drift. Worth addressing in a
  later release that picks up an engine bump.

May 8, 2026
975fc19
zip
tar.gz

0.23.3

SKaiNET-transformers 0.23.3 — prefill progress callback for AgentLoop.

Highlights

- Prefill progress visibility. generateUntilStop gains an optional
  onPrefill: ((Int, Int) -> Unit)? parameter that fires once per
  prompt token during the autoregressive prefill loop, with
  (done, total) where `done` is 1-based and `total` is `prompt.size`.
  Plumbed through both AgentLoop.run and AgentLoop.runWithEncoder as
  a new default-no-op AgentListener.onPrefillProgress(done, total)
  method.

  Why this matters: prefill in 0.23.x is autoregressive — one
  forward() per prompt token (the comment on generateUntilStop
  documents the forwardBatched correctness regression we reverted).
  On a CPU-only runtime with a 300-token prompt the first onToken
  lands tens of seconds to minutes after the agent loop starts; UIs
  previously had no way to surface that work was happening, so the
  loop appeared hung. The new callback lets a UI show e.g.
  "prefill: 32/282 (11%)" instead of dead silence.

  Backwards compatible — the new parameter and interface method
  default to null/no-op, so existing AgentListener implementations
  and callers compile and behave unchanged.

Tests

- generateUntilStopReportsPrefillProgressForEachPromptToken pins the
  contract: one (done, total) pair per prompt token, in order, with
  done 1-based and total = prompt.size.
- generateUntilStopWithEmptyPromptDoesNotInvokePrefillCallback pins
  the empty-prompt edge case (callback must not fire).

Build / version

- VERSION_NAME 0.23.2 → 0.23.3; skainet pin stays at 0.23.1.

Docs

- CHANGELOG: 0.23.3 entry added; backfilled the missing 0.23.2 entry
  covering DSL-path swaps, tokenizer unification, Llama 3 fenced
  tool-call parser fix, Qwen3 NEOX RoPE pairing, and QK-norm
  RMSNorm-eps wiring.
- README: version coordinates 0.23.1 → 0.23.3; "What's new" section
  refreshed to lead with 0.23.3 and recap 0.23.2 / 0.23.1 below.

Known followups

- Same Llama Q8 perf gap from 0.23.2 stays open: give the DSL
  first-class Q4/Q8 DTypes so linearProject dispatches SIMD without
  the per-call ops.transpose tax, or push that selection deeper into
  ops.matmul.
- forwardBatched parity — the prefill speedup left on the table
  behind the autoregressive fallback. Once forwardBatched matches
  autoregressive logits, the new onPrefill callback could fire
  per-batch instead of per-token (with appropriate API tweak) for a
  5–10× prefill cost reduction.

May 6, 2026
81899f4
zip
tar.gz

0.23.2

SKaiNET-transformers 0.23.2 — DSL swap-out for Llama/Qwen runners, GP…

…U stub cleanup, Llama 3 tool-calling robustness.

Highlights

- DSL inference path. The kllama CLI's Qwen GGUF (#1bacb56) and Llama
  GGUF + SafeTensors (#d519eb2) branches, the kllama-native (#35aac6b)
  and kllama-wasm (#8ffd459) browser CLIs, the KLlamaJava facade
  (#e4b8b66), and the skainet-cli LLaMA/Qwen branch (#4219088) all run
  through DecoderGgufWeightLoader → LlamaNetworkLoader.fromWeights →
  OptimizedLLMRuntime DIRECT. Pinned by QwenDslLegacyParityTest
  (closes #114).
- Native-quantized DSL entry point. DecoderGgufMemSegConverter
  (#5847330) wraps Q4_0/Q8_0 GGUF tensors as Q4/Q8MemorySegmentTensorData
  with logical [out, in] shapes for the SIMD quant matmul kernels;
  K-quants dequant to FP32; token_embd dequantizes regardless of quant
  type so Embedding.gather sees real floats.
- Shared decoder body. llm-core gained a shared decoder transformer
  body builder + DecoderModelMetadata (#61488de); Llama/Qwen/Voxtral
  NetworkDef collapsed onto it (#5eb18fc); generic loaders renamed
  Llama* → Decoder* (#a2758a7).
- llm-core tokenizer alignment. GGUF tokenizer load routes through
  upstream sk.ainet.io.tokenizer (closes #52); SentencePiece decorator
  for Gemma-style chat models (#e5738a9); fromGgufSource /
  fromTokenizerJsonString (#864186c); Qwen / GPT-2 BPE GGUFs route to
  upstream byte-level BPE (#bc7c70c).
- Llama 3 tool calling robustness. Markdown code fences around the
  JSON tool call (```json ... ``` / ``` ... ```) are now peeled by
  Llama31ToolCallParserStrategy (#edb366c) — fixes silently-missed
  calls on Llama 3.2 1B that wraps its JSON despite the bare-JSON
  prompt instruction. ToolCallingDemo prints the rendered prompt,
  tools list, raw assistant output, and final conversation
  (#5c3b9fa) for debuggability.
- GPU stub cleanup. GpuAttentionBackend, GpuTensorBridge, and the
  createGpuBridge / createMetalContext / createMlxContext expect-actual
  chains were placeholders that always fell back to CPU. Deleted; the
  native benchmark scenario was renamed native-cpu-throughput (#cbc5cc6).
- Module cleanups. :llm-runtime:kqwen deleted (#db1fba8) — Qwen now
  shares the kllama runtime via the DSL swap. LlamaIngestionBlocking.kt
  removed (#26a0fed) — the Java facade went DSL.
- Docs. End-to-end Llama 3 tool-calling walkthrough for app integrators
  (#cea3173): dependency, KLlamaJava.loadGGUF, custom Tool, ChatSession
  + AgentLoop, AgentListener observability, parser fence note. Pre-
  existing format-internals reference preserved.
- Smoke. Llama-3.2-1B-Instruct entry pinned with a tool-calling
  assertion (#1e7af50).

Fixes

- fix(tool-calling): tolerate markdown code fences around Llama 3 JSON.
- fix(kllama-cli): route Llama GGUF/SafeTensors back to eager
  LlamaRuntime for now. The DSL Q4/Q8 path is functionally correct but
  pays a per-linearProject ops.transpose tax on packed Q4/Q8 weights
  (the DSL doesn't yet have first-class Q4/Q8 DTypes). Measured 0.24
  t/s vs ~0.37 t/s on the eager path on Llama-3.2-1B-Instruct-Q8;
  Qwen GGUF stays on DSL. Tracked as a perf followup.
- fix(llama): inject logical 2D shape and dequant token_embd in the
  DSL converter (now Qwen-only after the Llama revert above).
- fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs.
- fix(transformer): thread metadata RMSNorm eps through QK-norm.

Build / version

- VERSION_NAME 0.23.1 → 0.23.2; skainet pin stays at 0.23.1.
- New :llm-inference:voxtral module surfaces in the API dump.
- llm-performance JVM benchmark drops the legacy LlamaRuntime adapter
  (#4999ae5).
- Public API dumps refreshed via apiDump (#40200da).

Known followups

- Recover the previous ~2 t/s baseline on Llama Q8: either give the DSL
  first-class Q4/Q8 DTypes so linearProject can dispatch the SIMD
  kernel directly, or push that selection deeper into ops.matmul so
  the per-call transpose disappears.
- Bisect the residual gap between the 0.37 t/s eager path on this
  branch and the 2 t/s seen earlier on the same eager stack — skainet
  is still pinned at 0.23.1, so the regression isn't an upstream
  backend bump.

May 5, 2026
6eec93a
zip
tar.gz

0.23.1

SKaiNET-transformers 0.23.1 — version-aligned with SKaiNET 0.23.1.

Highlights

- Apertus end-to-end. Real-GGUF loading on top of skainet 0.23.x's
  block-major Q4_K TensorData wiring, routed through OptimizedLLMRuntime
  + apertusNetwork(). Chat template, tool calling, and integration tests
  against Apertus-8B-Q4_K_S. See APERTUS_ROLLOUT.md.
- Gemma 4 chat-model JVM facade (Gemma4ChatModel) for embedded text-only
  deployments; close() propagates to the mmap arena; PLE mmap path now
  consumes upstream loadTensorStorageMapped.
- Multi-id EOS / stop-token support in the chat layer.
- Tokenizer auto-detect for SentencePiece in fromTokenizerJson.
- New end-to-end smoke test in llm-test/llm-test-java that wires LEAF
  (mdbr-leaf-mt via KBertJava) and Llama 3.2-1B (KLlamaJava) in one JVM,
  gated on env vars / cache fallbacks.
- Apertus tool calling as a first-class family alongside Llama 3, Gemma 4,
  Qwen, and ChatML/Hermes.
- kllama-cli + skainet-cli shadow-jar ServiceLoader fix-up so the
  priority-100 skainet-backend-native-cpu provider is picked up at runtime.

Fixes

- fix(apertus): force-dequant token_embd under NATIVE_OPTIMIZED.
- fix(tokenizer): auto-detect SentencePiece marker in fromTokenizerJson.
- fix(gemma4): produce coherent text on real SafeTensors checkpoint.
- fix(apertus): route through OptimizedLLMRuntime + apertusNetwork().

Build / version

- VERSION_NAME 0.21.1 → 0.23.1; skainet pin 0.23.0 → 0.23.1.
- llm-test/llm-test-java maxHeapSize 8g → 16g (Llama 3.2-1B + LEAF in one JVM).
- No 0.22.x transformers release was tagged; the version line jumps to
  re-sync with the engine.

See CHANGELOG.md for the full list of changes.

May 4, 2026
72e2e66
zip
tar.gz

0.21.1

Release 0.21.1

Hotfix re-publish of 0.21.0 with missing POM_NAME for the apertus,
voxtral, and llm-performance modules — the 0.21.0 publish run failed
Sonatype Central Portal validation with 'Project name is missing'
on every publication from those three modules.

See PR #86.

Apr 29, 2026
d3b54d2
zip
tar.gz

0.21.0

SKaiNET-transformers 0.21.0

Mirrors SKaiNET 0.21.0. Highlights:

- SKaiNET 0.21.0 dependency: Panama Vector FP32 matmul kernel auto-discovered
  via ServiceLoader, ScratchPool SPI for runtime workspace allocation, Q4_K
  SIMD-fused kernel + SPI, Q6_K SIMD dequant, Q4_0 partial-vec dot, canonical
  ggml layout for Q4_K/Q5_K, FP32 MemSeg arena leak fix, TensorOps.permute.
- ScratchPool wired into kllama batched-prefill attention output and the BERT
  encoder forward — pooling is opt-in via PooledExecutionContext, default
  NoopScratchPool preserves existing behavior.
- First-class Java surface for Llama tool calling: KLlamaJava + KLlamaSession
  + JavaTool + JavaTools.definition + JavaAgentLoop, exercised end-to-end by
  llm-test:llm-test-java and llm-apps:kllama-java-sample.
- Removed deprecated-runtime CLIs: kqwen-only, kapertus-cli, kvoxtral-cli.
  Qwen now goes through skainet-cli or kllama-cli (same tensor layout).
- Antora docs site populated with Divio quadrants — Getting Started (Kotlin
  and Java), Tool Calling (generic + Llama 3 family), Embeddings, Smoke
  Tests, plus how-to and explanation pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apr 29, 2026
ce8b9ee
zip
tar.gz

0.16.0

version 0.16.0

Mar 9, 2026
4dfb771
zip
tar.gz

v0.16.0

Release version 0.16.0

Mar 9, 2026
020a474
zip
tar.gz

0.3.0

version 0.3.0

Mar 8, 2026
0f6415f
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.23.5

0.23.4

0.23.3

0.23.2

0.23.1

0.21.1

0.21.0

0.16.0

v0.16.0

0.3.0

Tags: SKaiNET-developers/SKaiNET-transformers