Tags: SKaiNET-developers/SKaiNET-transformers
Tags
SKaiNET-transformers 0.23.4 — BOM coverage gap fixed; docs corrected;…
… BOM internals auto-discover.
Transformers-only release on the 0.23.x line. No SKaiNET engine bump
in this version; the `sk.ainet:skainet-bom` pin in
`gradle/libs.versions.toml` stays at 0.23.1.
Highlights
- BOM coverage gap. `:llm-inference:apertus` and
`:llm-inference:voxtral` apply `com.vanniktech.maven.publish` and
ship to Maven Central, but were missing from
`skainet-transformers-bom`'s constraints. Consumers who imported
the BOM and pulled either of these artifacts didn't get version
alignment for them. Both now constrained.
- Wrong artifact IDs in the README and tutorials. The "Current
release" snippet in README.md and the two tutorial pages
(getting-started-java.adoc, llama3-tool-calling.adoc) showed
`sk.ainet.transformers:llm-core` / `llm-runtime-kllama` /
`llm-agent` — those are project paths, not published artifact
IDs. Real coordinates are `skainet-transformers-core`,
`skainet-transformers-runtime-kllama`,
`skainet-transformers-agent`. Anyone copy-pasting hit a "module
not found" error. Snippets switched to the BOM pattern so future
version bumps only need to touch one line; Maven snippet now uses
the `-jvm` classifier suffix that Maven needs for KMP artifacts.
- BOM internals: auto-discovery via a buildSrc convention plugin.
The `bomModules` list in `llm-bom/build.gradle.kts` is no longer
hand-maintained. A new `sk.ainet.transformers.bom-coverage`
plugin (`buildSrc/`) iterates `rootProject.subprojects`, picks up
every sibling that applies `com.vanniktech.maven.publish`, and
adds it as an `api` constraint on the BOM. The only manual input
left is the exclusion list (currently just `:llm-performance` —
benchmarks, not part of the consumer surface). The BOM is
coherent by construction; missing or drifting modules can no
longer happen, which is why the previous `verifyBomCoverage`
drift-guard task was removed.
- llm-test-java now consumes SKaiNET through the local BOM. The
three `sk.ainet.core:*` deps in `llm-test/llm-test-java/build.gradle.kts`
are version-less and pinned through `platform(project(":llm-bom"))`,
so the BOM is exercised inside this build itself. A regression in
the BOM's constraints fails the local build instead of leaking
out to a published artifact.
- Removed dead `allprojects { group = "sk.ainet.llm" }` from the
root build. The published group has always been
`sk.ainet.transformers` (sourced from `gradle.properties`); the
override was being overridden in turn by vanniktech at publish
time. The in-memory project group now matches the published
group, removing a footgun for anyone resolving internal modules
by GAV.
Behavior
- POM contents for `skainet-transformers-bom` are bit-for-bit
equivalent to a hand-maintained BOM with `:llm-inference:apertus`
and `:llm-inference:voxtral` added — same set of constrained
modules, alphabetical ordering in the generated POM (Maven
dependency-management is order-independent).
- Configuration cache: clean. `--configuration-cache` stores on
first run and reuses on subsequent runs.
Notes
- `gradle/libs.versions.toml` keeps `skainet = "0.23.1"` — the
CHANGELOG narrative claim of "version-aligned with SKaiNET X.Y.Z"
has been drifting from the actual engine pin since 0.23.2 and
this release does not fix that drift. Worth addressing in a
later release that picks up an engine bump.
SKaiNET-transformers 0.23.3 — prefill progress callback for AgentLoop. Highlights - Prefill progress visibility. generateUntilStop gains an optional onPrefill: ((Int, Int) -> Unit)? parameter that fires once per prompt token during the autoregressive prefill loop, with (done, total) where `done` is 1-based and `total` is `prompt.size`. Plumbed through both AgentLoop.run and AgentLoop.runWithEncoder as a new default-no-op AgentListener.onPrefillProgress(done, total) method. Why this matters: prefill in 0.23.x is autoregressive — one forward() per prompt token (the comment on generateUntilStop documents the forwardBatched correctness regression we reverted). On a CPU-only runtime with a 300-token prompt the first onToken lands tens of seconds to minutes after the agent loop starts; UIs previously had no way to surface that work was happening, so the loop appeared hung. The new callback lets a UI show e.g. "prefill: 32/282 (11%)" instead of dead silence. Backwards compatible — the new parameter and interface method default to null/no-op, so existing AgentListener implementations and callers compile and behave unchanged. Tests - generateUntilStopReportsPrefillProgressForEachPromptToken pins the contract: one (done, total) pair per prompt token, in order, with done 1-based and total = prompt.size. - generateUntilStopWithEmptyPromptDoesNotInvokePrefillCallback pins the empty-prompt edge case (callback must not fire). Build / version - VERSION_NAME 0.23.2 → 0.23.3; skainet pin stays at 0.23.1. Docs - CHANGELOG: 0.23.3 entry added; backfilled the missing 0.23.2 entry covering DSL-path swaps, tokenizer unification, Llama 3 fenced tool-call parser fix, Qwen3 NEOX RoPE pairing, and QK-norm RMSNorm-eps wiring. - README: version coordinates 0.23.1 → 0.23.3; "What's new" section refreshed to lead with 0.23.3 and recap 0.23.2 / 0.23.1 below. Known followups - Same Llama Q8 perf gap from 0.23.2 stays open: give the DSL first-class Q4/Q8 DTypes so linearProject dispatches SIMD without the per-call ops.transpose tax, or push that selection deeper into ops.matmul. - forwardBatched parity — the prefill speedup left on the table behind the autoregressive fallback. Once forwardBatched matches autoregressive logits, the new onPrefill callback could fire per-batch instead of per-token (with appropriate API tweak) for a 5–10× prefill cost reduction.
SKaiNET-transformers 0.23.2 — DSL swap-out for Llama/Qwen runners, GP… …U stub cleanup, Llama 3 tool-calling robustness. Highlights - DSL inference path. The kllama CLI's Qwen GGUF (#1bacb56) and Llama GGUF + SafeTensors (#d519eb2) branches, the kllama-native (#35aac6b) and kllama-wasm (#8ffd459) browser CLIs, the KLlamaJava facade (#e4b8b66), and the skainet-cli LLaMA/Qwen branch (#4219088) all run through DecoderGgufWeightLoader → LlamaNetworkLoader.fromWeights → OptimizedLLMRuntime DIRECT. Pinned by QwenDslLegacyParityTest (closes #114). - Native-quantized DSL entry point. DecoderGgufMemSegConverter (#5847330) wraps Q4_0/Q8_0 GGUF tensors as Q4/Q8MemorySegmentTensorData with logical [out, in] shapes for the SIMD quant matmul kernels; K-quants dequant to FP32; token_embd dequantizes regardless of quant type so Embedding.gather sees real floats. - Shared decoder body. llm-core gained a shared decoder transformer body builder + DecoderModelMetadata (#61488de); Llama/Qwen/Voxtral NetworkDef collapsed onto it (#5eb18fc); generic loaders renamed Llama* → Decoder* (#a2758a7). - llm-core tokenizer alignment. GGUF tokenizer load routes through upstream sk.ainet.io.tokenizer (closes #52); SentencePiece decorator for Gemma-style chat models (#e5738a9); fromGgufSource / fromTokenizerJsonString (#864186c); Qwen / GPT-2 BPE GGUFs route to upstream byte-level BPE (#bc7c70c). - Llama 3 tool calling robustness. Markdown code fences around the JSON tool call (```json ... ``` / ``` ... ```) are now peeled by Llama31ToolCallParserStrategy (#edb366c) — fixes silently-missed calls on Llama 3.2 1B that wraps its JSON despite the bare-JSON prompt instruction. ToolCallingDemo prints the rendered prompt, tools list, raw assistant output, and final conversation (#5c3b9fa) for debuggability. - GPU stub cleanup. GpuAttentionBackend, GpuTensorBridge, and the createGpuBridge / createMetalContext / createMlxContext expect-actual chains were placeholders that always fell back to CPU. Deleted; the native benchmark scenario was renamed native-cpu-throughput (#cbc5cc6). - Module cleanups. :llm-runtime:kqwen deleted (#db1fba8) — Qwen now shares the kllama runtime via the DSL swap. LlamaIngestionBlocking.kt removed (#26a0fed) — the Java facade went DSL. - Docs. End-to-end Llama 3 tool-calling walkthrough for app integrators (#cea3173): dependency, KLlamaJava.loadGGUF, custom Tool, ChatSession + AgentLoop, AgentListener observability, parser fence note. Pre- existing format-internals reference preserved. - Smoke. Llama-3.2-1B-Instruct entry pinned with a tool-calling assertion (#1e7af50). Fixes - fix(tool-calling): tolerate markdown code fences around Llama 3 JSON. - fix(kllama-cli): route Llama GGUF/SafeTensors back to eager LlamaRuntime for now. The DSL Q4/Q8 path is functionally correct but pays a per-linearProject ops.transpose tax on packed Q4/Q8 weights (the DSL doesn't yet have first-class Q4/Q8 DTypes). Measured 0.24 t/s vs ~0.37 t/s on the eager path on Llama-3.2-1B-Instruct-Q8; Qwen GGUF stays on DSL. Tracked as a perf followup. - fix(llama): inject logical 2D shape and dequant token_embd in the DSL converter (now Qwen-only after the Llama revert above). - fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs. - fix(transformer): thread metadata RMSNorm eps through QK-norm. Build / version - VERSION_NAME 0.23.1 → 0.23.2; skainet pin stays at 0.23.1. - New :llm-inference:voxtral module surfaces in the API dump. - llm-performance JVM benchmark drops the legacy LlamaRuntime adapter (#4999ae5). - Public API dumps refreshed via apiDump (#40200da). Known followups - Recover the previous ~2 t/s baseline on Llama Q8: either give the DSL first-class Q4/Q8 DTypes so linearProject can dispatch the SIMD kernel directly, or push that selection deeper into ops.matmul so the per-call transpose disappears. - Bisect the residual gap between the 0.37 t/s eager path on this branch and the 2 t/s seen earlier on the same eager stack — skainet is still pinned at 0.23.1, so the regression isn't an upstream backend bump.
SKaiNET-transformers 0.23.1 — version-aligned with SKaiNET 0.23.1. Highlights - Apertus end-to-end. Real-GGUF loading on top of skainet 0.23.x's block-major Q4_K TensorData wiring, routed through OptimizedLLMRuntime + apertusNetwork(). Chat template, tool calling, and integration tests against Apertus-8B-Q4_K_S. See APERTUS_ROLLOUT.md. - Gemma 4 chat-model JVM facade (Gemma4ChatModel) for embedded text-only deployments; close() propagates to the mmap arena; PLE mmap path now consumes upstream loadTensorStorageMapped. - Multi-id EOS / stop-token support in the chat layer. - Tokenizer auto-detect for SentencePiece in fromTokenizerJson. - New end-to-end smoke test in llm-test/llm-test-java that wires LEAF (mdbr-leaf-mt via KBertJava) and Llama 3.2-1B (KLlamaJava) in one JVM, gated on env vars / cache fallbacks. - Apertus tool calling as a first-class family alongside Llama 3, Gemma 4, Qwen, and ChatML/Hermes. - kllama-cli + skainet-cli shadow-jar ServiceLoader fix-up so the priority-100 skainet-backend-native-cpu provider is picked up at runtime. Fixes - fix(apertus): force-dequant token_embd under NATIVE_OPTIMIZED. - fix(tokenizer): auto-detect SentencePiece marker in fromTokenizerJson. - fix(gemma4): produce coherent text on real SafeTensors checkpoint. - fix(apertus): route through OptimizedLLMRuntime + apertusNetwork(). Build / version - VERSION_NAME 0.21.1 → 0.23.1; skainet pin 0.23.0 → 0.23.1. - llm-test/llm-test-java maxHeapSize 8g → 16g (Llama 3.2-1B + LEAF in one JVM). - No 0.22.x transformers release was tagged; the version line jumps to re-sync with the engine. See CHANGELOG.md for the full list of changes.
Release 0.21.1 Hotfix re-publish of 0.21.0 with missing POM_NAME for the apertus, voxtral, and llm-performance modules — the 0.21.0 publish run failed Sonatype Central Portal validation with 'Project name is missing' on every publication from those three modules. See PR #86.
SKaiNET-transformers 0.21.0 Mirrors SKaiNET 0.21.0. Highlights: - SKaiNET 0.21.0 dependency: Panama Vector FP32 matmul kernel auto-discovered via ServiceLoader, ScratchPool SPI for runtime workspace allocation, Q4_K SIMD-fused kernel + SPI, Q6_K SIMD dequant, Q4_0 partial-vec dot, canonical ggml layout for Q4_K/Q5_K, FP32 MemSeg arena leak fix, TensorOps.permute. - ScratchPool wired into kllama batched-prefill attention output and the BERT encoder forward — pooling is opt-in via PooledExecutionContext, default NoopScratchPool preserves existing behavior. - First-class Java surface for Llama tool calling: KLlamaJava + KLlamaSession + JavaTool + JavaTools.definition + JavaAgentLoop, exercised end-to-end by llm-test:llm-test-java and llm-apps:kllama-java-sample. - Removed deprecated-runtime CLIs: kqwen-only, kapertus-cli, kvoxtral-cli. Qwen now goes through skainet-cli or kllama-cli (same tensor layout). - Antora docs site populated with Divio quadrants — Getting Started (Kotlin and Java), Tool Calling (generic + Llama 3 family), Embeddings, Smoke Tests, plus how-to and explanation pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PreviousNext