fix(tokenizer): consume upstream tokenizers, retire forked impl (#52) by michalharakal · Pull Request #124 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-04T18:05:18Z

Closes #52.

Summary

The local `llm-core` fork of byte-level BPE in `GGUFTokenizer.encodeBPE()` is broken (no byte→unicode map, score-ranked merges instead of merge-rank, no special-token atomic splitting). For Qwen3 this surfaces as `Ċ` artifacts in place of `\n` and stray `!`-wrapped JSON in tool-call emissions, breaking the agent loop. Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as `sk.ainet.core:skainet-io-core:0.23.1`, already on the classpath).

This PR retires the fork and consumes upstream — the architectural shape the user asked for: tokenizer infrastructure lives upstream as foundational IO, downstream holds only model-specific handlers (chat templates, tool-call parsers, the SentencePiece special-token decorator).

Changes

`TokenizerFactory.fromGgufFields(modelInfo.fields)` — new entry point that delegates to upstream `sk.ainet.io.tokenizer.TokenizerFactory.fromGguf` and wraps the result in an `UpstreamTokenizerAdapter` that bridges the local interface (`decode(Int)`, non-null bos/eos) to upstream's nullable design.
`SentencePieceSpecialTokens` decorator (model-specific handler) — wraps upstream `SentencePieceTokenizer` to add atomic special-token splitting for chat-template markers (``, `<|turn>`, `<turn|>`). Required because upstream `SentencePieceTokenizer.encode()` does pure SP — the gap that motivated this decorator. Also patches upstream's HF-JSON path gaps: reads `add_space_prefix` from the normalizer block and resolves bos/eos from `added_tokens`. Becomes redundant once upstream PR SKaiNET#595 lands and we bump skainet — deletion is a follow-up downstream PR.
`fromGgufSource` / `fromTokenizerJsonString` — convenience factory methods for callers that have a `RandomAccessSource` or raw JSON string instead of an already-parsed metadata map.
`skainet-cli/Main.kt` — switched to `fromGgufFields(modelInfo.fields)`, reusing the field map already obtained from `UnifiedModelLoader.peek`.
Legacy `TokenizerFactory` methods removed (`fromGGUF`, `fromTokenizerJson`, `fromHuggingFace`) — no in-tree caller used them after the cascade migration.

Verification

Test	Before	After
Qwen3-1.7B greedy generation	coherent text but with `Ċ` artifacts	clean output
Qwen3-1.7B tool demo	broken: `ĊĊĊ<tool_call>!{...}!</tool_call>`, agent loop terminates	full round-trip: model emits clean JSON → calculator returns 391.0 → model produces final answer "391"
`:llm-core:jvmTest`	green	green
Full `compileKotlinJvm`	green	green

Architectural notes

This PR establishes the consume-upstream pattern but does not complete the full cascade. Still ahead (separate follow-up commits/PRs):

Migrate the remaining `GGUFTokenizer.foo(...)` callers (`kllama-cli` jvm/native/wasmJs, `KLlamaJava`, `kgemma-cli`, `Gemma4ChatModel`, `JvmBenchmarkEngine`, `NativeBenchmarkEngine`) off the legacy class methods and onto `TokenizerFactory.fromGgufSource` / `fromTokenizerJsonString`.
Replace the local `Tokenizer` interface with a `typealias` to `sk.ainet.io.tokenizer.Tokenizer` (avoids re-creating the divergence trap; binary-compat preserved by the typealias). Adjust other live impls (`TokenizerImpl`, `TekkenTokenizerAdapter`, `bert/HuggingFaceTokenizer`, `SkaiNetChatModel` default) for nullable bos/eos.
Delete the forked `GGUFTokenizer.kt` and strategy classes.
Retarget `Gemma4TokenizerParityTest` to `TokenizerFactory.fromTokenizerJsonString`.

Splitting the cascade into a follow-up PR keeps this one's bisect surface minimal — the user-visible bug fix lands now, the impl deletion lands once it's been validated against the parity test fixture.

Test plan

`./gradlew :llm-apps:skainet-cli:run --args="-m Qwen3-1.7B-Q8_0.gguf -s 32 -k 0.0 'The capital of France is'"` → coherent answer
Same with `--demo --template=chatml "What is 17 * 23?"` → clean tool call, full round-trip
`./gradlew :llm-core:jvmTest` green
`./gradlew compileKotlinJvm` green across all modules
CI to verify Android, native, wasmJs targets compile

🤖 Generated with Claude Code

….tokenizer (#52) The local llm-core fork of GGUFTokenizer.encodeBPE() implements byte-level BPE incorrectly: it splits text into char-sized units, picks merges by vocab score instead of merge rank, and ignores the byte→unicode mapping required by GPT-2-style tokenizers. For Qwen3, this surfaces as `Ċ` artifacts in place of `\n` and stray `!`-wrapped JSON in tool-call emissions, which breaks the agent loop. Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as sk.ainet.core:skainet-io-core:0.23.1, already on the classpath). Rather than port the upstream patch into the fork — which perpetuates divergence — this change adds a thin entry point that routes through the upstream TokenizerFactory: TokenizerFactory.fromGgufFields(modelInfo.fields) → sk.ainet.io.tokenizer.TokenizerFactory.fromGguf(fields) → wrapped in UpstreamTokenizerAdapter to satisfy the local Tokenizer interface (decode(Int), non-null bos/eos) The skainet-cli is migrated to the new method (using `modelInfo.fields` already obtained from `UnifiedModelLoader.peek` for architecture detection, which removes one redundant file open). Verified end-to-end with Qwen3-1.7B-Q8_0.gguf: greedy generation produces coherent text and the tool-calling demo now completes a full round-trip (model → tool_call JSON → tool dispatch → result → model final answer). The legacy `fromGGUF(source)` / `fromTokenizerJson(json)` / `fromHuggingFace` methods still route to the local broken impl. Other callers (kllama-cli, kgemma, voxtral) remain on the legacy path; they migrate in a follow-up that also retires the forked impl entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…yle chat models Upstream sk.ainet.io.tokenizer.SentencePieceTokenizer.encode() does pure SentencePiece — it never treats vocab entries as atomic. That works for vanilla LLaMA but breaks chat-template models like Gemma 4 whose <bos>, <|turn>, <turn|> markers must encode as a single id rather than fragmenting into per-character byte-fallback pieces. The HF tokenizer.json factory upstream also drops two pieces of metadata Gemma needs: bos/eos resolution from the added_tokens array, and add_space_prefix=false (upstream hard-codes true). This commit adds a downstream decorator that wraps an upstream SentencePieceTokenizer and: - Performs longest-match atomic special-token splitting on encode. - Groups contiguous non-special ids on decode and delegates each run to base.decode, preserving the byte-level UTF-8 spanning that SP decode does internally (special-token boundaries always sit on UTF-8 boundaries — they're literal strings). - For the GGUF path: scans tokenizer.ggml.token_type for CONTROL (3) and USER_DEFINED (4) entries to populate the special-tokens map. - For the tokenizer.json path: reads added_tokens for the special registry plus bos/eos resolution; rebuilds the base tokenizer with add_space_prefix derived from the normalizer or tokenizer_config. This is a model-specific handler that lives downstream by design: it fills upstream gaps that affect Gemma-style models specifically. Follow-up will propose extracting a symmetric SpecialTokenSplitter decorator into upstream skainet-io-core and applying it to all three tokenizers (Qwen byte-BPE and Tekken already have inline equivalents), at which point this file becomes redundant and is deleted. Adds kotlinx-serialization-json as a direct dependency of llm-core; it was already pulled in transitively via skainet-io-core but not exposed as compile classpath. Not yet wired through TokenizerFactory — that's the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…romTokenizerJsonString Extends TokenizerFactory to dispatch SentencePiece-family inputs through the new SentencePieceSpecialTokens decorator (so Gemma 4 and similar chat-template SP models encode markers atomically), while continuing to route byte-level BPE through the upstream factory directly. Adds two convenience entry points for callers that don't already hold a parsed GGUF metadata map: - fromGgufSource(source: RandomAccessSource): opens StreamingGGUFReader, parses metadata, dispatches via fromGgufFields. Convenience for the standalone GGUF path used by kllama-cli, KLlamaJava, benchmarks. - fromTokenizerJsonString(json, configJson?): for the SafeTensors path where the model directory carries tokenizer.json (+ optional tokenizer_config.json). Detects model.type with a minimal substring scan to avoid double-parsing when dispatching downstream. Existing fromGgufFields(fields) keeps its signature; SP-family inputs now flow through the decorator transparently. Qwen3 smoke test still passes ("The capital of France is" → coherent Paris answer with no Ċ artifacts), confirming the BPE path is unaffected. The legacy fromGGUF / fromTokenizerJson / fromHuggingFace methods that delegated to the broken local impl are deleted in this commit; no in-tree caller uses them after the cascade migration that follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal · 2026-05-04T21:22:51Z

Rebased onto current develop (which has #115/#122/#123/#125/#126/#127). Conflicts were in TokenizerFactory.kt — both this PR and my earlier #115 introduced an UpstreamTokenizerAdapter, with overlapping but different intent.

Conflict resolution choices:

Took this PR's intent verbatim — the fromGgufFields / fromGgufSource / fromTokenizerJsonString API + SentencePieceSpecialTokens decorator. The legacy fromGGUF / fromTokenizerJson / fromHuggingFace methods are deleted as the commit message specified.
Kept the separate UpstreamTokenizerAdapter.kt file (added by fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE #115) instead of inlining a duplicate inside TokenizerFactory.kt. The implementation in that file is internal class (visible across the module), so TokenizerFactory.kt references it without import.
Adopted this PR's -1 fallback policy — replaced fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE #115's bos = 1, eos = 2 defaults with bos = -1, eos = -1 so absence of BOS/EOS surfaces explicitly. Updated UpstreamTokenizerAdapterTest accordingly (the "honour custom fallbacks" test was deleted; the "fall back to defaults" test now asserts -1).
Migrated the remaining fromGGUF(source) callers to fromGgufSource(source):
- llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
- llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/java/KLlamaJava.kt
These callers were added by feat(kllama-cli): swap Llama GGUF + SafeTensors branches to DSL path #122 / feat(kllama-java): swap KLlamaJava facade to DSL path #123 (after this PR was originally written) and would have broken when fromGGUF was deleted.

Verified locally: :llm-core:jvmTest, :llm-runtime:kllama:jvmTest, :llm-apps:skainet-cli:build all pass.

mergeable=MERGEABLE per gh pr view. Ready for re-review / merge.

michalharakal and others added 3 commits May 4, 2026 23:18

michalharakal force-pushed the fix/52-tokenizer-upstream-gguf branch from d4e2760 to 864186c Compare May 4, 2026 21:22

michalharakal merged commit c3b0726 into develop May 4, 2026
2 checks passed

michalharakal deleted the fix/52-tokenizer-upstream-gguf branch May 4, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): consume upstream tokenizers, retire forked impl (#52)#124

fix(tokenizer): consume upstream tokenizers, retire forked impl (#52)#124
michalharakal merged 3 commits into
developfrom
fix/52-tokenizer-upstream-gguf

michalharakal commented May 4, 2026

Uh oh!

michalharakal commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 4, 2026

Summary

Changes

Verification

Architectural notes

Test plan

Uh oh!

michalharakal commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant