Skip to content

fix(tokenizer): consume upstream tokenizers, retire forked impl (#52)#124

Merged
michalharakal merged 3 commits into
developfrom
fix/52-tokenizer-upstream-gguf
May 4, 2026
Merged

fix(tokenizer): consume upstream tokenizers, retire forked impl (#52)#124
michalharakal merged 3 commits into
developfrom
fix/52-tokenizer-upstream-gguf

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Closes #52.

Summary

The local `llm-core` fork of byte-level BPE in `GGUFTokenizer.encodeBPE()` is broken (no byte→unicode map, score-ranked merges instead of merge-rank, no special-token atomic splitting). For Qwen3 this surfaces as `Ċ` artifacts in place of `\n` and stray `!`-wrapped JSON in tool-call emissions, breaking the agent loop. Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as `sk.ainet.core:skainet-io-core:0.23.1`, already on the classpath).

This PR retires the fork and consumes upstream — the architectural shape the user asked for: tokenizer infrastructure lives upstream as foundational IO, downstream holds only model-specific handlers (chat templates, tool-call parsers, the SentencePiece special-token decorator).

Changes

  1. `TokenizerFactory.fromGgufFields(modelInfo.fields)` — new entry point that delegates to upstream `sk.ainet.io.tokenizer.TokenizerFactory.fromGguf` and wraps the result in an `UpstreamTokenizerAdapter` that bridges the local interface (`decode(Int)`, non-null bos/eos) to upstream's nullable design.
  2. `SentencePieceSpecialTokens` decorator (model-specific handler) — wraps upstream `SentencePieceTokenizer` to add atomic special-token splitting for chat-template markers (``, `<|turn>`, `<turn|>`). Required because upstream `SentencePieceTokenizer.encode()` does pure SP — the gap that motivated this decorator. Also patches upstream's HF-JSON path gaps: reads `add_space_prefix` from the normalizer block and resolves bos/eos from `added_tokens`. Becomes redundant once upstream PR SKaiNET#595 lands and we bump skainet — deletion is a follow-up downstream PR.
  3. `fromGgufSource` / `fromTokenizerJsonString` — convenience factory methods for callers that have a `RandomAccessSource` or raw JSON string instead of an already-parsed metadata map.
  4. `skainet-cli/Main.kt` — switched to `fromGgufFields(modelInfo.fields)`, reusing the field map already obtained from `UnifiedModelLoader.peek`.
  5. Legacy `TokenizerFactory` methods removed (`fromGGUF`, `fromTokenizerJson`, `fromHuggingFace`) — no in-tree caller used them after the cascade migration.

Verification

Test Before After
Qwen3-1.7B greedy generation coherent text but with `Ċ` artifacts clean output
Qwen3-1.7B tool demo broken: `ĊĊĊ<tool_call>!{...}!</tool_call>`, agent loop terminates full round-trip: model emits clean JSON → calculator returns 391.0 → model produces final answer "391"
`:llm-core:jvmTest` green green
Full `compileKotlinJvm` green green

Architectural notes

This PR establishes the consume-upstream pattern but does not complete the full cascade. Still ahead (separate follow-up commits/PRs):

  • Migrate the remaining `GGUFTokenizer.foo(...)` callers (`kllama-cli` jvm/native/wasmJs, `KLlamaJava`, `kgemma-cli`, `Gemma4ChatModel`, `JvmBenchmarkEngine`, `NativeBenchmarkEngine`) off the legacy class methods and onto `TokenizerFactory.fromGgufSource` / `fromTokenizerJsonString`.
  • Replace the local `Tokenizer` interface with a `typealias` to `sk.ainet.io.tokenizer.Tokenizer` (avoids re-creating the divergence trap; binary-compat preserved by the typealias). Adjust other live impls (`TokenizerImpl`, `TekkenTokenizerAdapter`, `bert/HuggingFaceTokenizer`, `SkaiNetChatModel` default) for nullable bos/eos.
  • Delete the forked `GGUFTokenizer.kt` and strategy classes.
  • Retarget `Gemma4TokenizerParityTest` to `TokenizerFactory.fromTokenizerJsonString`.

Splitting the cascade into a follow-up PR keeps this one's bisect surface minimal — the user-visible bug fix lands now, the impl deletion lands once it's been validated against the parity test fixture.

Test plan

  • `./gradlew :llm-apps:skainet-cli:run --args="-m Qwen3-1.7B-Q8_0.gguf -s 32 -k 0.0 'The capital of France is'"` → coherent answer
  • Same with `--demo --template=chatml "What is 17 * 23?"` → clean tool call, full round-trip
  • `./gradlew :llm-core:jvmTest` green
  • `./gradlew compileKotlinJvm` green across all modules
  • CI to verify Android, native, wasmJs targets compile

🤖 Generated with Claude Code

michalharakal and others added 3 commits May 4, 2026 23:18
….tokenizer (#52)

The local llm-core fork of GGUFTokenizer.encodeBPE() implements byte-level
BPE incorrectly: it splits text into char-sized units, picks merges by vocab
score instead of merge rank, and ignores the byte→unicode mapping required
by GPT-2-style tokenizers. For Qwen3, this surfaces as `Ċ` artifacts in
place of `\n` and stray `!`-wrapped JSON in tool-call emissions, which
breaks the agent loop.

Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as
sk.ainet.core:skainet-io-core:0.23.1, already on the classpath). Rather
than port the upstream patch into the fork — which perpetuates divergence
— this change adds a thin entry point that routes through the upstream
TokenizerFactory:

  TokenizerFactory.fromGgufFields(modelInfo.fields)
  → sk.ainet.io.tokenizer.TokenizerFactory.fromGguf(fields)
  → wrapped in UpstreamTokenizerAdapter to satisfy the local Tokenizer
    interface (decode(Int), non-null bos/eos)

The skainet-cli is migrated to the new method (using `modelInfo.fields`
already obtained from `UnifiedModelLoader.peek` for architecture detection,
which removes one redundant file open). Verified end-to-end with
Qwen3-1.7B-Q8_0.gguf: greedy generation produces coherent text and the
tool-calling demo now completes a full round-trip (model → tool_call JSON
→ tool dispatch → result → model final answer).

The legacy `fromGGUF(source)` / `fromTokenizerJson(json)` / `fromHuggingFace`
methods still route to the local broken impl. Other callers (kllama-cli,
kgemma, voxtral) remain on the legacy path; they migrate in a follow-up
that also retires the forked impl entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…yle chat models

Upstream sk.ainet.io.tokenizer.SentencePieceTokenizer.encode() does pure
SentencePiece — it never treats vocab entries as atomic. That works for
vanilla LLaMA but breaks chat-template models like Gemma 4 whose <bos>,
<|turn>, <turn|> markers must encode as a single id rather than
fragmenting into per-character byte-fallback pieces. The HF
tokenizer.json factory upstream also drops two pieces of metadata Gemma
needs: bos/eos resolution from the added_tokens array, and
add_space_prefix=false (upstream hard-codes true).

This commit adds a downstream decorator that wraps an upstream
SentencePieceTokenizer and:

  - Performs longest-match atomic special-token splitting on encode.
  - Groups contiguous non-special ids on decode and delegates each run
    to base.decode, preserving the byte-level UTF-8 spanning that SP
    decode does internally (special-token boundaries always sit on
    UTF-8 boundaries — they're literal strings).
  - For the GGUF path: scans tokenizer.ggml.token_type for CONTROL (3)
    and USER_DEFINED (4) entries to populate the special-tokens map.
  - For the tokenizer.json path: reads added_tokens for the special
    registry plus bos/eos resolution; rebuilds the base tokenizer with
    add_space_prefix derived from the normalizer or tokenizer_config.

This is a model-specific handler that lives downstream by design: it
fills upstream gaps that affect Gemma-style models specifically.
Follow-up will propose extracting a symmetric SpecialTokenSplitter
decorator into upstream skainet-io-core and applying it to all three
tokenizers (Qwen byte-BPE and Tekken already have inline equivalents),
at which point this file becomes redundant and is deleted.

Adds kotlinx-serialization-json as a direct dependency of llm-core; it
was already pulled in transitively via skainet-io-core but not exposed
as compile classpath.

Not yet wired through TokenizerFactory — that's the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…romTokenizerJsonString

Extends TokenizerFactory to dispatch SentencePiece-family inputs through
the new SentencePieceSpecialTokens decorator (so Gemma 4 and similar
chat-template SP models encode markers atomically), while continuing to
route byte-level BPE through the upstream factory directly. Adds two
convenience entry points for callers that don't already hold a parsed
GGUF metadata map:

  - fromGgufSource(source: RandomAccessSource): opens StreamingGGUFReader,
    parses metadata, dispatches via fromGgufFields. Convenience for the
    standalone GGUF path used by kllama-cli, KLlamaJava, benchmarks.
  - fromTokenizerJsonString(json, configJson?): for the SafeTensors path
    where the model directory carries tokenizer.json (+ optional
    tokenizer_config.json). Detects model.type with a minimal substring
    scan to avoid double-parsing when dispatching downstream.

Existing fromGgufFields(fields) keeps its signature; SP-family inputs
now flow through the decorator transparently. Qwen3 smoke test still
passes ("The capital of France is" → coherent Paris answer with no Ċ
artifacts), confirming the BPE path is unaffected.

The legacy fromGGUF / fromTokenizerJson / fromHuggingFace methods that
delegated to the broken local impl are deleted in this commit; no
in-tree caller uses them after the cascade migration that follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal force-pushed the fix/52-tokenizer-upstream-gguf branch from d4e2760 to 864186c Compare May 4, 2026 21:22
@michalharakal
Copy link
Copy Markdown
Contributor Author

Rebased onto current develop (which has #115/#122/#123/#125/#126/#127). Conflicts were in TokenizerFactory.kt — both this PR and my earlier #115 introduced an UpstreamTokenizerAdapter, with overlapping but different intent.

Conflict resolution choices:

  1. Took this PR's intent verbatim — the fromGgufFields / fromGgufSource / fromTokenizerJsonString API + SentencePieceSpecialTokens decorator. The legacy fromGGUF / fromTokenizerJson / fromHuggingFace methods are deleted as the commit message specified.

  2. Kept the separate UpstreamTokenizerAdapter.kt file (added by fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE #115) instead of inlining a duplicate inside TokenizerFactory.kt. The implementation in that file is internal class (visible across the module), so TokenizerFactory.kt references it without import.

  3. Adopted this PR's -1 fallback policy — replaced fix(tokenizer): route Qwen / GPT-2 BPE GGUFs to upstream byte-level BPE #115's bos = 1, eos = 2 defaults with bos = -1, eos = -1 so absence of BOS/EOS surfaces explicitly. Updated UpstreamTokenizerAdapterTest accordingly (the "honour custom fallbacks" test was deleted; the "fall back to defaults" test now asserts -1).

  4. Migrated the remaining fromGGUF(source) callers to fromGgufSource(source):

    • llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/cli/Main.kt
    • llm-runtime/kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/java/KLlamaJava.kt

    These callers were added by feat(kllama-cli): swap Llama GGUF + SafeTensors branches to DSL path #122 / feat(kllama-java): swap KLlamaJava facade to DSL path #123 (after this PR was originally written) and would have broken when fromGGUF was deleted.

Verified locally: :llm-core:jvmTest, :llm-runtime:kllama:jvmTest, :llm-apps:skainet-cli:build all pass.

mergeable=MERGEABLE per gh pr view. Ready for re-review / merge.

@michalharakal michalharakal merged commit c3b0726 into develop May 4, 2026
2 checks passed
@michalharakal michalharakal deleted the fix/52-tokenizer-upstream-gguf branch May 4, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors)

1 participant