fix(tokenizer): consume upstream tokenizers, retire forked impl (#52)#124
Conversation
….tokenizer (#52) The local llm-core fork of GGUFTokenizer.encodeBPE() implements byte-level BPE incorrectly: it splits text into char-sized units, picks merges by vocab score instead of merge rank, and ignores the byte→unicode mapping required by GPT-2-style tokenizers. For Qwen3, this surfaces as `Ċ` artifacts in place of `\n` and stray `!`-wrapped JSON in tool-call emissions, which breaks the agent loop. Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as sk.ainet.core:skainet-io-core:0.23.1, already on the classpath). Rather than port the upstream patch into the fork — which perpetuates divergence — this change adds a thin entry point that routes through the upstream TokenizerFactory: TokenizerFactory.fromGgufFields(modelInfo.fields) → sk.ainet.io.tokenizer.TokenizerFactory.fromGguf(fields) → wrapped in UpstreamTokenizerAdapter to satisfy the local Tokenizer interface (decode(Int), non-null bos/eos) The skainet-cli is migrated to the new method (using `modelInfo.fields` already obtained from `UnifiedModelLoader.peek` for architecture detection, which removes one redundant file open). Verified end-to-end with Qwen3-1.7B-Q8_0.gguf: greedy generation produces coherent text and the tool-calling demo now completes a full round-trip (model → tool_call JSON → tool dispatch → result → model final answer). The legacy `fromGGUF(source)` / `fromTokenizerJson(json)` / `fromHuggingFace` methods still route to the local broken impl. Other callers (kllama-cli, kgemma, voxtral) remain on the legacy path; they migrate in a follow-up that also retires the forked impl entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…yle chat models
Upstream sk.ainet.io.tokenizer.SentencePieceTokenizer.encode() does pure
SentencePiece — it never treats vocab entries as atomic. That works for
vanilla LLaMA but breaks chat-template models like Gemma 4 whose <bos>,
<|turn>, <turn|> markers must encode as a single id rather than
fragmenting into per-character byte-fallback pieces. The HF
tokenizer.json factory upstream also drops two pieces of metadata Gemma
needs: bos/eos resolution from the added_tokens array, and
add_space_prefix=false (upstream hard-codes true).
This commit adds a downstream decorator that wraps an upstream
SentencePieceTokenizer and:
- Performs longest-match atomic special-token splitting on encode.
- Groups contiguous non-special ids on decode and delegates each run
to base.decode, preserving the byte-level UTF-8 spanning that SP
decode does internally (special-token boundaries always sit on
UTF-8 boundaries — they're literal strings).
- For the GGUF path: scans tokenizer.ggml.token_type for CONTROL (3)
and USER_DEFINED (4) entries to populate the special-tokens map.
- For the tokenizer.json path: reads added_tokens for the special
registry plus bos/eos resolution; rebuilds the base tokenizer with
add_space_prefix derived from the normalizer or tokenizer_config.
This is a model-specific handler that lives downstream by design: it
fills upstream gaps that affect Gemma-style models specifically.
Follow-up will propose extracting a symmetric SpecialTokenSplitter
decorator into upstream skainet-io-core and applying it to all three
tokenizers (Qwen byte-BPE and Tekken already have inline equivalents),
at which point this file becomes redundant and is deleted.
Adds kotlinx-serialization-json as a direct dependency of llm-core; it
was already pulled in transitively via skainet-io-core but not exposed
as compile classpath.
Not yet wired through TokenizerFactory — that's the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…romTokenizerJsonString
Extends TokenizerFactory to dispatch SentencePiece-family inputs through
the new SentencePieceSpecialTokens decorator (so Gemma 4 and similar
chat-template SP models encode markers atomically), while continuing to
route byte-level BPE through the upstream factory directly. Adds two
convenience entry points for callers that don't already hold a parsed
GGUF metadata map:
- fromGgufSource(source: RandomAccessSource): opens StreamingGGUFReader,
parses metadata, dispatches via fromGgufFields. Convenience for the
standalone GGUF path used by kllama-cli, KLlamaJava, benchmarks.
- fromTokenizerJsonString(json, configJson?): for the SafeTensors path
where the model directory carries tokenizer.json (+ optional
tokenizer_config.json). Detects model.type with a minimal substring
scan to avoid double-parsing when dispatching downstream.
Existing fromGgufFields(fields) keeps its signature; SP-family inputs
now flow through the decorator transparently. Qwen3 smoke test still
passes ("The capital of France is" → coherent Paris answer with no Ċ
artifacts), confirming the BPE path is unaffected.
The legacy fromGGUF / fromTokenizerJson / fromHuggingFace methods that
delegated to the broken local impl are deleted in this commit; no
in-tree caller uses them after the cascade migration that follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d4e2760 to
864186c
Compare
|
Rebased onto current Conflict resolution choices:
Verified locally:
|
Closes #52.
Summary
The local `llm-core` fork of byte-level BPE in `GGUFTokenizer.encodeBPE()` is broken (no byte→unicode map, score-ranked merges instead of merge-rank, no special-token atomic splitting). For Qwen3 this surfaces as `Ċ` artifacts in place of `\n` and stray `!`-wrapped JSON in tool-call emissions, breaking the agent loop. Upstream PR SKaiNET#465 fixed this in `skainet-io-core` (consumed here as `sk.ainet.core:skainet-io-core:0.23.1`, already on the classpath).
This PR retires the fork and consumes upstream — the architectural shape the user asked for: tokenizer infrastructure lives upstream as foundational IO, downstream holds only model-specific handlers (chat templates, tool-call parsers, the SentencePiece special-token decorator).
Changes
Verification
Architectural notes
This PR establishes the consume-upstream pattern but does not complete the full cascade. Still ahead (separate follow-up commits/PRs):
Splitting the cascade into a follow-up PR keeps this one's bisect surface minimal — the user-visible bug fix lands now, the impl deletion lands once it's been validated against the parity test fixture.
Test plan
🤖 Generated with Claude Code