A 200,000-token SentencePiece tokenizer (Unigram) built for the Auralis / Helix German-primary language model. It is trained for German, English, and code in one vocabulary, with byte-fallback (zero unknowns) and a built-in chat template.
If you are building a German or German/English LLM and want a ready, low-fertility 200k tokenizer instead of training your own, take this one.
German is fertility-expensive for English-tuned tokenizers (long compound words get shredded into many tokens). A vocabulary trained with German as a first-class language encodes German text more densely → effectively more context per window and cheaper/faster inference per sentence. This tokenizer was trained on ~15.5 GB of mixed cleaned corpus (German + English + code) to get that density without giving up English or code.
| Language | Tokens / 100 words | Unknown rate |
|---|---|---|
| German | 133.8 | 0.000000 |
| English | 123.0 | 0.000000 |
| Code | 313.6 tokens / KB | 0.000000 |
Chat-template round-trip: byte-exact. Zero unknown tokens (byte-fallback).
tokenizer/
helix_v2_tokenizer.model # the SentencePiece model — load this
helix_v2_tokenizer.vocab # human-readable vocab (piece \t score)
quality_report.md # fertility + round-trip report
training_manifest.yaml # full training record (args, sizes, timing)
configs/tokenizer/helix_v2.yaml # training config
scripts/tokenizer/train_tokenizer.py # reproduce the tokenizer
scripts/eval/tokenizer_fertility.py # benchmark fertility vs tiktoken / Llama-3
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer/helix_v2_tokenizer.model")
ids = sp.EncodeAsIds("Die Donaudampfschifffahrtsgesellschaft fährt heute nicht.")
print(len(ids), ids)
print(sp.DecodeIds(ids)) # byte-exact round-trip
print(sp.GetPieceSize()) # 200000The first user-defined pieces are reserved for a chat template and control:
<pad> <unk> <s> </s>
<|system|> <|user|> <|assistant|> <|end|>
<think> </think>
(IDs 0-3 are pad/unk/bos/eos; the chat-template pieces follow. See training_manifest.yaml
for the full list and exact ids.)
pip install -r requirements.txt
# point the config/CLI at your own cleaned corpus, then:
python scripts/tokenizer/train_tokenizer.py --helpTrained as a SentencePiece Unigram model, character_coverage=0.99995,
byte_fallback=true, normalization_rule_name=identity,
remove_extra_whitespaces=false (whitespace preserved — matters for code).
python scripts/eval/tokenizer_fertility.py --tokenizer tokenizer/helix_v2_tokenizer.model
# optional comparisons load if installed: tiktoken (o200k/cl100k), Llama-3 via transformers- Python 3.10+,
sentencepiece(load/train),PyYAML(training config) - Optional for the fertility benchmark:
tiktoken,transformers
Apache-2.0. The tokenizer model is released for reuse; it does not contain the training corpus. Built as part of the Auralis / Helix project.