Skip to content

AuraIis/auralis-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Auralis Tokenizer (Helix v2)

A 200,000-token SentencePiece tokenizer (Unigram) built for the Auralis / Helix German-primary language model. It is trained for German, English, and code in one vocabulary, with byte-fallback (zero unknowns) and a built-in chat template.

If you are building a German or German/English LLM and want a ready, low-fertility 200k tokenizer instead of training your own, take this one.

Why it exists

German is fertility-expensive for English-tuned tokenizers (long compound words get shredded into many tokens). A vocabulary trained with German as a first-class language encodes German text more densely → effectively more context per window and cheaper/faster inference per sentence. This tokenizer was trained on ~15.5 GB of mixed cleaned corpus (German + English + code) to get that density without giving up English or code.

Measured fertility (lower = denser; from tokenizer/quality_report.md)

Language Tokens / 100 words Unknown rate
German 133.8 0.000000
English 123.0 0.000000
Code 313.6 tokens / KB 0.000000

Chat-template round-trip: byte-exact. Zero unknown tokens (byte-fallback).

What's in here

tokenizer/
  helix_v2_tokenizer.model     # the SentencePiece model — load this
  helix_v2_tokenizer.vocab     # human-readable vocab (piece \t score)
  quality_report.md            # fertility + round-trip report
  training_manifest.yaml       # full training record (args, sizes, timing)
configs/tokenizer/helix_v2.yaml          # training config
scripts/tokenizer/train_tokenizer.py     # reproduce the tokenizer
scripts/eval/tokenizer_fertility.py      # benchmark fertility vs tiktoken / Llama-3

Usage

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer/helix_v2_tokenizer.model")

ids = sp.EncodeAsIds("Die Donaudampfschifffahrtsgesellschaft fährt heute nicht.")
print(len(ids), ids)
print(sp.DecodeIds(ids))            # byte-exact round-trip
print(sp.GetPieceSize())            # 200000

Special tokens (chat template)

The first user-defined pieces are reserved for a chat template and control:

<pad> <unk> <s> </s>
<|system|> <|user|> <|assistant|> <|end|>
<think> </think>

(IDs 0-3 are pad/unk/bos/eos; the chat-template pieces follow. See training_manifest.yaml for the full list and exact ids.)

Reproduce it

pip install -r requirements.txt
# point the config/CLI at your own cleaned corpus, then:
python scripts/tokenizer/train_tokenizer.py --help

Trained as a SentencePiece Unigram model, character_coverage=0.99995, byte_fallback=true, normalization_rule_name=identity, remove_extra_whitespaces=false (whitespace preserved — matters for code).

Benchmark fertility against other tokenizers

python scripts/eval/tokenizer_fertility.py --tokenizer tokenizer/helix_v2_tokenizer.model
# optional comparisons load if installed: tiktoken (o200k/cl100k), Llama-3 via transformers

Requirements

  • Python 3.10+, sentencepiece (load/train), PyYAML (training config)
  • Optional for the fertility benchmark: tiktoken, transformers

License

Apache-2.0. The tokenizer model is released for reuse; it does not contain the training corpus. Built as part of the Auralis / Helix project.

About

200k-vocab SentencePiece (Unigram) tokenizer for German-primary LLMs — German/English/code, low fertility, byte-fallback, chat-template tokens. From the Auralis/Helix project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages