Skip to content

Fix tokenizer special token handling #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented May 1, 2025

Handles special tokens before pretokenization and splitting in the c++ huggingface tokenizer.

Llama runner output now with special tokens + json tokenizer:

cmake-out/examples/models/llama/llama_main --model_path qwen3-0_6B_long_context.pte --tokenizer_path ./qwen3
_tokenizer/tokenizer.json --prompt="<|im_start|>system                                     
You are a helpful assistant.                                                               
<|im_end|>                                                                                                                                                                            
<|im_start|>user                                                                           
Tell me a story about Julius Caesar<|im_end|>                                              
<|im_start|>assistant                                                                      
<think>                                                                                                                                                                               
                                                                                                                                                                                      
</think>                                                                                   
                                                                                           
                                                                                                                                                                                      
                                                                                           
" --temperature 0.6 --seq_len 1024
>> Julius Caesar was a Roman general and statesman who played a pivotal role in the fall of the Roman Republic and the establishment of the Roman Empire. He is often remembered for his lea
dership, courage, and decisive actions during the Battle of the Colosseum in 44 BCE. However, there are also many stories about his life and death. Some believe that he was poisoned 
by his own brothers and died in 44 BCE, while others argue that his death was due to a conspiracy involving his family and political rivals. His legacy remains one of the most celebrated in Roman history.<|im_end|>                                                           
<|endoftext|>

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2025
@jackzhxng jackzhxng changed the base branch from main to jz/fix-null-eos-bos May 1, 2025 17:42
@jackzhxng jackzhxng merged commit 58a6381 into jz/fix-null-eos-bos May 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants