Skip to content

chat template is broken when input contains emoji #210

@queue-min

Description

@queue-min

Hello! 👋
I have a chat template and tokenization problem when input text contains emoji.

I decoded the tokenized input text after tokenizer.applyChatTemplate and the result is as follows.

<|eot_id|><|start_header_id|>user<|end_header_id|>

🥳🥳🥳<|e<|eot_id|><|start_header_id|>istant<|e<|end_header_id|>

But it has to be

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

🥳🥳🥳<|eot_id|><|start_header_id|>assistant<|end_header_id|>

All the other emojis occur same problem after it encodes.
I got same problem when I use different model (Qwen3-0.6B)

Any comments or help would be greatly appreciated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions