Skip to content

[Bug]: MistralTokenizer Detokenization Issue #8627

Open
@ywang96

Description

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

Code to repro

from pathlib import Path

from huggingface_hub import snapshot_download
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from vllm import LLM
from vllm.sampling_params import SamplingParams


model_name = "mistralai/Pixtral-12B-2409"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True)

prompt = "這個圖片是什麼"
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print("vllm: " + outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print("detok: " + tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) # skip the last token_id = 2

🐛 Describe the bug

When the engine is initialized with tokenizer_model="mistral", there's some encoding error when it comes to certain languages. However, when using initialized MistralTokenizer to decode the token ids directly there's no such issue.

Output from the above code

Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.11s/it, est. speed input: 346.06 toks/s, output: 28.72 toks/s]
vllm: 图片展示了一幅��丽的自然景观,主要是一条������的河流��过一片宁静的草地,周��环��着高耸的岩石����和��木。河流清��见底,水面平静,周��散布着岩石和��色��被。河流两岸的草地上点��着各种��物和��木,营造出宁静的����。背景中的岩石����高大险��,直��云��,增��了场景的宏��感。天空��朗,点��着几��云彩,暗示着一个明亮、��朗的日子。图片中没有明显的文字或人造物品,突出了自然的美丽。整体����宁静而��丽,突显了大自然的宏��和宁静。
(16442, 49395, 60288, 21552, 30841, 117293, 6693, 1174, 62326, 2713, 43090, 79088, 44885, 1625, 125192, 2499, 3087, 17624, 1232, 1156, 1191, 1232, 1156, 1146, 2713, 49563, 45605, 16842, 1191, 5984, 3087, 49395, 109042, 49554, 2713, 87781, 8736, 1625, 22675, 2854, 1180, 105080, 6046, 1149, 9883, 14370, 129695, 2713, 125632, 40801, 24934, 1173, 6693, 1129, 4300, 4901, 1145, 23942, 1320, 49563, 45605, 37202, 53760, 1136, 13594, 26800, 1625, 24777, 8682, 7210, 49554, 1625, 22675, 2854, 1180, 83632, 25120, 9883, 125632, 40801, 4300, 6046, 1191, 26416, 83777, 1141, 24443, 1320, 49563, 45605, 36987, 122890, 2713, 87781, 8736, 4445, 9079, 29532, 1128, 9883, 36283, 14164, 83777, 1141, 16307, 4300, 4901, 1145, 23942, 1625, 121634, 35747, 7059, 109042, 49554, 2713, 7020, 1155, 2854, 1180, 1320, 55022, 79088, 56245, 125632, 40801, 24934, 1173, 6693, 1129, 14370, 5368, 124592, 24934, 1187, 1625, 13334, 19528, 1146, 56212, 26985, 1132, 1625, 44290, 23295, 1187, 4836, 50381, 79088, 2713, 126928, 5596, 1159, 27934, 1320, 6434, 26095, 4343, 1180, 52678, 1625, 9079, 29532, 1128, 9883, 29538, 1632, 1181, 56212, 96037, 1625, 121028, 21552, 9883, 26535, 8560, 88518, 1749, 4343, 1180, 52678, 2713, 1866, 8390, 1320, 16442, 49395, 4392, 16685, 66876, 2713, 121873, 10443, 3405, 35747, 16307, 20353, 1625, 21949, 7059, 4836, 43090, 2713, 8350, 62326, 1320, 60896, 18807, 7020, 1155, 2854, 1180, 109042, 49554, 4262, 6693, 1174, 62326, 1625, 21949, 21802, 4836, 5368, 43090, 2713, 126928, 5596, 1159, 4300, 109042, 49554, 1320, 2)
detok: 图片展示了一幅壮丽的自然景观,主要是一条蜿蜒的河流穿过一片宁静的草地,周围环绕着高耸的岩石峭壁和树木。河流清澈见底,水面平静,周围散布着岩石和绿色植被。河流两岸的草地上点缀着各种植物和树木,营造出宁静的氛围。背景中的岩石峭壁高大险峻,直插云霄,增添了场景的宏伟感。天空晴朗,点缀着几朵云彩,暗示着一个明亮、晴朗的日子。图片中没有明显的文字或人造物品,突出了自然的美丽。整体氛围宁静而壮丽,突显了大自然的宏伟和宁静。

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions