Skip to content

[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens #13175

Open
@AlexPiche

Description

@AlexPiche

Your current environment

>>> import vllm
INFO 02-12 20:27:04 __init__.py:190] Automatically detected platform cuda.
>>> vllm.__version__
'0.7.2'

🐛 Describe the bug

Hi,

It looks like Qwen models can generate tokens out of vocabulary. We can see this by feeding the generate tokens to the model which sometimes result in the following exception: Token id 151779 is out of vocabulary. Here is a minimal code to reproduce this error.

import vllm
from transformers import AutoTokenizer
import numpy as np

PROMPT = """
<|im_start|>system
Please reason step by step, and put your final answer within \\boxed{}.<|im_end|>
<|im_start|>user
The equation $a^7xy-a^6y-a^5x=a^4(b^4-1)$ is equivalent to the equation $(a^mx-a^n)(a^py-a^2)=a^4b^4$ for some integers $m$, $n$, and $p$.  Find $mnp$.<|im_end|>
<|im_start|>assistant
"""

if __name__ == '__main__':
    model_path = "Qwen/Qwen2.5-1.5B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    PROMPT_TOKEN_IDS = tokenizer.encode(PROMPT)

    sampling_params = vllm.SamplingParams(temperature=1.2, max_tokens=100)
    llm = vllm.LLM(model_path)

    # can we now generate tokens out of vocabulary?
    out_of_vocab = []
    out_of_vocab_tokens = []
    for i in range(100):
        out = llm.generate(prompt_token_ids=PROMPT_TOKEN_IDS, sampling_params=sampling_params)
        PROMPT_COMPLETION_TOKEN_IDS = PROMPT_TOKEN_IDS + list(out[0].outputs[0].token_ids)
        try:
            out2 = llm.generate(prompt_token_ids=PROMPT_COMPLETION_TOKEN_IDS, sampling_params=sampling_params)
            out_of_vocab.append(0)
        except Exception as e:
            print(e)
            # Extract token id from error message
            token_id = int(str(e).split("Token id ")[1].split(" ")[0])
            out_of_vocab_tokens.append(token_id)
            out_of_vocab.append(1)
    
    print(f"Proportion of out of vocabulary generations: {np.mean(out_of_vocab)}")
    print(out_of_vocab_tokens)
        

selected output

Token id 151779 is out of vocabulary
Token id 151734 is out of vocabulary
...
Proportion of out of vocabulary generations: 0.03
[151925, 151779, 151734]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions