Skip to content

[Bug]: Incremental detokenization error when running llama-3.3-70b-fp8 model #21951

@njhill

Description

@njhill

Recently reported by @npalaska:

we have been seeing the following error with vLLM 0.9.2 and transformers 4.52.4 when serving the llama-3.3-70b-fp8 and recently with maverick fp8 as well.

INFO:     10.131.1.21:55318 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 07-24 22:01:26 [async_llm.py:419] AsyncLLM output_handler failed.
ERROR 07-24 22:01:26 [async_llm.py:419] Traceback (most recent call last):
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 396, in output_handler
ERROR 07-24 22:01:26 [async_llm.py:419]     processed_outputs = output_processor.process_outputs(
ERROR 07-24 22:01:26 [async_llm.py:419]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 398, in process_outputs
ERROR 07-24 22:01:26 [async_llm.py:419]     stop_string = req_state.detokenizer.update(
ERROR 07-24 22:01:26 [async_llm.py:419]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 117, in update
ERROR 07-24 22:01:26 [async_llm.py:419]     self.output_text += self.decode_next(new_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 216, in decode_next
ERROR 07-24 22:01:26 [async_llm.py:419]     token = self._protected_step(next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 233, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419]     raise e
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 230, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419]     token = self.stream.step(self.tokenizer, next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] OverflowError: out of range integral type conversion attempted
WARNING 07-24 22:01:26 [protocol.py:58] The following fields were present in the request but ignored: {'max_completion_tokens'} 

model launch args are:

    - --model=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
    - --port=8080
    - --max-model-len=8192
    - --uvicorn-log-level=debug
    - --trust-remote-code
    - --tensor-parallel-size=4
    - --gpu-memory-utilization=0.92
    - --no-enable-prefix-caching
    - --disable-log-requests
    - --kv-cache-dtype=fp8 

Used the guidellm to run the performance sweep:

guidellm benchmark --target '<openai_endpoint>' \
         --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
         --processor RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
         --data='{"prompt_tokens":512, "prompt_tokens_stdev":128, "prompt_tokens_min":256, "prompt_tokens_max":1024, "output_tokens":2048, "output_tokens_stdev":512, "output_tokens_min":1024, "output_tokens_max":3072}'  \
         --rate-type concurrent --rate 5,25,100,500 \
         --max-seconds 300 \
         --output-path output.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions