[Bug]: Incremental detokenization error when running `llama-3.3-70b-fp8` model

Recently reported by @npalaska:

> we have been seeing the following error with vLLM 0.9.2 and transformers 4.52.4 when serving the llama-3.3-70b-fp8 and recently with maverick fp8 as well.

```
INFO:     10.131.1.21:55318 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 07-24 22:01:26 [async_llm.py:419] AsyncLLM output_handler failed.
ERROR 07-24 22:01:26 [async_llm.py:419] Traceback (most recent call last):
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 396, in output_handler
ERROR 07-24 22:01:26 [async_llm.py:419]     processed_outputs = output_processor.process_outputs(
ERROR 07-24 22:01:26 [async_llm.py:419]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 398, in process_outputs
ERROR 07-24 22:01:26 [async_llm.py:419]     stop_string = req_state.detokenizer.update(
ERROR 07-24 22:01:26 [async_llm.py:419]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 117, in update
ERROR 07-24 22:01:26 [async_llm.py:419]     self.output_text += self.decode_next(new_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 216, in decode_next
ERROR 07-24 22:01:26 [async_llm.py:419]     token = self._protected_step(next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 233, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419]     raise e
ERROR 07-24 22:01:26 [async_llm.py:419]   File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 230, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419]     token = self.stream.step(self.tokenizer, next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] OverflowError: out of range integral type conversion attempted
WARNING 07-24 22:01:26 [protocol.py:58] The following fields were present in the request but ignored: {'max_completion_tokens'} 
```

model launch args are:
```
    - --model=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
    - --port=8080
    - --max-model-len=8192
    - --uvicorn-log-level=debug
    - --trust-remote-code
    - --tensor-parallel-size=4
    - --gpu-memory-utilization=0.92
    - --no-enable-prefix-caching
    - --disable-log-requests
    - --kv-cache-dtype=fp8 
```
Used the [guidellm](https://github.com/vllm-project/guidellm) to run the performance sweep:
```
guidellm benchmark --target '<openai_endpoint>' \
         --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
         --processor RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
         --data='{"prompt_tokens":512, "prompt_tokens_stdev":128, "prompt_tokens_min":256, "prompt_tokens_max":1024, "output_tokens":2048, "output_tokens_stdev":512, "output_tokens_min":1024, "output_tokens_max":3072}'  \
         --rate-type concurrent --rate 5,25,100,500 \
         --max-seconds 300 \
         --output-path output.json
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Incremental detokenization error when running `llama-3.3-70b-fp8` model #21951

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Incremental detokenization error when running llama-3.3-70b-fp8 model #21951

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Incremental detokenization error when running `llama-3.3-70b-fp8` model #21951