-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
Recently reported by @npalaska:
we have been seeing the following error with vLLM 0.9.2 and transformers 4.52.4 when serving the llama-3.3-70b-fp8 and recently with maverick fp8 as well.
INFO: 10.131.1.21:55318 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 07-24 22:01:26 [async_llm.py:419] AsyncLLM output_handler failed.
ERROR 07-24 22:01:26 [async_llm.py:419] Traceback (most recent call last):
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 396, in output_handler
ERROR 07-24 22:01:26 [async_llm.py:419] processed_outputs = output_processor.process_outputs(
ERROR 07-24 22:01:26 [async_llm.py:419] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 398, in process_outputs
ERROR 07-24 22:01:26 [async_llm.py:419] stop_string = req_state.detokenizer.update(
ERROR 07-24 22:01:26 [async_llm.py:419] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 117, in update
ERROR 07-24 22:01:26 [async_llm.py:419] self.output_text += self.decode_next(new_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 216, in decode_next
ERROR 07-24 22:01:26 [async_llm.py:419] token = self._protected_step(next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 233, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419] raise e
ERROR 07-24 22:01:26 [async_llm.py:419] File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/detokenizer.py", line 230, in _protected_step
ERROR 07-24 22:01:26 [async_llm.py:419] token = self.stream.step(self.tokenizer, next_token_id)
ERROR 07-24 22:01:26 [async_llm.py:419] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 22:01:26 [async_llm.py:419] OverflowError: out of range integral type conversion attempted
WARNING 07-24 22:01:26 [protocol.py:58] The following fields were present in the request but ignored: {'max_completion_tokens'}
model launch args are:
- --model=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
- --port=8080
- --max-model-len=8192
- --uvicorn-log-level=debug
- --trust-remote-code
- --tensor-parallel-size=4
- --gpu-memory-utilization=0.92
- --no-enable-prefix-caching
- --disable-log-requests
- --kv-cache-dtype=fp8
Used the guidellm to run the performance sweep:
guidellm benchmark --target '<openai_endpoint>' \
--model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
--processor RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
--data='{"prompt_tokens":512, "prompt_tokens_stdev":128, "prompt_tokens_min":256, "prompt_tokens_max":1024, "output_tokens":2048, "output_tokens_stdev":512, "output_tokens_min":1024, "output_tokens_max":3072}' \
--rate-type concurrent --rate 5,25,100,500 \
--max-seconds 300 \
--output-path output.json
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed