Skip to content

[Bug]: v0.8.5.post1 Eagle3 broken with llama3-70b #18452

Open
@fan-niu

Description

@fan-niu

Your current environment

vllm v0.8.5.post1
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
NVIDIA H100 80GB HBM3

🐛 Describe the bug

The vllm 0.8.5.post1 works on meta-llama/Llama-3.1-8B-Instruct with eagle3, but when i change the model to meta-llama/Llama-3.3-70B-Instruct and send the request, it will be broken, please help to figure out, thanks a lot.

Start script:

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_USE_V1=1
python3 -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.3-70B-Instruct \
        --disable-log-requests --port 8080 \
        --served-model-name zoom_llama_3_70b \
        --tensor-parallel-size 4 \
        --device cuda \
        --speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 2}'

Error Log:

DEBUG 05-21 02:19:18 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
DEBUG 05-21 02:19:28 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(VllmWorker rank=3 pid=1820) DEBUG 05-21 02:19:38 [shm_broadcast.py:430] No available shared memory broadcast block foundin 60 second.
(VllmWorker rank=0 pid=1817) DEBUG 05-21 02:19:38 [shm_broadcast.py:430] No available shared memory broadcast block foundin 60 second.
(VllmWorker rank=1 pid=1818) DEBUG 05-21 02:19:38 [shm_broadcast.py:430] No available shared memory broadcast block foundin 60 second.
(VllmWorker rank=2 pid=1819) DEBUG 05-21 02:19:38 [shm_broadcast.py:430] No available shared memory broadcast block foundin 60 second.
DEBUG 05-21 02:19:38 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:41210 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
WARNING 05-21 02:19:42 [protocol.py:71] The following fields were present in the request but ignored: {'include_special_tokens'}
INFO 05-21 02:19:42 [chat_utils.py:397] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO:     127.0.0.1:41216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
DEBUG 05-21 02:19:42 [core.py:427] EngineCore loop active.
INFO 05-21 02:19:48 [loggers.py:111] Engine 000: Avg prompt throughput: 77.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
INFO 05-21 02:19:58 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
DEBUG 05-21 02:20:08 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
DEBUG 05-21 02:20:18 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
ERROR 05-21 02:20:24 [core.py:398] EngineCore encountered a fatal error.
ERROR 05-21 02:20:24 [core.py:398] Traceback (most recent call last):
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 181, in collective_rpc
ERROR 05-21 02:20:24 [core.py:398]     status, result = w.worker_response_mq.dequeue(
ERROR 05-21 02:20:24 [core.py:398]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 479, in dequeue
ERROR 05-21 02:20:24 [core.py:398]     with self.acquire_read(timeout, cancel) as buf:
ERROR 05-21 02:20:24 [core.py:398]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/contextlib.py", line 137, in __enter__
ERROR 05-21 02:20:24 [core.py:398]     return next(self.gen)
ERROR 05-21 02:20:24 [core.py:398]            ^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 443, in acquire_read
ERROR 05-21 02:20:24 [core.py:398]     raise TimeoutError
ERROR 05-21 02:20:24 [core.py:398] TimeoutError
ERROR 05-21 02:20:24 [core.py:398] 
ERROR 05-21 02:20:24 [core.py:398] The above exception was the direct cause of the following exception:
ERROR 05-21 02:20:24 [core.py:398] 
ERROR 05-21 02:20:24 [core.py:398] Traceback (most recent call last):
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 389, in run_engine_core
ERROR 05-21 02:20:24 [core.py:398]     engine_core.run_busy_loop()
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 413, in run_busy_loop
ERROR 05-21 02:20:24 [core.py:398]     self._process_engine_step()
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 438, in _process_engine_step
ERROR 05-21 02:20:24 [core.py:398]     outputs = self.step_fn()
ERROR 05-21 02:20:24 [core.py:398]               ^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 203, in step
ERROR 05-21 02:20:24 [core.py:398]     output = self.model_executor.execute_model(scheduler_output)
ERROR 05-21 02:20:24 [core.py:398]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 146, in execute_model
ERROR 05-21 02:20:24 [core.py:398]     (output, ) = self.collective_rpc("execute_model",
ERROR 05-21 02:20:24 [core.py:398]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [core.py:398]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 193, in collective_rpc
ERROR 05-21 02:20:24 [core.py:398]     raise TimeoutError(f"RPC call to {method} timed out.") from e
ERROR 05-21 02:20:24 [core.py:398] TimeoutError: RPC call to execute_model timed out.
ERROR 05-21 02:20:24 [async_llm.py:399] AsyncLLM output_handler failed.
ERROR 05-21 02:20:24 [async_llm.py:399] Traceback (most recent call last):
ERROR 05-21 02:20:24 [async_llm.py:399]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 357, in output_handler
ERROR 05-21 02:20:24 [async_llm.py:399]     outputs = await engine_core.get_output_async()
ERROR 05-21 02:20:24 [async_llm.py:399]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [async_llm.py:399]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 716, in get_output_async
ERROR 05-21 02:20:24 [async_llm.py:399]     raise self._format_exception(outputs) from None
ERROR 05-21 02:20:24 [async_llm.py:399] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 05-21 02:20:24 [serving_chat.py:885] Error in chat completion stream generator.
ERROR 05-21 02:20:24 [serving_chat.py:885] Traceback (most recent call last):
ERROR 05-21 02:20:24 [serving_chat.py:885]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 487, in chat_completion_stream_generator
ERROR 05-21 02:20:24 [serving_chat.py:885]     async for res in result_generator:
ERROR 05-21 02:20:24 [serving_chat.py:885]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 306, in generate
ERROR 05-21 02:20:24 [serving_chat.py:885]     out = q.get_nowait() or await q.get()
ERROR 05-21 02:20:24 [serving_chat.py:885]                             ^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [serving_chat.py:885]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 05-21 02:20:24 [serving_chat.py:885]     raise output
ERROR 05-21 02:20:24 [serving_chat.py:885]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 357, in output_handler
ERROR 05-21 02:20:24 [serving_chat.py:885]     outputs = await engine_core.get_output_async()
ERROR 05-21 02:20:24 [serving_chat.py:885]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-21 02:20:24 [serving_chat.py:885]   File "/home/anaconda3/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 716, in get_output_async
ERROR 05-21 02:20:24 [serving_chat.py:885]     raise self._format_exception(outputs) from None
ERROR 05-21 02:20:24 [serving_chat.py:885] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1792]
/home/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 5 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '


Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    To Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions