Open
Description
Your current environment
I am pulling latest vllm v0.9.0.1
and installing the latest fixed PR for EAGLE-3.
Instance is 8*H100.
🐛 Describe the bug
Using serving command. The EAGLE-3 head for llama 3.3 70B is from its official release
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --seed 42 --tensor-parallel-size 8 --max-model-len 131072 --max-num-batched-tokens 131072 --max-num-seqs 100 --max_seq_len_to_capture 131072 --gpu_memory_utilization 0.9 --no-enable-prefix-caching --speculative_config '{
"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
"draft_tensor_parallel_size": 8,
"num_speculative_tokens":5,
"method": "eagle3"
}'
Then I hit this error.
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return forward_call(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/test_code/open_vllm/vllm/vllm/model_executor/models/llama_eagle3.py", line 205, in forward
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self.model(input_ids, positions, hidden_states)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/test_code/open_vllm/vllm/vllm/compilation/decorators.py", line 238, in __call__
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] output = self.compiled_callable(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return fn(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/test_code/open_vllm/vllm/vllm/model_executor/models/llama_eagle3.py", line 121, in forward
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] def forward(
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self._call_impl(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return forward_call(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return fn(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] raise e
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self._call_impl(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return forward_call(*args, **kwargs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "<eval_with_key>.167", line 18, in forward
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] submod_0 = self.submod_0(l_input_ids_, s0, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_hidden_states_, l_self_modules_layers_modules_0_modules_hidden_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_hidden_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_positions_ = l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_ = None
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/test_code/open_vllm/vllm/vllm/compilation/cuda_piecewise_backend.py", line 110, in __call__
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self.compiled_graph_for_general_shape(*args)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/test_code/open_vllm/vllm/vllm/compilation/compiler_interface.py", line 489, in compiled_graph
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] graph_output = inductor_compiled_graph(list_args)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/opt/pytorch/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] return self.current_callable(inputs)
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] File "/home/ubuntu/.cache/vllm/torch_compile_cache/3825025aed-1/rank_2_0/inductor_cache/ct/ccttgk63mie75qcg7t4swtaeuygoparq7phni4yhir5chq2tpbsb.py", line 365, in call
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] assert_size_stride(arg2_1, (16032, 8192), (8192, 1))
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] AssertionError: expected size 16032==16032, stride 6144==8192 at dim=0; expected size 6144==8192, stride 1==1 at dim=1
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
(VllmWorker rank=2 pid=1204085) ERROR 06-05 02:08:42 [multiproc_executor.py:522] Use torch.library.opcheck to test your custom op.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.