Description
Your current environment
vllm-0.9.0+cu126
torch-2.7.0+cu126
cuda 12.6
🐛 Describe the bug
The output of
CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server
--port 32367
--dtype float16
--gpu-memory-utilization 0.9
--disable-log-requests
--enable-prefix-caching
--max-model-len 8192
--max_num_seqs 8
--model /path/to/content_llama33_70b_instruct/
--no-enable-chunked-prefill
--speculative-config '{"method": "eagle3","model": "/path/to/eagle3-llama3.3-70b-inst","num_speculative_tokens": 3,"draft_tensor_parallel_size": 1,"max_model_len": 2048}'
The output shows an extra weight:
KeyError: 'embed_tokens.weight'
So I removed it, but this time there showed another error.
assert hidden_states.shape[-1] == input_embeds.shape[-1]
This seems to be because the hidden size of the target model (llama3.3-70b-inst) and the eagle head mismatches.
(These two values are consistent in eagle3-llama3.1-8b, eagle1-llama3.1-8b and eagle1-llama3.3-70b models.)
So I tried to avoid sharing the same vocabulary embeddings with the target model by commenting out two lines of code.
vllm/v1/spec_decode/eagle.py line336
# self.model.model.embed_tokens = target_model.model.embed_tokens
vllm/model_executor/models/llama_eagle3.py line 98
# if get_pp_group().world_size > 1:
it works.
The model loaded.
But the draft acceptance rate is low, less than 10%, and the average acceptance length is about 1. Whereas when I use eagle1-llama3.3-70b, the draft acceptance rate for the same prompt is about 40%.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status