Skip to content

[Bug]: vllm0.9.0 cannot load eagle30-llama3.3-70b-inst model #18906

Closed
@WuChuYi

Description

@WuChuYi

Your current environment

vllm-0.9.0+cu126
torch-2.7.0+cu126
cuda 12.6

🐛 Describe the bug

The output of CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --port 32367 --dtype float16 --gpu-memory-utilization 0.9 --disable-log-requests --enable-prefix-caching --max-model-len 8192 --max_num_seqs 8 --model /path/to/content_llama33_70b_instruct/ --no-enable-chunked-prefill --speculative-config '{"method": "eagle3","model": "/path/to/eagle3-llama3.3-70b-inst","num_speculative_tokens": 3,"draft_tensor_parallel_size": 1,"max_model_len": 2048}'

The output shows an extra weight:

KeyError: 'embed_tokens.weight'

So I removed it, but this time there showed another error.

assert hidden_states.shape[-1] == input_embeds.shape[-1]

This seems to be because the hidden size of the target model (llama3.3-70b-inst) and the eagle head mismatches.
(These two values are consistent in eagle3-llama3.1-8b, eagle1-llama3.1-8b and eagle1-llama3.3-70b models.)

So I tried to avoid sharing the same vocabulary embeddings with the target model by commenting out two lines of code.

vllm/v1/spec_decode/eagle.py line336

# self.model.model.embed_tokens = target_model.model.embed_tokens

vllm/model_executor/models/llama_eagle3.py line 98

# if get_pp_group().world_size > 1:

it works.
The model loaded.
But the draft acceptance rate is low, less than 10%, and the average acceptance length is about 1. Whereas when I use eagle1-llama3.3-70b, the draft acceptance rate for the same prompt is about 40%.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions