Skip to content

[V1][Spec Decode] EAGLE-3 Support #16937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 25, 2025
Merged

Conversation

benchislett
Copy link
Collaborator

@benchislett benchislett commented Apr 21, 2025

Overview

This PR adds EAGLE3 support for llama3 models in the vLLM V1 engine. The main change to existing code is that auxiliary hidden states must be returned from the target model. EAGLE3 consumes a concatenation of hidden states extracted from intermediate layers in the target model. This will require modification of any model which wants to support EAGLE3.

EAGLE3 Model Implementation

The EAGLE reference implementation of EAGLE3 can be found here. There are a number of changes from previous EAGLE implementations:

  • The projection of concatenated [input_embeds, hidden_states] is moved into the attention matrices. Now, the self.fc matrix projects the 3x hidden state inputs into a single embedding which is concatenated into [input_embeds, projected_hidden_states]. The final projection to hidden_dim happens in the attention as a part of the QKVParallelLinear matrices.
  • A pair of hidden_states are returned from the EAGLE layer. This is so that the normed value can be used for sampling, and the pre-normalization value can be passed into the next iteration of speculation. An alternative way to handle this is to apply the norm in the compute_logits function. I am indifferent to how we choose to accomplish this.
  • Output token mapping must be implemented, as the EAGLE3 lm_head is not the same as the target model and has output dimension 32k which is mapped to the target model vocab size using a simple index mapping (called "d2t" in the EAGLE codebase).

Acceptance Rate

To generate the acceptance rates, you will need #16367. I merged this at one point for testing, then reverted the diff so it doesn't appear on the PR. For now, you can just check out the previous commit git checkout aa11bef1632c345224dda0aa2a56248c5de8ea55. Alternatively, for latest changes, you can try git revert 0721cfa which undoes the revert and returns the experimental diff to the branch.

To run, download the mt_bench data and invoke:

VLLM_USE_V1=1 python examples/offline_inference/eagle.py --dataset="./data/mt_bench/question.jsonl" --num_spec_tokens 7 --max_num_seqs 1 --num_prompts 80

Here are the results for meta-llama/Llama-3.1-8B-Instruct with yuhuili/EAGLE-LLaMA3.1-Instruct-8B on my RTX 4090 (baseline):

Processed prompts: 100%|...| 80/80 [03:23<00:00,  2.55s/it, est. speed input: 39.49 toks/s, output: 82.47 toks/s]
--------------------------------------------------
mean acceptance length:         2.48
--------------------------------------------------
acceptance at token 0:1.00
acceptance at token 1:0.68
acceptance at token 2:0.39
acceptance at token 3:0.21
acceptance at token 4:0.11
acceptance at token 5:0.06
acceptance at token 6:0.03
acceptance at token 7:0.02

Here are the EAGLE3 results with yuhuili/EAGLE3-LLaMA3.1-Instruct-8B (this PR):

Processed prompts: 100%|...| 80/80 [01:56<00:00,  1.46s/it, est. speed input: 68.86 toks/s, output: 143.81 toks/s]
--------------------------------------------------
mean acceptance length:         3.57
--------------------------------------------------
acceptance at token 0:1.00
acceptance at token 1:0.75
acceptance at token 2:0.55
acceptance at token 3:0.42
acceptance at token 4:0.31
acceptance at token 5:0.24
acceptance at token 6:0.18
acceptance at token 7:0.13

Notably, the conditional acceptance rate per-token does not seem to decrease. Each token is ~75% likely to be sampled assuming the previous token is accepted.

Profiling indicates that CUDA graphs and avoiding synchronization in EAGLE code would significantly improve overall performance as drafting time approaches the decode time for increased amounts of speculated tokens.

Testing

I am in the process of updating acceptance tests to evaluate EAGLE3. I hope that code review can be done while test integration is ongoing.

luyuzhe111 and others added 14 commits April 11, 2025 21:43
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
… eagle-temp"

This reverts commit 906e2b3, reversing
changes made to 1dd2338.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Apr 21, 2025
Copy link

mergify bot commented Apr 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 21, 2025
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@mergify mergify bot removed the needs-rebase label Apr 21, 2025
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Copy link

mergify bot commented Apr 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented Apr 24, 2025

I just did the bench for K=2,4,7 and compared EAGLE1 and EAGLE3. The AL matches with your run.

  • In your RTX 4090 run, EAGLE-3 K=7 gives 1.74x more speedup than EAGLE-1. However, it is 1.29x on H100.
  • K=4 was better for E3 and K=2 for E1. The ratio of E3/E1 at their optimal K is 1.14x

@benchislett - what are your thoughts on the RTX 4090 vs H100 results and this benchmark?

@DarkLight1337
Copy link
Member

The failing test looks related to this PR, PTAL

@benchislett
Copy link
Collaborator Author

@ekagra-ranjan this seems reasonable. Are you measuring output token throughput, or TPOT? These are not quite the same, I believe output token throughput is averaged over the duration of the whole run so prefill time will be included. There will be a slight difference if we measure one way or the other. I re-iterate that the goal of this PR is to establish initial support, and not to optimize the implementation. There is lots of perf to be gained by refining the EAGLE code, but that is outside the scope of this PR.

@benchislett
Copy link
Collaborator Author

@DarkLight1337 (cc @WoosukKwon) I'm not sure how to address these failing tests. There are two failures:

  • The base EAGLE test is failing with 68% matches compared to the baseline of 70%. I think this might just be variance, as I didn't change much about the EAGLE code and did not significantly change the launch configuration in the test.
  • The EAGLE3 launch is hitting OOM. The tests passed on my 4090, so I wonder if this is a matter of the pytorch cleanup not freeing memory between tests or some similar artifact.

Besides re-running or tweaking the acceptance conditions for the tests, I'm not sure how to move forward here

@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented Apr 25, 2025

Are you measuring output token throughput, or TPOT?

@benchislett Thanks for pointing it out. The output tokens/s by running /offline_inference/eagle.py is indeed not removing the TTFT. I ran benchmark_serving.py to measure the TPOT at BS1 for Llama 3.1 and its 7.5ms which corresponds to 133 tokens/s which is very close to the output tokens/s reported by /offline_inference/eagle.py. This is because the input len is small so the numbers I have shared can be treated as TPOT. The TTFT for MTBench is ~12ms so it contributes to <2% even if output len is 100, i.e., [12/(7.5*100)] *100%.

What are your thoughts on RTX 4090 getting 1.74x gain for E3 over E1 whereas H100 getting only 1.29x? RTX 4090 is getting 1.34x more gains than H100. I tried comparing for the fp16 tensorcore FLOPS/mem BW of RTX 4090 vs H100 hoping that RTX 4090 is relatively more memory bandwidth than H100 but it does seem so. As per this, fp16 tensorcore flops = 330 TFOPS and mem bw is 1 TBPS so the memory bw is just 300x slower than compute flops whereas H100 is 600x slower (1979/3.35).

@WoosukKwon WoosukKwon merged commit a0e619e into vllm-project:main Apr 25, 2025
62 of 65 checks passed
@WoosukKwon
Copy link
Collaborator

@benchislett @DarkLight1337 I've merged the PR and submitted a new PR #17209 to fix the test (by just lowering the bar).
I think the failure is because we changed the target model from llama 3 to llama 3.1.

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
@lfopensource
Copy link

exciting feature

@fan-niu
Copy link

fan-niu commented May 22, 2025

@benchislett Hi this task so cool, i run v0.8.5.post1 Eagle3 broken with llama3-70b, could you help to check this issue? thanks a lot.

#18452

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Jun 20, 2025
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
vllm-project/vllm#16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: #1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Bryan Lu <yuzhelu@amazon.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants