[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets #18847

ekagra-ranjan · 2025-05-28T17:17:16Z

Merge after #18511 so that CustomDataset can be used.

Currently, offline benchmark only supports eagle via offline_inference/eagle.py and supports mtbench dataset which users have to download and place in the right directory.

This PR improves the benchmark setup:

Adds offline_inference/spec_decode.py which supports ngram as well. This makes computing AL of ngram as easy as eagle. I am not sure if I should delete offline_inference/eagle.py in this PR.
expands the dataset supported by offline spec decode from mtbench to more datasets. It reuses benchmarks/datasets.py which is used in online benchmarking
moves the parser from benchmarks/serve.py to benchmarks/datasets.py so that it can be shared by offline and online benchmarking

Testing

online benchmark

server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

client

vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts 80 \
    --max-concurrency 64 \
    --result-dir "./log/EAGLE-1"

Before this PR

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:06<00:00, 12.87it/s]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  6.21      
Total input tokens:                      8133      
Total generated tokens:                  16700     
Request throughput (req/s):              12.87     
Output token throughput (tok/s):         2687.44   
Total Token throughput (tok/s):          3996.24   
---------------Time to First Token----------------
Mean TTFT (ms):                          54.64     
Median TTFT (ms):                        30.79     
P99 TTFT (ms):                           153.55    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.05      
Median TPOT (ms):                        4.89      
P99 TPOT (ms):                           7.21      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.15     
Median ITL (ms):                         10.21     
P99 ITL (ms):                            11.66     
==================================================

After this PR

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:06<00:00, 12.87it/s]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  6.21      
Total input tokens:                      8133      
Total generated tokens:                  16700     
Request throughput (req/s):              12.87     
Output token throughput (tok/s):         2687.36   
Total Token throughput (tok/s):          3996.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          55.14     
Median TTFT (ms):                        30.71     
P99 TTFT (ms):                           156.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.04      
Median TPOT (ms):                        4.89      
P99 TPOT (ms):                           7.17      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.14     
Median ITL (ms):                         10.26     
P99 ITL (ms):                            11.91     
==================================================

offline benchmark

# eagle

# instruct coder
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output

# mtbench
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output


# ngram

# instruct coder
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 3 --prompt_lookup_max 5 --prompt_lookup_min 2 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output

# mtbench
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 3 --prompt_lookup_max 5 --prompt_lookup_min 2 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output

…chmark datasets

github-actions · 2025-05-28T17:17:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ekagra-ranjan · 2025-05-28T17:34:54Z

pre-commit is failing because CustomDataset is not defined. It will get fixed when #18511 gets merged.

…ffline-dataset

ekagra-ranjan · 2025-06-02T14:57:44Z

I am not sure if I should delete offline_inference/eagle.py in this PR.

…ffline-dataset

ekagra-ranjan · 2025-06-05T03:03:09Z

@WoosukKwon - the tests have passed.

ZhongYingMatrix · 2025-06-06T09:46:52Z

Regarding an off-topic matter, what is the distinction between vllm bench serve and python benchmarks/benchmark_serving.py?

ekagra-ranjan · 2025-06-06T14:14:31Z

@ZhongYingMatrix - benchmark used to be outside vllm folder so importing it as vllm module was not possible which is why there has been some migration effort to move it to within vllm. Going forward the plan is to use it from within vllm and deprecate the one outside. I am not aware of the timeline though.

ekagra-ranjan · 2025-06-09T15:36:56Z

@WoosukKwon - is there anything else needed before we merge this?

WoosukKwon

LGTM! Thanks for the refactoring!

…more methods and datasets (vllm-project#18847) Signed-off-by: kouroshhakha <kourosh@anyscale.com>

…more methods and datasets (vllm-project#18847) Signed-off-by: minpeter <kali2005611@gmail.com>

…more methods and datasets (vllm-project#18847)

…more methods and datasets (vllm-project#18847) Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

…more methods and datasets (vllm-project#18847)

ekagra-ranjan added 2 commits May 28, 2025 17:09

add offline spec_decode which supports different methods and uses ben…

09e2231

…chmark datasets

reuse datasets parser so that offline bench can also use it

66bcedc

mergify bot added the documentation Improvements or additions to documentation label May 28, 2025

lint

b72b78d

markmc added the speculative-decoding label May 29, 2025

Merge branch 'main' of https://github.com/vllm-project/vllm into er-o…

9df3030

…ffline-dataset

ekagra-ranjan added 3 commits June 3, 2025 17:56

add output len

f2e47b9

add warning

2e5307b

Merge branch 'main' of https://github.com/vllm-project/vllm into er-o…

3393d12

…ffline-dataset

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2025

ekagra-ranjan added 2 commits June 4, 2025 17:40

fix default dataset name

0905480

Merge branch 'main' of https://github.com/vllm-project/vllm into er-o…

1c7f5df

…ffline-dataset

WoosukKwon approved these changes Jun 12, 2025

View reviewed changes

WoosukKwon merged commit 017ef64 into vllm-project:main Jun 12, 2025
65 of 66 checks passed

kouroshHakha pushed a commit to kouroshHakha/vllm that referenced this pull request Jun 18, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

5c7d8b2

…more methods and datasets (vllm-project#18847) Signed-off-by: kouroshhakha <kourosh@anyscale.com>

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

a181938

…more methods and datasets (vllm-project#18847) Signed-off-by: minpeter <kali2005611@gmail.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

8d63fc5

…more methods and datasets (vllm-project#18847)

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

ec67ad1

…more methods and datasets (vllm-project#18847)

ekagra-ranjan mentioned this pull request Jul 8, 2025

[V1][Spec Decode][Feature] Spec decode with probs #20459

Open

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

db683e0

…more methods and datasets (vllm-project#18847) Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to …

d3fac77

…more methods and datasets (vllm-project#18847)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets #18847

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets #18847

Uh oh!

ekagra-ranjan commented May 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 28, 2025

Uh oh!

ekagra-ranjan commented May 28, 2025

Uh oh!

ekagra-ranjan commented Jun 2, 2025

Uh oh!

ekagra-ranjan commented Jun 5, 2025

Uh oh!

ZhongYingMatrix commented Jun 6, 2025

Uh oh!

ekagra-ranjan commented Jun 6, 2025

Uh oh!

ekagra-ranjan commented Jun 9, 2025

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets #18847

[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets #18847

Uh oh!

Conversation

ekagra-ranjan commented May 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

online benchmark

offline benchmark

Uh oh!

github-actions bot commented May 28, 2025

Uh oh!

ekagra-ranjan commented May 28, 2025

Uh oh!

ekagra-ranjan commented Jun 2, 2025

Uh oh!

ekagra-ranjan commented Jun 5, 2025

Uh oh!

ZhongYingMatrix commented Jun 6, 2025

Uh oh!

ekagra-ranjan commented Jun 6, 2025

Uh oh!

ekagra-ranjan commented Jun 9, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ekagra-ranjan commented May 28, 2025 •

edited by github-actions bot

Loading