Skip to content

Conversation

ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented May 28, 2025

Merge after #18511 so that CustomDataset can be used.

Currently, offline benchmark only supports eagle via offline_inference/eagle.py and supports mtbench dataset which users have to download and place in the right directory.

This PR improves the benchmark setup:

  • Adds offline_inference/spec_decode.py which supports ngram as well. This makes computing AL of ngram as easy as eagle. I am not sure if I should delete offline_inference/eagle.py in this PR.
  • expands the dataset supported by offline spec decode from mtbench to more datasets. It reuses benchmarks/datasets.py which is used in online benchmarking
  • moves the parser from benchmarks/serve.py to benchmarks/datasets.py so that it can be shared by offline and online benchmarking

Testing

online benchmark

server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

client

vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts 80 \
    --max-concurrency 64 \
    --result-dir "./log/EAGLE-1"

Before this PR

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:06<00:00, 12.87it/s]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  6.21      
Total input tokens:                      8133      
Total generated tokens:                  16700     
Request throughput (req/s):              12.87     
Output token throughput (tok/s):         2687.44   
Total Token throughput (tok/s):          3996.24   
---------------Time to First Token----------------
Mean TTFT (ms):                          54.64     
Median TTFT (ms):                        30.79     
P99 TTFT (ms):                           153.55    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.05      
Median TPOT (ms):                        4.89      
P99 TPOT (ms):                           7.21      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.15     
Median ITL (ms):                         10.21     
P99 ITL (ms):                            11.66     
==================================================

After this PR

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:06<00:00, 12.87it/s]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  6.21      
Total input tokens:                      8133      
Total generated tokens:                  16700     
Request throughput (req/s):              12.87     
Output token throughput (tok/s):         2687.36   
Total Token throughput (tok/s):          3996.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          55.14     
Median TTFT (ms):                        30.71     
P99 TTFT (ms):                           156.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.04      
Median TPOT (ms):                        4.89      
P99 TPOT (ms):                           7.17      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.14     
Median ITL (ms):                         10.26     
P99 ITL (ms):                            11.91     
==================================================

offline benchmark

# eagle

# instruct coder
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output

# mtbench
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output


# ngram

# instruct coder
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 3 --prompt_lookup_max 5 --prompt_lookup_min 2 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path likaixin/InstructCoder --num-prompts 1000 --print-output

# mtbench
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --num_spec_tokens 3 --prompt_lookup_max 5 --prompt_lookup_min 2 --max_num_seqs 64 --tp 1 --draft_tp 1 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label May 28, 2025
@ekagra-ranjan
Copy link
Contributor Author

pre-commit is failing because CustomDataset is not defined. It will get fixed when #18511 gets merged.

@ekagra-ranjan
Copy link
Contributor Author

I am not sure if I should delete offline_inference/eagle.py in this PR.

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2025
@ekagra-ranjan
Copy link
Contributor Author

@WoosukKwon - the tests have passed.

@ZhongYingMatrix
Copy link
Contributor

Regarding an off-topic matter, what is the distinction between vllm bench serve and python benchmarks/benchmark_serving.py?

@ekagra-ranjan
Copy link
Contributor Author

@ZhongYingMatrix - benchmark used to be outside vllm folder so importing it as vllm module was not possible which is why there has been some migration effort to move it to within vllm. Going forward the plan is to use it from within vllm and deprecate the one outside. I am not aware of the timeline though.

@ekagra-ranjan
Copy link
Contributor Author

@WoosukKwon - is there anything else needed before we merge this?

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the refactoring!

@WoosukKwon WoosukKwon merged commit 017ef64 into vllm-project:main Jun 12, 2025
65 of 66 checks passed
kouroshHakha pushed a commit to kouroshHakha/vllm that referenced this pull request Jun 18, 2025
…more methods and datasets (vllm-project#18847)

Signed-off-by: kouroshhakha <kourosh@anyscale.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
…more methods and datasets (vllm-project#18847)

Signed-off-by: minpeter <kali2005611@gmail.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…more methods and datasets (vllm-project#18847)

Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants