[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix #18971

ekagra-ranjan · 2025-05-30T18:38:38Z

So far we only have InstructCoder #15303 as the only dataset where we are able to benchmark a realistic editing task where ngram shines. Earlier, we only got 1.35x gain even though the ngram match were supposed to be high since we are editiing code (see the dataset).

On investigation, I found that the ngram match is not happening because the prompt is not correctly setup due to which the model doesn't output the correct code but rather continues the older code or writes an essay about the code. To get high matches, I changed the prompt to

first have the code and then the instruction
wrap with chat template so that special token would prompt the model to not continue the prev code or instruction and give the response
Add an instruction of Just output the code, do not include any explanation. which further increased the ngram match since the model was still writing a few lines about its changes before it wrote the code.

I even tried being more descriptive but it gave same speedup, i.e.,

prompt = f"This is the input code:\n\n{item['input']}\n\nInstruction: {item['instruction']}\nJust give the output code without any explanation."

With these changes we go from 1.35x gain to 1.95x gain

Vanilla

start server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001

bench

time vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 1000 \
    --max-concurrency 4 \
    --result-dir "./log/ngram-vanilla-instruct-coder"

Result

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  206.88                                                                                                                                         
Total input tokens:                      173622                                                                                                                                         
Total generated tokens:                  109411                                                                                                                                         
Request throughput (req/s):              4.83                                                                                                                                           
Output token throughput (tok/s):         528.86     
Total Token throughput (tok/s):          1368.09    
---------------Time to First Token----------------  
Mean TTFT (ms):                          23.53      
Median TTFT (ms):                        22.78     
P99 TTFT (ms):                           35.31      
-----Time per Output Token (excl. 1st token)------  
Mean TPOT (ms):                          7.39       
Median TPOT (ms):                        7.39       
P99 TPOT (ms):                           7.47       
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.32       
Median ITL (ms):                         7.35       
P99 ITL (ms):                            9.43       
==================================================

Ngram

start server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'

bench

time vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 1000 \
    --max-concurrency 4 \
    --result-dir "./log/ngram-instruct-coder"

Before this PR

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  179.88                                                                                                                                         
Total input tokens:                      128569    
Total generated tokens:                  127760    
Request throughput (req/s):              5.56       
Output token throughput (tok/s):         710.26    
Total Token throughput (tok/s):          1425.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.95      
Median TTFT (ms):                        23.37      
P99 TTFT (ms):                           33.67      
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.47       
Median TPOT (ms):                        5.75       
P99 TPOT (ms):                           7.27       
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.58       
Median ITL (ms):                         7.71       
P99 ITL (ms):                            9.46       
==================================================

After this PR

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  110.43                                                                                                                                         
Total input tokens:                      173622                                                                                                                                         
Total generated tokens:                  109484                                                                                                                                         
Request throughput (req/s):              9.06                                                                                                                                           
Output token throughput (tok/s):         991.43                                                                                                                                         
Total Token throughput (tok/s):          2563.65                                                                                                                                        
---------------Time to First Token----------------                                                                                                                                      
Mean TTFT (ms):                          25.08                                                                                                                                          
Median TTFT (ms):                        24.05                                                                                                                                          
P99 TTFT (ms):                           37.84                                                                                                                                          
-----Time per Output Token (excl. 1st token)------                                                                                                                                      
Mean TPOT (ms):                          3.79                                                                                                                                           
Median TPOT (ms):                        3.60                                                                                                                                           
P99 TPOT (ms):                           7.20                                                                                                                                           
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.69       
Median ITL (ms):                         7.79       
P99 ITL (ms):                            9.89       
==================================================

All bench were on BS4.
TPOT

vanilla: 7.39ms
ngram before: 5.47ms (1.35x)
ngram after: 3.79ms (1.95x)

cc: @CXIAAAAA @LiuXiaoxuanPKU

github-actions · 2025-05-30T18:38:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ekagra-ranjan · 2025-05-30T19:25:06Z

Sharing for posterity, I just found that the instructCoder has templates similar to what this PR proposes. However, its for alpaca and it has special formating probably for alpaca like ### Response. Modern LLMs have special tokens like chatbot tokens which prompt the model to respond and is applied with tokenizer.apply_chat_template so I think the proposed template is fine.

…gram-instruct-coder-prompt

fix instruct coder prompt

7973e2d

ekagra-ranjan added 3 commits May 30, 2025 18:41

revert op len

60875f1

lint

fa1517b

add to bench dataset

758a2b7

ekagra-ranjan mentioned this pull request May 30, 2025

[Misc]add coding benchmark for speculative decoding #15303

Merged

ekagra-ranjan added 3 commits May 30, 2025 18:47

lint

0ec58ec

lint

d44b95e

lint

e48c1a7

Merge branch 'main' of https://github.com/vllm-project/vllm into er-n…

42c4fc0

…gram-instruct-coder-prompt

WoosukKwon approved these changes Jun 3, 2025

View reviewed changes

WoosukKwon merged commit 135cf55 into vllm-project:main Jun 3, 2025
12 of 13 checks passed

ekagra-ranjan mentioned this pull request Aug 26, 2025

[Spec Decode][Benchmark] Add Blitzedit dataset #23605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix #18971

[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix #18971

Uh oh!

ekagra-ranjan commented May 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

ekagra-ranjan commented May 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix #18971

[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix #18971

Uh oh!

Conversation

ekagra-ranjan commented May 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vanilla

Ngram

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

ekagra-ranjan commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ekagra-ranjan commented May 30, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan commented May 30, 2025 •

edited

Loading