Skip to content

Conversation

ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented May 30, 2025

So far we only have InstructCoder #15303 as the only dataset where we are able to benchmark a realistic editing task where ngram shines. Earlier, we only got 1.35x gain even though the ngram match were supposed to be high since we are editiing code (see the dataset).

On investigation, I found that the ngram match is not happening because the prompt is not correctly setup due to which the model doesn't output the correct code but rather continues the older code or writes an essay about the code. To get high matches, I changed the prompt to

  1. first have the code and then the instruction
  2. wrap with chat template so that special token would prompt the model to not continue the prev code or instruction and give the response
  3. Add an instruction of Just output the code, do not include any explanation. which further increased the ngram match since the model was still writing a few lines about its changes before it wrote the code.

I even tried being more descriptive but it gave same speedup, i.e.,

prompt = f"This is the input code:\n\n{item['input']}\n\nInstruction: {item['instruction']}\nJust give the output code without any explanation."

With these changes we go from 1.35x gain to 1.95x gain

Vanilla

start server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 

bench

time vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 1000 \
    --max-concurrency 4 \
    --result-dir "./log/ngram-vanilla-instruct-coder"

Result

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  206.88                                                                                                                                         
Total input tokens:                      173622                                                                                                                                         
Total generated tokens:                  109411                                                                                                                                         
Request throughput (req/s):              4.83                                                                                                                                           
Output token throughput (tok/s):         528.86     
Total Token throughput (tok/s):          1368.09    
---------------Time to First Token----------------  
Mean TTFT (ms):                          23.53      
Median TTFT (ms):                        22.78     
P99 TTFT (ms):                           35.31      
-----Time per Output Token (excl. 1st token)------  
Mean TPOT (ms):                          7.39       
Median TPOT (ms):                        7.39       
P99 TPOT (ms):                           7.47       
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.32       
Median ITL (ms):                         7.35       
P99 ITL (ms):                            9.43       
==================================================

Ngram

start server

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'

bench

time vllm bench serve --port 9001 --save-result --save-detailed \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --endpoint-type openai-chat \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 1000 \
    --max-concurrency 4 \
    --result-dir "./log/ngram-instruct-coder"

Before this PR

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  179.88                                                                                                                                         
Total input tokens:                      128569    
Total generated tokens:                  127760    
Request throughput (req/s):              5.56       
Output token throughput (tok/s):         710.26    
Total Token throughput (tok/s):          1425.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.95      
Median TTFT (ms):                        23.37      
P99 TTFT (ms):                           33.67      
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.47       
Median TPOT (ms):                        5.75       
P99 TPOT (ms):                           7.27       
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.58       
Median ITL (ms):                         7.71       
P99 ITL (ms):                            9.46       
==================================================

After this PR

============ Serving Benchmark Result ============                                                                                                                                      
Successful requests:                     1000                                                                                                                                           
Benchmark duration (s):                  110.43                                                                                                                                         
Total input tokens:                      173622                                                                                                                                         
Total generated tokens:                  109484                                                                                                                                         
Request throughput (req/s):              9.06                                                                                                                                           
Output token throughput (tok/s):         991.43                                                                                                                                         
Total Token throughput (tok/s):          2563.65                                                                                                                                        
---------------Time to First Token----------------                                                                                                                                      
Mean TTFT (ms):                          25.08                                                                                                                                          
Median TTFT (ms):                        24.05                                                                                                                                          
P99 TTFT (ms):                           37.84                                                                                                                                          
-----Time per Output Token (excl. 1st token)------                                                                                                                                      
Mean TPOT (ms):                          3.79                                                                                                                                           
Median TPOT (ms):                        3.60                                                                                                                                           
P99 TPOT (ms):                           7.20                                                                                                                                           
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.69       
Median ITL (ms):                         7.79       
P99 ITL (ms):                            9.89       
==================================================

All bench were on BS4.
TPOT

  • vanilla: 7.39ms
  • ngram before: 5.47ms (1.35x)
  • ngram after: 3.79ms (1.95x)

cc: @CXIAAAAA @LiuXiaoxuanPKU

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented May 30, 2025

Sharing for posterity, I just found that the instructCoder has templates similar to what this PR proposes. However, its for alpaca and it has special formating probably for alpaca like ### Response. Modern LLMs have special tokens like chatbot tokens which prompt the model to respond and is applied with tokenizer.apply_chat_template so I think the proposed template is fine.

@WoosukKwon WoosukKwon merged commit 135cf55 into vllm-project:main Jun 3, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants