[V1] [Spec Decode] Support random sampling for spec decode #13933

LiuXiaoxuanPKU · 2025-02-26T23:11:03Z

After syncing with @WoosukKwon, we change the scope of this PR,

We will support random sampling for spec decode in this PR.
Since only ngram is supported in vllm V1, we only support ngram random sampling for now. However, the random sampling should be general to other drafting methods.
The PR should support mixed batch cases, where requests within the same batch might some perform spec decode, some do not perform spec decode.
Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling.
We will give a more clear definition of recover token ids, and bonus token ids.
We will create new test cases for V1 rejection sampler instead of reusing V0 tests for cleaner separation.

~~This PR tries to:~~
~~1. Support random sampling in rejection sampler. This should be general to different drafting methods, not limited to ngram spec decode.~~
~~6. Clean up and reuse rejection sampling tests from V0.~~

~~This PR does not:~~
~~1. Change model runner to use rejection sampler with random sampling. We need one extra PR to support ngram with random sampling.~~

github-actions · 2025-02-26T23:11:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

vllm/v1/spec_decode/utils.py

WoosukKwon · 2025-03-15T08:06:43Z

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

vllm/v1/worker/gpu_model_runner.py

JaheimLee · 2025-03-16T03:18:51Z

Hi, I always got the following error when my server ran for a long time (a whole night).

ERROR 03-16 09:11:23 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-16 09:11:23 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-16 09:11:23 [core.py:337]     outputs = step_fn()
ERROR 03-16 09:11:23 [core.py:337]               ^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in step
ERROR 03-16 09:11:23 [core.py:337]     scheduler_output = self.scheduler.schedule()
ERROR 03-16 09:11:23 [core.py:337]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/scheduler.py", line 172, in schedule
ERROR 03-16 09:11:23 [core.py:337]     new_blocks = self.kv_cache_manager.allocate_slots(
ERROR 03-16 09:11:23 [core.py:337]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 243, in allocate_slots
ERROR 03-16 09:11:23 [core.py:337]     self.block_pool.cache_full_blocks(
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/block_pool.py", line 112, in cache_full_blocks
ERROR 03-16 09:11:23 [core.py:337]     assert blk.block_hash is None
ERROR 03-16 09:11:23 [core.py:337]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337] AssertionError

Memory is sufficient with my two 3090 24GB. My config is

 AsyncEngineArgs(
        model=Qwen/Qwen2.5-72B-Instruct-AWQ,
        tensor_parallel_size=2,
        gpu_memory_utilization=0.97,
        enforce_eager=True,
        max_model_len=7000,
        enable_prefix_caching=True,
        enable_chunked_prefill=True,
        speculative_model='[ngram]',
        ngram_prompt_lookup_max=5,
        ngram_prompt_lookup_min=3,
        num_speculative_tokens=3,
        max_num_seqs=128,
        max_num_batched_tokens=2048,
        compilation_config=3,
    )

LiuXiaoxuanPKU · 2025-03-16T04:01:45Z

I did a quick performance check.
Prompt: "Given the code below, could you add one line comment to the return line: {quick_sort_str}"
Max_token = 1024, Batch_size = 1, Hardware: 1x80G H100
Model: meta-llama/Llama-3.1-8B-Instruct

Since the output might be different, we use throughput(tokens/s) metric below. T is the temperature.

LiuXiaoxuanPKU · 2025-03-16T21:42:16Z

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

	Temperature	Accuracy (flexible-extract/strict-match)
w/o SD	0	0.79/0.79
with ngram SD	0	0.77/0.77
w/o SD	1.0	0.63/0.65
with ngram SD	1.0	0.62/0.64

LiuXiaoxuanPKU · 2025-03-16T23:16:01Z

More results on meta-llama/Llama-3.2-3B-Instruct

WoosukKwon · 2025-03-17T04:34:03Z

@LiuXiaoxuanPKU Is the PR ready for merge?

LiuXiaoxuanPKU · 2025-03-17T04:38:16Z

@LiuXiaoxuanPKU Is the PR ready for merge?

Yes, I checked more about the quality. For greedy, it's steady. For random sampling, it fluctuates (sometimes better, sometimes worse). Overall it looks correct to me.

WoosukKwon

LGTM! Thanks for the great work! 👍

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com>

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

JuntongMa · 2025-06-07T09:06:12Z

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

Temperature Accuracy (flexible-extract/strict-match)
w/o SD 0 0.79/0.79
with ngram SD 0 0.77/0.77
w/o SD 1.0 0.63/0.65
with ngram SD 1.0 0.62/0.64

请问，openai服务化engine V1 能使用ngram跑了嘛？配置--speculative-config '{"num_speculative_tokens":1,"method":"ngram","prompt_lookup_min":1,"prompt_lookup_max":8}'后，不起作用

DarkLight1337 · 2025-06-11T08:43:58Z

Can you update the V1 User Guide according to the latest status?

snova-rodrigom · 2025-06-17T20:33:58Z

Is spec dec actually working in current source? I'm trying to set up this:
python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8000 \ --model meta-llama/Llama-3.3-70B-Instruct \ --seed 42 \ -tp 4 \ --max-model-len 4096 \ --speculative_config '{"model": "meta-llama/Llama-3.2-1B", "num_speculative_tokens": 5}'
but I'm getting this warning:
WARNING 06-17 20:29:46 [arg_utils.py:1665] Speculative Decoding is not supported by the V1 Engine. Falling back to V0.

LiuXiaoxuanPKU added 15 commits February 23, 2025 14:53

change rejection sampler api

1f69c5a

change rejection sampler api

3399a00

fix

24acbb4

minor

18a1815

update rejection sampler tests

269e48b

minor

0e0bf0a

minor

b1f6228

fix tests

a2c830a

rejection sampler in model runner

715f7b1

minor

4b89ffc

add comments

3950b5c

runnable, need to check correctness

e53fa24

minor

f4b6d38

minor

2292d59

merge

e6b4914

LiuXiaoxuanPKU requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 26, 2025 23:11

LiuXiaoxuanPKU marked this pull request as draft February 26, 2025 23:11

mergify bot added the v1 label Feb 26, 2025

LiuXiaoxuanPKU added 3 commits February 26, 2025 18:53

pass basic correctness

ec45962

pass all tests

d706b88

minor

923b980

LiuXiaoxuanPKU marked this pull request as ready for review February 27, 2025 07:45

LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025

LiuXiaoxuanPKU requested a review from sroy745 February 27, 2025 07:46

LiuXiaoxuanPKU and others added 4 commits March 14, 2025 18:38

Update vllm/v1/spec_decode/utils.py

f2c11fd

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into random-sampling

94243bf

fix comments

4fd945b

more fix

8875d1a

WoosukKwon reviewed Mar 15, 2025

View reviewed changes

vllm/v1/spec_decode/utils.py Outdated Show resolved Hide resolved

fix comments

e7c5805

WoosukKwon reviewed Mar 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

change rejection sampling output

8fff235

WoosukKwon approved these changes Mar 17, 2025

View reviewed changes

LiuXiaoxuanPKU merged commit 8d6cf89 into vllm-project:main Mar 17, 2025
30 checks passed

WoosukKwon mentioned this pull request Mar 17, 2025

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

Merged

markmc mentioned this pull request Mar 20, 2025

[V1][Bug] IMA with ngram spec decoding and flashinfer #14765

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1] [Spec Decode] Support random sampling for spec decode (vllm-proj…

40e6513

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1] [Spec Decode] Support random sampling for spec decode (vllm-proj…

8f6dfdc

…ect#13933) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

la1ty mentioned this pull request Jul 22, 2025

[Performance]: Speculative decoding doesn't seem to speed up inference? #21278

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] [Spec Decode] Support random sampling for spec decode #13933

[V1] [Spec Decode] Support random sampling for spec decode #13933

Uh oh!

LiuXiaoxuanPKU commented Feb 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

Uh oh!

WoosukKwon commented Mar 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

JaheimLee commented Mar 16, 2025

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025 •

edited

Loading

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025 •

edited

Loading

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025

Uh oh!

WoosukKwon commented Mar 17, 2025

Uh oh!

LiuXiaoxuanPKU commented Mar 17, 2025

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

JuntongMa commented Jun 7, 2025

Uh oh!

DarkLight1337 commented Jun 11, 2025

Uh oh!

snova-rodrigom commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

[V1] [Spec Decode] Support random sampling for spec decode #13933

[V1] [Spec Decode] Support random sampling for spec decode #13933

Uh oh!

Conversation

LiuXiaoxuanPKU commented Feb 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

Uh oh!

WoosukKwon commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JaheimLee commented Mar 16, 2025

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Mar 16, 2025

Uh oh!

WoosukKwon commented Mar 17, 2025

Uh oh!

LiuXiaoxuanPKU commented Mar 17, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JuntongMa commented Jun 7, 2025

Uh oh!

DarkLight1337 commented Jun 11, 2025

Uh oh!

snova-rodrigom commented Jun 17, 2025

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Feb 26, 2025 •

edited by github-actions bot

Loading

WoosukKwon commented Mar 15, 2025 •

edited

Loading

LiuXiaoxuanPKU commented Mar 16, 2025 •

edited

Loading

LiuXiaoxuanPKU commented Mar 16, 2025 •

edited

Loading