Skip to content

[V1] [Spec Decode] Support random sampling for spec decode #13933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Mar 17, 2025

Conversation

LiuXiaoxuanPKU
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU commented Feb 26, 2025

After syncing with @WoosukKwon, we change the scope of this PR,

  1. We will support random sampling for spec decode in this PR.
  2. Since only ngram is supported in vllm V1, we only support ngram random sampling for now. However, the random sampling should be general to other drafting methods.
  3. The PR should support mixed batch cases, where requests within the same batch might some perform spec decode, some do not perform spec decode.
  4. Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling.
  5. We will give a more clear definition of recover token ids, and bonus token ids.
  6. We will create new test cases for V1 rejection sampler instead of reusing V0 tests for cleaner separation.

This PR tries to:
1. Support random sampling in rejection sampler. This should be general to different drafting methods, not limited to ngram spec decode.
6. Clean up and reuse rejection sampling tests from V0.

This PR does not:
1. Change model runner to use rejection sampler with random sampling. We need one extra PR to support ngram with random sampling.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@LiuXiaoxuanPKU LiuXiaoxuanPKU marked this pull request as draft February 26, 2025 23:11
@mergify mergify bot added the v1 label Feb 26, 2025
@LiuXiaoxuanPKU LiuXiaoxuanPKU marked this pull request as ready for review February 27, 2025 07:45
@LiuXiaoxuanPKU LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025
@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Mar 15, 2025

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

@JaheimLee
Copy link

Hi, I always got the following error when my server ran for a long time (a whole night).

ERROR 03-16 09:11:23 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-16 09:11:23 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-16 09:11:23 [core.py:337]     outputs = step_fn()
ERROR 03-16 09:11:23 [core.py:337]               ^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in step
ERROR 03-16 09:11:23 [core.py:337]     scheduler_output = self.scheduler.schedule()
ERROR 03-16 09:11:23 [core.py:337]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/scheduler.py", line 172, in schedule
ERROR 03-16 09:11:23 [core.py:337]     new_blocks = self.kv_cache_manager.allocate_slots(
ERROR 03-16 09:11:23 [core.py:337]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 243, in allocate_slots
ERROR 03-16 09:11:23 [core.py:337]     self.block_pool.cache_full_blocks(
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/block_pool.py", line 112, in cache_full_blocks
ERROR 03-16 09:11:23 [core.py:337]     assert blk.block_hash is None
ERROR 03-16 09:11:23 [core.py:337]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337] AssertionError

Memory is sufficient with my two 3090 24GB. My config is

 AsyncEngineArgs(
        model=Qwen/Qwen2.5-72B-Instruct-AWQ,
        tensor_parallel_size=2,
        gpu_memory_utilization=0.97,
        enforce_eager=True,
        max_model_len=7000,
        enable_prefix_caching=True,
        enable_chunked_prefill=True,
        speculative_model='[ngram]',
        ngram_prompt_lookup_max=5,
        ngram_prompt_lookup_min=3,
        num_speculative_tokens=3,
        max_num_seqs=128,
        max_num_batched_tokens=2048,
        compilation_config=3,
    )

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Mar 16, 2025

I did a quick performance check.
Prompt: "Given the code below, could you add one line comment to the return line: {quick_sort_str}"
Max_token = 1024, Batch_size = 1, Hardware: 1x80G H100
Model: meta-llama/Llama-3.1-8B-Instruct

Since the output might be different, we use throughput(tokens/s) metric below. T is the temperature.

Screenshot 2025-03-15 at 9 01 08 PM

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Mar 16, 2025

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"
lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"
Temperature Accuracy (flexible-extract/strict-match)
w/o SD 0 0.79/0.79
with ngram SD 0 0.77/0.77
w/o SD 1.0 0.63/0.65
with ngram SD 1.0 0.62/0.64

@LiuXiaoxuanPKU
Copy link
Collaborator Author

More results on meta-llama/Llama-3.2-3B-Instruct
Screenshot 2025-03-16 at 4 15 01 PM

@WoosukKwon
Copy link
Collaborator

@LiuXiaoxuanPKU Is the PR ready for merge?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

@LiuXiaoxuanPKU Is the PR ready for merge?

Yes, I checked more about the quality. For greedy, it's steady. For random sampling, it fluctuates (sometimes better, sometimes worse). Overall it looks correct to me.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work! 👍

@LiuXiaoxuanPKU LiuXiaoxuanPKU merged commit 8d6cf89 into vllm-project:main Mar 17, 2025
30 checks passed
DefTruth pushed a commit to DefTruth/vllm that referenced this pull request Mar 17, 2025
…ect#13933)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
…ect#13933)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…ect#13933)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
@JuntongMa
Copy link

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"
lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

Temperature Accuracy (flexible-extract/strict-match)
w/o SD 0 0.79/0.79
with ngram SD 0 0.77/0.77
w/o SD 1.0 0.63/0.65
with ngram SD 1.0 0.62/0.64

请问,openai服务化engine V1 能使用ngram跑了嘛?配置--speculative-config '{"num_speculative_tokens":1,"method":"ngram","prompt_lookup_min":1,"prompt_lookup_max":8}'后,不起作用

@DarkLight1337
Copy link
Member

Can you update the V1 User Guide according to the latest status?

@snova-rodrigom
Copy link

Is spec dec actually working in current source? I'm trying to set up this:
python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8000 \ --model meta-llama/Llama-3.3-70B-Instruct \ --seed 42 \ -tp 4 \ --max-model-len 4096 \ --speculative_config '{"model": "meta-llama/Llama-3.2-1B", "num_speculative_tokens": 5}'
but I'm getting this warning:
WARNING 06-17 20:29:46 [arg_utils.py:1665] Speculative Decoding is not supported by the V1 Engine. Falling back to V0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants