Skip to content

[Bugfix][V1][Spec Dec] Add generator to request even when no seed is provided. #17509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

luyuzhe111
Copy link
Contributor

@luyuzhe111 luyuzhe111 commented May 1, 2025

Fix for #17498.

Currently, when no seed is provided to sampling parameters, the request state will have no generator. This leads to a problem in speculative decoding, where different workers will accept different number of tokens when temp > 0 since the sampling process has not been seeded.

cc @WoosukKwon @LiuXiaoxuanPKU

Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Copy link

github-actions bot commented May 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label May 1, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luyuzhe111 Thanks for identifying the bug and submitting the PR.

I think this is not the ideal fix, because we don't want to use generator for requests without seeds (which is a common case). Having a generator increases the sampling overheads because it makes it difficult to use batch operations.

@luyuzhe111
Copy link
Contributor Author

@WoosukKwon thanks for the reply and for the feedback! when you say "it makes it difficult to use batch operations", you mean that when a batch contains multiple requests, each with a different seed, sampling will be less efficient, right? in fact, do you mind explaining how seeding will function when multiple seeds are specified at all?

if i understand it correctly, we want a solution that somehow applies a seed 1) only to the rejection sampler, and 2) shared by all non-seeded requests, correct?

@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented May 1, 2025

Nice catch @luyuzhe111 !

Another solution could be to sample on rank 0 and broadcast it on other ranks. This will add a sync but will remove any possibilities of divergence in randomly sampled tokens in multi GPU setting. This design is also useful when sampling requires some heavy CPU operation (something like Structured Outputs FSM/PDA building) so we can let rank 0 utilize all the CPU threads instead of doing the same CPU work in all the ranks.

@luyuzhe111
Copy link
Contributor Author

@ekagra-ranjan yea what you proposed sounds like a great solution! feel free to start a new PR with the solution since you have more experience on that front. thanks!!

@morgendave
Copy link
Collaborator

I think we don't have to broadcast. There are two more easier solutions

  1. At gpu_model_runner init time, since it is binding ranks already, set the seed once only at that moment.
  2. Using DP to do single rank sampling and gather to all ranks

@WoosukKwon
Copy link
Collaborator

Closed by #17929

@WoosukKwon WoosukKwon closed this May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants