-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[Bugfix][V1][Spec Dec] Add generator to request even when no seed is provided. #17509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luyuzhe111 Thanks for identifying the bug and submitting the PR.
I think this is not the ideal fix, because we don't want to use generator for requests without seeds (which is a common case). Having a generator increases the sampling overheads because it makes it difficult to use batch operations.
@WoosukKwon thanks for the reply and for the feedback! when you say "it makes it difficult to use batch operations", you mean that when a batch contains multiple requests, each with a different seed, sampling will be less efficient, right? in fact, do you mind explaining how seeding will function when multiple seeds are specified at all? if i understand it correctly, we want a solution that somehow applies a seed 1) only to the rejection sampler, and 2) shared by all non-seeded requests, correct? |
Nice catch @luyuzhe111 ! Another solution could be to sample on rank 0 and broadcast it on other ranks. This will add a sync but will remove any possibilities of divergence in randomly sampled tokens in multi GPU setting. This design is also useful when sampling requires some heavy CPU operation (something like Structured Outputs FSM/PDA building) so we can let rank 0 utilize all the CPU threads instead of doing the same CPU work in all the ranks. |
@ekagra-ranjan yea what you proposed sounds like a great solution! feel free to start a new PR with the solution since you have more experience on that front. thanks!! |
I think we don't have to broadcast. There are two more easier solutions
|
Closed by #17929 |
Fix for #17498.
Currently, when no seed is provided to sampling parameters, the request state will have no generator. This leads to a problem in speculative decoding, where different workers will accept different number of tokens when temp > 0 since the sampling process has not been seeded.
cc @WoosukKwon @LiuXiaoxuanPKU