-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Feature] Support sequence parallelism for static fp8 quantization #19181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: cascade812 <cascade812@outlook.com>
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me.
@ProExpertProg @bnellnm Could you take a more detailed look at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few asks:
- I agree with @tlrmchlsmth that including fused ops might be unnecessary - could we just make this pass run before fusion, and then make sure fusion still works?
- Is there any way we could make this pass more general and not reliant on the exact ops? That way it could also work if custom ops are disabled.
- Perhaps here we could enable the custom ops and then lower them after the passes run, like you described in an offline conversation.
- Could you post performance numbers? And should we do this for any other ops as well?
Signed-off-by: cascade812 <cascade812@outlook.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: cascade812 <cascade812@outlook.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2 similar comments
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: cascade812 <cascade812@outlook.com>
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: cascade812 <cascade812@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments!
Signed-off-by: cascade812 <cascade812@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nits but looks great and thanks for all the code quality improvements!
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: cascade812 <cascade812@outlook.com>
…llm-project#19181) Signed-off-by: cascade812 <cascade812@outlook.com>
Add support sequence parallelism for static fp8 quantization in this PR.
It requires below config to enable it