-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature #14968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature #14968
Conversation
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
… add AITER package Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…eddedLLM/vllm into aiter-block-gemm-integration
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
This pull request has merge conflicts that must be resolved before it can be |
…gration Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks fine. What models are you all using this kernel with? If there are any models that we would like to claim that this kernel supports, please just include lm_eval results in a comment on this PR.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
weight_scale, | ||
block_size, | ||
output_dtype=input.dtype) | ||
# TODO is_shape_supported_by_cutlass is never used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add the github id or create an issue for this for tracking purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@houseroad We have cross-checked with main. It seems they have implemented and removed the comment. Thus, we have removed the TODO comment.
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks fine, but let's iron out this "is cutlass supported logic" ironed out before landing.
weight_scale: torch.Tensor, | ||
input_2d: torch.Tensor) -> bool: | ||
if current_platform.is_rocm(): | ||
# TODO this is never used, as cutlass_block_fp8_supported is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case can we just return False? Or move the current_platform.is_rocm()
check to apply_w8a8_block_fp8_linear
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore I think we could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make sure you have a look at #14397 as it describes how this is currently a bug.
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Thanks for the contribution!
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Description
This PR integrates the Block-Scaled GEMM functionality from AITER into vLLM, and will allow any up-coming optimizations in AITER kernel to be directly used and evaluated within the vLLM framework.
Implementation
The gemm_a8w8_blockscale kernel from AITER has been added to
/vllm/model_executor/layers/quantization/utils/fp8_utils.py
. This kernel is:VLLM_ROCM_USE_AITER
andVLLM_ROCM_USE_AITER_LINEAR
are both set to1
.Testing
The integration has been verified through:
[Updated] Performance
V1 Engine
Summary of Improvements
When comparing the performance with and without AITER Blockscaled GEMM FP8, the following improvements were observed:
Key Observations
[Updated] LM Eval accuracy
V1 Engine
Without AITER Block Scaled GEMM FP8
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
With AITER Block Scaled GEMM FP8
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Old info
Throughput
SharedGPT Dataset
Throughput: 5.42 requests/s, 2274.81 total tokens/s, 1087.69 output tokens/s
Throughput: 5.23 requests/s, 2196.01 total tokens/s, 1050.01 output tokens/s
Gain 3.5% in SharedGPT Dataset.
Accuracy Test GSM8K
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=30000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Environment Settings
Updates in Dockerfile.rocm_base
Added AITER Package:
Additional Notes
The following branches were used as references for this integration:
https://github.com/ROCm/vllm/tree/aiter_upstream
https://github.com/ROCm/vllm/tree/aiter_integration_final
https://github.com/ROCm/vllm/tree/deepseek_v3_dev
This PR is part of a larger effort to integrate AITER kernels into vLLM for improved performance on the ROCm platform.