-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Hardware][AMD] integrate aiter chunked prefill into vllm #18596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: charlifu <charlifu@amd.com>
Let's get this merged soon. cc @houseroad @WoosukKwon @simon-mo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept to unblock. Since only touch AMD related logic, should be safe on other platform.
Could @HAIAI or someone from AMD approve this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Zzz9990 would you take a look at failed checks? |
https://buildkite.com/vllm/ci/builds/22190#01977cbb-a80d-4f96-93d8-a821ca8095cf failure seems related? cc: @Zzz9990 |
ImportError: cannot import name 'rocm_aiter_fused_add_rms_norm' from 'vllm.model_executor.layers.layernorm' (/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/layernorm.py) |
Signed-off-by: fsx950223 <fsx950223@outlook.com>
|
Please note all the failing AMD checks. Updating from the latest should help fix the build to get started. |
@Zzz9990 @fsx950223 can you update/rebase from main? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. If you can delete some more of the cascade attention code in the builder/metadata that would be great. If not, it's no big deal.
) | ||
return output | ||
else: | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT you need the use_cascade_attention
method for API compatibility but everything else should be able to be removed. If not it's fine. No worries.
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Head branch was pushed to by a user without write access
…ct#18596) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: fsx950223 <fsx950223@outlook.com> Co-authored-by: charlifu <charlifu@amd.com>
…ct#18596) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: fsx950223 <fsx950223@outlook.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: minpeter <kali2005611@gmail.com>
…ct#18596) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: fsx950223 <fsx950223@outlook.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: Yang Wang <elainewy@meta.com>
…ct#18596) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: fsx950223 <fsx950223@outlook.com> Co-authored-by: charlifu <charlifu@amd.com>
CMD: VLLM_TORCH_PROFILER_DIR=/mnt/raid0/sixifang/vllm/vllm_profile HIP_VISIBLE_DEVICES=4,5,6,7 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 16 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching --max-num-batched-tokens=8192
Performance with aiter:
Performance without aiter: