-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[V1] Enable V1 for compute capability < 8.0 + FP32 #23614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the V1 engine for GPUs with a compute capability of less than 8.0. This is achieved by removing a check that previously restricted V1 to newer GPUs. The change is justified by the integration of the FlexAttention backend, which is designed to support these older architectures within the V1 engine. The modification is localized and appears consistent with the existing attention backend selection logic. I have not identified any issues with this change.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, but can we test this?
|
I don't have access to such a device, so I'll wait for the OP in #23531 to comment on this |
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this on a T4 machine earlier, can confirm flex attention working with fp32.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Purpose
Since we now support FlexAttention: https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L332
V1 Engine should be allowed for older devices as well.
FIX #23531
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.