-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Add FLASHINFER_MLA to backend selector test #24753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds test coverage for the new FLASHINFER_MLA attention backend. The changes correctly add the new backend to the test matrix and include specific test logic for its constraints. However, I found a critical issue where the test code calls get_attn_backend with an incorrect argument type, which could lead to runtime errors. A fix is suggested.
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
| # FlashInfer MLA only supported on V1 engine | ||
| pytest.skip( | ||
| "FlashInfer MLA only supported on V1 engine") | ||
| elif block_size not in [32, 64]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is there a reason we don't have the attention backend define what block size it supports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely should -- I can make that change in a separate PR to standardize the logic now that we have 3 different backends with a restriction like this. That would be cleaner than what we have now where it's hardcoded in a couple different places
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Adds the recently-added FlashInfer MLA backend to the attention selector test. This test would have caught the bug fixed by #24692
Test Plan
pytest tests/kernels/attention/test_attention_selector.pyTest Result
passes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.