-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix][B200] Fix cutlass_mla hang
#24966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a workaround to fix a hang in cutlass_mla for large batch sizes by limiting kv_splits. The fix itself seems reasonable given the context. However, I've noticed that several debugging print statements were uncommented in csrc/attention/mla/cutlass_sm100_mla/device/sm100_mla.hpp. These should be removed before merging to keep the codebase clean and avoid performance issues. The PR also includes substantial changes to dependency management files, which seem unrelated to the bugfix. It would be beneficial to address these dependency changes in a separate pull request with a dedicated description.
bb41cec to
37f1a09
Compare
cutlass_mla hang
|
In my small model tests with few prompts(BS < 8), the engine still hangs. Would it be worth investing in why there is a hang? |
|
@pavanimajety I will limit for B>1 |
145c28f to
6870d15
Compare
…for larger batch size Signed-off-by: Alexander Matveev <amatveev@redhat.com>
6870d15 to
2975c3f
Compare
Thanks, that works for now. |
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Alexander Matveev <amatveev@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR fixes the hang issue with cutlass_mla when batch size is sufficiently large and kv_splits is high.
The solution is to limit the max kv_splits to 2 when batch size >= 1. We avoid limiting batch_size == 1, since larger kv_splits improve low-latency performance.