-
Notifications
You must be signed in to change notification settings - Fork 13.2k
CUDA: add attention sinks for tile and wmma #15178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should produce correct results, but I think some of the synchronizations can be optimized out. In addition to the usual tests for correctness, please also check compute-sanitizer --tool=racecheck ./tests/test-backend-ops -o FLASH_ATTN_EXT
, the compute sanitizer should come with the CUDA installation but it may not be on the PATH (on my system it's under /opt/cuda/bin/compute-sanitizer
).
…rp_reduce_max from wmma
@JohannesGaessler the |
If possible i would like to be tagged for prs that touch the wmma code. |
Adding attention sink support for older GPUs (Volta and below), this would complete support for attention sinks in the flash attention code
on P100
master
PR
on V100
master (with fix) - at the moment it looks this model is broken on solely Volta because it goes through the wmma path even though attention sinks are not supported
PR