-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Implement Flash attention. #7141
Comments
@qnixsynapse |
@NeoZhangJianyu Nice. Thank you! |
This issue has been tagged "stale" label. I am studying SYCL and C++ currently and waiting for major SYCL refactoring so that the code is readable and it will be easier for me to (eventually) implement flash attn kernel if needed. Commented here to make this issue active again. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Any progress? |
Currently Flash attention is available in CUDA and Metal backends in #5021.
From the paper: Flash attention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. [...] it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. [..]
Thing is whether dedicated Intel GPUs can benefit from it or not and it will be interesting to see how much the performance improves.
The text was updated successfully, but these errors were encountered: