You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA: Improve flash decoding kernel occupancy for BS=1 case
Adds the following optimizations to the CUDA flash decoding code:
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
This results in upto 15% perf improvement in gen phase throughput for large seq lengths.
Issue: #12182
0 commit comments