Performance update on the backward split kernel #127

jtang10 · 2025-02-12T22:57:43Z

This PR improves the backward kernel performance, bringing it closer to the tutorial/06-fused-attention.py performance. As it shows above, based on the benchmark in flash_attn/flash_attn_triton_amdwe, we are on average 90% of the tutorial, and achieves 60% performance improvement from the previous PR #122.

The improvement comes from the following places:

tot Triton compiler, which greatly alleiviates the register spilling problem in the dkdv kernel.
Merged dkdv and dq kernel. There is nothing changed algorithm-wise, simply merging two kernels together to share common variables and reduce launch latency, like the tutorial does.
Turn on use_exp2 by default.

jtang10 added 8 commits February 12, 2025 22:44

added experimental changes for perf

96ed1b6

updated oneKernel, not tested

c8b6c5b

set GROUP_SIZE as constexpr

46aa1d3

updated oneKernel for bound check

a7ad09b

added oneKernel as an option to bench.py

e0b8801

merged dkdv and dq kernel. passing all tests

23532bf

turn on exp2 by default

e34dce0

cleaned up unintended change in bench.py

1909192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance update on the backward split kernel #127

Performance update on the backward split kernel #127

jtang10 commented Feb 12, 2025

Performance update on the backward split kernel #127

Are you sure you want to change the base?

Performance update on the backward split kernel #127

Conversation

jtang10 commented Feb 12, 2025