Performant backward Triton implementation with separated dkdv and dq kernels #122

jtang10 · 2025-01-31T18:43:42Z

This PR introduces a performant version of the backward kernel implementation in Triton. It follows the same implementation strategy as the example in upstream Triton but with added functionality.

As expected, it surpasses the existing implementation but falls short to only half of the performance from the example mentioned above. Here's a quick summary of the performance at the time of submitting this PR.

Note that the tutorials/06-fused-attention.py assumes Q and KV have the same heads and sequence length, so it only is tested on those examples.

The performance investigation will follow through from another PR later

jtang10 · 2025-01-31T18:49:14Z

The switch between this new kernel bwd_prefill_split.py and bwd_prefill.py is controlled by an environment variable here

flash-attention/flash_attn/flash_attn_triton_amd/utils.py

Line 14 in ea973eb

    
           USE_SINGLE_BWD_KERNEL = os.environ.get('USE_SINGLE_BWD_KERNEL', '0').lower() in ('1', 'true', 'yes')

.

Though we discussed that an argument in a function to toggle it is preferred, passing the triton kernel wrapper into interfaces makes it hard to be controlled by argument imo, so I leave it as-is, but turn on split kernel by default.

flash_attn/flash_attn_interface.py

flash_attn/flash_attn_triton_amd/bwd_prefill.py

flash_attn/flash_attn_triton_amd/bwd_ref.py

flash_attn/flash_attn_triton_amd/interface_fa.py

flash_attn/flash_attn_triton_amd/test.py

…commit

jtang10 · 2025-02-03T04:41:52Z

most of the changes related to removing/changing the existing debug messages is within a single commit and therefore dropped.

jtang10 · 2025-02-03T23:17:29Z

The problem introduced from 58941ed has been manually reverted from 8436dc7 and fixed in 814471e.

The solution is simple, just do not assume any tensor sharing the same stride and therefore use the strides from the tensor itself. For example, q and dq does not necessarily share the same stride because q is individually potentially padded (padding seems to pack the tensor and make it contiguous first) and loaded from forward() path and dq is indexed from dqkv in the backward() path and discontiguous on the dim of 3 at (b, s, 3, h, d). As a result, assuming q and dq would lead to wrong results.

Another change is to NOT .contiguous() q, k, v at the beginning of triton wrapper. After the change above, this is unnecessary and create additional data movement.

TODO:
I keep the hack from your version removed but didn't patch this update into it. I'm gonna focus on the perf part so I'll leave it to you @micmelesse to have it updated.

micmelesse

This is excellent

jtang10 added 3 commits January 31, 2025 18:00

added the split file

f8c1ee5

overhauled split file, need to add new kernels

c1b5fae

copied triton fa over for reference

a3f622f

jtang10 requested review from vgokhale and micmelesse January 31, 2025 18:44

micmelesse reviewed Jan 31, 2025

View reviewed changes

jtang10 added 23 commits February 3, 2025 04:08

added comments

90d639e

preprocess and dkdv done

0560963

fixed dkdv, added dq

9328e5a

fixed assumption on q, kv length different, run but incorrect

682eb1f

added standalone test for split bwd kernel

d0afca7

minor change on the ptr arith

1c542d2

separated the dkdv and dq kernels

8b1629f

GQA works now, onto seqlen q != k

2e4c812

dk,dq working, dv still failing

be65b39

fixed the masking and num_step calc, now q==k works

5e5ad91

added debug print with interpreter, might not work entirely w/o next …

2e67d95

…commit

fixed all issues with q != k

149dd4b

fixed varlen issue

c16a4f7

fixup on debug print

58cc0f2

fixed dropout, esp w/ varlen

88054a7

added USE_EXP2 toggle

059f665

added noncausal kernel

ac18466

updated internal test for noncausal and use_exp2

9fa9688

formatting

10e5468

fixed dropout from seed bug

91b30ac

added envvar USE_SPLIT to toggle btw bwd kernels

00d2c77

fixed the qkv pack issue and removed hack

58941ed

added the split kernel into interface_fa.py

7fadfe3

jtang10 added 3 commits February 3, 2025 04:11

change USE_SPLIT to USE_SINGLE_BWD_KERNEL to make split default

c97f298

removed redundant file

c6f5607

fixed missing import in test

383d8c7

jtang10 force-pushed the jingtang/fa_bwd_split branch from ea973eb to 383d8c7 Compare February 3, 2025 04:29

jtang10 added 2 commits February 3, 2025 15:33

fixed import in interface_fa.py

8e99137

revert changes in flash_attn_interface.py

8436dc7

jtang10 force-pushed the jingtang/fa_bwd_split branch from 814471e to 7d83cd6 Compare February 3, 2025 23:10

updated strides to adapt to various tensor init shape

ada4bb8

jtang10 force-pushed the jingtang/fa_bwd_split branch from 7d83cd6 to ada4bb8 Compare February 3, 2025 23:22

fixed issue that dqkv not zero'd

bacc596

jtang10 force-pushed the jingtang/fa_bwd_split branch 2 times, most recently from ddd07df to bacc596 Compare February 4, 2025 16:39

disabled the AMD local test

0055b35

micmelesse approved these changes Feb 4, 2025

View reviewed changes

micmelesse merged commit 929f0e8 into main_perf Feb 4, 2025
1 check passed

jtang10 mentioned this pull request Feb 12, 2025

Performance update on the backward split kernel #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performant backward Triton implementation with separated dkdv and dq kernels #122

Performant backward Triton implementation with separated dkdv and dq kernels #122

jtang10 commented Jan 31, 2025

jtang10 commented Jan 31, 2025

jtang10 commented Feb 3, 2025

jtang10 commented Feb 3, 2025

micmelesse left a comment

Performant backward Triton implementation with separated dkdv and dq kernels #122

Performant backward Triton implementation with separated dkdv and dq kernels #122

Conversation

jtang10 commented Jan 31, 2025

jtang10 commented Jan 31, 2025

jtang10 commented Feb 3, 2025

jtang10 commented Feb 3, 2025

micmelesse left a comment

Choose a reason for hiding this comment