Fix Variable Sequence Length Support for Flash Attention Decode #362

muhammad-tanvir-1211 · 2025-05-07T15:47:07Z

This PR fixes the variable sequence length support for Flash Attention Decode. It also fixes the causal masking on device code and matches the verification with the one for prefill along with the flops and gbps calculation.

examples/sycl/06_pvc_flash_attention/pvc_flash_decode_runner.hpp

test/unit/flash_attention/flash_attention_decode/flash_decode_testbed_3x.hpp

applications/flash_attention_v2/collective/xe_flash_attn_decode_epilogue.hpp

mehdi-goli · 2025-05-09T10:55:00Z

applications/flash_attention_v2/collective/xe_flash_decode_mma.hpp

    if constexpr (!is_var_len) {
      return params;
    } else {
-      auto [batch, num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = problem_shape;
+      auto [batch, num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = logical_problem_shape;


Suggested change

auto [batch, num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = logical_problem_shape;

auto [num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = select<1, 2, 3, 4, 5, 6, 7>logical_problem_shape;

The logical problem shape and problem shape holding a lot of duplicate inputs unnecessarily which takes up register space. The logical problem shape only need to hold shape<int, int, int> for seq_len_qo, seq_len_kv, seq_len_kv_cache the remaining is already provided in the problem shape

…tlass-fork into flash_fix_varlen_decode

Fix variable length support

613a8b9

muhammad-tanvir-1211 mentioned this pull request May 7, 2025

Add benchmark for Flash Attention Decode #363

Open

t4c1 reviewed May 8, 2025

View reviewed changes

Add comment on problemshape

b5ff4b0

muhammad-tanvir-1211 force-pushed the flash_fix_varlen_decode branch from 07bc319 to b5ff4b0 Compare May 9, 2025 10:39

mehdi-goli reviewed May 9, 2025

View reviewed changes

muhammad-tanvir-1211 added 2 commits May 9, 2025 12:48

Remove duplicate parameters from logical_shape

b620168

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

93c6748

…tlass-fork into flash_fix_varlen_decode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Variable Sequence Length Support for Flash Attention Decode #362

Fix Variable Sequence Length Support for Flash Attention Decode #362

muhammad-tanvir-1211 commented May 7, 2025

mehdi-goli May 9, 2025

mehdi-goli May 9, 2025

	auto [batch, num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = logical_problem_shape;
	auto [num_heads_q, num_heads_kv, seq_len_qo, seq_len_kv, seq_len_kv_cache, head_size_qk, head_size_vo] = select<1, 2, 3, 4, 5, 6, 7>logical_problem_shape;

Fix Variable Sequence Length Support for Flash Attention Decode #362

Are you sure you want to change the base?

Fix Variable Sequence Length Support for Flash Attention Decode #362

Conversation

muhammad-tanvir-1211 commented May 7, 2025

mehdi-goli May 9, 2025

Choose a reason for hiding this comment

mehdi-goli May 9, 2025

Choose a reason for hiding this comment