fix blockmask offset compute by starcrown001 · Pull Request #104 · PaddlePaddle/flash-attention

starcrown001 · 2026-01-30T13:02:37Z

Fixed the precision misalignment issue of blockmask when the sequence length (seqlen) exceeds 16k, ensuring correct computation for large-scale inputs.
This version optimizes the blockmask implementation. Compared to the original (mit-han-lab/Block-Sparse-Attention), achieves an 81% to 188% improvement in forward performance
and achieves a 48% to 105% improvement in backward performance on H800 Significantly boosts overall operator efficiency.
Comprehensive regression testing for both accuracy and performance has been conducted against the original flashmask operator. The impact on accuracy or performance is negligible, ensuring compatibility and stability.

fix blockmask smem size code clean

fix blockmask offset compute

5163d6c

fix blockmask smem size code clean

GuoxiaWang approved these changes Feb 13, 2026

View reviewed changes

GuoxiaWang merged commit e1ea941 into PaddlePaddle:main Feb 13, 2026
1 check passed

umiswing mentioned this pull request Feb 14, 2026

fix fm blockmask offset compute PaddlePaddle/Paddle#77907

Open

Provide feedback