Skip to content

fix blockmask offset compute#104

Merged
GuoxiaWang merged 1 commit intoPaddlePaddle:mainfrom
starcrown001:xhy/fix_blockmask
Feb 13, 2026
Merged

fix blockmask offset compute#104
GuoxiaWang merged 1 commit intoPaddlePaddle:mainfrom
starcrown001:xhy/fix_blockmask

Conversation

@starcrown001
Copy link

  • Fixed the precision misalignment issue of blockmask when the sequence length (seqlen) exceeds 16k, ensuring correct computation for large-scale inputs.
  • This version optimizes the blockmask implementation. Compared to the original (mit-han-lab/Block-Sparse-Attention), achieves an 81% to 188% improvement in forward performance
    and achieves a 48% to 105% improvement in backward performance on H800 Significantly boosts overall operator efficiency.
  • Comprehensive regression testing for both accuracy and performance has been conducted against the original flashmask operator. The impact on accuracy or performance is negligible, ensuring compatibility and stability.

fix blockmask smem size

code clean
@GuoxiaWang GuoxiaWang merged commit e1ea941 into PaddlePaddle:main Feb 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants