Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Correct the order of operations in the attention bias calculation for improved numerical stability and introduce window size handling. Adjust dbias shape and broadcasting logic to ensure proper dimension management during backward operations.

Corrects parenthesis to apply the matrix scaling before transpose when building the attention bias, aligning with the intended formula and improving numerical stability/broadcasting.

Passes window size into the attention kernel to enable proper windowed masking and behavior.
Fixes incorrect dbias dimension handling in backward by deriving batch and query length from the bias tensor, not reused vars, ensuring correct allocation.

Updates expansion/reduction logic for MQA/GQA and broadcasted dims (batch or seqlen_q == 1) to sum over the right axes, preventing mis-shaped outputs.

Removes unused variables for clarity.
Copilot AI review requested due to automatic review settings October 27, 2025 08:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the attention bias calculation to improve numerical stability by correcting the order of operations, and updates the dbias handling logic during backward operations by introducing dedicated variables for dimension tracking and adding window size parameter support.

  • Corrects parenthesization in attention bias calculation to ensure multiplication happens before transpose operations
  • Adds window_size parameter to the attention interface call
  • Refactors dbias dimension tracking by replacing batch_size_bias and seqlen_q_bias with batch_size_dbias and seqlen_q_dbias

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
examples/modeling/modeling_doge.py Fixes operator precedence in attention bias calculation and adds window_size parameter to attention call
csrc/flash_dmattn/flash_api.cpp Removes unused mask/bias dimension variables and introduces dedicated dbias dimension tracking variables for proper backward pass handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LoserCheems LoserCheems merged commit 424b733 into main Oct 27, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants