Fix varlen mask and bias tensor shapes for all varlen attention functions #114

Copilot · 2025-08-21T12:28:42Z

The varlen attention functions were creating default mask and bias tensors with incorrect shapes, causing a RuntimeError when the C++ backend expected different dimensions.

Problem

When calling flash_dmattn_varlen_func (and related varlen functions) with default attn_mask=None and attn_bias=None, the following error occurred:

RuntimeError: bias must have shape (total_q, num_heads_k, max_seqlen_k)

Root Cause

The default mask and bias tensors were being created with shapes:

Incorrect: (batch_size, num_heads, max_seqlen_q, max_seqlen_k)
Expected by C++ backend: (total_q, num_heads_k, max_seqlen_k)

Where:

total_q = sum of all sequence lengths in the batch (first dimension of query tensor)
num_heads_k = number of key/value heads (second dimension of key tensor)

Solution

Fixed the default tensor shape creation in three varlen functions:

FlashDMAttnVarlenFunc: Now creates (total_q, num_heads_k, max_seqlen_k)
FlashDMAttnVarlenQKVPackedFunc: Now creates (total_tokens, num_heads, max_seqlen)
FlashDMAttnVarlenKVPackedFunc: Now creates (total_q, num_heads_k, max_seqlen_k)

Example

The bug report scenario now works correctly:

B = 3
seq_lens = [512, 1024, 768]
T = sum(seq_lens)  # 2304
H, D = 16, 64

q = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
k = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
v = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
cu = torch.tensor([0] + seq_lens, device='cuda').cumsum(0)

# This now works without RuntimeError:
output = flash_dmattn_varlen_func(
    query=q, key=k, value=v,
    cu_seqlens_q=cu, cu_seqlens_k=cu,
    max_seqlen_q=max(seq_lens), max_seqlen_k=max(seq_lens),
    is_causal=True
)

Before: Creates mask/bias with shape (3, 16, 1024, 1024) → RuntimeError
After: Creates mask/bias with shape (2304, 16, 1024) → Success

Tests Added

Added comprehensive test suite to validate the fix:

Shape validation tests - validates tensor shape creation logic for all scenarios
Integration tests - unittest-based tests covering all three varlen functions
Demo script - demonstrates the bug fix with before/after comparisons
Memory efficiency validation - shows ~25% memory reduction in the bug scenario
GQA support - validates Group Query Attention scenarios work correctly
Edge cases - covers single sequences, short sequences, and various head counts

Also updated the documentation to reflect the correct expected tensor shapes for all varlen functions.

Fixes #113.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

LoserCheems · 2025-08-22T11:03:03Z

Lood works @copilot, let's complete some tests

Copilot

Pull Request Overview

This PR fixes a critical bug in varlen attention functions where default mask and bias tensors were created with incorrect shapes, causing RuntimeError when the C++ backend expected different dimensions. The fix ensures that default tensors match the expected shapes for variable-length sequence processing.

Updates tensor shape creation from batch-based to token-based dimensions
Corrects shape calculation to use total_q/total_tokens and num_heads_k instead of batch_size and num_heads
Updates documentation to reflect the correct expected tensor shapes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-22T11:03:40Z

flash_dmattn/flash_dmattn_interface.py

        is_grad = is_grad_enabled and qkv.requires_grad
        if mask is None:
-            mask = torch.ones((batch_size, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)
+            mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)


The mask tensor shape should be (total_tokens, num_heads, max_seqlen, max_seqlen) to maintain consistency with attention mask semantics. A 3D mask (total_tokens, num_heads, max_seqlen) cannot properly represent attention relationships between query and key positions.

Suggested change

mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)

mask = torch.ones((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

Copilot · 2025-08-22T11:03:40Z

flash_dmattn/flash_dmattn_interface.py

+            mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
        if bias is None:
-            bias = torch.zeros((batch_size, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)
+            bias = torch.zeros((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)


The bias tensor shape should be (total_tokens, num_heads, max_seqlen, max_seqlen) to properly represent attention bias between all query-key pairs. A 3D bias tensor cannot encode relationships between different sequence positions.

Suggested change

bias = torch.zeros((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)

bias = torch.zeros((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Copilot · 2025-08-22T11:13:02Z

Lood works @copilot, let's complete some tests

I've completed comprehensive tests for the varlen attention function bug fix! Added three test files:

Shape validation tests (test_varlen_shapes.py) - validates the tensor shape creation logic
Integration tests (test_varlen_integration.py) - unittest-based tests covering all varlen functions
Demo script (demo_varlen_fix.py) - demonstrates the bug fix with before/after comparisons

The tests validate:
✅ Bug #113 is fixed - no more RuntimeError about bias shapes
✅ All three varlen functions create correct default tensor shapes
✅ Memory usage improved by ~25% in the bug scenario
✅ GQA scenarios work correctly
✅ All edge cases covered

All tests pass and can be run with python test_varlen_*.py and python demo_varlen_fix.py.

Commit: ee3102f

Initial plan

ad22798

Copilot AI assigned Copilot and LoserCheems Aug 21, 2025

Copilot started work on behalf of LoserCheems August 21, 2025 12:28 View session

Fix varlen mask and bias shapes in all varlen functions

c616403

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] [BUG] varlen example mask and bias wrong shapes~~ Fix varlen mask and bias tensor shapes for all varlen attention functions Aug 21, 2025

Copilot AI requested a review from LoserCheems August 21, 2025 12:43

Copilot finished work on behalf of LoserCheems August 21, 2025 12:43

LoserCheems marked this pull request as ready for review August 22, 2025 11:03

Copilot AI review requested due to automatic review settings August 22, 2025 11:03

Copilot started work on behalf of LoserCheems August 22, 2025 11:03 View session

Copilot AI reviewed Aug 22, 2025

View reviewed changes

Add comprehensive tests for varlen attention function bug fix

ee3102f

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>

Copilot finished work on behalf of LoserCheems August 22, 2025 11:14

LoserCheems merged commit 11e8602 into main Aug 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix varlen mask and bias tensor shapes for all varlen attention functions #114

Fix varlen mask and bias tensor shapes for all varlen attention functions #114

Uh oh!

Copilot AI commented Aug 21, 2025 •

edited

Loading

Uh oh!

LoserCheems commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 22, 2025

Uh oh!

Copilot AI Aug 22, 2025

Uh oh!

Copilot AI commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
	mask = torch.ones((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

	bias = torch.zeros((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
	bias = torch.zeros((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

Fix varlen mask and bias tensor shapes for all varlen attention functions #114

Fix varlen mask and bias tensor shapes for all varlen attention functions #114

Uh oh!

Conversation

Copilot AI commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Example

Tests Added

Uh oh!

LoserCheems commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 21, 2025 •

edited

Loading