Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Aug 21, 2025

The varlen attention functions were creating default mask and bias tensors with incorrect shapes, causing a RuntimeError when the C++ backend expected different dimensions.

Problem

When calling flash_dmattn_varlen_func (and related varlen functions) with default attn_mask=None and attn_bias=None, the following error occurred:

RuntimeError: bias must have shape (total_q, num_heads_k, max_seqlen_k)

Root Cause

The default mask and bias tensors were being created with shapes:

  • Incorrect: (batch_size, num_heads, max_seqlen_q, max_seqlen_k)
  • Expected by C++ backend: (total_q, num_heads_k, max_seqlen_k)

Where:

  • total_q = sum of all sequence lengths in the batch (first dimension of query tensor)
  • num_heads_k = number of key/value heads (second dimension of key tensor)

Solution

Fixed the default tensor shape creation in three varlen functions:

  1. FlashDMAttnVarlenFunc: Now creates (total_q, num_heads_k, max_seqlen_k)
  2. FlashDMAttnVarlenQKVPackedFunc: Now creates (total_tokens, num_heads, max_seqlen)
  3. FlashDMAttnVarlenKVPackedFunc: Now creates (total_q, num_heads_k, max_seqlen_k)

Example

The bug report scenario now works correctly:

B = 3
seq_lens = [512, 1024, 768]
T = sum(seq_lens)  # 2304
H, D = 16, 64

q = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
k = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
v = torch.randn(T, H, D, device='cuda', dtype=torch.bfloat16)
cu = torch.tensor([0] + seq_lens, device='cuda').cumsum(0)

# This now works without RuntimeError:
output = flash_dmattn_varlen_func(
    query=q, key=k, value=v,
    cu_seqlens_q=cu, cu_seqlens_k=cu,
    max_seqlen_q=max(seq_lens), max_seqlen_k=max(seq_lens),
    is_causal=True
)

Before: Creates mask/bias with shape (3, 16, 1024, 1024) → RuntimeError
After: Creates mask/bias with shape (2304, 16, 1024) → Success

Tests Added

Added comprehensive test suite to validate the fix:

  • Shape validation tests - validates tensor shape creation logic for all scenarios
  • Integration tests - unittest-based tests covering all three varlen functions
  • Demo script - demonstrates the bug fix with before/after comparisons
  • Memory efficiency validation - shows ~25% memory reduction in the bug scenario
  • GQA support - validates Group Query Attention scenarios work correctly
  • Edge cases - covers single sequences, short sequences, and various head counts

Also updated the documentation to reflect the correct expected tensor shapes for all varlen functions.

Fixes #113.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>
Copilot AI changed the title [WIP] [BUG] varlen example mask and bias wrong shapes Fix varlen mask and bias tensor shapes for all varlen attention functions Aug 21, 2025
Copilot AI requested a review from LoserCheems August 21, 2025 12:43
Copilot finished work on behalf of LoserCheems August 21, 2025 12:43
@LoserCheems
Copy link
Collaborator

Lood works @copilot, let's complete some tests

@LoserCheems LoserCheems marked this pull request as ready for review August 22, 2025 11:03
Copilot AI review requested due to automatic review settings August 22, 2025 11:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical bug in varlen attention functions where default mask and bias tensors were created with incorrect shapes, causing RuntimeError when the C++ backend expected different dimensions. The fix ensures that default tensors match the expected shapes for variable-length sequence processing.

  • Updates tensor shape creation from batch-based to token-based dimensions
  • Corrects shape calculation to use total_q/total_tokens and num_heads_k instead of batch_size and num_heads
  • Updates documentation to reflect the correct expected tensor shapes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

is_grad = is_grad_enabled and qkv.requires_grad
if mask is None:
mask = torch.ones((batch_size, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)
mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mask tensor shape should be (total_tokens, num_heads, max_seqlen, max_seqlen) to maintain consistency with attention mask semantics. A 3D mask (total_tokens, num_heads, max_seqlen) cannot properly represent attention relationships between query and key positions.

Suggested change
mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
mask = torch.ones((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

Copilot uses AI. Check for mistakes.
mask = torch.ones((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
if bias is None:
bias = torch.zeros((batch_size, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)
bias = torch.zeros((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bias tensor shape should be (total_tokens, num_heads, max_seqlen, max_seqlen) to properly represent attention bias between all query-key pairs. A 3D bias tensor cannot encode relationships between different sequence positions.

Suggested change
bias = torch.zeros((total_tokens, num_heads, max_seqlen), dtype=qkv.dtype, device=qkv.device)
bias = torch.zeros((total_tokens, num_heads, max_seqlen, max_seqlen), dtype=qkv.dtype, device=qkv.device)

Copilot uses AI. Check for mistakes.
Co-authored-by: LoserCheems <124847097+LoserCheems@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Aug 22, 2025

Lood works @copilot, let's complete some tests

I've completed comprehensive tests for the varlen attention function bug fix! Added three test files:

  1. Shape validation tests (test_varlen_shapes.py) - validates the tensor shape creation logic
  2. Integration tests (test_varlen_integration.py) - unittest-based tests covering all varlen functions
  3. Demo script (demo_varlen_fix.py) - demonstrates the bug fix with before/after comparisons

The tests validate:
✅ Bug #113 is fixed - no more RuntimeError about bias shapes
✅ All three varlen functions create correct default tensor shapes
✅ Memory usage improved by ~25% in the bug scenario
✅ GQA scenarios work correctly
✅ All edge cases covered

All tests pass and can be run with python test_varlen_*.py and python demo_varlen_fix.py.

Commit: ee3102f

Copilot finished work on behalf of LoserCheems August 22, 2025 11:14
@LoserCheems LoserCheems merged commit 11e8602 into main Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] varlen example mask and bias wrong shapes

2 participants