[GQA] Add regional atomic add to slightly boost performance #1093

tzj-fxz · 2025-10-21T11:30:08Z

Summary by CodeRabbit

Refactor
- Restructured backward pass kernel to use vectorized gradient accumulation, replacing sequential per-element operations with batched slice-based updates for improved computational efficiency.

github-actions · 2025-10-21T11:30:27Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-21T11:30:38Z

Walkthrough

The backward pass kernel in a flash attention example refactors atomic updates from per-element loops to vectorized slice-based operations for dQ, dV, and dK tensors. The control flow remains unchanged, but accumulation steps are restructured to use contiguous slices instead of per-element atomic additions.

Changes

Cohort / File(s)	Summary
Flash Attention GQA Backward Kernel Optimization `examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py`	Refactors dQ/dV/dK atomic accumulation from per-element loops to vectorized slice-based atomic_add operations; removes inner loops over (i, d) for dQ updates and consolidates into batched slice operations while preserving memory_order="release" semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes involve targeted logic refactoring of atomic operations and tensor slicing patterns within a single example file. Review requires understanding of atomic semantics, vectorized memory operations, and tensor indexing correctness, but is localized to a specific optimization path without branching concerns.

Possibly related PRs

[Feature][Example] Support TMA reduce operation and update GQA bwd example #969: Adds TMA-based atomic-reduce path for FlashAttention backward pass, complementing this PR's slice-based vectorization of the same accumulation sites.
[Feature] Add GQA backward kernel with varlen input #1082: Directly addresses the TODO to vectorize atomic adds in the varlen GQA backward kernel implementation.

Suggested reviewers

chengyupku
LeiWang1999

Poem

🐰 Per-element loops once danced alone,
Now slices glide in vectorized zones,
Atomic whispers, contiguous streams,
Flash attention flows through faster dreams! ⚡

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[GQA] Add regional atomic add to slightly boost performance" directly aligns with the main change described in the summary. The PR refactors atomic update operations in the backward pass kernel from per-element loops to vectorized slice-based updates, which the title appropriately refers to as "regional atomic add." The title is concise, clear, and specific—it identifies what was added (a performance optimization technique), explains the purpose (performance boost), and includes context with the GQA prefix. A developer scanning commit history would clearly understand that this PR introduces a vectorized atomic operations optimization for the Group Query Attention backward pass.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdc67fc and 32b0d41.

📒 Files selected for processing (1)

examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (2)

examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py (2)

369-373: Excellent optimization using vectorized atomic_add for dQ.

The slice-based atomic operation efficiently replaces per-element loops, reducing the number of atomic operations from O(block_N × dim_qk) to O(1) per iteration. The slice indices are correct, and the memory_order="release" ensures proper synchronization.

375-384: Smart refactoring: accumulate locally, write once.

Moving dV and dK atomic operations outside the loop is an excellent optimization. The fragments are accumulated during the k_base loop iterations, then written to global memory with a single vectorized atomic_add per tensor. This significantly reduces atomic contention and improves performance.

The slice indices correctly align with the kernel grid dimensions, and bx // groups properly handles the grouped query attention layout.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tzj-fxz added 6 commits October 21, 2025 05:46

[Lint]

54b8031

[BugFix] Freeze the memory order of all atomic_add operations

9c18ffa

[Lint]

6749593

[Atomic] Move on to regional atomic add

a69df37

[Lint]

14deea8

Merge branch 'main' into gqa1020

32b0d41

LeiWang1999 merged commit f003f37 into tile-ai:main Oct 21, 2025
13 of 15 checks passed

coderabbitai bot mentioned this pull request Oct 22, 2025

[BugFix] Add memory order and testing script for split version GQA bwd kernel #1100

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GQA] Add regional atomic add to slightly boost performance #1093

[GQA] Add regional atomic add to slightly boost performance #1093

Uh oh!

tzj-fxz commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[GQA] Add regional atomic add to slightly boost performance #1093

[GQA] Add regional atomic add to slightly boost performance #1093

Uh oh!

Conversation

tzj-fxz commented Oct 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzj-fxz commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading