[Example] Optimize sink attention forward via swizzled layout and report benchmark results #885

Rachmanino · 2025-09-26T17:23:07Z

This pull request introduces improvements and documentation updates for the attention sink implementation and its benchmarking. The main changes focus on enhancing memory layout for shared fragments in the TileLang kernels, updating benchmarking logic, and providing a detailed README to explain the algorithm and performance results.

Documentation:

Added a comprehensive README.md to the examples/attention_sink directory, describing the attention sink algorithm, its forward and backward computations, and benchmarking results comparing Triton and TileLang implementations.

Performance and benchmarking improvements:

Updated benchmarking output in example_gqa_sink_fwd_bhsd_wgmma_pipelined.py to separately report Triton and TileLang latencies and TFlops, and added a speedup calculation for easier comparison. Also, changed the default sequence lengths to 2048 for both query and key/value.

Memory layout enhancements:

Applied swizzled layouts to shared fragments (Q_shared, K_shared, V_shared, O_shared) in all three example files (example_gqa_sink_fwd_bhsd_wgmma_pipelined.py, example_mha_sink_fwd_bhsd.py, and example_mha_sink_fwd_bhsd_wgmma_pipelined.py) using make_swizzled_layout and T.annotate_layout for improved memory access patterns. [1] [2] [3] [4] [5] [6]

Summary by CodeRabbit

New Features
- Example scripts now print a "Speedup" metric comparing Triton and TileLang runs.
Performance
- Added swizzled memory layout annotations in attention examples to enable faster kernel execution.
- Reduced default sequence lengths to shorten example run times.
- Updated benchmark calculations and outputs for clearer performance reporting.
Documentation
- Added an Attention Sink README with forward/backward notes and benchmark results.

…metrics - Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns. - Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.

…hmark results - Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`. - Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.

coderabbitai · 2025-09-26T17:23:15Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds an Attention Sink README and applies swizzled shared-memory layout annotations to Q/K/V/O buffers in three example kernels; one example also renames a benchmark variable, prints speedup, and reduces default sequence lengths. No public APIs or core computation logic changed.

Changes

Cohort / File(s)	Change summary
Documentation `examples/attention_sink/README.md`	New README describing Attention Sink forward/backward math, benchmark environment and results; documentation only.
Shared-memory swizzled layouts `examples/attention_sink/example_mha_sink_fwd_bhsd.py`, `examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, `examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`	Added `from tilelang.layout import make_swizzled_layout` and `T.annotate_layout` calls to apply swizzled layouts to `Q_shared`, `K_shared`, `V_shared`, `O_shared` before shared-buffer copies; no changes to computation.
Benchmark / param tweaks `examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`	Renamed `latency` → `latency_triton`, updated TFLOPS calculation to use `latency_triton`, print `Speedup` (Triton/Tilelang), and changed default `seq_q`/`seq_kv` from `4096` → `2048`; minor naming/guard adjustments.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[Example] Add efficient attention sink backward implementations and tests #877 — modifies attention_sink examples’ memory layout annotations and uses make_swizzled_layout; closely related changes at the code level.

Suggested reviewers

LeiWang1999

Poem

I twitch my whiskers at each byte,
Swizzled tiles twirl in moonlit light.
Benchmarks hop with half the pace,
Triton greets Tilelang face to face.
Docs sip sinks — the rabbit smiles, then naps. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title clearly summarizes the key technical changes by highlighting the optimization of sink attention forward with a swizzled layout and the addition of benchmark reporting. It remains concise and focused on the main update, avoids vague terms, and signals that it pertains to example code. The phrasing is specific enough for reviewers to immediately grasp the core impact without listing every affected file.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-26T17:23:18Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist · 2025-09-26T17:23:27Z

Summary of Changes

Hello @Rachmanino, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the attention sink implementation by focusing on performance optimization and comprehensive documentation. It introduces swizzled memory layouts within TileLang kernels to improve memory access efficiency, refines benchmarking scripts to provide clearer performance comparisons between Triton and TileLang, and adds a detailed README.md to explain the algorithm and present the observed performance gains. The overall impact is a more optimized and better-understood attention sink mechanism, demonstrating TileLang's performance advantages.

Highlights

Documentation: A new README.md has been added to the examples/attention_sink directory, providing a comprehensive explanation of the attention sink algorithm, its forward and backward computations, and detailed benchmark results comparing Triton and TileLang implementations.
Performance Benchmarking: The benchmarking logic in example_gqa_sink_fwd_bhsd_wgmma_pipelined.py has been updated to separately report Triton and TileLang latencies and TFlops, along with a calculated speedup. Default sequence lengths for query and key/value have been adjusted to 2048.
Memory Layout Optimization: Swizzled layouts are now applied to shared fragments (Q_shared, K_shared, V_shared, O_shared) in all three attention sink example files (example_gqa_sink_fwd_bhsd_wgmma_pipelined.py, example_mha_sink_fwd_bhsd.py, and example_mha_sink_fwd_bhsd_wgmma_pipelined.py) using make_swizzled_layout and T.annotate_layout to improve memory access patterns and overall performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

… implementation

gemini-code-assist

Code Review

This pull request optimizes the attention sink examples by applying a swizzled memory layout, which is a great performance enhancement. The addition of a detailed README.md with benchmark results is also very helpful for context and understanding. My review focuses on a few minor improvements for the documentation and code clarity: I've pointed out an incomplete sentence and a typo in the README, and suggested a variable rename in the benchmarking script to improve consistency and readability.

examples/attention_sink/README.md

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ned.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 6

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c861d8a and 7f90f4d.

📒 Files selected for processing (4)

examples/attention_sink/README.md (1 hunks)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (3 hunks)
examples/attention_sink/example_mha_sink_fwd_bhsd.py (2 hunks)
examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)

tilelang/language/__init__.py (1)

annotate_layout (105-143)

examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (1)

tilelang/language/__init__.py (1)

annotate_layout (105-143)

examples/attention_sink/example_mha_sink_fwd_bhsd.py (1)

tilelang/language/__init__.py (1)

annotate_layout (105-143)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py

examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py

examples/attention_sink/example_mha_sink_fwd_bhsd.py

examples/attention_sink/README.md

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/attention_sink/README.md (1)
3-3: Use descriptive link text for accessibility. Replace [here] with anchor text that tells readers what they will open; this also satisfies markdownlint rule MD059.
-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
+We compare with an optimized version of the official Triton implementation available in the [OpenAI GPT-OSS repository](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
Based on static analysis

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f90f4d and 5ddcd7f.

📒 Files selected for processing (2)

examples/attention_sink/README.md (1 hunks)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)

tilelang/language/__init__.py (1)

annotate_layout (105-143)

🪛 markdownlint-cli2 (0.18.1)

examples/attention_sink/README.md

3-3: Link text should be descriptive

(MD059, descriptive-link-text)

🔇 Additional comments (1)

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)

156-161: Fix make_swizzled_layout usage to avoid runtime TypeError. make_swizzled_layout expects a plain shape tuple; passing the TileBuffer raises 'TileBuffer' object is not subscriptable during JIT. Feed each buffer’s .shape instead.

-            T.annotate_layout({
-                Q_shared: make_swizzled_layout(Q_shared),
-                K_shared: make_swizzled_layout(K_shared),
-                V_shared: make_swizzled_layout(V_shared),
-                O_shared: make_swizzled_layout(O_shared),
-            })
+            T.annotate_layout({
+                Q_shared: make_swizzled_layout(Q_shared.shape),
+                K_shared: make_swizzled_layout(K_shared.shape),
+                V_shared: make_swizzled_layout(V_shared.shape),
+                O_shared: make_swizzled_layout(O_shared.shape),
+            })

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/attention_sink/README.md (1)
3-3: Use descriptive link text instead of “here”.

“Here” as anchor text fails markdownlint MD059 and is uninformative for accessibility; please replace it with a descriptive phrase that names the resource.

Apply this diff:
-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
+We compare with an optimized version of the official Triton implementation available at [the GPT-OSS attention kernel](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ddcd7f and 08c0bf8.

📒 Files selected for processing (1)

examples/attention_sink/README.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

examples/attention_sink/README.md

3-3: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

Rachmanino added 2 commits September 26, 2025 17:18

Update README.md for Attention Sink example to include link to Triton…

3d8cb7e

… implementation

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

examples/attention_sink/README.md Outdated Show resolved Hide resolved

examples/attention_sink/README.md Outdated Show resolved Hide resolved

examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py Outdated Show resolved Hide resolved

Rachmanino and others added 2 commits September 27, 2025 01:25

Update examples/attention_sink/README.md

7e75150

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipeli…

5ddcd7f

…ned.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

typo

08c0bf8

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

LeiWang1999 merged commit bf67fb1 into tile-ai:main Sep 26, 2025
4 checks passed

Rachmanino deleted the attn-sink-bench branch September 26, 2025 17:49

This was referenced Oct 10, 2025

[Lint] Enable pyupgrade linter in ruff #963

Merged

[BugFix] Phaseout dependency of Triton in sink examples to make CI happy #1045

Merged

This was referenced Oct 19, 2025

[Refactor][Example] Update linear attention examples and add tests #1010

Merged

[Benchmark] Add matmul FP16 benchmark results #1067

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Uh oh!

[Example] Optimize sink attention forward via swizzled layout and report benchmark results #885

[Example] Optimize sink attention forward via swizzled layout and report benchmark results #885

Uh oh!

Conversation

Rachmanino commented Sep 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot commented Sep 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rachmanino commented Sep 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 26, 2025 •

edited

Loading