Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Sep 26, 2025

This pull request introduces improvements and documentation updates for the attention sink implementation and its benchmarking. The main changes focus on enhancing memory layout for shared fragments in the TileLang kernels, updating benchmarking logic, and providing a detailed README to explain the algorithm and performance results.

Documentation:

  • Added a comprehensive README.md to the examples/attention_sink directory, describing the attention sink algorithm, its forward and backward computations, and benchmarking results comparing Triton and TileLang implementations.

Performance and benchmarking improvements:

  • Updated benchmarking output in example_gqa_sink_fwd_bhsd_wgmma_pipelined.py to separately report Triton and TileLang latencies and TFlops, and added a speedup calculation for easier comparison. Also, changed the default sequence lengths to 2048 for both query and key/value.

Memory layout enhancements:

  • Applied swizzled layouts to shared fragments (Q_shared, K_shared, V_shared, O_shared) in all three example files (example_gqa_sink_fwd_bhsd_wgmma_pipelined.py, example_mha_sink_fwd_bhsd.py, and example_mha_sink_fwd_bhsd_wgmma_pipelined.py) using make_swizzled_layout and T.annotate_layout for improved memory access patterns. [1] [2] [3] [4] [5] [6]

Summary by CodeRabbit

  • New Features

    • Example scripts now print a "Speedup" metric comparing Triton and TileLang runs.
  • Performance

    • Added swizzled memory layout annotations in attention examples to enable faster kernel execution.
    • Reduced default sequence lengths to shorten example run times.
    • Updated benchmark calculations and outputs for clearer performance reporting.
  • Documentation

    • Added an Attention Sink README with forward/backward notes and benchmark results.

…metrics

- Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns.
- Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.
…hmark results

- Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`.
- Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 26, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds an Attention Sink README and applies swizzled shared-memory layout annotations to Q/K/V/O buffers in three example kernels; one example also renames a benchmark variable, prints speedup, and reduces default sequence lengths. No public APIs or core computation logic changed.

Changes

Cohort / File(s) Change summary
Documentation
examples/attention_sink/README.md
New README describing Attention Sink forward/backward math, benchmark environment and results; documentation only.
Shared-memory swizzled layouts
examples/attention_sink/example_mha_sink_fwd_bhsd.py, examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py, examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Added from tilelang.layout import make_swizzled_layout and T.annotate_layout calls to apply swizzled layouts to Q_shared, K_shared, V_shared, O_shared before shared-buffer copies; no changes to computation.
Benchmark / param tweaks
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Renamed latencylatency_triton, updated TFLOPS calculation to use latency_triton, print Speedup (Triton/Tilelang), and changed default seq_q/seq_kv from 40962048; minor naming/guard adjustments.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

I twitch my whiskers at each byte,
Swizzled tiles twirl in moonlit light.
Benchmarks hop with half the pace,
Triton greets Tilelang face to face.
Docs sip sinks — the rabbit smiles, then naps. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title clearly summarizes the key technical changes by highlighting the optimization of sink attention forward with a swizzled layout and the addition of benchmark reporting. It remains concise and focused on the main update, avoids vague terms, and signals that it pertains to example code. The phrasing is specific enough for reviewers to immediately grasp the core impact without listing every affected file.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Rachmanino, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the attention sink implementation by focusing on performance optimization and comprehensive documentation. It introduces swizzled memory layouts within TileLang kernels to improve memory access efficiency, refines benchmarking scripts to provide clearer performance comparisons between Triton and TileLang, and adds a detailed README.md to explain the algorithm and present the observed performance gains. The overall impact is a more optimized and better-understood attention sink mechanism, demonstrating TileLang's performance advantages.

Highlights

  • Documentation: A new README.md has been added to the examples/attention_sink directory, providing a comprehensive explanation of the attention sink algorithm, its forward and backward computations, and detailed benchmark results comparing Triton and TileLang implementations.
  • Performance Benchmarking: The benchmarking logic in example_gqa_sink_fwd_bhsd_wgmma_pipelined.py has been updated to separately report Triton and TileLang latencies and TFlops, along with a calculated speedup. Default sequence lengths for query and key/value have been adjusted to 2048.
  • Memory Layout Optimization: Swizzled layouts are now applied to shared fragments (Q_shared, K_shared, V_shared, O_shared) in all three attention sink example files (example_gqa_sink_fwd_bhsd_wgmma_pipelined.py, example_mha_sink_fwd_bhsd.py, and example_mha_sink_fwd_bhsd_wgmma_pipelined.py) using make_swizzled_layout and T.annotate_layout to improve memory access patterns and overall performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the attention sink examples by applying a swizzled memory layout, which is a great performance enhancement. The addition of a detailed README.md with benchmark results is also very helpful for context and understanding. My review focuses on a few minor improvements for the documentation and code clarity: I've pointed out an incomplete sentence and a typo in the README, and suggested a variable rename in the benchmarking script to improve consistency and readability.

Rachmanino and others added 2 commits September 27, 2025 01:25
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ned.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c861d8a and 7f90f4d.

📒 Files selected for processing (4)
  • examples/attention_sink/README.md (1 hunks)
  • examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (3 hunks)
  • examples/attention_sink/example_mha_sink_fwd_bhsd.py (2 hunks)
  • examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
  • annotate_layout (105-143)
examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
  • annotate_layout (105-143)
examples/attention_sink/example_mha_sink_fwd_bhsd.py (1)
tilelang/language/__init__.py (1)
  • annotate_layout (105-143)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: format-check
  • GitHub Check: bot-task

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/attention_sink/README.md (1)

3-3: Use descriptive link text for accessibility. Replace [here] with anchor text that tells readers what they will open; this also satisfies markdownlint rule MD059.

-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
+We compare with an optimized version of the official Triton implementation available in the [OpenAI GPT-OSS repository](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).

Based on static analysis

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f90f4d and 5ddcd7f.

📒 Files selected for processing (2)
  • examples/attention_sink/README.md (1 hunks)
  • examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
  • annotate_layout (105-143)
🪛 markdownlint-cli2 (0.18.1)
examples/attention_sink/README.md

3-3: Link text should be descriptive

(MD059, descriptive-link-text)

🔇 Additional comments (1)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)

156-161: Fix make_swizzled_layout usage to avoid runtime TypeError. make_swizzled_layout expects a plain shape tuple; passing the TileBuffer raises 'TileBuffer' object is not subscriptable during JIT. Feed each buffer’s .shape instead.

-            T.annotate_layout({
-                Q_shared: make_swizzled_layout(Q_shared),
-                K_shared: make_swizzled_layout(K_shared),
-                V_shared: make_swizzled_layout(V_shared),
-                O_shared: make_swizzled_layout(O_shared),
-            })
+            T.annotate_layout({
+                Q_shared: make_swizzled_layout(Q_shared.shape),
+                K_shared: make_swizzled_layout(K_shared.shape),
+                V_shared: make_swizzled_layout(V_shared.shape),
+                O_shared: make_swizzled_layout(O_shared.shape),
+            })

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/attention_sink/README.md (1)

3-3: Use descriptive link text instead of “here”.

“Here” as anchor text fails markdownlint MD059 and is uninformative for accessibility; please replace it with a descriptive phrase that names the resource.

Apply this diff:

-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
+We compare with an optimized version of the official Triton implementation available at [the GPT-OSS attention kernel](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ddcd7f and 08c0bf8.

📒 Files selected for processing (1)
  • examples/attention_sink/README.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
examples/attention_sink/README.md

3-3: Link text should be descriptive

(MD059, descriptive-link-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: bot-task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants