-
Couldn't load subscription status.
- Fork 286
[Example] Optimize sink attention forward via swizzled layout and report benchmark results #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…metrics - Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns. - Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.
…hmark results - Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`. - Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds an Attention Sink README and applies swizzled shared-memory layout annotations to Q/K/V/O buffers in three example kernels; one example also renames a benchmark variable, prints speedup, and reduces default sequence lengths. No public APIs or core computation logic changed. Changes
Sequence Diagram(s)Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
Summary of ChangesHello @Rachmanino, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the attention sink implementation by focusing on performance optimization and comprehensive documentation. It introduces swizzled memory layouts within TileLang kernels to improve memory access efficiency, refines benchmarking scripts to provide clearer performance comparisons between Triton and TileLang, and adds a detailed Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes the attention sink examples by applying a swizzled memory layout, which is a great performance enhancement. The addition of a detailed README.md with benchmark results is also very helpful for context and understanding. My review focuses on a few minor improvements for the documentation and code clarity: I've pointed out an incomplete sentence and a typo in the README, and suggested a variable rename in the benchmarking script to improve consistency and readability.
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Outdated
Show resolved
Hide resolved
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ned.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/attention_sink/README.md(1 hunks)examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py(3 hunks)examples/attention_sink/example_mha_sink_fwd_bhsd.py(2 hunks)examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
annotate_layout(105-143)
examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
annotate_layout(105-143)
examples/attention_sink/example_mha_sink_fwd_bhsd.py (1)
tilelang/language/__init__.py (1)
annotate_layout(105-143)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/attention_sink/README.md (1)
3-3: Use descriptive link text for accessibility. Replace[here]with anchor text that tells readers what they will open; this also satisfies markdownlint rule MD059.-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py). +We compare with an optimized version of the official Triton implementation available in the [OpenAI GPT-OSS repository](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).Based on static analysis
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/attention_sink/README.md(1 hunks)examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)
tilelang/language/__init__.py (1)
annotate_layout(105-143)
🪛 markdownlint-cli2 (0.18.1)
examples/attention_sink/README.md
3-3: Link text should be descriptive
(MD059, descriptive-link-text)
🔇 Additional comments (1)
examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py (1)
156-161: Fix make_swizzled_layout usage to avoid runtime TypeError.make_swizzled_layoutexpects a plain shape tuple; passing the TileBuffer raises'TileBuffer' object is not subscriptableduring JIT. Feed each buffer’s.shapeinstead.- T.annotate_layout({ - Q_shared: make_swizzled_layout(Q_shared), - K_shared: make_swizzled_layout(K_shared), - V_shared: make_swizzled_layout(V_shared), - O_shared: make_swizzled_layout(O_shared), - }) + T.annotate_layout({ + Q_shared: make_swizzled_layout(Q_shared.shape), + K_shared: make_swizzled_layout(K_shared.shape), + V_shared: make_swizzled_layout(V_shared.shape), + O_shared: make_swizzled_layout(O_shared.shape), + })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/attention_sink/README.md (1)
3-3: Use descriptive link text instead of “here”.“Here” as anchor text fails markdownlint MD059 and is uninformative for accessibility; please replace it with a descriptive phrase that names the resource.
Apply this diff:
-We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py). +We compare with an optimized version of the official Triton implementation available at [the GPT-OSS attention kernel](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/attention_sink/README.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
examples/attention_sink/README.md
3-3: Link text should be descriptive
(MD059, descriptive-link-text)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
This pull request introduces improvements and documentation updates for the attention sink implementation and its benchmarking. The main changes focus on enhancing memory layout for shared fragments in the TileLang kernels, updating benchmarking logic, and providing a detailed README to explain the algorithm and performance results.
Documentation:
README.mdto theexamples/attention_sinkdirectory, describing the attention sink algorithm, its forward and backward computations, and benchmarking results comparing Triton and TileLang implementations.Performance and benchmarking improvements:
example_gqa_sink_fwd_bhsd_wgmma_pipelined.pyto separately report Triton and TileLang latencies and TFlops, and added a speedup calculation for easier comparison. Also, changed the default sequence lengths to 2048 for both query and key/value.Memory layout enhancements:
Q_shared,K_shared,V_shared,O_shared) in all three example files (example_gqa_sink_fwd_bhsd_wgmma_pipelined.py,example_mha_sink_fwd_bhsd.py, andexample_mha_sink_fwd_bhsd_wgmma_pipelined.py) usingmake_swizzled_layoutandT.annotate_layoutfor improved memory access patterns. [1] [2] [3] [4] [5] [6]Summary by CodeRabbit
New Features
Performance
Documentation