Skip to content

Conversation

@Paran0idy
Copy link
Contributor

@Paran0idy Paran0idy commented Sep 22, 2025

  • add a new emitter for weight preshuffle mfma.

Summary by CodeRabbit

  • New Features

    • Enhanced AMD MFMA preshuffle GEMM path with streamlined data flow for improved stability and potential performance gains.
    • Added an option to load matrix B directly from global memory in preshuffle scenarios.
    • Expanded supported data types by removing restrictive runtime checks.
  • Tests

    • Added coverage for int8 GEMM configurations (e.g., 128x256x256) and k-pack variations.
    • Extended tests to validate the new B loading option and updated preshuffle behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 22, 2025

Walkthrough

Adds two new AMD MFMA intrinsic tests, introduces a preshuffle-aware matmul path with a new b_g2l_load option in tests, and extends MFMA intrinsics with preshuffle-enabled emitters and signatures, including a new MatrixCorePreshuffleIntrinEmitter and modified ldmatrix load paths and loop structure.

Changes

Cohort / File(s) Summary of Changes
AMD GEMM MFMA intrinsic tests
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py
Added two test cases to test_assert_tl_matmul_correctness: int8→int32 with b_transposed=False, and same config with k_pack=2.
AMD GEMM MFMA preshuffle tests
testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py
Switched to MatrixCorePreshuffleIntrinEmitter; added tl_matmul parameter b_g2l_load; reworked tiling (warp tiles, chunk), pipeline loop via num_ko/num_ki; simplified stores; removed dtype assertions and cache_write_shared/C_shared; updated load paths (A S2L, B S2L or G2L). Adjusted helper/test signatures and cases.
MFMA intrinsics generator
tilelang/intrinsics/mfma_macro_generator.py
Extended MatrixCoreIntrinEmitter to handle a_preshuffle/b_preshuffle; added _initialize_preshuffle; refactored ldmatrix_a/b to branch on preshuffle with new global/shared loaders; simplified non-preshuffle B path; introduced new MatrixCorePreshuffleIntrinEmitter with preshuffle-aware ldmatrix implementations and constructor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test
  participant TL as tl_matmul
  participant EM as MatrixCorePreshuffleIntrinEmitter
  participant A as A Loader (S2L)
  participant B as B Loader (S2L/G2L)
  participant P as Pipelined(num_ko,num_ki)
  participant C as Store (stmatrix)

  T->>TL: call tl_matmul(..., b_g2l_load)
  TL->>EM: construct emitter (a_preshuffle,b_preshuffle,k_pack,...)
  TL->>P: iterate over num_ko (outer) and num_ki (inner)
  loop per ki
    P->>A: ldmatrix_a (shared→local)
    alt b_g2l_load == true
      P->>B: ldmatrix_b (global→local)
    else
      P->>B: ldmatrix_b (shared→local)
    end
    P->>EM: mma (accumulate)
  end
  TL->>C: stmatrix with pid_m/pid_n
  C-->>T: completion
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

I hop through tiles where matrices flow,
New preshuffle winds begin to blow—
A loads from shared, B may glide from G2L,
Pipelined loops drum a rhythmic spell.
With mfma sparks, I thump in delight,
Storing C neat—carrots coded right. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[AMD] refactor MatrixCoreIntrinEmitter" is concise and directly describes the primary change in the diff — a refactor of MatrixCoreIntrinEmitter (including adding preshuffle handling and introducing MatrixCorePreshuffleIntrinEmitter) — and aligns with the PR objectives and modified files.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Paran0idy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized intrinsic emitter, MatrixCorePreshuffleIntrinEmitter, designed to optimize Matrix-FMA (MFMA) operations by efficiently handling preshuffled weight data. It refactors the core matrix multiplication logic to support direct global-to-local data loading for preshuffled matrices and streamlines shared memory usage. The accompanying test suite has been updated with new parameters and expanded test cases to validate these performance enhancements and new data flow paths.

Highlights

  • New Emitter for Preshuffled MFMA: Introduced MatrixCorePreshuffleIntrinEmitter to specifically handle weight preshuffling in MFMA operations, inheriting from the base MatrixCoreIntrinEmitter.
  • Optimized Data Loading: The new emitter overrides ldmatrix_a and ldmatrix_b to support loading preshuffled data directly from global memory (G2L) or shared memory (S2L) with appropriate indexing.
  • Simplified Base Emitter: The original MatrixCoreIntrinEmitter's _warp_ldmatrix_b method was simplified by removing the now-redundant preshuffle-specific logic, centralizing this behavior in the new emitter.
  • Updated GEMM Test Infrastructure: The tl_matmul function and assert_tl_matmul_correctness in the test suite were updated to include a b_g2l_load parameter, allowing for testing of global-to-local loads for preshuffled B matrices.
  • Adjusted Preshuffle Parameters: Modified warp_row_tiles, warp_col_tiles, and chunk values within the preshuffle test configuration for better optimization.
  • Streamlined Shared Memory Management: Removed C_shared allocation and simplified the stmatrix operation to directly store results from local to global memory, removing an intermediate shared memory step.
  • Expanded Test Coverage: Added new test cases to test_tilelang_gemm_mfma_intrinsic.py and updated existing preshuffle tests in test_tilelang_gemm_mfma_preshuffle.py with larger dimensions and k_pack variations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MatrixCoreIntrinEmitter to support weight preshuffling by introducing a new MatrixCorePreshuffleIntrinEmitter class. The changes are well-structured, moving specialized logic into a subclass. I've provided a few suggestions to improve code quality, such as removing duplicated code by using inheritance properly, removing leftover debug statements, and cleaning up test cases. I also noted the removal of some explicit type checks which could be reconsidered for better error reporting.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

214-214: Fix the missing dot in the method call

There's a typo in the A.Tto method call - it should be A.T.to.

-        ref_c = torch.matmul(A.Tto(torch.float32),
+        ref_c = torch.matmul(A.T.to(torch.float32),
testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (1)

248-248: Fix the missing dot in method call

Same typo as in the other test file - A.Tto should be A.T.to.

-        ref_c = torch.matmul(A.Tto(torch.float32),
+        ref_c = torch.matmul(A.T.to(torch.float32),
🧹 Nitpick comments (5)
tilelang/intrinsics/mfma_macro_generator.py (5)

56-56: Consider better naming for the preshuffle parameter

The parameter b_preshuffle in the base class constructor might cause confusion since there's also a_preshuffle in the derived class. Consider either adding a_preshuffle to the base class or documenting why only B preshuffle is needed here.


146-148: Initialize preshuffle attributes before use

The _initialize_b_preshuffle method references self.b_preshuffle but doesn't handle the case where the attribute might not exist. Consider initializing it first.

 def _initialize_b_preshuffle(self, b_preshuffle: Optional[bool] = False):
     if b_preshuffle is not None:
+        self.b_preshuffle = False  # Initialize with default
         self.b_preshuffle = b_preshuffle

Or better yet:

 def _initialize_b_preshuffle(self, b_preshuffle: Optional[bool] = False):
-    if b_preshuffle is not None:
-        self.b_preshuffle = b_preshuffle
+    self.b_preshuffle = b_preshuffle if b_preshuffle is not None else False

300-300: Consider renaming ambiguous variable names

Variables l and r could be confused with numbers (1) or be unclear in meaning. Consider more descriptive names like left_idx, right_idx or row_base, col_base.

-                        l, r = (
+                        row_base, col_base = (
                             warp_n * warp_col_tiles + j * micro_size_y,
                             rk * chunk + ki * (k_pack * micro_size_k),
                         )
-                        B_local_buf[j * k_pack * local_size_b + local_id] = B_shared_buf[l + row,
-                                                                                         r + col]
+                        B_local_buf[j * k_pack * local_size_b + local_id] = B_shared_buf[row_base + row,
+                                                                                         col_base + col]

Also applies to: 311-311


459-609: Consider extracting common patterns in ldmatrix methods

The global and shared variants of ldmatrix_a and ldmatrix_b have very similar patterns. Consider extracting common logic to reduce code duplication.

Consider creating a helper method for the common pattern:

def _ldmatrix_helper(self, local_buf, buf, ki, thread_binding, rk, 
                    is_b=False, is_global=False, is_transposed=False):
    # Common extraction and indexing logic
    ...

This would reduce duplication and make the code more maintainable.


487-487: Variable naming consistency

The static analysis correctly flags the use of l as an ambiguous variable name throughout the preshuffle methods. For consistency with the parent class and better readability, consider using more descriptive names.

Also applies to: 496-496, 515-515, 526-526, 562-562, 571-571, 590-590, 600-600

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9a51c4 and 06d85db.

📒 Files selected for processing (3)
  • testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1 hunks)
  • testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (10 hunks)
  • tilelang/intrinsics/mfma_macro_generator.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)
testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (1)
  • assert_tl_matmul_correctness (197-261)
testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (1)
tilelang/intrinsics/mfma_macro_generator.py (7)
  • MatrixCorePreshuffleIntrinEmitter (405-609)
  • ldmatrix_a (233-272)
  • ldmatrix_a (459-532)
  • ldmatrix_b (274-318)
  • ldmatrix_b (534-609)
  • mfma (320-352)
  • stmatrix (354-402)
tilelang/intrinsics/mfma_macro_generator.py (2)
tilelang/intrinsics/mma_macro_generator.py (6)
  • ldmatrix_a (199-261)
  • ldmatrix_a (753-857)
  • get_thread_binding (151-157)
  • extract_thread_binding (167-197)
  • ldmatrix_b (263-333)
  • ldmatrix_b (859-971)
tilelang/language/kernel.py (4)
  • KernelLaunchFrame (72-206)
  • Current (115-121)
  • get_thread_binding (151-156)
  • get_thread_binding (261-265)
🪛 Ruff (0.13.1)
tilelang/intrinsics/mfma_macro_generator.py

300-300: Ambiguous variable name: l

(E741)


311-311: Ambiguous variable name: l

(E741)


487-487: Ambiguous variable name: l

(E741)


496-496: Ambiguous variable name: l

(E741)


515-515: Ambiguous variable name: l

(E741)


526-526: Ambiguous variable name: l

(E741)


562-562: Ambiguous variable name: l

(E741)


571-571: Ambiguous variable name: l

(E741)


590-590: Ambiguous variable name: l

(E741)


600-600: Ambiguous variable name: l

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: format-check
🔇 Additional comments (8)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

237-240: LGTM!

The new test cases add valuable coverage for non-transposed B matrix configurations.

tilelang/intrinsics/mfma_macro_generator.py (2)

296-317: Refactoring simplified the ldmatrix_b logic nicely

The removal of complex preshuffle-specific logic and the cleaner implementation with consistent l, r mapping improves code maintainability.


405-452: Well-structured preshuffle emitter class

The new MatrixCorePreshuffleIntrinEmitter class appropriately extends the base emitter with preshuffle-specific functionality. The constructor properly initializes all necessary attributes.

testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (5)

6-6: Good adoption of the new preshuffle emitter

The import change correctly switches to using MatrixCorePreshuffleIntrinEmitter for preshuffle-aware testing.


24-24: Well-integrated b_g2l_load parameter

The addition of b_g2l_load parameter enables testing of both global-to-local and shared-to-local B matrix loading paths, providing comprehensive test coverage.

Also applies to: 207-207, 209-209


41-44: Verify the preshuffle configuration changes

The warp tile dimensions change from 32x32 to 64x16 for preshuffle mode, and chunk size increases significantly. Please verify these are the intended optimizations for preshuffle mode.

Are these specific tile and chunk sizes validated for optimal performance with preshuffle on AMD hardware?


153-162: Clean separation of B loading paths

The conditional logic for G2L vs S2L B matrix loading is well-structured and provides clear path selection based on the b_g2l_load flag.


267-284: Good test coverage for preshuffle scenarios

The test cases comprehensively cover different configurations including transposed/non-transposed B matrices and different k_pack values with preshuffle enabled.

Comment on lines +453 to +457
def _initialize_preshuffle(self, a_preshuffle: bool, b_preshuffle: bool):
if a_preshuffle is not None:
self.a_preshuffle = a_preshuffle
if b_preshuffle is not None:
self.b_preshuffle = b_preshuffle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve preshuffle initialization logic

The condition if a_preshuffle is not None will be True even when a_preshuffle=False is passed. This could lead to unexpected behavior.

 def _initialize_preshuffle(self, a_preshuffle: bool, b_preshuffle: bool):
-    if a_preshuffle is not None:
-        self.a_preshuffle = a_preshuffle
-    if b_preshuffle is not None:
-        self.b_preshuffle = b_preshuffle
+    self.a_preshuffle = a_preshuffle if a_preshuffle is not None else False
+    self.b_preshuffle = b_preshuffle if b_preshuffle is not None else False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _initialize_preshuffle(self, a_preshuffle: bool, b_preshuffle: bool):
if a_preshuffle is not None:
self.a_preshuffle = a_preshuffle
if b_preshuffle is not None:
self.b_preshuffle = b_preshuffle
def _initialize_preshuffle(self, a_preshuffle: bool, b_preshuffle: bool):
self.a_preshuffle = a_preshuffle if a_preshuffle is not None else False
self.b_preshuffle = b_preshuffle if b_preshuffle is not None else False
🤖 Prompt for AI Agents
In tilelang/intrinsics/mfma_macro_generator.py around lines 453-457, change the
preshuffle initialization to explicitly treat None as "no-op" and accept False
as a valid value: check "if a_preshuffle is None: pass else: self.a_preshuffle =
a_preshuffle" (and the same for b_preshuffle) so that passing False correctly
sets the attribute rather than being misinterpreted as omitted.

if self.a_preshuffle is False:
return super().ldmatrix_a(A_local_buf, A_buf, ki, rk)

def _warp_ldmatrix_a_global(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing @T.macro decorator for _warp_ldmatrix_a_global

The _warp_ldmatrix_a_global function is missing the @T.macro decorator, unlike its shared counterpart and other similar functions in the codebase.

+    @T.macro
     def _warp_ldmatrix_a_global(
         A_local_buf,
         A_buf,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _warp_ldmatrix_a_global(
@T.macro
def _warp_ldmatrix_a_global(
🤖 Prompt for AI Agents
In tilelang/intrinsics/mfma_macro_generator.py around line 475, the helper
function _warp_ldmatrix_a_global is missing the @T.macro decorator; add the
@T.macro decorator immediately above its def line to match its shared
counterpart and other similar functions so it is registered as a TIR macro,
ensuring consistent behavior and enabling the macro to be expanded where
expected.

A_local_buf[i * k_pack * local_size_a + local_id] = A_shared_buf[l, r, row,
col]
else:
print(self.a_preshuffle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug print statement

There's a print statement that appears to be for debugging purposes and should be removed before merging.

             else:
-                print(self.a_preshuffle)
                 for i in T.serial(warp_rows):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(self.a_preshuffle)
else:
for i in T.serial(warp_rows):
🤖 Prompt for AI Agents
In tilelang/intrinsics/mfma_macro_generator.py around line 522, there is a
leftover debug print statement "print(self.a_preshuffle)"; remove this line so
the function no longer emits debug output, and run tests/lint to ensure no other
debug prints remain.

is_global = pid_m is not None and pid_n is not None

if self.b_preshuffle is False:
return super().ldmatrix_b(B_local_buf, B_buf, ki, rk, pid_m, pid_n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect method signature in super() call

The super().ldmatrix_b() call includes extra parameters (pid_m, pid_n) that don't exist in the parent class method signature.

     if self.b_preshuffle is False:
-        return super().ldmatrix_b(B_local_buf, B_buf, ki, rk, pid_m, pid_n)
+        return super().ldmatrix_b(B_local_buf, B_buf, ki, rk)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return super().ldmatrix_b(B_local_buf, B_buf, ki, rk, pid_m, pid_n)
return super().ldmatrix_b(B_local_buf, B_buf, ki, rk)
🤖 Prompt for AI Agents
In tilelang/intrinsics/mfma_macro_generator.py around line 547, the
super().ldmatrix_b(...) call passes two extra arguments (pid_m, pid_n) that the
parent method doesn't accept; remove those extra parameters and call
super().ldmatrix_b with the expected parameters (i.e., B_local_buf, B_buf, ki,
rk) so the call matches the parent signature.

@LeiWang1999 LeiWang1999 merged commit 48c9a35 into tile-ai:main Sep 23, 2025
6 of 7 checks passed
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants