[AMD] support mfma i32_16x16x32_i8 #800

Paran0idy · 2025-09-10T10:05:00Z

Enhance AMD backend codegen for __builtin_amdgcn_mfma_i32_16x16x32_i8.
Fix some bugs.

Summary by CodeRabbit

New Features
- Added int8 GEMM support using MFMA on AMD GPUs with int32 accumulation.
- Enhanced MFMA macro generation to handle int8-specific shapes and suffixes.
Bug Fixes
- Corrected dtype handling and placeholder naming in generated code to ensure proper casting and intrinsic selection.
- Updated identifier formatting for MFMA variants for consistency.
Tests
- Expanded coverage to include int8 with int32 accumulation and varied k-pack configurations.
- Added kernel source output during tests to aid debugging.

coderabbitai · 2025-09-10T10:05:09Z

Walkthrough

Adds int8 MFMA support across HIP codegen and templates: fixes dtype placeholder names, extends dtype map with int8x8→int64_t, adjusts MFMA prefix tokens, and introduces MfmaTraits<int8_t> using the AMD i8 MFMA intrinsic. Updates macro generation for int8 suffix and k-dimension, and expands tests to include int8 with k_pack scaling and kernel source printing.

Changes

Cohort / File(s)	Summary
HIP codegen MFMA dtype and placeholders `src/target/codegen_hip.cc`	Corrects MFMA dtype placeholder names to {A_dtype}/{B_dtype}/{C_dtype}; updates emitted casts accordingly; adjusts MFMA arg0 prefix to use intrM/intrN/intrK tokens; extends dtype_map with "int8x8" → "int64_t".
MFMA int8 trait specialization `src/tl_templates/hip/gemm.h`	Adds MfmaTraits<int8_t> with mfma_op taking int8_t A/B pointers, packing into 64-bit and invoking __builtin_amdgcn_mfma_i32_16x16x32_i8 to update accumulator C.
MFMA macro generation for int8 `tilelang/intrinsics/mfma_macro_generator.py`	For "int8", sets k_dim=32; adjusts MFMA suffix generation to use “…_x{k_dim}_i8” for i8.
Tests for MFMA int8 and k_pack scaling `testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py`	Makes chunk = 32 * k_pack affecting shared/local shapes; prints generated kernel source; adds int8 tests with accum_dtype="int32" (k_pack=1 and 2).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant T as Test Runner
    participant TL as TileLang GEMM (tests)
    participant MG as MFMA Macro Generator
    participant CG as HIP Codegen
    participant K as Generated Kernel
    participant HW as AMD MFMA Intrinsic

    T->>TL: Invoke GEMM with dtype=int8, accum=int32, k_pack
    TL->>MG: Request MFMA config (dtype=int8)
    MG-->>TL: k_dim=32, suffix "..._x32_i8"
    TL->>CG: Build kernel with MFMA tokens and dtypes
    CG-->>TL: HIP kernel source (uses {A/B/C_dtype}, int8x8→int64_t)
    TL->>K: Launch kernel
    K->>HW: __builtin_amdgcn_mfma_i32_16x16x32_i8(...)
    HW-->>K: Accumulate results
    K-->>T: Completion (optionally print kernel source)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly indicates the primary change, namely support for the AMD MFMA i32_16x16x32_i8 intrinsic, which aligns with the pull request’s main objective of enabling int8 MFMA on AMD. The inclusion of “[AMD]” at the start provides useful context about the affected backend. It avoids extraneous details or vague terms and remains concise. Therefore, it effectively summarizes the core update.

Poem

I twitch my ears at sixteen-by-sixteen dreams,
Pack bytes in paws, then ripple K in streams.
An i8 hop, an i32 thump—so neat!
MFMA drums a rhythmic compute beat.
I nibble tests, their kernels spin—
Carrots compiled, results hop in! 🥕✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-10T10:05:11Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Summary of Changes

Hello @Paran0idy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on extending the AMD backend to fully support int8 matrix multiplication operations using the __builtin_amdgcn_mfma_i32_16x16x32_i8 intrinsic. It includes necessary updates to the code generation logic, type mappings, and MFMA macro generation, along with new test cases to verify the integration and correctness of these low-precision computations.

Highlights

MFMA i32_16x16x32_i8 Support: Added comprehensive support for the __builtin_amdgcn_mfma_i32_16x16x32_i8 intrinsic, enabling int8 matrix multiplication accumulation to int32 on AMD GPUs.
Backend Codegen Enhancements: Updated the HIP backend code generation to correctly handle int8x8 types and fixed a typo in the MFMA code generation template.
MFMA Macro Generation Logic: Modified the MFMA macro generator to correctly determine k_dim and generate the appropriate suffix for int8 inputs.
Testing and Validation: Introduced new test cases to validate the correctness of int8 matrix multiplication with int32 accumulation, ensuring proper functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the __builtin_amdgcn_mfma_i32_16x16x32_i8 instruction for AMD GPUs, enabling int8 matrix multiplication. The changes are well-structured, touching the HIP codegen, C++ templates, and Python layers for intrinsic generation and testing. The implementation is largely correct. I have two suggestions for improvement: one to enhance code quality and safety in a C++ template by removing an unnecessary const_cast, and another to remove a leftover debugging print statement from a Python test file.

gemini-code-assist · 2025-09-10T10:06:09Z

src/tl_templates/hip/gemm.h

+    int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));
+    int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));


The use of const_cast here is unnecessary and not considered best practice. Since the data pointed to by a and b is not modified, you can directly reinterpret_cast to a const pointer type. This preserves const correctness and improves code safety.

const int64_t *b_packed = reinterpret_cast<const int64_t *>(b); const int64_t *a_packed = reinterpret_cast<const int64_t *>(a);

gemini-code-assist · 2025-09-10T10:06:09Z

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py

    C = torch.zeros(M, N, device="cuda", dtype=getattr(torch, out_dtype))

    kernel(A, B, C)
+    print(kernel.get_kernel_source())


This print statement appears to be for debugging. It should be removed before merging to keep the test output clean.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (2)
45-53: Scale of block_K now depends on k_pack — add divisibility guard to avoid silent tail-drop.

With chunk = 32 * k_pack and block_K = chunk, the outer K loop uses floor division and can silently skip a remainder when K % block_K != 0. Add an explicit assert to fail fast. The ki inner loop math remains consistent.

Apply this diff:
   block_M = block_row_warps * warp_row_tiles
   block_N = block_col_warps * warp_col_tiles
   block_K = chunk
+  # Ensure we don't drop a tail on K
+  assert K % block_K == 0, f"K ({K}) must be divisible by block_K ({block_K})"
Also applies to: 126-127

211-214: Typo causes AttributeError: use A.T.to(...), not A.Tto(...).

This path will crash for a_transposed=True, b_transposed=False.

Apply this diff:
-        ref_c = torch.matmul(A.Tto(torch.float32),
+        ref_c = torch.matmul(A.T.to(torch.float32),
                              B.to(torch.float32)).to(getattr(torch, out_dtype))

🧹 Nitpick comments (4)

src/target/codegen_hip.cc (1)
883-883: Comment improvement: Fix grammar in parameter description.

The comment has a grammatical error - "multiplicand" should be "multiplier" for the A and B parameters.

Apply this diff to improve the comment:
-    // arg 0: prefix: {otype}_{intrM}x{intrN}x{intrK}_{itype}
+    // arg 0: mfma suffix: {otype}_{intrM}x{intrN}x{intrK}_{itype}
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (3)
182-183: Gate noisy prints and reuse fetched source to keep CI logs clean.

Unconditional prints bloat logs. Make them opt-in via env vars and avoid calling get_kernel_source() twice.

Apply this diff:
-    print(matmul)
+    if os.getenv("TL_DEBUG"):
+        print(matmul)
@@
-    print(kernel.get_kernel_source())
+    if os.getenv("TL_DUMP_KERNEL"):
+        print(src_code)
@@
-    print(C)
-    print(ref_c)
+    if os.getenv("TL_DEBUG"):
+        print(C)
+        print(ref_c)
And add this import near the top of the file:
import os
Also applies to: 198-198, 223-224

190-191: Include full int8 range in randint.

high is exclusive; using 127 excludes 127. Use 128 to cover [-128, 127].

Apply this diff:
-        A = torch.randint(-128, 127, A_shape, device="cuda", dtype=torch.int8)
-        B = torch.randint(-128, 127, B_shape, device="cuda", dtype=torch.int8)
+        A = torch.randint(-128, 128, A_shape, device="cuda", dtype=torch.int8)
+        B = torch.randint(-128, 128, B_shape, device="cuda", dtype=torch.int8)
185-185: Clarify backend in comment.

This runs on ROCm; “CUDA source” is misleading.

Apply this diff:
-    # src_code is the generated cuda source
+    # src_code is the generated GPU kernel source (HIP/CUDA)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54aaec9 and 7ba643c.

📒 Files selected for processing (4)

src/target/codegen_hip.cc (3 hunks)
src/tl_templates/hip/gemm.h (1 hunks)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (3 hunks)
tilelang/intrinsics/mfma_macro_generator.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (4)

tilelang/jit/kernel.py (1)

get_kernel_source (378-389)

tilelang/jit/adapter/ctypes/adapter.py (1)

get_kernel_source (290-296)

tilelang/jit/adapter/cython/adapter.py (1)

get_kernel_source (516-522)

tilelang/jit/adapter/base.py (1)

get_kernel_source (51-52)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

🔇 Additional comments (5)

tilelang/intrinsics/mfma_macro_generator.py (2)

84-84: LGTM! Support for int8 k_dim is correctly added.

The change correctly adds "int8" to the set of dtypes that trigger k_dim = 32, aligning with AMD's int8 MFMA instruction requirements which use 32 elements in the K dimension.

126-127: LGTM! Correct suffix generation for int8 MFMA intrinsic.

The new branch properly generates the suffix with an underscore before "i8" (e.g., i32_16x16x32_i8), matching the expected AMD intrinsic naming convention __builtin_amdgcn_mfma_i32_16x16x32_i8.

src/target/codegen_hip.cc (2)

917-917: LGTM! Correct mapping for int8x8 to int64_t.

The addition of int8x8 → int64_t mapping is correct for packing 8 int8 values into a 64-bit integer, which aligns with the AMD MFMA int8 intrinsic requirements.

929-931: LGTM! Fixed typo in placeholder names.

Good catch fixing the typo from {C_dytpe}, {A_dytpe}, {B_dytpe} to the correct {C_dtype}, {A_dtype}, {B_dtype}. This ensures proper type casting in the generated MFMA code.

Also applies to: 937-939

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

233-235: Int8/int32 test coverage looks good.

Solid additions covering k_pack=1 and k_pack=2. No issues spotted.

coderabbitai · 2025-09-10T10:13:27Z

src/tl_templates/hip/gemm.h

+    int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));
+    int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));


⚠️ Potential issue

Potential undefined behavior: const_cast removes const qualifier inappropriately.

Using const_cast to remove the const qualifier from const int8_t* parameters and then modifying the memory through reinterpret_cast could lead to undefined behavior if the underlying memory is actually const. The pointers should be cast directly without removing const.

Apply this diff to fix the const-correctness issue:

- int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b)); - int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a)); + const int64_t *b_packed = reinterpret_cast<const int64_t *>(b); + const int64_t *a_packed = reinterpret_cast<const int64_t *>(a);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));

int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));

const int64_t *b_packed = reinterpret_cast<const int64_t *>(b);

const int64_t *a_packed = reinterpret_cast<const int64_t *>(a);

🤖 Prompt for AI Agents

In src/tl_templates/hip/gemm.h around lines 15 to 16, the code uses const_cast to strip const from const int8_t* and then reinterpret_cast to int64_t*, which can invoke undefined behavior; change the target pointer types to const int64_t* and cast directly from the original const int8_t* using reinterpret_cast<const int64_t*>(...), removing any const_cast so the const qualifier is preserved and you don't attempt to modify potentially const memory.

[AMD] support mfma i32_16x16x32_i8

7ba643c

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

LeiWang1999 merged commit 9fd6bb3 into tile-ai:main Sep 10, 2025
5 of 6 checks passed

This was referenced Sep 11, 2025

[AMD] support fp8 T.gemm #804

Merged

[AMD] support preshuffle weight mfma #806

Merged

coderabbitai bot mentioned this pull request Oct 10, 2025

[AMD] fix bug&add amd fp8 examples #966

Merged

kurisu6912 mentioned this pull request Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

coderabbitai bot mentioned this pull request Oct 27, 2025

[AMD] Supoort T.gemm_v2 for AMD Backend #1136

Merged

kurisu6912 mentioned this pull request Oct 28, 2025

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[AMD] support mfma i32_16x16x32_i8 #800

[AMD] support mfma i32_16x16x32_i8 #800

Uh oh!

Paran0idy commented Sep 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Uh oh!

gemini-code-assist bot Sep 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		int64_t b_packed = reinterpret_cast<int64_t >(const_cast<int8_t *>(b));
		int64_t a_packed = reinterpret_cast<int64_t >(const_cast<int8_t *>(a));

Uh oh!

[AMD] support mfma i32_16x16x32_i8 #800

[AMD] support mfma i32_16x16x32_i8 #800

Uh oh!

Conversation

Paran0idy commented Sep 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Paran0idy commented Sep 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 10, 2025 •

edited

Loading