Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Nov 24, 2025

Thanks @Hamerlate to help find the issue, this reverts commit 0d101c1.

Summary by CodeRabbit

  • Refactor

    • Consolidated FP8 type representations and simplified metadata structures by removing internal flags, reducing complexity and improving maintainability.
    • Removed FP8 GEMM example code.
  • Changes

    • Restricted accumulator dtype support for TCGEN05 operations to float32.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Walkthrough

This PR removes unused metadata flags (enable_ws, enable_2cta) from the TCGEN5 MMA infrastructure, simplifies template-based packing logic (pack16), and replaces CuTe floating-point types with native FP8 types in SM100 GEMM dispatch. Additionally, it removes an outdated FP8 example and consolidates related type mappings.

Changes

Cohort / File(s) Summary
Example removal
examples/gemm_fp8/example_tilelang_gemm_fp8_sm100.py
Deleted entire TileLang FP8 GEMM example script including matmul kernel function and calc_diff helper.
TCGEN5 metadata simplification
src/op/gemm_py.cc, src/op/tcgen5_meta.h
Removed enable_ws and enable_2cta fields from TCGEN5MMAMeta struct; updated SUCCESS/FAIL macros to take 3 parameters instead of 5; refined dtype condition logic for FP8 handling.
Copy operation refactoring
src/op/copy.cc, src/tl_templates/cuda/copy_sm100.h
Removed unused pack/unpack boolean flags; deleted fp8_e5_32_t global memory load/store methods; removed pack16 template parameter from tcgen05_ld_32dpXXbNx helper templates.
TCGEN05 load template simplification
src/tl_templates/cuda/tcgen_05_ld.h
Converted templated class specializations to concrete non-template classes; added explicit classes (tmem_ld_32dp32bNx, tmem_ld_16dp128bNx, tmem_ld_16dp256bNx, tmem_ld_32dp128bNx, tmem_ld_32dp256bNx); unified copy interface across non-pack variants.
GEMM type migration
src/tl_templates/cuda/gemm_sm100.h
Replaced cute::float_e4m3_t/cute::float_e5m2_t with fp8_e4_t/fp8_e5_t types in DispatchInstruction specializations; updated enable_if constraints from individual M values to combined (M == 64 || M == 32) patterns for FP8 variants.
TileLang Python intrinsics
tilelang/intrinsics/mma_macro_generator.py, tilelang/intrinsics/tcgen05_macro_generator.py, tilelang/jit/adapter/wrapper.py, tilelang/tileop/gemm/gemm_tcgen05.py
Removed float8 dtype abbreviation mappings; reduced TCGEN5MMA meta extraction from 5 to 3 fields (removed enable_ws/enable_2cta); removed meta retrieval in gemm_tcgen05 lower(); narrowed accumulator dtype acceptance to float32 only.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • src/tl_templates/cuda/tcgen_05_ld.h: Significant restructuring of templated class hierarchy to concrete classes; requires careful verification that all call sites and copy implementations align with new non-pack interfaces.
  • src/tl_templates/cuda/gemm_sm100.h: Multiple DispatchInstruction specialization replacements; verify type consistency across FP8 variants and that enable_if constraints correctly filter matrix shapes.
  • src/op/tcgen5_meta.h: Macro changes to SUCCESS/FAIL affecting all call sites; validate dtype condition refinement (float8_e4m3fn/float8_e5m2 with float32 accumulator) matches intended hardware support.
  • Cross-file consistency: Ensure TCGEN5MMA metadata field reductions (5 → 3) are consistently applied across Python intrinsics, C++ headers, and gemm lowering logic.
  • Type migration: Confirm all CuTe float_e4m3_t/float_e5m2_t → fp8_e4_t/fp8_e5_t replacements are complete and correct.

Possibly related PRs

Poem

A rabbit hops through template code, 🐰
Removing flags that made the load
Too heavy, complex, hard to trace—
Pack16 and bools now leave no trace.
FP8 types stride in clean and bright,
Simplicity wins the day. What a sight! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: reverting a specific prior commit that added support for multiple dtypes in tcgen05.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LeiWang1999 LeiWang1999 merged commit ca98cc3 into tile-ai:main Nov 24, 2025
5 of 7 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tilelang/intrinsics/tcgen05_macro_generator.py (2)

171-178: Meta tuple now 3‑element; enable_ws derived from atom_m

Updating the meta length check to 3 and unpacking (atom_m, atom_n, atom_k) aligns with the new TCGEN5MMAMeta definition. Deriving enable_ws as atom_m != 128 matches the D/E/G layout convention (only the 128‑row atom is non‑WS), so this looks consistent with the C++ meta logic.

If you want to quiet TRY003 from Ruff, you could factor the long error message into a helper or shorten it, but that’s purely stylistic.


384-389: Store‑layout meta handling consistent with 3‑field TCGEN5MMAMeta

make_mma_store_layout now expects a 3‑element meta and uses only atom_m and atom_n to validate tile divisibility, ignoring atom_k. That matches how the store layout is computed (only M×N tiling matters here), and the error message accurately describes the unsupported configuration.

Same as above, the long f-string in the ValueError is fine functionally; consider shortening or moving it if you want to satisfy TRY003.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f7bac4 and af01ffc.

📒 Files selected for processing (11)
  • examples/gemm_fp8/example_tilelang_gemm_fp8_sm100.py (0 hunks)
  • src/op/copy.cc (2 hunks)
  • src/op/gemm_py.cc (0 hunks)
  • src/op/tcgen5_meta.h (2 hunks)
  • src/tl_templates/cuda/copy_sm100.h (1 hunks)
  • src/tl_templates/cuda/gemm_sm100.h (1 hunks)
  • src/tl_templates/cuda/tcgen_05_ld.h (11 hunks)
  • tilelang/intrinsics/mma_macro_generator.py (0 hunks)
  • tilelang/intrinsics/tcgen05_macro_generator.py (2 hunks)
  • tilelang/jit/adapter/wrapper.py (0 hunks)
  • tilelang/tileop/gemm/gemm_tcgen05.py (1 hunks)
💤 Files with no reviewable changes (4)
  • tilelang/jit/adapter/wrapper.py
  • tilelang/intrinsics/mma_macro_generator.py
  • src/op/gemm_py.cc
  • examples/gemm_fp8/example_tilelang_gemm_fp8_sm100.py
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/tileop/gemm/gemm_tcgen05.py (1)
tilelang/tileop/gemm/gemm_base.py (1)
  • accum_dtype (59-60)
src/tl_templates/cuda/copy_sm100.h (1)
src/tl_templates/cuda/tcgen_05_ld.h (5)
  • tl (10-713)
  • tmem_ld_32dp32bNx (13-182)
  • tmem_ld_32dp64bNx (684-691)
  • tmem_ld_32dp128bNx (694-701)
  • tmem_ld_32dp256bNx (704-711)
🪛 Ruff (0.14.5)
tilelang/intrinsics/tcgen05_macro_generator.py

173-175: Avoid specifying long messages outside the exception class

(TRY003)


386-387: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (18)
src/op/copy.cc (2)

1127-1129: ICHECK message reflow is fine

This just improves readability of the unsupported tmem copy diagnostic; no behavioral change.


1249-1250: Confirm tcgen05 intrinsic name matches new template signature

Appending only <num_chunks_each_wg> to meta.intrinsics_name assumes that name now refers to the unspecialized function (e.g. tmem_ld_*Nx::copy). With the removal of the Pack16 template parameter on the loader classes, that looks correct, but it’s worth double‑checking that tcgen05_layout.h sets intrinsics_name accordingly so codegen still emits valid calls.

src/op/tcgen5_meta.h (2)

24-28: FAIL/SUCCESS macros correctly updated for 3‑field TCGEN5MMAMeta

Returning {false, TCGEN5MMAMeta{0, 0, 0}} on FAIL and {true, TCGEN5MMAMeta{atom_m, atom_n, atom_k}} on SUCCESS matches the new struct definition and keeps the API straightforward for FFI callers.


30-75: Verification confirms narrow TCGEN5MMA dtype space is intentional; minor mismatch found in _FLOAT8_DTYPES set

The review concerns are validated:

  1. c_dtype=float32 enforcement: Confirmed at line 106 in gemm_tcgen05.py, which validates accum_dtype != "float32" and raises an error. All TCGEN5 operations will enforce this constraint before invoking get_tcgen5_mma_meta.

  2. Supported FP8 encodings: No TCGEN5 tests or examples use fnuz variants. All fnuz usage (e.g., float8_e4m3fnuz) is confined to AMD MFMA code paths, not TCGEN5. The narrowing to {float8_e4m3fn, float8_e5m2} is consistent with codebase usage.

  3. Unsupported dtype handling: Confirmed that unsupported dtypes will trigger FAIL in GetTCGEN5MMAMeta, which returns an empty result, and the caller in tcgen05_macro_generator.py (lines 172–175, 386–387) raises ValueError with the configuration details.

Minor issue: _FLOAT8_DTYPES in gemm_tcgen05.py (lines 13–16) still lists fnuz variants (float8_e4m3fnuz, float8_e5m2fnuz), but these are not validated or enforced. The set appears stale and should be updated to match the actual supported dtypes for clarity and to prevent future confusion.

src/tl_templates/cuda/tcgen_05_ld.h (7)

13-182: Non‑templated tmem_ld_32dp32bNx with copy<N> looks correct

The new class with template<int N> static copy(...) and the N power‑of‑two static_assert matches the PTX mnemonics from x1 through x128, and the dst_ptr indexing aligns with the number of output registers per variant. The trap fallback for invalid N is also appropriate for debug builds.


185-354: tmem_ld_16dp64bNx concrete class is consistent with the 16×64b PTX forms

This mirrors the 32dp32b pattern: a single class with copy<N> handling x1..x128 via inline asm, guarded by the same power‑of‑two static_assert. The register lists and dst_ptr indices line up with the PTX signatures.


356-520: tmem_ld_16dp128bNx: tighter N bound and expanded asm variants

Limiting N to ≤64 for 16×128b loads and explicitly spelling out x1..x64 variants via inline asm is reasonable and keeps register pressure manageable. The static_assert correctly enforces the allowed N values, and the dst_ptr indexing matches the number of 32‑bit outputs each variant emits.


522-680: tmem_ld_16dp256bNx: N≤32 and multi‑register outputs per pattern

Here N is capped at 32, with each pattern emitting 4·N 32‑bit words. The x1..x32 asm blocks have consistent operand lists and write into the expected dst_ptr ranges; invalid N again traps. This is a clean consolidation compared to the old Pack16‑templated specializations.


684-691: tmem_ld_32dp64bNx wrapper correctly composes two 16‑lane loads

Calling tmem_ld_16dp64bNx::copy<N> twice, offsetting src_addr by (16 << 16) and dst_ptr by +N, matches the idea of stitching two 16‑lane 64b patterns into a 32‑lane 64b pattern with 2·N outputs.


694-701: tmem_ld_32dp128bNx dst_ptr stride N * 2 matches 2·N outputs per half

Each 16×128b load produces 2·N 32‑bit outputs, so offsetting the second call’s dst_ptr by N * 2 ensures the 32‑lane variant writes a contiguous 4·N‑word block. This is consistent with the lower‑level implementation.


704-711: tmem_ld_32dp256bNx correctly offsets by N * 4 for 4·N outputs per half

For 16×256b, each load yields 4·N 32‑bit words, so using dst_ptr + N * 4 on the second call gives the expected 8·N total outputs for the 32‑lane pattern. The composition looks sound.

tilelang/tileop/gemm/gemm_tcgen05.py (1)

105-107: Verify float32 accumulator restriction and add documentation

The restriction if accum_dtype != "float32" is unique to GemmTCGEN5; other GEMM variants (MMA, WGMMA, MFMA) accept wider accumulator dtypes without restriction. While the SM100 preview examples show only float32 accumulators, the code lacks explanation of whether this is a hardware limitation or a conservative preview-stage choice.

No TCGEN5-specific tests or active call sites using non-float32 accumulators were found, but this absence doesn't confirm the restriction is necessary. Consider:

  1. Adding an inline comment explaining why float32-only (hardware constraint vs. implementation limitation)
  2. Verifying with hardware/intrinsic documentation whether other dtypes can eventually be supported
  3. Adding a test case documenting this constraint for future maintainers
src/tl_templates/cuda/copy_sm100.h (4)

98-104: tcgen05_ld_32dp32bNx wrapper matches simplified tmem loader interface

Dropping the pack16 template parameter and instantiating tcgen05_ld_core with tl::tmem_ld_32dp32bNx directly is consistent with the tmem_ld_32dp32bNx interface (copy templated only on N); the recursion and fencing remain unchanged.


107-114: tcgen05_ld_32dp64bNx wrapper correctly tracks core loader signature

The 64‑byte variant mirrors the 32‑byte path: template parameters are simplified to <int N, typename dst_t>, and tcgen05_ld_core is instantiated with tl::tmem_ld_32dp64bNx and the same MAX_LOGN as before, so behavior is preserved.


116-123: tcgen05_ld_32dp128bNx wrapper remains structurally consistent

This wrapper follows the same pattern as the 32/64‑byte versions, with only the target_call_cls template argument simplified; MAX_LOGN is still 6 and the async fence is unchanged, so load tiling and synchronization semantics are intact.


125-132: tcgen05_ld_32dp256bNx wrapper aligns with core recursion contract

The 256‑byte path now also uses the non‑pack tl::tmem_ld_32dp256bNx while preserving MAX_LOGN == 5 and the recursive tcgen05_ld_core structure; this keeps the segmentation logic and fencing behavior the same as before the pack16 change.

src/tl_templates/cuda/gemm_sm100.h (2)

245-264: FP8 e4 dispatch specializations correctly gate shapes and map to F8F6F4 kernels

The two DispatchInstruction<fp8_e4_t, fp8_e4_t, float, ...> specializations cleanly separate the M == 128 && K == 32 case (non‑.ws SM100_MMA_F8F6F4_SS) from the (M == 64 || M == 32) && K == 32 case (.ws SM100_MMA_F8F6F4_WS_SS), matching the constraints encoded in the MMA_Traits<SM100_MMA_F8F6F4*_SS,...> specialization and aligning with the existing F16/BF16 pattern. Using MMA_Traits<...> as the MMA alias is consistent with the F8F6F4 traits definition here and should integrate with GemmTensorOp’s make_tiled_mma usage.


266-285: FP8 e5 dispatch mirrors e4 path and reuses the same F8F6F4 infrastructure

The DispatchInstruction<fp8_e5_t, fp8_e5_t, float, ...> specializations mirror the e4 ones: they enforce K == 32, choose between SM100_MMA_F8F6F4_SS and SM100_MMA_F8F6F4_WS_SS based on M, and rely on MMA_Traits to wire up K‑dimension (32 elements) and layouts. This keeps the FP8‑e5 path consistent with the e4 path and the underlying F8F6F4 WS traits implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants