Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Aug 20, 2025

Summary by CodeRabbit

  • New Features
    • CUDA shared-memory barrier support with new device barrier utilities and object-backed shared.barrier handling; TMA load APIs generalized to accept flexible barrier types.
  • Bug Fixes
    • Corrected a misspelled PTX store intrinsic and updated related call sites.
  • Refactor
    • Broadened export surface for intrinsics, removed legacy memory helper and narrowed a public re-export; codegen and transform passes updated for barrier flow.
  • Tests
    • Added autotune decorator to a matmul test; removed one dequantize test.
  • Style
    • Replaced stdout prints with logging; minor formatting and CI parallelism tweaks.

…upport

- Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
- Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
- Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
- Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
- Enhanced the overall structure and readability of the codebase.
…o streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.
- Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
- Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
- Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
- Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
- Removed deprecated memory scope handling code to enhance clarity and maintainability.
… streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the management of memory barriers, particularly CUDA mbarriers, within the system. The changes introduce a new intrinsic for explicit barrier allocation, streamline the CUDA code generation process for these barriers, and generalize their usage in TMA (Tensor Memory Accelerator) load operations. This refactoring aims to improve the flexibility and clarity of barrier handling, alongside minor code cleanups and typo corrections.

Highlights

  • Explicit Barrier Allocation: A new allocate_barrier intrinsic has been introduced, allowing for more explicit and flexible allocation of memory barriers.
  • CUDA Backend Integration: The CUDA code generation backend has been updated to support the new allocate_barrier intrinsic and to generalize the handling of mbarriers, moving away from a fixed _mbarrier array to a more dynamic system.
  • Generalized TMA Load Operations: The tma_load and tma_load_im2col functions in copy_sm90.h now accept a more generic BarrierType, enabling greater flexibility in how barriers are passed and used with TMA operations.
  • Centralized Barrier Definitions: A new header file, src/tl_templates/cuda/barrier.h, has been added to centralize CUDA mbarrier-related device functions, improving code organization and maintainability.
  • Typo Correction: A consistent typo, ptx_stmatirx, has been corrected to ptx_stmatrix across multiple files.
  • Memory Scope Simplification: The memscope.py file has been removed, and related memory scope checks have been updated, indicating a simplification of memory management definitions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors barrier management, introducing a new C++ object-oriented approach for CUDA barriers and removing the old implementation. This includes adding a new barrier.h header with PTX inline assembly, updating CUDA codegen to use these new barrier objects, and refactoring related passes. Support for barriers on HIP seems to have been removed as part of this refactoring. The changes are extensive and improve the barrier implementation for CUDA. However, there are a few issues: a critical bug in argument checking in CUDA codegen, and a couple of leftover debug print/log statements that should be removed.

<< mbarrier_dtype_ << "*>(" << mbarrier_storage_name << ");\n";
} else if (op->op.same_as(tl::get_mbarrier())) {
std::string barrier_name = "_mbarrier";
ICHECK_EQ(op->args.size(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The argument count check for tl::allocate_barrier is incorrect. It checks for 1 argument but then proceeds to access op->args[1], which will lead to an out-of-bounds access. The allocate_barrier intrinsic takes two arguments (name and barrier_count). The check should be for 2 arguments.

Suggested change
ICHECK_EQ(op->args.size(), 1);
ICHECK_EQ(op->args.size(), 2);

num_split = 1

kernel = flashattn(batch, heads, kv_heads, kv_ctx, dim, pe_dim, BLOCK_N, BLOCK_H, num_split)
print(kernel.get_kernel_source())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging purposes. It should be removed before merging.

Comment on lines +1791 to +1792
LOG(INFO) << "Allocate with " << new_buffer_var << " and "
<< info.new_element_dtype << " extents: " << extents;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This LOG(INFO) statement seems to be a debug log. It should be removed from the final code.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a CUDA shared.barrier object model and TL/CUDA barrier templates, generalizes TMA loads to a templated BarrierType, fixes PTX op name ptx_stmatirxptx_stmatrix, exports many builtins with TVM_DLL, updates HIP/CUDA codegen and TIR transforms for barrier scope handling, and includes assorted small example/test/logging tweaks.

Changes

Cohort / File(s) Summary
CUDA barrier & templates
src/tl_templates/cuda/barrier.h, src/tl_templates/cuda/common.h, src/tl_templates/cuda/copy_sm90.h
Add tl::Barrier alias and extensive barrier utilities; add warpgroup_reg_alloc/dealloc helpers; generalize tma_load overloads to accept templated BarrierType; update includes and internal barrier encoding; remove redundant barrier helpers from copy_sm90.h.
CUDA codegen (barrier object model)
src/target/codegen_cuda.cc, src/target/codegen_cuda.h
Introduce shared.barrier storage scope, allocate mbarrier shared memory and a Barrier* view, emit barrier object calls (init/arrive/expect_tx/wait), adapt allocation/printing for barrier objects, and rename stmatirxstmatrix.
HIP codegen adjustments
src/target/codegen_hip.cc
Remove prior HIP barrier emission; replace ptx_stmatirx dispatch with ptx_stmatrix-based emission, emitting tl::ptx_stmatrix_x<num> with optional _trans and adjusted argument offset.
PTX op rename & builtin exports
src/op/builtin.h, src/op/builtin.cc, src/op/elem.cc
Rename op symbol ptx_stmatirxptx_stmatrix; add TVM_DLL export qualifier to many Op declarations and export additional intrinsics (tl_gemm, tl_gemm_sp, tl_shuffle_elect, etc.); update uses to the corrected name.
TIR transforms: shared/barrier handling & passes
src/transform/lower_shared_barrier.cc, src/transform/inject_fence_proxy.cc, src/transform/lower_device_storage_access_info.cc, src/transform/storage_rewrite.cc
Add disable_shuffle_elect flag plumbing to LowerSharedBarrier/SharedBarrierRewriter, reuse original buffer data for barrier buffers, gate initialization condition on tl_shuffle_elect or thread 0, update op checks to ptx_stmatrix, and treat .barrier scope as non-special (skip memory-info lowering/merging).
Warp-specialized rewriter & detection
src/transform/warp_specialized_rewriter.cc
Replace string-based shared-scope checks with runtime StorageScope/StorageRank detection (uses runtime/thread_storage_scope.h) for shared memory detection.
TL templates / tma_load API changes
src/tl_templates/cuda/copy_sm90.h
Convert tma_load APIs to accept BarrierType template parameter and use compile-time pointer vs. object dispatch for barrier address computation.
JIT/logging & autotune test
tilelang/jit/kernel.py, testing/python/autotune/test_tilelang_autotune_with_inputs.py
Replace stdout print with logger.info and assert global_symbol for kernel compile logging; add @tilelang.autotune decorator and fix enable_rasterization typo in autotune test.
Language API removals
tilelang/language/__init__.py, tilelang/language/memscope.py
Re-export StridedTensor; remove memscope wildcard re-export; delete FFI-registered tvm.info.mem.local.var helper.
Examples & tests (minor edits)
examples/warp_specialize/example_warp_specialize_flashmla.py, examples/warp_specialize/example_warp_specialize_gemm_copy_0_gemm_1.py, examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py, examples/dequantize_gemm/test_example_dequantize_gemm.py
Add debug print of kernel source in one example; remove stray blank line; drop unused torch.backends import; remove one dequantize test.
Transforms: inject_fence_proxy & storage rewrite
src/transform/inject_fence_proxy.cc, src/transform/storage_rewrite.cc
Update proxy detection to use ptx_stmatrix; exclude .barrier tag from special-memory handling and memory-info lookups; add informational log when allocating merged vector buffers.
CI & whitespace
.github/workflows/ci.yml, tilelang/engine/phase.py
Increase pytest parallelism from 4→8 in two CI steps; remove extra blank lines (whitespace only).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant TIR as TIR Passes
  participant Pass as LowerSharedBarrier
  participant CG as CodeGenTileLangCUDA
  participant SM as __shared__ Memory
  participant K as Kernel

  Note over TIR,Pass: Buffer declared with scope "shared.barrier"
  TIR->>Pass: Rewrite/Lower (disable_shuffle_elect flag)
  Pass->>CG: Emit PrimFunc with barrier allocs
  CG->>SM: Allocate mbarrier_mem (__shared__)
  CG->>K: Define Barrier* mbarrier_ = reinterpret_cast<Barrier*>(mbarrier_mem)
  CG->>K: Lower barrier ops -> mbarrier_->init/arrive/expect_tx/wait
  K->>SM: Barrier methods operate on mbarrier_ in shared mem
Loading
sequenceDiagram
  autonumber
  participant TL as TL Template (tma_load)
  participant K as Generated Kernel
  participant SM as Shared Mem Barrier
  participant GM as Global Memory

  TL->>K: tma_load(..., BarrierType& smem_mbar, ...)
  alt BarrierType is pointer
    K->>SM: barrier_addr = reinterpret_cast<uint64_t*>(smem_mbar)
  else BarrierType is object/ref
    K->>SM: barrier_addr = reinterpret_cast<uint64_t*>(&smem_mbar)
  end
  K->>GM: Issue TMA load using barrier_addr and cache hint
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

I hop and patch through CUDA night,
A barrier garden springs to light.
Stmatrix fixed, templates sing,
TMA learns a flexible ring.
I nibble bytes and stamp my paw—code hums, I clap, hurrah! 🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Docstrings were successfully generated. (🔄 Check again to generate docstrings again)
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/transform/lower_shared_barrier.cc (2)

47-51: Fix potential null dereference before ICHECK on PointerTypeNode.

ptr_type is dereferenced to read storage_scope before the ICHECK validates it. If type_annotation is not a PointerType, this will segfault before the check fires.

Apply this diff to guard the access:

-      const auto *ptr_type =
-          buffer->data->type_annotation.as<PointerTypeNode>();
-      auto storage_scope = ptr_type->storage_scope;
-      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
+      const auto *ptr_type =
+          buffer->data->type_annotation.as<PointerTypeNode>();
+      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
+      auto storage_scope = ptr_type->storage_scope;

36-54: Scope leakage: buffer_map_ persists across blocks; restrict detection to this block’s alloc_buffers.

buffer_map_ is a class member and never cleared, then iterated to find barrier_buffers. This risks accumulating buffers from previously visited blocks, and detecting barriers not allocated in the current block. Additionally, detection should rely on op->alloc_buffers (where allocation happens), not match_buffers.

Minimal, localized fix:

  • Clear buffer_map_ at block entry.
  • Build barrier_buffers from alloc_buffers only.
-    // Record the mapping from buffer data var to buffer for later lookup
+    // Record the mapping from buffer data var to buffer for later lookup
+    // NOTE: Clear per-block to avoid leaking mappings across blocks.
+    buffer_map_.clear();
     for (auto buffer : alloc_buffers) {
       buffer_map_.insert({buffer->data, buffer});
     }
     for (auto match_buffer : op->match_buffers) {
       buffer_map_.insert({match_buffer->buffer->data, match_buffer->buffer});
     }
 
     Array<Buffer> barrier_buffers;
 
-    for (auto [data, buffer] : buffer_map_) {
-      const auto *ptr_type =
-          buffer->data->type_annotation.as<PointerTypeNode>();
-      auto storage_scope = ptr_type->storage_scope;
-      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
-      if (storage_scope == "shared.barrier") {
-        barrier_buffers.push_back(buffer);
-      }
-    }
+    // Only consider buffers allocated by this block for barrier lowering.
+    for (const auto& buffer : alloc_buffers) {
+      const auto* ptr_type = buffer->data->type_annotation.as<PointerTypeNode>();
+      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
+      const auto& storage_scope = ptr_type->storage_scope;
+      if (storage_scope == "shared.barrier") {
+        barrier_buffers.push_back(buffer);
+      }
+    }

Rationale:

  • Prevents incorrect accumulation from outer/sibling blocks.
  • Ensures we only transform allocations owned by this block (consistent with the subsequent alloc_buffers mutation check).
src/tl_templates/cuda/copy_sm90.h (2)

17-27: Make the basic tma_load pointer-friendly too

The first overload doesn’t handle pointer-typed barriers, unlike the descriptor overloads. Align behavior.

Apply this diff:

-template <typename BarrierType = uint64_t>
-TL_DEVICE void tma_load(void *smem_ptr, void *gmem_ptr, BarrierType &smem_mbar,
-                        uint32_t size) {
-  uint32_t smem_int_mbar =
-      smem_ptr_to_uint(reinterpret_cast<uint64_t *>(&smem_mbar));
+template <typename BarrierType = uint64_t>
+TL_DEVICE void tma_load(void *smem_ptr, void *gmem_ptr, BarrierType &smem_mbar,
+                        uint32_t size) {
+  uint32_t smem_int_mbar;
+  if constexpr (std::is_pointer_v<BarrierType>) {
+    smem_int_mbar = smem_ptr_to_uint(reinterpret_cast<uint64_t *>(smem_mbar));
+  } else {
+    smem_int_mbar = smem_ptr_to_uint(reinterpret_cast<uint64_t *>(&smem_mbar));
+  }
   uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr);
   asm volatile("cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::"
                "bytes [%0], [%1], %2, [%3]; \n" ::"r"(smem_int_ptr),
                "l"(gmem_ptr), "r"(size), "r"(smem_int_mbar)
                :);
 }

160-172: tma_load_im2col should mirror pointer-aware barrier handling

Keep parity with other overloads to support both Barrier& and Barrier*.

Apply this diff:

-template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL,
-          typename BarrierType = uint64_t>
-TL_DEVICE void
-tma_load_im2col(const CUtensorMap &descriptor, BarrierType &smem_mbar,
+template <CacheHintSm90 cache_hint = CacheHintSm90::EVICT_NORMAL,
+          typename BarrierType = uint64_t>
+TL_DEVICE void
+tma_load_im2col(const CUtensorMap &descriptor, BarrierType &smem_mbar,
                 void const *const smem_ptr, int32_t const &coord_c,
                 int32_t const &coord_w, int32_t const &coord_h,
                 int32_t const &coord_n, uint16_t const &offset_w,
                 uint16_t const &offset_h) {
   uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(&descriptor);
-  uint32_t smem_int_mbar =
-      smem_ptr_to_uint(reinterpret_cast<uint64_t *>(&smem_mbar));
+  uint32_t smem_int_mbar;
+  if constexpr (std::is_pointer_v<BarrierType>) {
+    smem_int_mbar = smem_ptr_to_uint(reinterpret_cast<uint64_t *>(smem_mbar));
+  } else {
+    smem_int_mbar = smem_ptr_to_uint(reinterpret_cast<uint64_t *>(&smem_mbar));
+  }
   uint32_t smem_int_ptr = smem_ptr_to_uint(smem_ptr);
   asm volatile("cp.async.bulk.tensor.4d.shared::cluster.global.im2col.mbarrier:"
                ":complete_tx::bytes.L2::cache_hint"
                " [%0], [%1, {%3, %4, %5, %6}], [%2], {%7, %8}, %9;"
♻️ Duplicate comments (2)
examples/warp_specialize/example_warp_specialize_flashmla.py (1)

394-394: Make kernel source dump opt-in (guard or remove the debug print).

Unconditional printing of the full generated kernel floods stdout and isn’t desirable for an example that benchmarks performance. Echoing prior feedback: please remove or gate it.

Apply this minimal guard via an environment variable:

-    print(kernel.get_kernel_source())
+    if os.environ.get("TILELANG_DUMP_KERNEL"):
+        print(kernel.get_kernel_source())

And add this import near the top of the file (outside the selected range):

import os

If you prefer CLI control, I can provide a follow-up patch to add a --dump_kernel flag and plumb it through main().

src/transform/storage_rewrite.cc (1)

1791-1793: Remove debug LOG(INFO) from production pass (duplicate of prior feedback)

The informational LOG(INFO) on allocation in VectorTypeRewriter is a stray debug print and will spam logs.

Apply this diff to remove it:

-    LOG(INFO) << "Allocate with " << new_buffer_var << " and "
-              << info.new_element_dtype << " extents: " << extents;
🧹 Nitpick comments (6)
src/transform/lower_shared_barrier.cc (1)

160-169: Support other thread dimensions or define a safe fallback when threadIdx.x is absent.

Barrier init condition relies on capturing threadIdx.x. If a kernel binds only threadIdx.y/z, thread_var_ remains undefined and trips the ICHECK.

Options:

  • Accept any of threadIdx.{x,y,z} and build a linearized thread id for the condition.
  • Fallback to thread_var_ = IterVar(Range(0, 1), Var("tid", DataType::Int(32)), kThreadIndex) for CPU paths.

If helpful, I can send a patch to scan for y/z as well.

src/target/codegen_cuda.h (1)

118-121: New mbarrier fields: ensure usage and document expectations.

Adding mbarrier_name_ = "mbarrier" and mbarrier_dtype_ = "Barrier" makes sense for the barrier-object model. Two follow-ups:

  • Confirm codegen_cuda.cc uses these fields to emit the mbarrier storage symbol and reinterpret_cast to the Barrier type from tl_templates/cuda/barrier.h.
  • Consider documenting alignment requirements for mbarrier arrays (if distinct from barrier_alignment_bytes_) and whether barrier_count_ applies to mbarrier as well.
src/target/codegen_hip.cc (1)

790-797: Add a minimal arg-count check before indexing stmatrix args

Defensive check avoids OOB access if upstream changes or malformed IR reach HIP codegen.

Apply this diff:

-  } else if (op->op.same_as(tl::ptx_stmatrix())) {
-    int trans = Downcast<IntImm>(op->args[0])->value;
-    int num = Downcast<IntImm>(op->args[1])->value;
+  } else if (op->op.same_as(tl::ptx_stmatrix())) {
+    ICHECK_GE(op->args.size(), 2) << "ptx_stmatrix expects at least <trans, num> as the first two args";
+    int trans = Downcast<IntImm>(op->args[0])->value;
+    int num = Downcast<IntImm>(op->args[1])->value;
     std::string func_name = "tl::ptx_stmatrix_x" + std::to_string(num);
     if (trans == 1)
       func_name += "_trans";
     print_extern_call_stmt(func_name, 2);
src/tl_templates/cuda/barrier.h (1)

123-135: Inline asm operand constraints are fine for single-block usage

state is only used within the same asm block, so lack of an output constraint won’t leak UB. If you ever need to use state outside or across asm blocks, promote it to an output with "=l"(state).

src/target/codegen_cuda.cc (2)

946-958: Prefer consistent stringification for barrier indices

Mixing direct stream of PrimExpr and PrintExpr can produce subtle differences (parentheses, casting). Use PrintExpr for both paths.

Apply this diff:

-  auto print_mbarrier_obj = [&](PrimExpr barrier_id) {
+  auto print_mbarrier_obj = [&](PrimExpr barrier_id) {
     std::ostringstream ss;
-    if (barrier_id.as<IntImmNode>()) {
-      // incase the barrier_id is an integer, we need to print the barrier_id as
-      // an integer
-      ss << mbarrier_name_ << "[" << barrier_id << "]";
-    } else {
-      // otherwise may be a T.get_mbarrier() call or BufferLoad Node
-      // we need to print the barrier_id as a string
-      ss << this->PrintExpr(barrier_id);
-    }
+    // Always stringify via PrintExpr for consistency
+    ss << mbarrier_name_ << "[" << this->PrintExpr(barrier_id) << "]";
     return ss.str();
   };

985-993: Validate create_barriers arg count

Add a simple size check to avoid undefined behavior if IR is malformed.

Apply this diff:

  } else if (op->op.same_as(builtin::create_barriers())) {
-    this->PrintIndent();
-    int barrier_count = Downcast<IntImm>(op->args[0])->value;
+    this->PrintIndent();
+    ICHECK_EQ(op->args.size(), 1) << "create_barriers expects exactly 1 argument: barrier_count";
+    int barrier_count = Downcast<IntImm>(op->args[0])->value;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fff24ae and 383ad60.

📒 Files selected for processing (20)
  • examples/warp_specialize/example_warp_specialize_flashmla.py (1 hunks)
  • examples/warp_specialize/example_warp_specialize_gemm_copy_0_gemm_1.py (0 hunks)
  • src/op/builtin.cc (1 hunks)
  • src/op/builtin.h (5 hunks)
  • src/op/elem.cc (1 hunks)
  • src/target/codegen_cuda.cc (6 hunks)
  • src/target/codegen_cuda.h (1 hunks)
  • src/target/codegen_hip.cc (1 hunks)
  • src/tl_templates/cuda/barrier.h (1 hunks)
  • src/tl_templates/cuda/common.h (1 hunks)
  • src/tl_templates/cuda/copy_sm90.h (9 hunks)
  • src/transform/inject_fence_proxy.cc (1 hunks)
  • src/transform/lower_device_storage_access_info.cc (1 hunks)
  • src/transform/lower_shared_barrier.cc (6 hunks)
  • src/transform/storage_rewrite.cc (3 hunks)
  • src/transform/warp_specialized_rewriter.cc (6 hunks)
  • testing/python/autotune/test_tilelang_autotune_with_inputs.py (1 hunks)
  • tilelang/engine/phase.py (0 hunks)
  • tilelang/language/__init__.py (0 hunks)
  • tilelang/language/memscope.py (0 hunks)
💤 Files with no reviewable changes (4)
  • tilelang/language/init.py
  • tilelang/language/memscope.py
  • examples/warp_specialize/example_warp_specialize_gemm_copy_0_gemm_1.py
  • tilelang/engine/phase.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-nvidia
🔇 Additional comments (20)
src/op/builtin.cc (1)

86-86: Symbol name rename looks good.

The rename from ptx_stmatirx to ptx_stmatrix fixes the typo in the original name, making it consistent with the correct spelling "matrix".

testing/python/autotune/test_tilelang_autotune_with_inputs.py (1)

45-45: LGTM! The autotune decorator addition is appropriate.

Adding the @tilelang.autotune decorator with configuration exploration is a good test enhancement that exercises the autotuning functionality with the barrier management refactoring.

src/tl_templates/cuda/common.h (1)

253-259: Well-structured warp group register management utilities.

The two new template functions for warp group register allocation/deallocation are appropriately placed in the common utilities header. The inline PTX instructions correctly use the setmaxnreg operations with proper aligned sync semantics.

src/transform/warp_specialized_rewriter.cc (6)

17-18: Appropriate migration to StorageScope-based detection.

Good refactoring to use the runtime StorageScope abstraction instead of string-based scope comparisons. This provides better type safety and consistency across the codebase.


154-156: LGTM! Consistent use of StorageScope for shared memory detection.

The change from string comparison (op->buffer.scope() == "shared.dyn" || op->buffer.scope() == "shared") to using StorageScope with StorageRank::kShared is cleaner and more maintainable.


576-603: Comprehensive documentation for WSCodeEmitter constructor.

The expanded documentation clearly explains the parameters and their roles in warp-specialized code emission. This improves code maintainability and understanding.


614-622: Clear getter method documentation.

The documentation for hasSimtCopy() clearly explains its purpose in tracking SIMT-style bulk copy detection.


641-680: Thorough documentation for complex SeqStmt visitor.

The detailed documentation for VisitStmt_(const SeqStmtNode*) clearly explains the complex barrier synchronization logic, side effects, and state updates. This level of detail is crucial for understanding the warp specialization implementation.


1261-1291: Well-documented BlockRealize visitor for warp specialization.

The comprehensive documentation clearly explains the complex warp specialization logic, including producer/consumer role detection, WgMMA handling, barrier initialization, and thread partitioning. This documentation is essential for maintaining this intricate code.

src/op/builtin.h (1)

65-303: Appropriate TVM_DLL export additions for public API functions.

The addition of TVM_DLL export qualifiers to the public Op declarations is necessary for proper dynamic library linking. The rename from ptx_stmatirx to ptx_stmatrix (Line 152) correctly fixes the typo, and the new exports for tl_gemm, tl_gemm_sp, and tl_shuffle_elect align with the broader barrier management refactoring.

src/transform/lower_shared_barrier.cc (2)

121-130: Verify tl_shuffle_elect() semantics: does it elect one thread per CTA or one per warp?

If tl_shuffle_elect() elects a leader per warp (common for “shuffle elect”), barrier init will run once per warp, not once per block. Many mbarrier init routines expect a single initialization per barrier/CTA; multiple inits may be undefined.

Please confirm the semantics of tl_shuffle_elect():

  • If it elects per warp: gate on a single CTA leader instead (e.g., threadIdx.x == 0), or additionally guard with warp-id == 0 to prevent multiple initializations.
  • If it already elects a single CTA-wide leader: current logic is OK.

If you decide to force single-CTA init, this minimal change is safe and keeps the pass-config override:

-    if (!disable_shuffle_elect_) {
-      condition = Call(DataType::Bool(), tl_shuffle_elect(), {0});
-    } else {
-      condition = EQ(thread_var_->var, 0);
-    }
+    if (!disable_shuffle_elect_) {
+      // TODO(verify): replace with CTA-wide leader elect if tl_shuffle_elect() is warp-scoped.
+      condition = EQ(thread_var_->var, 0);
+    } else {
+      condition = EQ(thread_var_->var, 0);
+    }

I can adjust this to a CTA-wide “elect” once we confirm the intrinsic behavior.


82-92: Reusing original data var is good; ensure no aliasing issues and update tests.

Reusing buffer->data when constructing the new shared buffer avoids var remapping complexity. This looks correct given the Buffer remap covers loads/stores. Please add/extend unit tests validating:

  • Buffer shape is rewritten to 1.
  • Barrier init uses the original thread count (old shape[0]).
  • Loads/stores to barrier buffers are correctly redirected.
src/op/elem.cc (1)

305-306: Corrected symbol name: ptx_stmatrix() usage looks good; double-check ld op naming consistency.

The store path now calls tl::ptx_stmatrix(), fixing the prior typo. Verify that the corresponding builtin declaration/registration was also updated (it appears done elsewhere in the PR).

Minor: The load op remains ptx_ldmatirx() (with the older misspelling). If this is intentional (only st op was renamed), no action needed. If you meant to standardize both to “ldmatrix/stmatrix”, consider aligning the load op name for consistency.

src/transform/inject_fence_proxy.cc (1)

60-62: Fence proxy detection updated to ptx_stmatrix(): LGTM.

The proxy classification now recognizes the corrected store-matrix op symbol. This keeps generic-vs-async detection aligned with the rename.

src/transform/lower_device_storage_access_info.cc (1)

47-49: Skip MemoryInfo lowering for ".barrier" tagged storage: LGTM.

Excluding .barrier from MemoryInfo lookup prevents accidental reinterpretation of barrier storage as generic memory. This aligns with the new barrier-object handling.

src/transform/storage_rewrite.cc (2)

675-677: Good call: treat .barrier as non-special tagged memory

Excluding .barrier from “special” merging scopes aligns with the new barrier storage semantics and avoids unintended merging/constraints.


845-849: Correct to skip MemoryInfo for .barrier and .var

Gating GetMemoryInfo for .barrier/.var prevents accidental reliance on device memory limits for these pseudo-scopes.

src/target/codegen_hip.cc (1)

790-797: HIP Codegen Still Includes Barrier Intrinsics

The earlier note that HIP no longer receives barrier builtins is incorrect. In src/target/codegen_hip.cc (around lines 786–795) there is still a case for

op->op.same_as(tl::sync_thread_partial())

which emits

tl::syncthreads_partial

Additionally, several transform passes (e.g., lower_shared_barrier.cc, warp_specialized_rewriter.cc, thread_partial_sync.cc, etc.) still generate barrier intrinsics. You can disregard the previous request to “confirm HIP no longer receives barrier intrinsics.”

Likely an incorrect or invalid review comment.

src/target/codegen_cuda.cc (2)

693-701: Support for shared.barrier in storage scope looks correct

Emitting __shared__ for shared.barrier matches the backing storage expectations for barrier objects.


1691-1697: Allocation flow for shared.barrier is sound

Using a uint64_t backing array and reinterpret_cast to the barrier object pointer is correct and matches the barrier header API.

Comment on lines +6 to +11
// Reuse cutlass advanced barrier abstraction
using Barrier = cutlass::arch::ClusterTransactionBarrier;

namespace tl {

TL_DEVICE void mbarrier_init(uint64_t &smem_barrier, uint32_t arrive_count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Namespace mismatch: Barrier alias should be inside tl

Codegen emits tl::Barrier, but the alias is currently at global scope. This will fail to compile when referencing tl::Barrier.

Apply this diff to move the alias:

-// Reuse cutlass advanced barrier abstraction
-using Barrier = cutlass::arch::ClusterTransactionBarrier;
-
-namespace tl {
+// Reuse cutlass advanced barrier abstraction
+namespace tl {
+using Barrier = cutlass::arch::ClusterTransactionBarrier;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Reuse cutlass advanced barrier abstraction
using Barrier = cutlass::arch::ClusterTransactionBarrier;
namespace tl {
TL_DEVICE void mbarrier_init(uint64_t &smem_barrier, uint32_t arrive_count) {
// Reuse cutlass advanced barrier abstraction
namespace tl {
using Barrier = cutlass::arch::ClusterTransactionBarrier;
TL_DEVICE void mbarrier_init(uint64_t &smem_barrier, uint32_t arrive_count) {
🤖 Prompt for AI Agents
In src/tl_templates/cuda/barrier.h around lines 6 to 11, the using alias
"Barrier = cutlass::arch::ClusterTransactionBarrier;" is declared at global
scope but code expects tl::Barrier; move the alias inside the tl namespace by
removing the global-scope declaration and adding the using alias within the tl {
... } block (place it before mbarrier_init) so references to tl::Barrier resolve
correctly.

Comment on lines +7 to 8
#include "barrier.h"
#include "common.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing type_traits include for std::is_pointer_v

Compilation will fail on device code without including <type_traits>.

Apply this diff:

-#include "barrier.h"
+#include "barrier.h"
+#include <type_traits>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "barrier.h"
#include "common.h"
#include "barrier.h"
#include <type_traits>
#include "common.h"
🤖 Prompt for AI Agents
In src/tl_templates/cuda/copy_sm90.h around lines 7 to 8, the file uses
std::is_pointer_v but does not include <type_traits>, which causes device
compile failures; add #include <type_traits> alongside the existing includes
(i.e., include <type_traits> after or before "common.h"/"barrier.h") so the
template trait is available during compilation.

- Removed unused import of `torch.backends` in the example file.
- Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
- Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tilelang/jit/kernel.py (1)

14-16: Good switch to module-level logging; add a NullHandler to avoid "No handler found" warnings in library code.

This keeps logging opt-in for applications and avoids noisy warnings when no handlers are configured.

-import logging
-
-logger = logging.getLogger(__name__)
+import logging
+
+logger = logging.getLogger(__name__)
+# Avoid "No handler could be found" warnings in library code when no handlers are configured.
+logger.addHandler(logging.NullHandler())
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 383ad60 and 06ee681.

📒 Files selected for processing (2)
  • examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py (0 hunks)
  • tilelang/jit/kernel.py (2 hunks)
💤 Files with no reviewable changes (1)
  • examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-test-nvidia
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task

Comment on lines 118 to 125
# NOTE(Chenggang): printing could let the training/inference framework easier to know
# whether the communication timeout is from compilation
if env.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on"):
print(f"TileLang begins to compile kernel `{func.__name__}` with `{out_idx=}`")
# assert func must have "global_symbol"
func_name = func.attrs.get("global_symbol")
assert func_name is not None, "func must have global_symbol"
logger.info(f"TileLang begins to compile kernel `{func_name}` with `{out_idx=}`")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don't assert on PrimFunc.global_symbol; guard attrs None and make the log best-effort.

Asserting on a missing "global_symbol" will break compilation for valid PrimFuncs without attrs set, or when func is None. This log should never be fatal. Also normalize the env flag to str to avoid attribute errors if it isn't a string, and use parameterized logging.

-        if env.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on"):
-            # assert func must have "global_symbol"
-            func_name = func.attrs.get("global_symbol")
-            assert func_name is not None, "func must have global_symbol"
-            logger.info(f"TileLang begins to compile kernel `{func_name}` with `{out_idx=}`")
+        if str(env.TILELANG_PRINT_ON_COMPILATION).lower() in ("1", "true", "yes", "on"):
+            attrs = getattr(func, "attrs", None)
+            func_name = attrs.get("global_symbol") if attrs is not None else None
+            # Best-effort subject for logging if global_symbol is missing
+            subject = func_name or getattr(func, "name", None) or repr(func)
+            logger.info("TileLang begins to compile kernel `%s` with out_idx=%s", subject, out_idx)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# NOTE(Chenggang): printing could let the training/inference framework easier to know
# whether the communication timeout is from compilation
if env.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on"):
print(f"TileLang begins to compile kernel `{func.__name__}` with `{out_idx=}`")
# assert func must have "global_symbol"
func_name = func.attrs.get("global_symbol")
assert func_name is not None, "func must have global_symbol"
logger.info(f"TileLang begins to compile kernel `{func_name}` with `{out_idx=}`")
# NOTE(Chenggang): printing could let the training/inference framework easier to know
# whether the communication timeout is from compilation
if str(env.TILELANG_PRINT_ON_COMPILATION).lower() in ("1", "true", "yes", "on"):
attrs = getattr(func, "attrs", None)
func_name = attrs.get("global_symbol") if attrs is not None else None
# Best-effort subject for logging if global_symbol is missing
subject = func_name or getattr(func, "name", None) or repr(func)
logger.info("TileLang begins to compile kernel `%s` with out_idx=%s", subject, out_idx)
🤖 Prompt for AI Agents
In tilelang/jit/kernel.py around lines 118 to 125, the code currently asserts on
func.attrs.get("global_symbol") and assumes env.TILELANG_PRINT_ON_COMPILATION is
a string; change this to a non-fatal, best-effort log: coerce the env flag to
str before lowercasing (e.g., str(env.TILELANG_PRINT_ON_COMPILATION).lower()),
check isinstance(func, SomeExpectedType) or at least guard func is not None and
attrs is a dict-like object, retrieve func_name = (func.attrs or
{}).get("global_symbol") without asserting, and call logger.info with
parameterized logging (logger.info("TileLang begins to compile kernel %s
out_idx=%s", func_name or "<unknown>", out_idx)) so missing global_symbol or a
non-string env value won't raise.

- Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
- Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/ci.yml (1)

85-87: REQS_HASH is using raw file contents instead of a stable digest (cache key bug)

This differs from the first job’s sha256 approach and can produce brittle marker names (leading spaces/newlines) and cache misses/hits unrelated to content changes. Use a digest like in format-check.

Apply this diff:

-        REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true)
+        REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
♻️ Duplicate comments (1)
src/tl_templates/cuda/barrier.h (1)

6-11: Fix namespace: move Barrier alias into tl to match codegen (compile fix)

Codegen emits tl::Barrier but the alias is at global scope. This will fail to compile. Also, the AI summary claims it’s in tl, which is inconsistent with the current code.

Apply this diff:

-// Reuse cutlass advanced barrier abstraction
-using Barrier = cutlass::arch::ClusterTransactionBarrier;
-
-namespace tl {
+// Reuse cutlass advanced barrier abstraction
+namespace tl {
+using Barrier = cutlass::arch::ClusterTransactionBarrier;
🧹 Nitpick comments (5)
.github/workflows/ci.yml (3)

114-114: Make test parallelism configurable instead of hard-coding 8 workers

Hard-coding -n 8 may oversubscribe CPU/GPU on some runners and underutilize on others. Make it tunable via an env var with a sensible default to keep your intent.

Apply this diff:

-        python -m pytest -n 8 **/test*.py
+        python -m pytest -n "${PYTEST_XDIST_WORKERS:-8}" **/test*.py

121-121: Same here: avoid hard-coding test worker count

Mirror the configurability for the main test suite.

Apply this diff:

-        python -m pytest -n 8
+        python -m pytest -n "${PYTEST_XDIST_WORKERS:-8}"

22-22: Upgrade setup-python action to v5 (Node 16 deprecation, reliability)

actions/setup-python@v2 uses Node 16 which is deprecated on GitHub-hosted runners. v5 is the supported release and works on self-hosted as well.

Apply this diff:

-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v5

Also applies to: 78-78

src/tl_templates/cuda/barrier.h (2)

25-26: Unify barrier scope qualifiers (use shared::cta unless cluster is intended)

Most of your mbarrier ops specify the scope (e.g., shared::cta in test_wait), but some don’t. For consistency and to avoid relying on defaults, explicitly use shared::cta where appropriate.

Apply this diff:

-               "mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
+               "mbarrier.try_wait.parity.shared::cta.b64 P1, [%1], %2; \n\t"
-                 "mbarrier.try_wait.parity.shared.b64 P1, [%0], %1, %2; \n\t"
+                 "mbarrier.try_wait.parity.shared::cta.b64 P1, [%0], %1, %2; \n\t"
-               "mbarrier.try_wait.shared.b64 P1, [%0], %1;\n"
+               "mbarrier.try_wait.shared::cta.b64 P1, [%0], %1;\n"

If some of these should target cluster scope, please adjust accordingly and keep the qualifiers explicit.

Also applies to: 42-43, 136-137


132-141: Use a read-write constraint for state in syncthreads_partial

state is defined by mbarrier.arrive and consumed by mbarrier.try_wait within the same asm block. Declare it as a "+l" output to reflect this data flow and avoid constraint pitfalls.

Apply this diff:

-  asm volatile("{\n"
-               ".reg .pred                P1;\n"
-               "mbarrier.arrive.shared.b64 %1, [%0];\n"
-               "LAB_WAIT:\n"
-               "mbarrier.try_wait.shared.b64 P1, [%0], %1;\n"
-               "@!P1                      bra.uni LAB_WAIT;\n"
-               "}\n"
-               :
-               : "r"(smem_int_ptr), "l"(state));
+  asm volatile("{\n"
+               ".reg .pred                P1;\n"
+               "mbarrier.arrive.shared.b64 %0, [%1];\n"
+               "LAB_WAIT:\n"
+               "mbarrier.try_wait.shared::cta.b64 P1, [%1], %0;\n"
+               "@!P1                      bra.uni LAB_WAIT;\n"
+               "}\n"
+               : "+l"(state)
+               : "r"(smem_int_ptr)
+               : "memory");
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 06ee681 and e5dd402.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml (1 hunks)
  • examples/dequantize_gemm/test_example_dequantize_gemm.py (0 hunks)
  • src/tl_templates/cuda/barrier.h (1 hunks)
💤 Files with no reviewable changes (1)
  • examples/dequantize_gemm/test_example_dequantize_gemm.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/tl_templates/cuda/barrier.h (3)
src/tl_templates/cuda/copy_sm90.h (2)
  • tl (10-250)
  • void (245-248)
src/tl_templates/cuda/common.h (7)
  • tl (207-261)
  • void (178-182)
  • void (187-190)
  • void (192-195)
  • uint32_t (107-109)
  • int (86-89)
  • int (112-119)
tilelang/language/builtin.py (4)
  • mbarrier_arrive (201-216)
  • mbarrier_expect_tx (219-228)
  • fence_proxy_async (79-88)
  • tma_store_arrive (91-100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-nvidia

Comment on lines +3 to +5
#include "common.h"
#include <cutlass/arch/barrier.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing include for std::is_pointer_v

This header uses std::is_pointer_v but doesn’t include <type_traits>. Depending on include order, this can break builds.

Apply this diff:

 #include "common.h"
 #include <cutlass/arch/barrier.h>
 
+#include <type_traits>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "common.h"
#include <cutlass/arch/barrier.h>
#include "common.h"
#include <cutlass/arch/barrier.h>
#include <type_traits>
🤖 Prompt for AI Agents
In src/tl_templates/cuda/barrier.h around lines 3 to 5, the header uses
std::is_pointer_v but does not include <type_traits>, which can cause build
failures depending on include order; add #include <type_traits> near the other
includes (e.g., after "common.h") so std::is_pointer_v is available wherever
this header is used.

Comment on lines +11 to +16
TL_DEVICE void mbarrier_init(uint64_t &smem_barrier, uint32_t arrive_count) {
uint32_t smem_int_ptr = smem_ptr_to_uint(&smem_barrier);
asm volatile("mbarrier.init.shared.b64 [%1], %0;"
:
: "r"(arrive_count), "r"(smem_int_ptr));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add memory clobbers to barrier/fence/cp.async asm to prevent compiler reordering

Without a "memory" clobber, the compiler may move memory ops across these inline asm blocks, breaking the intended synchronization semantics.

Apply these diffs:

 asm volatile("mbarrier.init.shared.b64 [%1], %0;"
-               :
-               : "r"(arrive_count), "r"(smem_int_ptr));
+               :
+               : "r"(arrive_count), "r"(smem_int_ptr)
+               : "memory");
-  asm volatile("mbarrier.arrive.shared.b64 _, [%0];" : : "r"(smem_int_ptr));
+  asm volatile("mbarrier.arrive.shared.b64 _, [%0];" : : "r"(smem_int_ptr) : "memory");
 asm volatile("mbarrier.expect_tx.shared.b64 [%1], %0;"
-               :
-               : "r"(transaction_bytes), "r"(smem_int_ptr));
+               :
+               : "r"(transaction_bytes), "r"(smem_int_ptr)
+               : "memory");
 asm volatile("mbarrier.arrive.expect_tx.shared.b64 _, [%1], %0;"
-               :
-               : "r"(transaction_bytes), "r"(smem_int_ptr));
+               :
+               : "r"(transaction_bytes), "r"(smem_int_ptr)
+               : "memory");
-  asm volatile("cp.async.mbarrier.arrive.shared.b64 [%0];"
-               :
-               : "r"(smem_int_mbar));
+  asm volatile("cp.async.mbarrier.arrive.shared.b64 [%0];"
+               :
+               : "r"(smem_int_mbar)
+               : "memory");
-TL_DEVICE void fence_proxy_async() {
-  asm volatile("fence.proxy.async.shared::cta;" : :);
-}
+TL_DEVICE void fence_proxy_async() {
+  asm volatile("fence.proxy.async.shared::cta;" : : : "memory");
+}
-TL_DEVICE void tma_store_arrive() {
-  asm volatile("cp.async.bulk.commit_group;");
-}
+TL_DEVICE void tma_store_arrive() {
+  asm volatile("cp.async.bulk.commit_group;" ::: "memory");
+}
 asm volatile("{\n"
                ".reg .pred                P1;\n"
                "mbarrier.arrive.shared.b64 %1, [%0];\n"
                "LAB_WAIT:\n"
                "mbarrier.try_wait.shared.b64 P1, [%0], %1;\n"
                "@!P1                      bra.uni LAB_WAIT;\n"
                "}\n"
                :
-               : "r"(smem_int_ptr), "l"(state));
+               : "r"(smem_int_ptr), "l"(state)
+               : "memory");

Consider also adding the clobber to mbarrier_wait/mbarrier_test_wait loops for completeness.

Also applies to: 68-71, 87-93, 95-101, 103-114, 116-118, 121-127, 129-141

🤖 Prompt for AI Agents
In src/tl_templates/cuda/barrier.h around lines 11-16 (and likewise at the other
reported ranges 68-71, 87-93, 95-101, 103-114, 116-118, 121-127, 129-141), the
inline asm blocks for mbarrier_init and the barrier/fence/cp.async primitives
lack a "memory" clobber which allows the compiler to reorder memory operations
across them; update each asm volatile(...) invocation to include "memory" in the
clobber list (i.e., add "memory" to the asm constraints) and also add the
"memory" clobber to the asm inside mbarrier_wait/mbarrier_test_wait loops for
completeness so the compiler cannot move loads/stores across these
synchronization points.

LeiWang1999 added a commit that referenced this pull request Aug 21, 2025
* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* #742 (comment)

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…ed module

- Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
- Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
- Added `StridedTensor` import in the language module for enhanced tensor functionality.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
testing/python/cache/test_tilelang_cache_matmul.py (2)

70-78: Make the test robust: seed RNG and skip gracefully when CUDA is unavailable

This test will randomly fail on some CI runners without GPUs or with nondeterministic kernels. Add a light guard and seeding near the random tensor creation.

Apply this diff within run_cache_matmul():

@@
-    import torch
+    import torch
+    import pytest
@@
-    a = torch.randn(1024, 1024).cuda().half()
-    b = torch.randn(1024, 1024).cuda().half()
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA not available; skipping cache matmul test")
+    torch.manual_seed(0)
+    torch.cuda.manual_seed_all(0)
+    a = torch.randn(1024, 1024, device="cuda", dtype=torch.half)
+    b = torch.randn(1024, 1024, device="cuda", dtype=torch.half)

78-91: Reduce noisy test output by gating prints behind an env flag

CI logs can get chatty. Gate the informational prints to keep outputs clean while preserving local debuggability.

@@
-    print("\nOutput from Cached Kernel:")
-    print(c)
+    import os
+    if os.getenv("TL_TEST_VERBOSE"):
+        print("\nOutput from Cached Kernel:")
+        print(c)
@@
-    print("\nReference PyTorch Output:")
-    print(ref_c)
+    if os.getenv("TL_TEST_VERBOSE"):
+        print("\nReference PyTorch Output:")
+        print(ref_c)
@@
-    print("\nOutputs are close (within tolerance).")
+    if os.getenv("TL_TEST_VERBOSE"):
+        print("\nOutputs are close (within tolerance).")
@@
-    print("\nCUDA Kernel Source:")
-    print(kernel.get_kernel_source())
+    if os.getenv("TL_TEST_VERBOSE"):
+        print("\nCUDA Kernel Source:")
+        print(kernel.get_kernel_source())
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e5dd402 and 8404fda.

📒 Files selected for processing (3)
  • testing/python/autotune/test_tilelang_autotune_with_inputs.py (3 hunks)
  • testing/python/cache/test_tilelang_cache_matmul.py (1 hunks)
  • tilelang/language/__init__.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • testing/python/autotune/test_tilelang_autotune_with_inputs.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
testing/python/cache/test_tilelang_cache_matmul.py (2)
tilelang/cache/__init__.py (1)
  • cached (14-37)
tilelang/cache/kernel_cache.py (1)
  • cached (110-198)
tilelang/language/__init__.py (1)
tilelang/language/proxy.py (1)
  • StridedTensor (251-252)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-test-nvidia
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task
🔇 Additional comments (4)
tilelang/language/__init__.py (2)

20-20: Expose StridedTensor at tilelang.language namespace — LGTM

Additive API, aligns with Buffer/Tensor re-exports and improves discoverability for users importing from tilelang.language. No runtime risk evident.


15-24: No memscope module present—no API narrowing risk here
A search of the repository shows there is no tilelang/language/memscope.py (or memscope/ package) and no references to memscope or its symbols in downstream code. The public API in tilelang/language/__init__.py already only imports the proxy symbols you’ve listed, so there is no wildcard‐injection of memscope to remove and no breaking change to callers.

– No files or imports for memscope were found in tilelang/language or elsewhere.
– No downstream usage of tilelang.language.memscope or its symbols exists.

You can safely keep the explicit proxy imports as-is without adding a soft-compat alias for memscope—there’s nothing to expose.

Likely an incorrect or invalid review comment.

testing/python/cache/test_tilelang_cache_matmul.py (2)

3-3: Import path update to tilelang.cache.cached — LGTM

This aligns the test with the modularized public API and avoids relying on a legacy top-level re-export.


3-3: No stray imports or references found

Ran searches across the repository:

  • No occurrences of from tilelang import cached
  • No calls to tilelang.cached(...)
  • Confirmed that cached is now defined in tilelang/cache/__init__.py (around line 14)

All legacy import paths have been removed and the new public entry point correctly exposes cached. No further action required.

@LeiWang1999 LeiWang1999 merged commit cb37bfe into tile-ai:main Aug 21, 2025
7 of 8 checks passed
@LeiWang1999 LeiWang1999 deleted the barrier_0820 branch August 21, 2025 12:03
coderabbitai bot added a commit that referenced this pull request Aug 21, 2025
Docstrings generation was requested by @LeiWang1999.

* #744 (comment)

The following files were modified:

* `examples/warp_specialize/example_warp_specialize_flashmla.py`
* `examples/warp_specialize/example_warp_specialize_gemm_copy_0_gemm_1.py`
* `src/op/elem.cc`
* `src/target/codegen_cuda.cc`
* `src/target/codegen_cuda.h`
* `src/target/codegen_hip.cc`
* `src/tl_templates/cuda/barrier.h`
* `src/tl_templates/cuda/common.h`
* `src/tl_templates/cuda/copy_sm90.h`
* `src/transform/inject_fence_proxy.cc`
* `src/transform/lower_device_storage_access_info.cc`
* `src/transform/lower_shared_barrier.cc`
* `src/transform/storage_rewrite.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/jit/kernel.py`
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

Note

Generated docstrings for this pull request at #747

This was referenced Sep 24, 2025
chengyupku added a commit to tile-ai/tilescale that referenced this pull request Oct 24, 2025
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling

- Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
- Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
- Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
- Removed obsolete code and improved overall code clarity and maintainability.

* lint fix

* Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls

- Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
- Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.

* test fix

* Enhance global read detection in pipeline planning

- Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
- Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
- Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.

* Add IndexLegalizer to enforce int64 for out-of-bound indices

- Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds.
- Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability.
- Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.

* [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow.

* Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job.

* fix: NVRTC backend (#717)

* fix: NVRTC backend

* fix: CI

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CUDA] Init support for sm_120 (#716)

* Init support for sm120

* fmt

* resolve comments

* unify mma gemm

* fmt

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CI] fix docs ci (#720)

* [Chore] fix typos (#719)

* chore: fix typos

* chore: fix ruff

* chore: fix clang-format

* [CI][AMD] Add AMD GPU CI and fix some related bugs (#694)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Update AMD FlashAttention example and TVM submodule

- Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
- Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
- Updated the TVM submodule to the latest commit for improved functionality.
- Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.

* Add CI workflow for automated format checking and testing

- Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
- The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
- Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.

* Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.

* Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.

* Update AMD CI workflow to install pytest directly instead of using requirements-test.txt

* Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt

* Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation

* Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.

* Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.

* Add installation of ROCm in AMD CI workflow

- Included a step to execute the `install_rocm.sh` script for improved setup.
- Removed unnecessary blank line for better readability in the workflow script.

* Remove installation step for ROCm in AMD CI workflow to simplify the setup process.

* Update AMD CI workflow to run specific test file with verbose output instead of all tests.

* Add new tilelang built-in operations for AMD architecture

- Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
- Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.

* Enhance autotuner configurations and GEMM operations in AMD example

- Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
- Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.

* Update autotuner configurations in AMD example for enhanced performance

- Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
- Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.

* Enhance autotuner configurations and memory handling in AMD example

- Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
- Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.

* Refine autotuner configurations and memory usage in AMD example

- Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
- Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.

* Update autotuner configurations in AMD example for enhanced performance

- Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
- Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.

* Enhance autotuner configurations and GEMM operations in AMD example

- Expanded thread counts in `get_configs` to include higher values for improved autotuning.
- Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.

* Update AMD CI workflow and remove obsolete test script

- Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
- Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.

* Remove TVM subproject from 3rdparty directory

* Refactor configuration generation and accumulation logic in AMD example

- Reformatted the `get_configs` function for improved readability by aligning parameters.
- Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.

* Enhance AMD CI workflow with additional logging and setup steps

- Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
- Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.

* Comment out package copying in AMD CI workflow to prevent potential issues during environment setup

* Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps

* Enhance BuildTileLangHIP function by adding whitespace for improved readability

* Refactor kTVMGridConstant definition for clarity and remove unnecessary comment

* Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* lint fix

* Update AMD CI workflow to use requirements-rocm.txt for dependency installation

* fix ci

* Remove dependency on format-check from AMD CI workflow

* fix ci

* fix ci

* fix ci

* Remove format-check job from AMD CI workflow

* Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow

* Add dependency on format-check job in AMD CI workflow

* Add format-check job to AMD CI workflow

* Update format-check job in AMD CI workflow to run on self-hosted environment

* Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes

* Update amd_ci.yml

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724)

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy

* [Typo] Correct architecture selection for CUDA and CDNA

* [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor CUDA code generation to simplify eviction policy handling

- Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity.
- Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations.

* lint fix

* [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Support strided tensors

* Refactor target attribute helper functions for improved clarity

* No code changes made in proxy.py and setup.py

* lint fix

* lint fix via gemini

* lint fix

* test fix

* test fix

* lint fix

* Update wrapper.py

* test fix

* Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.

* lint fix

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

* [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712)

* bug fix and support gemm_sr fallback for hopper

* Update gemm.cc

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `fix` (#726)

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851

The following files were modified:

* `src/op/gemm.cc`
* `src/tl_templates/cuda/gemm_sm90.h`
* `src/transform/warp_specialized_rewriter.cc`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [CI] Fix AMD CI (#729)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725)

* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16

* [Dequant] Add extern call and serial dequantization

* [Dequant] Parallel Dequant wait for fence debug.

* [Scale] Add scale matrix to mxfp4 gemm

* [Remove] Remove fence-buggy example and some generated source cuda code

* [MXFP4] Update initial version of MXFP4 GEMM

* [Scale] Add scale to latest mxfp4 gemm

* [Lint]

* [BugFix] Load Scale, disabe TMA to recover performance

* [Lint]

* [Lint]

* [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance

* [Lint]

* Update example_dequant_gemm_bf16_fp4_hopper_serial.py

* Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.

* Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

* Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

* lint fix

* ci fix

* Remove non-existent example

* [BugFix] Add smem swizzle to recover performance of TMA

* [BugFix] Enough reg for producer when threads=512

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `mxfp4` (#732)

* 📝 Add docstrings to `mxfp4`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561

The following files were modified:

* `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`
* `examples/dequantize_gemm/utils.py`
* `examples/gemm/example_gemm_autotune.py`
* `tilelang/intrinsics/utils.py`
* `tilelang/language/__init__.py`
* `tilelang/language/utils.py`
* `tilelang/quantize/mxfp.py`
* `tilelang/quantize/quantization.py`

* [Lint] More accurate docstring

* [Lint]

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Refactor] Refactor env into a more flexible version (#740)

* Fix environment variable name for compilation print setting in `env.py`

* Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.

* lint fix

* Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

* [Enhancement] Add stride index validation in CythonKernelWrapper (#743)

* Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`.
* This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code.

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742)

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error

* Update atomicadd_vectorize.cc

* format

* 📝 Add docstrings to PR #744 (#745)

* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Refactor] Refactor barrier management (#744)

* Introduce Barrier

* Enhance CUDA kernel with new barrier management and post-processing support

- Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
- Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
- Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
- Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
- Enhanced the overall structure and readability of the codebase.

* Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.

* Enhance barrier management in TileLang

- Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
- Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
- Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
- Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
- Removed deprecated memory scope handling code to enhance clarity and maintainability.

* lint fix

* lint fix

* Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.

* Refactor logging in JITKernel to improve kernel compilation tracking

- Removed unused import of `torch.backends` in the example file.
- Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
- Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.

* Refactor dequantization tests and update barrier function

- Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
- Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.

* Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.

* Fix typos in rasterization parameters and update import path for cached module

- Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
- Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
- Added `StridedTensor` import in the language module for enhanced tensor functionality.

* Update ci.yml

* [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746)

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

* [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741)

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.

* Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.

* Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.

* Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.

* typo fix

* Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.

* lint fix

* test fix

* Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.

* use ninja instead of make

* Add CMake configuration step for Ninja build system in setup.py

* Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.

* Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.

* Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.

* Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.

* Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.

* Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.

* bugfix

* Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.

* Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.

* Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

* [Enhancement] Optimize loop body handling in IR (#749)

- Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable.
- This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [MXFP4] Fix bugs and optimize exponential operation (#750)

* [MXFP4] Fix bugs
- Optimize exp2 with shift operation to boost performance
- Fix bug of simple dequantization function call
- Fix bug of scaling factor with bias

* [Lint]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751)

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

* [Enhancement] Add shape checking for reduce options (#748)

* Add shape checking for reduce options

* lint fix

* Handle special case reducing into shape-1 tensor

Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix] Add missing FP8 header include (#752)

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Include cuda_fp8.h in gemm_sm90.h

- Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* lint fix

* [Refactor] Remove unused tl_shuffle_elect and related functions from common.h

- Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase.
- Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations.
- Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability.

* lint fix

* [Refactor] Update header inclusions in common.h and gemm_sm90.h

- Removed the inclusion of "intrin.h" from common.h to streamline dependencies.
- Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability.

* bug fix

* [MXFP4] Add bias to MXFP4 GEMM kernel (#753)

* [MXFP4] Add bias to gemm kernel

* [Lint]

* [Lint] Rename "bias" to "Bias"

* [Bugfix][WS] Consider loop min extent when computing phase id (#754)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* [Typo] Remove `disable_cache` in some tests (#755)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Feature] Add 1D TMA support (#761)

* [Feature] Add 1D TMA support
- Check the contiguous conditions of 1D TMA copy
- Add new interface and params order of `tma_load` and `tma_store` call
- Add 1D `tma_store` interface in sm90 template
- Add elementwise kernel for 1D TMA example

* [Lint]

* [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors

* [Lint]

* [BugFix] 1D TMA load

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Lint]

* [Lint]

* [MXFP4] Add test for bf16&mxfp4 gemm

* [BugFix]

* [Lint]

---------

Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>

* [Example] Add vertical slash sparse attention pattern (#762)

* upd sparse attn

* lint

* rename

* update test file

* update benchmark

* lint

* update benchmark

* [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767)

* fix ci and pass bug

* fix

* try

* lint

* [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766)

* [TMA] Add 1D TMA copy for Scale tensor

* [Lint]

* [Test] Add test for kernel

* [BugFix]

* hot fix blackwell (#768)

* [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763)

* Refactor operator classes to inherit from TileOperator and update layout inference methods

- Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations.
- Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency.
- Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization.
- Added missing layout inference implementations for Fill and Conv2DIm2ColOp.
- Removed deprecated op.cc and op.h files to streamline the codebase.

* lint fix

* Refactor operator classes to use Node pattern and improve memory management

- Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation.
- Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access.
- Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design.
- Refactored InferLayout and Lower methods to ensure consistency across operator implementations.
- Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase.

* Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning

- Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects.
- Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations.
- Made minor adjustments in layout inference and other related methods for consistency and clarity.

* Refactor FillNode::Lower method to remove unused global function call

- Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity.
- Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies.

* [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757)

* [Enhancement] Introduce finalize_reducer operator and layout reducer support

- Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
- Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
- Updated `setup.py` to include logging for build directory paths, improving build process visibility.
- Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
- Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
- Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.

* Refactor code formatting and improve readability in multiple files

- Cleaned up whitespace in `setup.py` to enhance logging clarity.
- Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
- Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
- Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.

* Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.

* [Enhancement] Disable reuse of small arrays in shared memory allocation

- Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.

* Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.

* Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.

* bug fix

* Add thread checks workaround for replicated cases

* Remove the is_one check

* fix lint error

* lint fix

* Update autotune tests to use smaller matrix sizes for improved performance and reliability

* [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods

- Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
- Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
- Adjusted header inclusions for improved organization and clarity across multiple files.

* [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py

- Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.

* [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution

- Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
- Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.

---------

Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>
Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com>

* 📝 Add docstrings to `pytile_0826` (#770)

* 📝 Add docstrings to `pytile_0826`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814

The following files were modified:

* `src/op/atomic_add.cc`
* `src/op/atomic_add.h`
* `src/op/copy.cc`
* `src/op/copy.h`
* `src/op/elem.cc`
* `src/op/elem.h`
* `src/op/gemm.cc`
* `src/op/gemm.h`
* `src/op/gemm_sp.cc`
* `src/op/gemm_sp.h`
* `src/op/operator.cc`
* `src/op/operator.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/op/reduce.h`
* `src/op/region.cc`
* `src/op/region.h`
* `src/transform/layout_inference.cc`
* `src/transform/lower_tile_op.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix]:Fix atomic add auto vectorize negative optimization (#765)

* [Bugfix]:Fix atomic add auto vectorize negative optimization

* fixbug

* format

* fix bug

* 📝 Add docstrings to `reducer_0825` (#772)

* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Allow fill global buffer (#774)

* Allow fill global buffer

* fix lint error

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771)

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match

* [Lint]

* add bf16 exp fallback (#776)

* [Lint] Introduce clang-tidy into format.sh (#777)

* [Refactor] Update Clang-Tidy Checks and Improve Code Consistency

- Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
- Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
- Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
- Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
- General code cleanup and adherence to best practices for better maintainability.

* [Refactor] Enhance Code Consistency and Clang-Tidy Configuration

- Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
- Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
- Replaced size checks with `empty()` method calls in various locations for clearer intent.
- Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency

- Added clang-tidy checks to the format script for improved code quality assurance.
- Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
- Updated the requirements-lint.txt file to include clang-tidy as a dependency.
- General code cleanup to adhere to best practices and improve maintainability.

* [CI] Update AMD CI Workflow to Include Build Directory Creation

- Added steps to create a build directory and configure CMake with ROCm support during the format check process.
- Ensured cleanup of the build directory after the format check to maintain a clean workspace.

* [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode

- Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
- This change enhances code clarity and maintainability by focusing on relevant attributes for each class.

* [Refactor] Update Clang-Tidy Integration and Code Improvements

- Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
- Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
- Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize

- Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
- Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
- General code cleanup to maintain consistency and adhere to best practices.

* [CI] Add Git Submodule Initialization to CI Workflow

- Included a step to initialize and update git submodules recursively in the CI workflow.
- This change ensures that all necessary submodules are available during the format check process, improving build reliability.

* [CI] Add Git Submodule Update Step to Format Check

- Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
- This enhancement ensures that all required submodules are available, contributing to improved build reliability.

* [Refactor] Update Function Signatures in AtomicAddVectorize

- Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
- This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.

* [Cache] Introduce detailed target information for the disk kernel cache (#780)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* [Example]Adds example for top-k operation (#775)

* [Example]Adds example for top-k operation

Adds an example demonstrating the top-k operation using tilelang

* format

* Adds topk tilelang example test

* fix lint

* [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h

* lint fix

* remove cmake dep in pyproject as it may lead to different cmake paths in diff stages

* lint fix

* Add cmake dependency to pyproject.toml and improve build logging in setup.py

* [CI] Adds pytest-durations for test timing (#782)

* [Ci] Adds pytest-durations for test timing

Adds `pytest-durations` to the test requirements and configures pytest to display test durations.

This helps in identifying slow-running tests and optimizing the test suite for faster feedback.

* add amd ci durations

* Removes flash_attn installation from CI

* [Refactor] Support python reflection for tile operators (#783)

* Implement Fill operator and related reflection methods in TileLang

- Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
- Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
- Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
- Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
- Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.

* Refactor operator reflection methods and improve code clarity

- Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
- Consolidated static initialization blocks for various operators to a single line for improved consistency.
- Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
- Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.

* Refactor GEMM and AtomicAdd operations for improved clarity

- Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
- Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
- Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
- Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.

* Remove deprecated operator files from the tilelang IR module

- Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
- This cleanup enhances maintainability by removing unused code and improving overall organization of the module.

* Refactor imports in tilelang IR module for improved organization

- Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.

* lint fix

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
- Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
- Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
- Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.

* Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability

- Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
- Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
- Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
- Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.

* comment

* Refactor operator header files for improved readability

- Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
- Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.

* Refactor MakeReduce method in ReduceOpNode for clarity

- Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
- This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.

* Update Reduce operation type checks for consistency

- Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
- This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.

* [AMD] Fix amd tir&add examples (#784)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file na…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant