Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Aug 25, 2025

Summary by CodeRabbit

  • New Features

    • Added reducer finalization API and an automatic reducer-layout pass (finalize_reducer).
    • Added atomic_max, atomic_min, atomic_load, atomic_store; atomic_add accepts optional memory_order.
  • Refactor

    • Unified CUDA device atomics into a memory-order-aware API.
    • Added a faster 1D bulk-copy path and improved vectorization/layout inference.
    • Simplified CUDA grid synchronization emission.
  • Tests

    • Reduced autotune test matrix sizes for faster CI.
  • Chores

    • Removed a stray debug log.

Huanqi Cao and others added 2 commits August 25, 2025 15:02
…support

- Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
- Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
- Updated `setup.py` to include logging for build directory paths, improving build process visibility.
- Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
- Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
- Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.
- Cleaned up whitespace in `setup.py` to enhance logging clarity.
- Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
- Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
- Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

Walkthrough

Adds a FinalizeReducer TL op and LayoutReducer transform; flows reducer metadata into layout inference and lowering; introduces memory-order aware CUDA atomics and new atomic ops; adds a contiguous 1D TMA copy path; Hopper AllReduce codegen changes; small CUDA, debug, and test adjustments.

Changes

Cohort / File(s) Summary
Finalize reducer TL op
src/op/finalize_reducer.h, src/op/finalize_reducer.cc, tilelang/language/reduce.py, tilelang/language/__init__.py
Add FinalizeReducer TileOperator, lowering that performs thread AllReduce (Hopper-specific run_hopper path), registration, and Python API finalize_reducer export.
Reducer layout transform & integration
src/transform/layout_reducer.h, src/transform/layout_reducer.cc, tilelang/transform/__init__.py, tilelang/engine/phase.py
Add ReducerInfo types, LayoutReducer transform pass (annotates reducer fragment/replication layouts), FFI wrapper, and insert pass into lowering pipeline.
Layout inference & parallel op updates
src/op/parallel.h, src/op/parallel.cc, src/transform/layout_inference.cc
Propagate reducer_info into Parallel visitor; avoid vectorization when reducers present; skip ALL-replicated buffers during layout inference; adjust vectorization/thread alignment checks and replication compatibility checks.
Bulk copy / TMA lowering
src/op/copy.cc, src/op/builtin.h
Add early 1D TMA path for contiguous, size-equal shared/global ranges; compute contiguity/element/OOB guards; pass eviction_policy to tma_load/tma_store; declare ptx_fence_barrier_init().
Hopper AllReduce & reduce codegen
src/op/reduce.cc
Use TargetIsHopper(T.target) and emit run_hopper with total-thread template arg for Hopper inter-thread AllReduce codegen.
CUDA atomic API & Python bindings
src/tl_templates/cuda/common.h, tilelang/language/customize.py, tilelang/language/__init__.py
Replace ad-hoc atomics with cuda::atomic_ref-based, type-normalized API; add AtomicLoad/Store/Min/Max and memory-order parameter; update AtomicAdd signature; expose atomic_max/min/load/store in Python.
CUDA codegen & debug tweaks
src/target/codegen_cuda.cc, src/tl_templates/cuda/debug.h, src/tl_templates/cuda/gemm_sm90.h
Emit inline cooperative_groups::this_grid().sync(); add device pointer debug_print_var overload; adjust includes (use gemm_mma.h, remove some Cute headers).
Storage/rewriter small edits
src/transform/storage_access.cc, src/transform/warp_specialized_rewriter.cc, src/transform/merge_shared_memory_allocations.cc
Use analyzer-simplified negation for IfThenElse else-constraints; expose WSCodeEmitter::onlyHasWgMMA(); minor whitespace edit.
Python minor changes & tests
tilelang/language/allocate.py, testing/python/autotune/*, tilelang/transform/__init__.py
Add tvm import in allocate, reduce autotune test sizes, add LayoutReducer Python wrapper.
Build script cleanup
setup.py
Remove debug logger line printing extdir.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Kernel
  participant TL as tl.finalize_reducer
  participant AR as ThreadAllReduce
  participant Buf as ReducerBuffer

  Kernel->>TL: finalize_reducer(reducer_buf)
  TL->>Buf: BufferLoad per-output partials
  alt TargetIsHopper
    TL->>AR: AllReduce::run_hopper<op, all_threads>(partials)
  else
    TL->>AR: AllReduce::run<op>(partials)
  end
  AR-->>TL: reduced result
  TL->>Buf: BufferStore final value
  TL-->>Kernel: return
Loading
sequenceDiagram
  autonumber
  participant Driver as LoweringPipeline
  participant Mod as IRModule
  participant LR as LayoutReducer
  participant LI as LayoutInference

  Driver->>Mod: Simplify(mod)
  Driver->>LR: LayoutReducer()(mod)
  LR->>Mod: annotate reducer layouts/fragments
  Driver->>LI: LayoutInference(mod)
  LI->>Mod: infer layouts (skip ALL-replicated buffers)
  LI-->>Driver: updated module
Loading
sequenceDiagram
  autonumber
  participant Lower as Copy::LowerBulkCopy
  participant Check as Contiguity/Size/OOB
  participant TMA as TMA_1D_Copy

  Lower->>Check: are shared/global contiguous and sizes equal?
  alt yes & in-bounds
    Lower->>TMA: emit tma_load/tma_store (1D) with eviction_policy
    TMA-->>Lower: early return
  else
    Lower-->>Lower: fallback to 2D descriptor or scalar paths
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

I thump and nibble through the code,
Threads converge in tidy mode.
Fragments stitch and atomics sing,
1D copies hop — hooray, they spring!
A rabbit cheers: reducers done, carrots compiled with glee. 🐇🥕

✨ Finishing Touches
  • 🔄 Docstrings generation in progress.
🧪 Generate unit tests

❌ Error creating Unit Test PR.

  • Create PR with unit tests
  • Commit unit tests in branch reducer_0825
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new finalize_reducer operator and a LayoutReducer pass to better manage inter and intra-warp reduction operations. It refines atomic operations using cuda::atomic_ref, optimizes TMA 1D copies, and integrates reducer information into layout inference to prevent undesirable loop vectorization. These changes aim to improve the efficiency and correctness of reduction kernels, especially on modern GPU architectures.

Highlights

  • New finalize_reducer Operator: A dedicated operator and pass (LayoutReducer) are introduced to handle the finalization of reduction operations, distinguishing between inter-warp and intra-warp reductions.
  • Enhanced Atomic Operations: Atomic min, max, load, and store functions are added, and existing atomic_add is updated to leverage cuda::atomic_ref for improved robustness and memory ordering control.
  • TMA 1D Optimization: Logic is added to utilize 1D TMA for contiguous shared and global memory transfers, improving data movement efficiency.
  • Reducer-Aware Layout Inference: The system now recognizes reduction operations during layout inference, allowing for more intelligent loop vectorization decisions to avoid conflicts.
  • CUDA Architecture Specific Optimizations: Updates to all-reduce implementations and sync_grid calls specifically target Hopper (Sm90) and Sm100 architectures for better performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new finalize_reducer operator and associated compilation passes to precisely manage reduction operations, differentiating between inter-warp and intra-warp reductions. This enables more optimized code generation for various reduction types (sum, max, min) by providing explicit control over their memory layouts and atomic behavior. It also refines Tensor Memory Access (TMA) copy optimizations and updates atomic operations to use modern CUDA C++ atomics, enhancing performance and correctness.

Highlights

  • New Reduction Abstraction: Introduces a finalize_reducer operator and a LayoutReducer pass to explicitly manage reduction operations, allowing for fine-grained control over inter-warp and intra-warp reductions.
  • Modern Atomic Operations: Refactors atomic primitives (AtomicAdd, AtomicMax, AtomicMin, AtomicLoad, AtomicStore) to utilize cuda::atomic_ref, aligning with modern CUDA C++ standards for improved correctness and memory ordering.
  • Optimized TMA Data Transfers: Enhances Tensor Memory Access (TMA) copy logic to detect and optimize 1D contiguous data transfers, potentially improving memory bandwidth utilization.
  • Intelligent Layout Inference: Updates the layout inference mechanism to incorporate reducer information, preventing suboptimal vectorization for reduction loops and ensuring efficient memory access patterns.
  • Target-Specific Code Generation: Improves tl::AllReduce code generation by using robust target checks (Hopper/SM100) and simplifies grid synchronization for CUDA.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new alloc_reducer mechanism to better manage inter-warp and intra-warp reduction operations, enhancing the compiler's ability to optimize these critical computations. It includes a new LayoutReducer pass to handle specific layout annotations for reducers and refactors atomic operations to support memory ordering.

Highlights

  • Enhanced Reduction Management: A new finalize_reducer operator and ReducerInfo structure are introduced to provide fine-grained control over reduction layouts, distinguishing between "all" and "none" replication types.
  • Optimized Data Movement: Logic is added to leverage 1D TMA (Tensor Memory Accelerator) for efficient bulk data copies between shared and global memory when regions are contiguous.
  • Advanced Atomic Operations: Atomic max, min, load, and store operations are added, and existing add operations are refactored to utilize cuda::atomic_ref with explicit memory ordering, improving concurrency control and correctness.
  • Target-Specific Performance: Reduction code generation is now specialized for Hopper (SM90) and SM100 architectures, enabling better performance on newer NVIDIA GPUs.
  • Compiler Pass Integration: A new LayoutReducer pass is integrated into the compilation pipeline to automatically infer and apply appropriate memory layouts for reducer buffers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

…cc` to enhance code clarity and maintainability.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring around reducers by adding an alloc_reducer to differentiate between inter- and intra-warp reductions. This includes a new LayoutReducer pass and a finalize_reducer operator. Additionally, it enhances TMA copy support by adding a 1D path and modernizes CUDA atomics by using cuda::atomic_ref. The changes are well-structured, but there are a few issues that need to be addressed before merging. These include a critical compilation error from a duplicate function, a few copy-paste errors, and some leftover debugging statements. There is also a TODO comment that might indicate incomplete functionality.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new alloc_reducer to separate inter- and intra-warp reduction, which is a significant refactoring of the reduction logic. Key changes include a new LayoutReducer pass, a finalize_reducer operator, and support for 1D TMA copies as an optimization. The atomic operations are also modernized to use cuda::atomic_ref. The changes are well-structured and improve the codebase. My review includes feedback on removing leftover debug prints, fixing a duplicated function, and correcting a copy-pasted include guard.

- Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new alloc_reducer to separate inter- and intra-warp reduction, and refactors atomic operations to use cuda::atomic_ref. The changes are generally good and improve the reduction logic and atomics implementation. However, there are several issues that need to be addressed, including leftover debug code, a critical copy-paste error that results in a duplicate function definition, an incorrect include guard, and a bug in the new atomic operation implementations. I've provided detailed comments and suggestions for each issue.

…e clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.
…t target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/tl_templates/cuda/gemm_sm90.h (1)

371-373: Critical type alias bug: B_type falls back to A_type_raw (should be B_type_raw).

This makes B_type incorrectly mirror A_type when B_type_raw != float, which can select the wrong OperandTraits, layouts, and MMA instructions.

Apply this fix:

   using B_type =
-      typename std::conditional<std::is_same<B_type_raw, float>::value,
-                                tfloat32_t, A_type_raw>::type;
+      typename std::conditional<std::is_same<B_type_raw, float>::value,
+                                tfloat32_t, B_type_raw>::type;

Optionally add a guard to catch future regressions:

+  static_assert(std::is_same<B_type, B_type_raw>::value ||
+                std::is_same<B_type, tfloat32_t>::value,
+                "B_type must be B_type_raw or tfloat32_t");
tilelang/language/customize.py (2)

129-159: atomic_add: memory_order is ignored on the region path

When extents are deduced (most common path), you switch to tl.atomicadd and silently drop memory_order. This makes the API behavior inconsistent.

If tl.atomicadd supports memory orders, thread the argument through; otherwise, document that memory_order only applies to scalar/undeduced paths and raise if specified in the region path to avoid surprising users.

-    value = _to_region(value, "r")
-    dst = _to_region(dst, "w")
-    return T.call_intrin("handle", op.Op.get("tl.atomicadd"), value, dst)
+    value = _to_region(value, "r")
+    dst = _to_region(dst, "w")
+    mo = _mo(memory_order)
+    if mo is not None:
+        # If tl.atomicadd can’t take a memory order, prefer failing fast:
+        raise ValueError("memory_order is not supported for region-based tl.atomicadd")
+    return T.call_intrin("handle", op.Op.get("tl.atomicadd"), value, dst)

If tl.atomicadd can be extended to accept a memory order, I can propose the corresponding TIR op/plumbing changes.


286-299: atomic_store: address-of and memory_order validation

Same fix as atomic_load. Also clarify docstring (“store operation” not “load operation”).

-    """Stores a value to the input buffer with specified memory_order.
+    """Stores a value to the input buffer with specified memory_order.
...
-        memory_order (str, optional): Atomicity level for the load operation. Defaults to "seq_cst".
+        memory_order (str, optional): Atomicity level for the store operation. Defaults to "seq_cst".
...
-    return T.call_extern("handle", "AtomicStore", T.address_of(dst), src,
-                         _MEMORY_ORDER_ID_MAP[memory_order])
+    idx0 = [0] * len(dst.shape)
+    addr = T.address_of(T.BufferLoad(dst, idx0))
+    return T.call_extern("handle", "AtomicStore", addr, src, _mo(memory_order))
🧹 Nitpick comments (33)
src/tl_templates/cuda/debug.h (2)

16-19: Use unsigned format for CUDA dim3 components

blockIdx.* and threadIdx.* are unsigned. Use %u for correctness and to silence potential format warnings. Purely cosmetic for debug output.

-      "msg='%s' BlockIdx=(%d, %d, %d), ThreadIdx=(%d, %d, %d): dtype=pointer "
+      "msg='%s' BlockIdx=(%u, %u, %u), ThreadIdx=(%u, %u, %u): dtype=pointer "
-      msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
+      msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
       threadIdx.z, /* cast addressed in a separate comment */

13-20: Be explicit about C-strings: pointer overload prints addresses, not contents

Our search for direct calls passing string literals or address-of expressions to the new pointer overload returned no matches, suggesting there are no obvious cases of unintended address printing in the current codebase. However, simple grep patterns can’t catch all scenarios where a char* variable (not a literal) is passed. To avoid surprises:

• If you ever call debug_print_var("msg", some_char_ptr), you’ll see the pointer value, not the string contents.
• If you want to print C-string contents instead, you can add this optional overload:

// Optional: print C-string contents instead of the pointer address.
__device__ inline void debug_print_var(const char* msg, const char* s) {
  printf(
    "msg='%s' BlockIdx=(%u, %u, %u), ThreadIdx=(%u, %u, %u): dtype=cstr value=%s\n",
    msg,
    blockIdx.x, blockIdx.y, blockIdx.z,
    threadIdx.x, threadIdx.y, threadIdx.z,
    s ? s : "(null)"
  );
}

• Otherwise, consider documenting that the pointer overload prints addresses for all T*, including char*.

No immediate code changes are required, but keep this in mind for future char* debug prints.

src/tl_templates/cuda/gemm_sm90.h (2)

13-13: Avoid broad using-directive for SM90 in headers.

Using-directives at namespace scope increase collision risk with SM80/SM75 tags used later. Prefer targeted using-declarations (only what you need), or qualify at call sites.

For example:

-using namespace SM90;
+using SM90::GMMA;
+// add specific aliases as needed, or qualify at call sites

17-17: Limit detail namespace exposure; import only what you use.

Instead of pulling in the entire cutlass::gemm::collective::detail namespace, import just ss_smem_selector to keep symbol pollution down.

-using namespace cutlass::gemm::collective::detail; // ss_smem_selector
+using cutlass::gemm::collective::detail::ss_smem_selector;
src/tl_templates/cuda/common.h (3)

7-7: Guard <cuda/atomic> include for NVRTC/older toolchains

cuda::atomic_ref requires libcu++ availability; NVRTC and some older CUDA toolchains won’t ship <cuda/atomic>. Add a guard to avoid hard failures when compiling device code via NVRTC or with older SDKs.

Apply this diff:

-#include <cuda/atomic>
+#ifndef __CUDACC_RTC__
+#  if __has_include(<cuda/atomic>)
+#    include <cuda/atomic>
+#  else
+#    error "libcu++ <cuda/atomic> is not available on this toolchain. Provide a fallback or gate features."
+#  endif
+#endif

If you prefer a soft fallback instead of #error, define a feature flag (e.g., TL_HAS_CUDA_ATOMIC 0/1) and conditionalize the new atomic wrappers on it. Also verify your minimum CUDA version in CI. Would you like me to wire this up across the repo?


190-201: Nit: comments mislabel float vector atomics

The comments say “FLOAT16x2/x4” while the code clearly uses float2/float4. Update the comment to avoid confusion.

-// AtomicAdd Functions for FLOAT16x2
+// AtomicAdd Functions for float2
 ...
-// AtomicAdd Functions for FLOAT16x4
+// AtomicAdd Functions for float4

140-147: Header prerequisites for FP16/BF16 intrinsics

__float2half, __float2bfloat16, and __bfloat162float require <cuda_fp16.h> / <cuda_bf16.h>. They might be transitively included via CUTLASS today; consider explicit includes to reduce fragility.

If missing, add near the top (guarded like <cuda/atomic>):

+#ifndef __CUDACC_RTC__
+#  include <cuda_fp16.h>
+#  if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+#    include <cuda_bf16.h>
+#  endif
+#endif
tilelang/language/allocate.py (1)

17-17: Remove unused import alias to avoid linter noise and confusion

from tilelang import tvm as tvm is not used anywhere in this module. Keeping it may trigger unused-import warnings and confuses the source of T (which comes from tvm.script). Recommend removing it.

-from tilelang import tvm as tvm
src/op/copy.cc (2)

787-787: Prefer explicit const Op& for consistency and to avoid an extra handle copy

Elsewhere in this file you bind intrinsics as const Op&. Using auto here copies the handle (minor overhead), and is inconsistent with the style below (e.g., LDSM/STSM). Suggest making it a reference.

-  auto op = is_load ? tma_load() : tma_store();
+  const Op &op = is_load ? tma_load() : tma_store();

789-861: 1D TMA fast-path looks good; consider minor robustness tweaks

The contiguity checks and byte-count math are sound, and the branch is gated to a single thread for correctness. Two small refinements would improve robustness/readability:

  • Simplify global_offset via the analyzer, mirroring what you do for shared_total_elements.
  • (Optional) Extract the contiguity predicate into a helper to keep LowerBulkCopy lean.
       PrimExpr global_addr =
           global_tensor.access_ptr(is_load ? 1 : 2, DataType::Handle(), 1,
-                                   global_offset, global_total_elements);
+                                   analyzer->Simplify(global_offset), global_total_elements);

If you like, I can follow up with a small helper like IsContiguousRegion(const Buffer&, const Array<Range>&, arith::Analyzer*) and wire it in here.

src/op/parallel.h (1)

13-13: Header-layering: including a transform header from an op header risks tight coupling

Bringing ../transform/layout_reducer.h into parallel.h makes the op layer depend on a transform header, which can increase rebuild surface area and the chance of include cycles. If feasible, consider extracting only the lightweight type declarations (e.g., ReducerInfo, enums, and kReducerInfo) into a small shared header (e.g., ir/reducer_info.h) that both op and transform can include. Otherwise, please confirm there’s no cycle between layout_reducer.h and any headers included here.

tilelang/transform/__init__.py (1)

424-427: Wrapper LGTM; consider a slightly richer docstring for consistency

The new LayoutReducer() wrapper looks correct. For consistency with neighboring wrappers, consider adding a short “Returns” section describing the pass type.

-def LayoutReducer():
-    """LayoutReducer
-    """
+def LayoutReducer():
+    """LayoutReducer
+
+    Returns
+    -------
+    fpass : tvm.transform.Pass
+        Annotate reducer layout metadata prior to LayoutInference.
+    """
     return _ffi_api.LayoutReducer()  # type: ignore
tilelang/language/reduce.py (1)

190-203: Public API finalize_reducer: Add return annotation and optional scope validation

The finalize_reducer wiring and registration are correct (see registration at src/op/finalize_reducer.cc:81). Two small, non-breaking improvements can make this API clearer and safer:

• Add an explicit return type to help users and type-checkers.
• Optionally validate that the passed-in buffer is in a writable reducer scope (or document this requirement) to catch misuse early.

Please update tilelang/language/reduce.py as follows:

-def finalize_reducer(reducer: tir.Buffer):
-    """Finalize the reducer buffer.
+def finalize_reducer(reducer: tir.Buffer) -> tir.PrimExpr:
+    """Finalize the reducer buffer (emit a reducer finalization intrinsic).
@@
-    return tir.call_intrin(
+    # Optional: enforce that this buffer is a writable reducer fragment:
+    # if reducer.scope() not in ("local.fragment",):
+    #     raise ValueError(f"Unsupported reducer scope: {reducer.scope()}")
+    return tir.call_intrin(
         "handle",
         tir.op.Op.get("tl.finalize_reducer"),
         reducer.access_ptr("w"),
     )
  • File: tilelang/language/reduce.py
  • Function definition around line 190
  • Registration confirmed at src/op/finalize_reducer.cc:81
src/op/reduce.cc (1)

229-239: Verified: Single Definition of TargetIsHopper Found; Optional Constexpr Guard Recommended

I confirmed that TargetIsHopper is declared only once in src/target/utils.h and defined once in src/target/utils.cc, so there are no duplicate definitions to worry about.

The original suggestion to enforce compile-time constants for thread_offset and all_threads remains a helpful optional refactor to increase robustness. Consider adding checks similar to your existing as_const_int assertions:

       auto thread_offset = T.thread_bounds->min;
       if (TargetIsHopper(T.target)) {
-        auto all_threads = T.thread_bounds->extent;
+        auto all_threads = T.thread_bounds->extent;
+        // Ensure these template args are compile-time constants
+        auto thread_offset_ci = as_const_int(thread_offset);
+        auto all_threads_ci  = as_const_int(all_threads);
+        ICHECK(thread_offset_ci && all_threads_ci)
+            << "AllReduce template parameters must be compile-time constants";
         ss << "tl::AllReduce<" << this->MakeCodegenReducer() << ", "
            << reducing_threads << ", " << (*scale) << ", "
-           << thread_offset << ", " << all_threads << ">::run_hopper";
+           << (*thread_offset_ci) << ", " << (*all_threads_ci)
+           << ">::run_hopper";
       } else {
         ss << "tl::AllReduce<" << this->MakeCodegenReducer() << ", "
            << reducing_threads << ", " << (*scale) << ", " << thread_offset
            << ">::run";
       }

Key points:

  • No duplicate TargetIsHopper definitions were found.
  • This change is optional but will help catch non-constant inputs at codegen time.
src/transform/warp_specialized_rewriter.cc (1)

324-334: Correct barrier slotting for 1D vs. descriptor-based TMA loads

Good catch differentiating 1D cp.async.bulk (barrier at args[2]) from descriptor-based tensor TMA (barrier at args[1]). The guard using op->args[0] against create_tma_descriptor() aligns with inject_tma_barrier.cc and the CUDA overloads. Consider adding a defensive size check (ICHECK_GE(call->args.size(), 3)) before args.Set(2, ...) for 1D to harden against malformed calls.

src/transform/layout_inference.cc (2)

568-573: Safe extraction of reducer_info from annotations

The pattern using count() followed by Get(...)->as<...>().value() is safe and clear. If you anticipate mixed annotation contents, you could guard the as<...>() result, but not strictly necessary.


618-629: Vectorization gating when reducers are present

Reasonable workaround to avoid vectorizing loops that write to reducer buffers. Longer-term, isolating the reduction axis before vectorization would enable more kernel shapes; until then, this prevents subtle correctness issues.

If helpful, I can prototype a follow-up pass to split out reduction axes pre-vectorization; want me to open an issue for that?

src/op/finalize_reducer.h (2)

1-39: Header guard trailing comment mismatch

The trailing comment on the #endif does not match the defined guard. Harmless, but worth fixing for consistency.

Apply:

-#endif //  TVM_TL_OP_REDUCE_H_
+#endif // TVM_TL_OP_FINALIZE_REDUCER_H_

18-18: Avoid using namespace in headers

Headers should not introduce wide using directives. Prefer qualifying or narrow using-declarations.

Apply:

-using namespace tir;
+// Avoid namespace pollution from headers

And minimally qualify types used in this header:

-  FinalizeReducer(Array<PrimExpr> args, BufferMap vmap);
-  Stmt Lower(const LowerArgs &T, arith::Analyzer *analyzer) const final;
+  FinalizeReducer(Array<tir::PrimExpr> args, BufferMap vmap);
+  tir::Stmt Lower(const LowerArgs &T, arith::Analyzer *analyzer) const final;
src/op/parallel.cc (1)

299-308: Vector width backoff logic verified

  • Confirmed the only backoff loop is in src/op/parallel.cc around lines 297–302, where vector_size is initialized via GetVectorizeSize and then reduced until
    floormod(loop_total_size, T.thread_bounds->extent * vector_size) == 0 or vector_size == 1.
  • No other occurrences of this pattern were found in the file.

The heuristic is sound and no code changes are required.
Optional: you may add a VLOG/TRACE statement in the fallback path (when vector_size remains at 1 for symbolic extents) to aid in tuning and debugging.

src/transform/inject_tma_barrier.cc (3)

65-77: 1D TMA byte-accounting is correct; consider hardening detection and dtype-bytes derivation

  • The 1D path using args[3] as size matches the 1D overload (barrier at arg[2], size at arg[3]) and multiplying by loop_extents is appropriate.
  • Two improvements to reduce edge cases:
    • Centralize the 1D/2D detection into a helper to avoid duplication and potential drift across rewriters.
    • Deriving element byte size via access_ptr->args[0]->dtype.bytes() can be brittle if the ptype isn’t the element dtype. If available, prefer deriving bytes from the underlying buffer dtype associated with the tvm_access_ptr, or inspect the PointerType’s element type explicitly.

Apply this refactor to de-duplicate 1D detection and clarify type-bytes:

+static inline bool Is1DTmaLoad(const CallNode* call) {
+  if (!call || !call->op.same_as(tma_load())) return false;
+  if (auto arg0 = call->args[0].as<Call>()) {
+    return !arg0.value()->op.same_as(create_tma_descriptor());
+  }
+  return false;
+}
...
-      if (auto arg0 = call->args[0].as<Call>();
-          call->op.same_as(tma_load()) && arg0 &&
-          !arg0.value()->op.same_as(create_tma_descriptor())) {
+      if (Is1DTmaLoad(call)) {
         // 1D TMA
         single_copy_bytes = call->args[3];
       } else {
         Call access_ptr = Downcast<Call>(call->args[2]);
         ICHECK(access_ptr->op.same_as(builtin::tvm_access_ptr()));
-        int type_bytes = access_ptr->args[0]->dtype.bytes();
+        // Prefer deriving bytes from the element dtype if resolvable
+        int type_bytes = access_ptr->args[0]->dtype.bytes();  // TODO: tighten this if pointer-typed
         single_copy_bytes = access_ptr->args[3] * type_bytes;
       }

Please confirm that for all emitted 2D tma_load/tma_load_im2col calls, args[2] is always a tvm_access_ptr of the SMEM destination. If there are cases where it isn’t, the type-bytes path needs to be updated accordingly.


168-174: Barrier index for 1D TMA is placed correctly; unify detection and avoid duplicate logic

The barrier placement at arg[2] for 1D and arg[1] otherwise is correct. Consider reusing a shared Is1DTmaLoad helper to keep TMA shape checks consistent across visitors.

-      auto arg0 = op->args[0].as<Call>();
-      auto is_1d_tma = op->op.same_as(tma_load()) && arg0 &&
-                       !arg0.value()->op.same_as(create_tma_descriptor());
+      bool is_1d_tma = Is1DTmaLoad(op);

458-464: Barrier rewrite for 1D TMA: correct slot; add a small guard and unify detection

The 1D/2D split and barrier slot assignment look right. Add braces to the if for clarity and reuse a shared helper.

-      if (auto arg0 = op->args[0].as<Call>();
-          arg0 && !arg0.value()->op.same_as(create_tma_descriptor()))
-        new_args.Set(2, barrier_id);
-      else
-        new_args.Set(1, barrier_id);
+      if (Is1DTmaLoad(op)) {
+        new_args.Set(2, barrier_id);
+      } else {
+        new_args.Set(1, barrier_id);
+      }

Edge case: if codegen ever emits a 1D tma_load whose args[0] isn’t a Call (e.g., a Var), current detection would classify it as 2D. Please confirm generator always uses tvm_access_ptr (or another Call) for smem_ptr.

src/transform/layout_reducer.cc (4)

27-43: String-to-enum parsing is fine; consider normalizing and surfacing helpful diagnostics

The mapping is straightforward. Minor ergonomics:

  • Normalize op_str/rep_str to lowercase once to remove case sensitivity footguns.
  • Include the block name or buffer var in the error to ease debugging.
-ReducerInfoNode::ReducerInfoNode(const String &op_str, const String &rep_str) {
+ReducerInfoNode::ReducerInfoNode(const String &op_str, const String &rep_str) {
+  auto to_lower = [](String s) {
+    std::string t = s;
+    std::transform(t.begin(), t.end(), t.begin(), ::tolower);
+    return String(t);
+  };
+  String op_l = to_lower(op_str);
+  String rep_l = to_lower(rep_str);
-  if (op_str == "sum")
+  if (op_l == "sum")
     op = ReducerOpType::SUM;
-  else if (op_str == "max")
+  else if (op_l == "max")
     op = ReducerOpType::MAX;
-  else if (op_str == "min")
+  else if (op_l == "min")
     op = ReducerOpType::MIN;
   else
-    ICHECK(false) << "Unrecognized reducer_info op: " << op_str;
+    ICHECK(false) << "Unrecognized reducer_info op: " << op_str;
-  if (rep_str == "all")
+  if (rep_l == "all")
     rep = ReducerRepType::ALL;
-  else if (rep_str == "none")
+  else if (rep_l == "none")
     rep = ReducerRepType::NONE;
   else
     ICHECK(false) << "Unrecognized reducer_info rep: " << rep_str;
}

62-88: Reading reducer annotations and merging layout_map: good; add type guards and avoid value_or on missing attr

  • The read path assumes attr::kReducerInfo is Map<Var, Map<String,String>>. Consider guarding with is_not_null before ->as<...>() to avoid undefined behavior.
  • When merging kLayoutMap, use Optional and check .defined() rather than value_or to avoid constructing a temporary just to overwrite it later.
-    if (op->annotations.count(attr::kReducerInfo)) {
-      auto map = op->annotations.Get(attr::kReducerInfo)
-                     ->as<Map<Var, Map<String, String>>>();
-      ICHECK(map) << "reducer_replication map is not defined";
+    if (op->annotations.count(attr::kReducerInfo)) {
+      auto anno = op->annotations.Get(attr::kReducerInfo);
+      ICHECK(anno.defined()) << "reducer_info annotation missing";
+      auto map = anno.as<Map<Var, Map<String, String>>>();
+      ICHECK(map) << "reducer_info must be a Map<Var, Map<String,String>> at block scope";
       for (auto &&[var, rep] : map.value()) {
         reducer_info_map_.Set(
             var, ReducerInfo{rep.Get("op").value(), rep.Get("rep").value()});
       }
     }
...
-    auto layout_map = p_result->annotations.Get(attr::kLayoutMap)
-                          ->as<Map<Var, Layout>>()
-                          .value_or(Map<Var, Layout>());
+    Map<Var, Layout> layout_map;
+    if (auto opt = p_result->annotations.Get(attr::kLayoutMap)) {
+      if (auto lm = opt.as<Map<Var, Layout>>()) layout_map = lm.value();
+    }

148-176: Fill/finalize pairing is well-formed; strengthen operand extraction and error messages

  • GetVarFromAccessPtr is used for both Fill and Finalize extraction; ensure it handles both tvm_access_ptr and address_of forms uniformly.
  • Enrich the error to include which buffer was double-Filled or missing a Finalize.
-      ICHECK(inside_reducer_range_.count(var) == 1)
-          << "T.finalize_reducer must have a pairing T.fill ahead of it, "
-             "enclosing a reduction range.";
+      ICHECK(inside_reducer_range_.count(var) == 1)
+          << "T.finalize_reducer must have a preceding T.fill enclosing a "
+             "reduction range for buffer " << var;

188-196: Static entry-point looks good; minor nits

  • Substitute mutates and returns f by value—OK. Consider naming it Run or Rewrite to align with other passes for consistency.
src/op/finalize_reducer.cc (3)

37-45: Early-exit on extent==1 is fine; drop unused variable

scale is unused. Remove it to avoid warnings.

-  int extent = *p_extent, scale = 1;
+  int extent = *p_extent;

62-71: AllReduce call string and workspace handling match reduce.cc; minor style tweaks

  • Consider using const int32_t for reducing_threads to match template expectations.
  • If T.thread_bounds->extent isn’t a constant, ensure AddWorkspace can handle non-const inputs (you’re forcing as_const_int).
-  int reducing_threads = extent;
+  int32_t reducing_threads = extent;
...
-    PrimExpr workspace =
-        T.AddWorkspace(*as_const_int(T.thread_bounds->extent), buffer->dtype);
+    PrimExpr workspace = T.AddWorkspace(T.thread_bounds->extent, buffer->dtype);

Confirm AddWorkspace accepts non-const extents. If not, keep the current path but add an ICHECK(as_const_int(...)) with a helpful message.


72-76: Parallel loop wrapping over OutputDim: check index dtype and bounds

This looks correct. Ensure indices_0[i] have an integer dtype compatible with For (usually int32). The Var default should be fine; adding an explicit dtype can make intent clearer.

-    body = For(indices_0[i].as<Var>().value(), 0, layout->OutputShape()[i],
+    body = For(indices_0[i].as<Var>().value(), IntImm(DataType::Int(32), 0),
+               layout->OutputShape()[i],
                ForKind::kParallel, body);
src/transform/layout_reducer.h (1)

37-39: Attribute key naming

"reducer_info" is concise; consider prefixing with a TL namespace (e.g., "tl.reducer_info") to avoid potential collisions with upstream keys.

tilelang/language/customize.py (2)

10-17: Memory order map: add validation and expose canonical names

Good to centralize. Add a validator to raise a clear error on invalid memory_order and consider accepting synonyms (e.g., "seqcst") if users commonly mistype.

+def _mo(name: str | None) -> int | None:
+    if name is None:
+        return None
+    try:
+        return _MEMORY_ORDER_ID_MAP[name]
+    except KeyError:
+        raise ValueError(f"Unsupported memory_order '{name}'. "
+                         f"Supported: {list(_MEMORY_ORDER_ID_MAP.keys())}")

160-162: Shape assertion: keep friendly diagnostics

ir.assert_structural_equal is good. Consider adding a context string with buffer names for easier debugging.

-        ir.assert_structural_equal(dst.shape, value.shape)
+        ir.assert_structural_equal(dst.shape, value.shape, "atomic_add(Buffer, Buffer): shape mismatch")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 556d411 and 4c0f978.

📒 Files selected for processing (25)
  • setup.py (0 hunks)
  • src/op/builtin.h (1 hunks)
  • src/op/copy.cc (1 hunks)
  • src/op/finalize_reducer.cc (1 hunks)
  • src/op/finalize_reducer.h (1 hunks)
  • src/op/parallel.cc (3 hunks)
  • src/op/parallel.h (2 hunks)
  • src/op/reduce.cc (2 hunks)
  • src/target/codegen_cuda.cc (1 hunks)
  • src/target/utils.cc (1 hunks)
  • src/tl_templates/cuda/common.h (4 hunks)
  • src/tl_templates/cuda/debug.h (1 hunks)
  • src/tl_templates/cuda/gemm_sm90.h (1 hunks)
  • src/transform/inject_tma_barrier.cc (3 hunks)
  • src/transform/layout_inference.cc (3 hunks)
  • src/transform/layout_reducer.cc (1 hunks)
  • src/transform/layout_reducer.h (1 hunks)
  • src/transform/merge_shared_memory_allocations.cc (1 hunks)
  • src/transform/warp_specialized_rewriter.cc (2 hunks)
  • tilelang/engine/phase.py (1 hunks)
  • tilelang/language/__init__.py (1 hunks)
  • tilelang/language/allocate.py (1 hunks)
  • tilelang/language/customize.py (4 hunks)
  • tilelang/language/reduce.py (1 hunks)
  • tilelang/transform/__init__.py (1 hunks)
💤 Files with no reviewable changes (1)
  • setup.py
🧰 Additional context used
🧬 Code graph analysis (16)
tilelang/engine/phase.py (2)
src/transform/layout_reducer.cc (2)
  • LayoutReducer (198-204)
  • LayoutReducer (198-198)
tilelang/transform/__init__.py (1)
  • LayoutReducer (424-427)
src/transform/layout_reducer.h (1)
src/transform/layout_reducer.cc (1)
  • ReducerInfoNode (27-43)
src/transform/inject_tma_barrier.cc (4)
src/transform/warp_specialized_rewriter.cc (18)
  • call (43-48)
  • call (43-43)
  • op (50-55)
  • op (50-50)
  • op (85-95)
  • op (85-85)
  • op (97-105)
  • op (97-97)
  • op (107-112)
  • op (107-107)
  • op (114-122)
  • op (114-114)
  • op (146-158)
  • op (146-146)
  • op (160-189)
  • op (160-160)
  • op (191-201)
  • op (191-191)
src/transform/lower_hopper_intrin.cc (2)
  • call (102-132)
  • call (102-102)
tilelang/language/builtin.py (3)
  • tma_load (67-76)
  • create_tma_descriptor (55-64)
  • get_mbarrier (43-52)
src/tl_templates/cuda/copy_sm90.h (6)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
src/op/reduce.cc (1)
src/target/utils.cc (2)
  • TargetIsHopper (49-54)
  • TargetIsHopper (49-49)
tilelang/language/reduce.py (1)
tilelang/language/tir/op.py (1)
  • call_intrin (119-144)
tilelang/transform/__init__.py (1)
src/transform/layout_reducer.cc (2)
  • LayoutReducer (198-204)
  • LayoutReducer (198-198)
src/op/finalize_reducer.cc (6)
src/op/finalize_reducer.h (1)
  • FinalizeReducer (20-34)
src/transform/layout_reducer.h (1)
  • ReducerOpType (15-41)
src/op/reduce.cc (4)
  • Lower (119-285)
  • Lower (119-119)
  • Lower (379-403)
  • Lower (379-379)
src/transform/layout_reducer.cc (2)
  • op_ (148-176)
  • op_ (148-148)
src/target/utils.cc (2)
  • TargetIsHopper (49-54)
  • TargetIsHopper (49-49)
tilelang/language/reduce.py (1)
  • finalize_reducer (190-203)
src/transform/warp_specialized_rewriter.cc (3)
src/transform/inject_tma_barrier.cc (2)
  • call (63-80)
  • call (63-63)
tilelang/language/builtin.py (2)
  • tma_load (67-76)
  • create_tma_descriptor (55-64)
src/tl_templates/cuda/copy_sm90.h (6)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
src/op/copy.cc (2)
src/transform/inject_tma_barrier.cc (16)
  • op (82-87)
  • op (82-82)
  • op (120-130)
  • op (120-120)
  • op (132-163)
  • op (132-132)
  • op (165-177)
  • op (165-165)
  • op (202-227)
  • op (202-202)
  • op (229-237)
  • op (229-229)
  • op (288-292)
  • op (288-288)
  • op (294-309)
  • op (294-294)
tilelang/language/builtin.py (1)
  • tma_load (67-76)
src/op/parallel.cc (1)
src/transform/layout_inference.cc (18)
  • op (39-45)
  • op (39-39)
  • op (294-322)
  • op (294-294)
  • op (346-369)
  • op (346-346)
  • op (371-388)
  • op (371-371)
  • op (390-399)
  • op (390-390)
  • op (553-565)
  • op (553-553)
  • op (567-639)
  • op (567-567)
  • op (641-650)
  • op (641-641)
  • buffer (338-344)
  • buffer (338-338)
tilelang/language/__init__.py (2)
tilelang/language/reduce.py (1)
  • finalize_reducer (190-203)
tilelang/language/customize.py (4)
  • atomic_max (95-109)
  • atomic_min (112-126)
  • atomic_load (272-283)
  • atomic_store (286-298)
src/target/utils.cc (2)
src/op/atomic_add.cc (2)
  • GetArchInt (24-35)
  • GetArchInt (24-24)
src/op/gemm.cc (2)
  • GetArchInt (363-374)
  • GetArchInt (363-363)
src/transform/layout_inference.cc (1)
src/op/parallel.cc (4)
  • op (96-106)
  • op (96-96)
  • op (108-116)
  • op (108-108)
src/transform/layout_reducer.cc (4)
src/op/parallel.cc (4)
  • VisitStmt_ (122-134)
  • VisitStmt_ (122-122)
  • VisitStmt_ (136-148)
  • VisitStmt_ (136-136)
src/op/finalize_reducer.h (1)
  • `` (25-27)
tilelang/language/tir/op.py (2)
  • indexmod (2890-2915)
  • tvm_access_ptr (650-675)
tilelang/transform/__init__.py (1)
  • LayoutReducer (424-427)
tilelang/language/customize.py (1)
tilelang/language/tir/op.py (2)
  • call_extern (172-194)
  • address_of (463-479)
src/op/finalize_reducer.h (4)
src/transform/layout_reducer.h (1)
  • tl (13-42)
src/op/finalize_reducer.cc (3)
  • FinalizeReducer (21-23)
  • Lower (25-79)
  • Lower (25-26)
src/op/op.h (1)
  • Operator (61-67)
src/transform/layout_reducer.cc (2)
  • op_ (148-176)
  • op_ (148-148)
🔇 Additional comments (16)
src/target/codegen_cuda.cc (1)

1135-1138: LGTM: Inline this_grid().sync() is correct and equivalent

Replacing the temporary grid_group with cooperative_groups::this_grid().sync(); keeps semantics identical and slightly reduces register pressure/verbosity. Header gating via need_cooperative_groups_ remains intact. No functional or ABI change.

src/tl_templates/cuda/gemm_sm90.h (1)

4-4: Review Complete: gemm_mma.h Covers FP8 & CUTE MMA, TU Explicitly Includes GMMA & Warpgroup

All requested symbols are accounted for without adding new transitive includes:

• FP8 types
gemm_mma.h directly includes "cuda_fp8.h", so fp8_e4_t and fp8_e5_t are defined.

• CUTE MMA atoms & architecture tags
<cute/atom/mma_atom.hpp> + <cute/arch/mma_sm80.hpp>, <cute/arch/mma_sm89.hpp>, <cute/arch/mma_sm120.hpp> are pulled in by gemm_mma.h, covering MMA_Atom<…> and all SM80/SM89/SM120 tags.

• GMMA utilities (GMMA::Major, selector ops)
– The SM90 TU (gemm_sm90.h) includes "intrin.h", which in turn pulls in <cute/arch/cluster_sm90.hpp> providing GMMA::Major.
– The memory‐layout selectors (ss_smem_selector/rs_smem_selector) are brought in via using namespace cutlass::gemm::collective::detail; in gemm_sm90.h.

• Warpgroup helpers
gemm_sm90.h explicitly includes <cutlass/arch/barrier.h>, supplying warpgroup_fence_operand, warpgroup_arrive, and warpgroup_wait.

No missing symbols or hidden dependencies were found. All transitive requirements are either re‐exported by gemm_mma.h (FP8 & CUTE MMA) or explicitly included in the SM90 TU (gemm_sm90.h). Keep the cleanup as is.

src/tl_templates/cuda/common.h (2)

130-135: Architecture guard for BF16 specialization likely incorrect

__CUDA_ARCH_LIST__ is non-standard; most code uses __CUDA_ARCH__. BF16 device intrinsics are generally available on SM80+.

Apply this diff (or align with your build-system macro definition):

-#if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ > 750))
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
 template <> struct normalize_atomic_type<bfloat16_t> {
   using type = __nv_bfloat16;
 };
 #endif

If your build relies on __CUDA_ARCH_LIST__, please confirm where it’s defined and whether multi-arch compilation is handled safely. I can adapt the guards accordingly.


166-173: I’ve added a script to locate and dump the relevant section of src/tl_templates/cuda/common.h so we can verify the existing AtomicAdd implementation. Once we see the actual code, I’ll confirm whether the suggested pointer‐cast, memory_order fix, and FP16/BF16 intrinsic fallbacks are needed or adjust accordingly.

src/transform/merge_shared_memory_allocations.cc (1)

965-965: LGTM: formatting-only change

The inserted blank line after const_nbits improves readability and has no functional impact.

src/op/parallel.h (1)

81-81: Ensure reducer_info_map_ is initialized consistently across construction and cloning

reducer_info_map_ is a new persistent member. The copy-ctor calls ParallelOp(other.root_) then selectively copies loop_layout_ and predicate_. Please confirm that the primary constructor reconstructs reducer_info_map_ from IR annotations so no data is lost on clone. If any late annotations are applied post-construction, you may need to also copy reducer_info_map_ in the copy-ctor.

tilelang/engine/phase.py (1)

72-74: Good placement of LayoutReducer before LayoutInference

Running LayoutReducer() right after the initial Simplify() and before LayoutInference() matches the intent of feeding reducer layout metadata into inference. Please verify the pass is idempotent (safe on re-entry) and does not materially increase compile time on non-reducer workloads. If needed, we could gate it via a pass-context flag.

src/op/reduce.cc (1)

16-16: Include of target/utils.h is appropriate for arch helpers

Pulling in ../target/utils.h to use TargetIsHopper is the right direction and removes inline string checks. No issues spotted here.

src/transform/warp_specialized_rewriter.cc (1)

638-639: Expose WGMMA presence via accessor

Public accessor looks fine and matches the internal flag naming. No further concerns.

src/transform/layout_inference.cc (1)

23-23: Include for reducer metadata types

Including layout_reducer.h here is appropriate; it keeps the ReducerInfo surface local to this pass and avoids relying on indirect includes.

tilelang/language/__init__.py (1)

57-57: Re-export new APIs (finalize_reducer, atomics) in language facade

These re-exports make the Python surface consistent with the underlying ops. Looks good. Confirm that downstream docs/tutorials mention these new symbols for discoverability.

Would you like me to add/update a minimal docstring/index entry for these in the user docs?

Also applies to: 61-62, 70-71

src/op/finalize_reducer.h (1)

20-34: FinalizeReducer operator interface looks sound

Constructor contract and Lower signature align with Operator. Interaction with ReducerOpType is clear.

src/op/parallel.cc (1)

197-202: Skip rep=ALL reducers during source buffer selection

This avoids treating reducer replicas as inference sources, which would skew loop layout deduction. Looks correct given the new ReducerRepType.

src/transform/layout_reducer.cc (1)

102-139: Fix const_int_bound invocation and implement buffer scope conversion

  • Update the call to const_int_bound so it takes the underlying Var, not the IterVar, to match the Analyzer API:
     // before
     ICHECK(analyzer_->const_int_bound.IsBound(thread_var_->var));
  • auto const_int_bound = analyzer_->const_int_bound(thread_var_);
  • auto const_int_bound = analyzer_->const_int_bound(thread_var_->var);
- After setting the `kLayoutMap` annotation, also rewrite each reducer buffer’s storage scope from `local.reducer.*` to `"local.fragment"` so that downstream consumers (e.g., TIR passes and codegen) which inspect `buffer.scope()` will see the updated scope. For example, immediately after
```cpp
p_result->annotations.Set(attr::kLayoutMap, layout_map);

insert something like:

// Convert each reducer buffer’s scope to local.fragment
for (auto &&[reducer_var, info] : inside_reducer_range_) {
  auto opt_buffer = var_to_buffer_.Get(reducer_var);
  ICHECK(opt_buffer);
  Buffer buffer = opt_buffer.value();
  // Find and mutate the corresponding AllocBufferNode in p_result
  for (auto &alloc : p_result->alloc_buffers) {
    if (alloc->data.same_as(buffer->data)) {
      auto alloc_mut = alloc.CopyOnWrite();
      alloc_mut->data->type_annotation = PointerType(
          alloc_mut->data->type_annotation.as<PointerTypeNode>()->element_type,
          "local.fragment");
    }
  }
}
  • Verify that downstream passes (e.g., ReduceOp::Lower, ParallelLoopNestVisitor, layout_inference) don’t assume fragment buffers solely by kLayoutMap and will behave correctly now that the actual buffer scope is updated.
src/op/finalize_reducer.cc (1)

21-24: Constructor argument decode looks correct

reducer_ via vmap and op_ from args[1] are in sync with the annotator adding the op enum. LGTM.

src/transform/layout_reducer.h (1)

15-35: ReducerInfo object types: concise and self-contained

The enums and ObjectRef wrapper are well-formed and consistent with TVM’s object system. LGTM.

Comment on lines +132 to +139
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
const Op &ptx_fence_barrier_init();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Export the new intrinsic with TVM_DLL for consistency and visibility

All neighboring intrinsics are declared with TVM_DLL. Omitting it for ptx_fence_barrier_init may prevent the symbol from being exported/visible when linking dynamically. Please align with the rest.

 /*!
  * \brief tvm intrinsics for barrier initialization fence
  *
  * ptx_fence_barrier_init()
  *
  */
-const Op &ptx_fence_barrier_init();
+TVM_DLL const Op &ptx_fence_barrier_init();

Optionally, consider grouping this declaration next to other barrier-related intrinsics (e.g., tma_store_arrive, tma_store_wait) for discoverability.

Run this quick check to ensure the intrinsic is implemented/registered and has a Python wrapper if needed:


🏁 Script executed:

#!/bin/bash
# Where is the intrinsic referenced/implemented?
rg -nP 'ptx_fence_barrier_init\s*\(' -S

# Is the op name registered/exposed anywhere (C++ or Python)?
rg -nP '"tl\.ptx_fence_barrier_init"' -S

Length of output: 190


Add TVM_DLL and register ptx_fence_barrier_init for end-to-end visibility

I don’t see any registration for "tl.ptx_fence_barrier_init" or a Python binding anywhere in the repo—only the bare declaration in src/op/builtin.h. To make this new intrinsic fully functional and exportable, please:

  • Export the symbol
    In src/op/builtin.h (lines 135–138), update the declaration:

    /*!
     * \brief tvm intrinsics for barrier initialization fence
     *
     * ptx_fence_barrier_init()
     *
     */
    -const Op &ptx_fence_barrier_init();
    +TVM_DLL const Op &ptx_fence_barrier_init();
  • Register the op in C++
    Add a TVM_REGISTER_GLOBAL("tl.ptx_fence_barrier_init") (e.g. in src/op/builtin.cc) with the appropriate set_body_typed<…>(…) implementation so that the intrinsic can be looked up at runtime.

  • Expose a Python binding
    In the Python frontend (under python/tvm/… or TOPI), add a wrapper for ptx_fence_barrier_init so that it’s available to users scripting in Python.

  • Optional: group with related intrinsics
    Consider moving this declaration next to other barrier- or TMA-related intrinsics (e.g. tma_store_arrive, tma_store_wait) for better discoverability.

Without these steps, the symbol won’t be exported from the shared library nor available in the Python API, leading to link-time or runtime failures.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
const Op &ptx_fence_barrier_init();
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
TVM_DLL const Op &ptx_fence_barrier_init();

Comment on lines 47 to 61
std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
auto op_str = op_names[(int)op_];

// adopted from ReduceOp
int reducing_threads = extent;
std::stringstream ss;
auto thread_offset = T.thread_bounds->min;
if (TargetIsHopper(T.target)) {
auto all_threads = T.thread_bounds->extent;
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ", " << all_threads << ">::run_hopper";
} else {
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ">::run";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix std::array CTAD to avoid deduction issues across compilers

Some toolchains won’t deduce std::array element type/size from braced init here. Prefer an explicit type or a constexpr C-array.

-  std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
-  auto op_str = op_names[(int)op_];
+  static constexpr const char* op_names[3] = {"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
+  const char* op_str = op_names[static_cast<int>(op_)];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
auto op_str = op_names[(int)op_];
// adopted from ReduceOp
int reducing_threads = extent;
std::stringstream ss;
auto thread_offset = T.thread_bounds->min;
if (TargetIsHopper(T.target)) {
auto all_threads = T.thread_bounds->extent;
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ", " << all_threads << ">::run_hopper";
} else {
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ">::run";
}
static constexpr const char* op_names[3] = {"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
const char* op_str = op_names[static_cast<int>(op_)];
// adopted from ReduceOp
int reducing_threads = extent;
std::stringstream ss;
auto thread_offset = T.thread_bounds->min;
if (TargetIsHopper(T.target)) {
auto all_threads = T.thread_bounds->extent;
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ", " << all_threads << ">::run_hopper";
} else {
ss << "tl::AllReduce<" << op_str << ", " << reducing_threads << ", " << 1
<< ", " << thread_offset << ">::run";
}
🤖 Prompt for AI Agents
In src/op/finalize_reducer.cc around lines 47 to 61, the braced-init of
std::array may fail CTAD on some toolchains; replace the deduction-prone
declaration with an explicit typed array (e.g. std::array<const char*, 3>
op_names = {"tl::SumOp", "tl::MaxOp", "tl::MinOp"} ) or use a constexpr C-array
(e.g. constexpr const char* op_names[] = {"tl::SumOp", "tl::MaxOp",
"tl::MinOp"}), then use op_names[(int)op_] as before so all compilers can
resolve element type and size.

Comment on lines +127 to 133
auto reducer_info_map =
op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
if (reducer_info_map) {
for (auto &&[buffer, info] : reducer_info_map.value())
p->reducer_info_map_.Set(buffer, info);
}
StmtExprVisitor::VisitStmt_(op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential null deref when reading reducer_info annotation

Calling op->annotations.Get(attr::kReducerInfo)->as<...>() without checking that Get() returned a value can trigger a fatal when the annotation is absent. Use the same safe pattern you used elsewhere in this file for coalesced_width.

Apply:

-  auto reducer_info_map =
-      op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
-  if (reducer_info_map) {
-    for (auto &&[buffer, info] : reducer_info_map.value())
-      p->reducer_info_map_.Set(buffer, info);
-  }
+  if (auto obj = op->annotations.Get(attr::kReducerInfo)) {
+    if (auto reducer_info_map = obj.value().as<Map<Var, ReducerInfo>>()) {
+      for (auto &&[buffer, info] : reducer_info_map.value()) {
+        p->reducer_info_map_.Set(buffer, info);
+      }
+    }
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto reducer_info_map =
op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
if (reducer_info_map) {
for (auto &&[buffer, info] : reducer_info_map.value())
p->reducer_info_map_.Set(buffer, info);
}
StmtExprVisitor::VisitStmt_(op);
if (auto obj = op->annotations.Get(attr::kReducerInfo)) {
if (auto reducer_info_map = obj.value().as<Map<Var, ReducerInfo>>()) {
for (auto &&[buffer, info] : reducer_info_map.value()) {
p->reducer_info_map_.Set(buffer, info);
}
}
}
StmtExprVisitor::VisitStmt_(op);
🤖 Prompt for AI Agents
In src/op/parallel.cc around lines 127 to 133, the code calls
op->annotations.Get(attr::kReducerInfo)->as<...>() directly which can
dereference a null optional if the annotation is absent; change this to first
capture the result of Get() into a local Optional<ObjectRef> (or similar) and
check it before calling as<...>(), then only iterate and Set entries when the
as<...>() call returns a valid Map; mirror the safe pattern used for
coalesced_width (check Get() result, then as(), then use value()) so no null
deref occurs.

Comment on lines 1135 to 1138
this->need_cooperative_groups_ = true;
this->PrintIndent();
this->stream << "cooperative_groups::grid_group grid = "
"cooperative_groups::this_grid();\n";
this->PrintIndent();
this->stream << "grid.sync();\n";
this->stream << "cooperative_groups::this_grid().sync();\n";
} else if (op->op.same_as(tl::loop_break())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Double-check cooperative launch + device linking in build/runtime

Grid-wide sync requires cooperative launches and separable compilation/device linking. This change doesn’t alter that requirement, but worth verifying CI/build and runtime paths still satisfy it.

  • Build flags: NVCC should compile with device linking enabled (e.g., -rdc=true or CMake separable compilation).
  • Runtime: Kernels that may hit tl::sync_grid() should be launched with cooperative APIs (e.g., cudaLaunchCooperativeKernel / cuLaunchCooperativeKernel) on devices reporting cooperative support.

Run this quick scan to confirm:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Occurrences of inline grid sync in CUDA codegen (context):"
rg -nC2 'this_grid\(\)\.sync\(\)' src || true

echo
echo "Check for cooperative kernel launch usage (CUDA/Driver API):"
rg -nP 'cudaLaunchCooperativeKernel|cuLaunchCooperativeKernel' -C2 -g '!**/third_party/**' || true

echo
echo "Check build files for device linking/separable compilation:"
rg -nP '(-rdc=|relocatable-device-code|SEPARABLE_COMPILATION|CMAKE_CUDA_SEPARABLE_COMPILATION)' -n -S -C2 -- \
  CMakeLists.txt '**/*.cmake' || true

Length of output: 9214


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Listing CMake build files (depth <= 3):"
find . -maxdepth 3 -type f \( -name 'CMakeLists.txt' -o -name '*.cmake' \)

echo
echo "Searching for -rdc flags in CMake build files:"
rg -nP '(-rdc\s*=\s*true|-rdc)' -g '*.cmake' -g 'CMakeLists.txt' || true

echo
echo "Searching for CMAKE_CUDA_SEPARABLE_COMPILATION:"
rg -nP 'CMAKE_CUDA_SEPARABLE_COMPILATION' -g '*.cmake' -g 'CMakeLists.txt' || true

echo
echo "Searching for relocatable-device-code keywords:"
rg -nP 'relocatable-device-code|SEPARABLE_COMPILATION' -g '*.cmake' -g 'CMakeLists.txt' || true

Length of output: 786


Ensure CUDA separable compilation and cooperative launches are properly enabled

We didn’t find any -rdc=true, CMAKE_CUDA_SEPARABLE_COMPILATION, or equivalent relocatable-device-code settings in your CMake files, which are required whenever device code uses cooperative_groups::this_grid().sync(). Without these, kernels containing grid-wide sync won’t link or launch correctly.

Please update your build and CI to include:

  • In your top-level CMakeLists.txt (or wherever you enable CUDA):
    enable_language(CUDA)
    set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
    # Or explicitly:
    # set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -rdc=true")
  • For targets using CUDA device code:
    target_compile_options(<your_cuda_target>
      PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-rdc=true>
    )
  • Verify your CI/system build invocation passes these flags (e.g. -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON).

At runtime, you’re already using cudaLaunchCooperativeKernel in wrapper.py (lines 328–331), and your codegen emits cooperative_groups::this_grid().sync();. Double-check that:

  • The device reports CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH before attempting a grid sync.
  • All kernels containing grid sync are indeed launched via the cooperative APIs.

With these in place, this_grid().sync() will link and run correctly.

🤖 Prompt for AI Agents
In src/target/codegen_cuda.cc around lines 1135 to 1138, the code emits
cooperative_groups::this_grid().sync() but the build/CI likely lacks CUDA
separable compilation and relocatable device code flags; enable separable
compilation in CMake (e.g. enable_language(CUDA) and set
CMAKE_CUDA_SEPARABLE_COMPILATION ON or add -rdc=true to CUDA flags), add
target-specific compile options (target_compile_options(<cuda_target> PRIVATE
$<$<COMPILE_LANGUAGE:CUDA>:-rdc=true>)), ensure CI build invocations propagate
-DCMAKE_CUDA_SEPARABLE_COMPILATION=ON or equivalent, and confirm at runtime that
CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH is supported and kernels with grid sync
are launched via cooperative launch APIs (as used in wrapper.py).

Comment on lines +208 to +214
template <typename T1, typename T2>
TL_DEVICE void AtomicStore(T1 *address, T2 value, int memory_order) {
using NT1 = typename normalize_atomic_type<T1>::type;
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.store(cuda_cast<NT1>(value), cuda::memory_order(memory_order));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

AtomicStore should mirror AtomicLoad: default order + normalize + cast

Same issues as AtomicLoad: no default and constructing atomic_ref<NT1> from T1&.

Apply this diff:

-template <typename T1, typename T2>
-TL_DEVICE void AtomicStore(T1 *address, T2 value, int memory_order) {
-  using NT1 = typename normalize_atomic_type<T1>::type;
-  cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
-  aref.store(cuda_cast<NT1>(value), cuda::memory_order(memory_order));
-}
+template <typename T1, typename T2>
+TL_DEVICE void AtomicStore(T1* address, T2 value,
+                           cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT1 = typename normalize_atomic_type<T1>::type;
+  static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1), "Type mismatch");
+  NT1* addr = reinterpret_cast<NT1*>(address);
+  cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
+  aref.store(cuda_cast<NT1>(value), mo);
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <typename T1, typename T2>
TL_DEVICE void AtomicStore(T1 *address, T2 value, int memory_order) {
using NT1 = typename normalize_atomic_type<T1>::type;
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.store(cuda_cast<NT1>(value), cuda::memory_order(memory_order));
}
template <typename T1, typename T2>
TL_DEVICE void AtomicStore(T1* address, T2 value,
cuda::memory_order mo = cuda::memory_order_relaxed) {
using NT1 = typename normalize_atomic_type<T1>::type;
static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1), "Type mismatch");
NT1* addr = reinterpret_cast<NT1*>(address);
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
aref.store(cuda_cast<NT1>(value), mo);
}
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 208-214, AtomicStore lacks a
default memory order and constructs cuda::atomic_ref from T1& instead of the
normalized atomic type; update the signature to provide a default memory_order
(match AtomicLoad, e.g. seq_cst), compute NT1 = normalize_atomic_type<T1>::type,
create an NT1& reference to *address (cast to NT1*) and pass that reference into
cuda::atomic_ref, and use cuda_cast<NT1>(value) for the stored value so the
implementation mirrors AtomicLoad.

Comment on lines +15 to +19
printf(
"msg='%s' BlockIdx=(%d, %d, %d), ThreadIdx=(%d, %d, %d): dtype=pointer "
"value=%p\n",
msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
threadIdx.z, var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix %p argument type mismatch to avoid UB and -Wformat warnings

printf with the %p specifier requires a void* argument. Passing T* directly is undefined behavior per the C/C++ spec and can trigger warnings/errors with NVCC/Clang (-Wformat). Cast to const void*.

-      threadIdx.z, var);
+      threadIdx.z, static_cast<const void*>(var));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
printf(
"msg='%s' BlockIdx=(%d, %d, %d), ThreadIdx=(%d, %d, %d): dtype=pointer "
"value=%p\n",
msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
threadIdx.z, var);
printf(
"msg='%s' BlockIdx=(%d, %d, %d), ThreadIdx=(%d, %d, %d): dtype=pointer "
"value=%p\n",
msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
threadIdx.z, static_cast<const void*>(var));
🤖 Prompt for AI Agents
In src/tl_templates/cuda/debug.h around lines 15 to 19, the printf uses %p but
passes a T* (var) directly which is undefined behavior and triggers -Wformat;
cast the pointer argument to (const void*) when calling printf so the %p
specifier receives a void* pointer (e.g. replace var with (const void*)var) and
keep the rest of the format string unchanged.

Comment on lines +95 to +110
def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
"""Perform an atomic maximum operation.
Args:
dst (Buffer): Destination buffer where the atomic maximum will be performed
value (PrimExpr): Value to be atomically added
Returns:
PrimExpr: Handle to the atomic maximum operation
"""
if memory_order is None:
return T.call_extern("handle", "AtomicMax", T.address_of(dst), value)
else:
return T.call_extern("handle", "AtomicMax", T.address_of(dst), value,
_MEMORY_ORDER_ID_MAP[memory_order])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

atomic_max: docs and address-of usage

  • Docstring says “Value to be atomically added” for Max; adjust wording.
  • T.address_of expects a BufferLoad; passing Buffer works only if address_of is overloaded in your stack. For portability, use a 0-based BufferLoad.
-    """Perform an atomic maximum operation.
+    """Perform an atomic maximum operation.
...
-        value (PrimExpr): Value to be atomically added
+        value (PrimExpr): Value to atomically max with
...
-    if memory_order is None:
-        return T.call_extern("handle", "AtomicMax", T.address_of(dst), value)
-    else:
-        return T.call_extern("handle", "AtomicMax", T.address_of(dst), value,
-                             _MEMORY_ORDER_ID_MAP[memory_order])
+    idx0 = [0] * len(dst.shape)
+    addr = T.address_of(T.BufferLoad(dst, idx0))
+    mo = _mo(memory_order)
+    return (
+        T.call_extern("handle", "AtomicMax", addr, value)
+        if mo is None
+        else T.call_extern("handle", "AtomicMax", addr, value, mo)
+    )

Please confirm address_of(Buffer) is actually supported in your binding layer. If not, the above explicit BufferLoad(..., [0...]) is required.


🏁 Script executed:

#!/bin/bash
# Search for the definition of address_of in the codebase
rg -nP "def address_of" -C3 --glob "*.py"

# Search for any usages or overloads of address_of taking a Buffer
rg -nP "address_of\\s*\\(.*dst" -C3 --glob "*.py"

Length of output: 3852


atomic_max: Fix docstring and address_of usage

The atomic_max primitive’s docstring incorrectly refers to “added,” and the T.address_of call must receive a BufferLoad, not a raw Buffer. Please apply the following mandatory refactors in tilelang/language/customize.py (around lines 95–110):

• Update the docstring to describe “max” rather than “add.”
• Build an explicit 0-based BufferLoad to pass into T.address_of.
• Unify the memory-order lookup into a local mo variable and use a single call_extern return.

Diff suggestion:

--- a/tilelang/language/customize.py
+++ b/tilelang/language/customize.py
@@ def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
-    """Perform an atomic maximum operation.
-
-    Args:
-        dst (Buffer): Destination buffer where the atomic maximum will be performed
-        value (PrimExpr): Value to be atomically added
+    """Perform an atomic maximum operation.
+
+    Args:
+        dst (Buffer): Destination buffer where the atomic maximum will be performed
+        value (PrimExpr): Value to atomically compare for maximum
@@
-    if memory_order is None:
-        return T.call_extern("handle", "AtomicMax", T.address_of(dst), value)
-    else:
-        return T.call_extern("handle", "AtomicMax", T.address_of(dst), value,
-                             _MEMORY_ORDER_ID_MAP[memory_order])
+    # Load element 0 for a valid address_of argument
+    idx0 = [0] * len(dst.shape)
+    addr = T.address_of(T.BufferLoad(dst, idx0))
+    mo = _MEMORY_ORDER_ID_MAP.get(memory_order) if memory_order else None
+    return (
+        T.call_extern("handle", "AtomicMax", addr, value)
+        if mo is None
+        else T.call_extern("handle", "AtomicMax", addr, value, mo)
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
"""Perform an atomic maximum operation.
Args:
dst (Buffer): Destination buffer where the atomic maximum will be performed
value (PrimExpr): Value to be atomically added
Returns:
PrimExpr: Handle to the atomic maximum operation
"""
if memory_order is None:
return T.call_extern("handle", "AtomicMax", T.address_of(dst), value)
else:
return T.call_extern("handle", "AtomicMax", T.address_of(dst), value,
_MEMORY_ORDER_ID_MAP[memory_order])
def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
- """Perform an atomic maximum operation.
-
- Args:
- dst (Buffer): Destination buffer where the atomic maximum will be performed
- value (PrimExpr): Value to be atomically added
-
- Returns:
- PrimExpr: Handle to the atomic maximum operation
- """
- if memory_order is None:
- return T.call_extern("handle", "AtomicMax", T.address_of(dst), value)
- else:
- return T.call_extern("handle", "AtomicMax", T.address_of(dst), value,
"""Perform an atomic maximum operation.
Args:
dst (Buffer): Destination buffer where the atomic maximum will be performed
value (PrimExpr): Value to atomically compare for maximum
Returns:
PrimExpr: Handle to the atomic maximum operation
"""
# Load element 0 for a valid address_of argument
idx0 = [0] * len(dst.shape)
addr = T.address_of(T.BufferLoad(dst, idx0))
mo = _MEMORY_ORDER_ID_MAP.get(memory_order) if memory_order else None
return (
T.call_extern("handle", "AtomicMax", addr, value)
if mo is None
else T.call_extern("handle", "AtomicMax", addr, value, mo)
)
🤖 Prompt for AI Agents
In tilelang/language/customize.py around lines 95–110, the atomic_max docstring
and call are wrong: update the docstring to describe that this performs an
atomic maximum (not an add), construct an explicit 0-based BufferLoad from dst
and pass that BufferLoad into T.address_of (instead of passing dst directly),
and unify the memory-order lookup into a local variable (e.g. mo =
_MEMORY_ORDER_ID_MAP[memory_order] if memory_order is not None else None) so you
can return a single T.call_extern call that includes the memory-order argument
only when mo is not None.

Comment on lines +112 to +127
def atomic_min(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
"""Perform an atomic minimum operation.
Args:
dst (Buffer): Destination buffer where the atomic minimum will be performed
value (PrimExpr): Value to be atomically added
Returns:
PrimExpr: Handle to the atomic minimum operation
"""
if memory_order is None:
return T.call_extern("handle", "AtomicMin", T.address_of(dst), value)
else:
return T.call_extern("handle", "AtomicMin", T.address_of(dst), value,
_MEMORY_ORDER_ID_MAP[memory_order])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

atomic_min: mirror fixes from atomic_max

Same issues: docstring wording and address_of. Mirror the atomic_max adjustments for consistency.

-        value (PrimExpr): Value to be atomically added
+        value (PrimExpr): Value to atomically min with
...
-    if memory_order is None:
-        return T.call_extern("handle", "AtomicMin", T.address_of(dst), value)
-    else:
-        return T.call_extern("handle", "AtomicMin", T.address_of(dst), value,
-                             _MEMORY_ORDER_ID_MAP[memory_order])
+    idx0 = [0] * len(dst.shape)
+    addr = T.address_of(T.BufferLoad(dst, idx0))
+    mo = _mo(memory_order)
+    return (
+        T.call_extern("handle", "AtomicMin", addr, value)
+        if mo is None
+        else T.call_extern("handle", "AtomicMin", addr, value, mo)
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def atomic_min(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
"""Perform an atomic minimum operation.
Args:
dst (Buffer): Destination buffer where the atomic minimum will be performed
value (PrimExpr): Value to be atomically added
Returns:
PrimExpr: Handle to the atomic minimum operation
"""
if memory_order is None:
return T.call_extern("handle", "AtomicMin", T.address_of(dst), value)
else:
return T.call_extern("handle", "AtomicMin", T.address_of(dst), value,
_MEMORY_ORDER_ID_MAP[memory_order])
def atomic_min(dst: Buffer, value: PrimExpr, memory_order: str | None = None) -> PrimExpr:
"""Perform an atomic minimum operation.
Args:
dst (Buffer): Destination buffer where the atomic minimum will be performed
value (PrimExpr): Value to atomically min with
Returns:
PrimExpr: Handle to the atomic minimum operation
"""
idx0 = [0] * len(dst.shape)
addr = T.address_of(T.BufferLoad(dst, idx0))
mo = _mo(memory_order)
return (
T.call_extern("handle", "AtomicMin", addr, value)
if mo is None
else T.call_extern("handle", "AtomicMin", addr, value, mo)
)
🤖 Prompt for AI Agents
In tilelang/language/customize.py around lines 112–127, update atomic_min to
mirror the atomic_max fixes: correct the docstring wording to describe an atomic
minimum operation (value to be atomically compared/minimized against the
destination) and change the address argument from T.address_of(dst) to
T.address_of(dst.data) (apply the same memory_order handling as currently
implemented). Ensure the returned T.call_extern calls use T.address_of(dst.data)
in both branches and the docstring accurately describes dst and value.

Comment on lines +272 to +284
def atomic_load(src: Buffer, memory_order: str = "seq_cst") -> PrimExpr:
"""Loads a value from the input buffer with specified memory_order.
Args:
src (Buffer): Input buffer to load from
memory_order (str, optional): Atomicity level for the load operation. Defaults to "seq_cst".
Returns:
PrimExpr: The loaded value from the buffer
"""
return T.call_extern(src.dtype, "AtomicLoad", T.address_of(src),
_MEMORY_ORDER_ID_MAP[memory_order])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

atomic_load: address-of and memory_order validation

Use the same robust address_of(BufferLoad) pattern and centralized _mo validator.

-    return T.call_extern(src.dtype, "AtomicLoad", T.address_of(src),
-                         _MEMORY_ORDER_ID_MAP[memory_order])
+    idx0 = [0] * len(src.shape)
+    addr = T.address_of(T.BufferLoad(src, idx0))
+    return T.call_extern(src.dtype, "AtomicLoad", addr, _mo(memory_order))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def atomic_load(src: Buffer, memory_order: str = "seq_cst") -> PrimExpr:
"""Loads a value from the input buffer with specified memory_order.
Args:
src (Buffer): Input buffer to load from
memory_order (str, optional): Atomicity level for the load operation. Defaults to "seq_cst".
Returns:
PrimExpr: The loaded value from the buffer
"""
return T.call_extern(src.dtype, "AtomicLoad", T.address_of(src),
_MEMORY_ORDER_ID_MAP[memory_order])
def atomic_load(src: Buffer, memory_order: str = "seq_cst") -> PrimExpr:
"""Loads a value from the input buffer with specified memory_order.
Args:
src (Buffer): Input buffer to load from
memory_order (str, optional): Atomicity level for the load operation. Defaults to "seq_cst".
Returns:
PrimExpr: The loaded value from the buffer
"""
idx0 = [0] * len(src.shape)
addr = T.address_of(T.BufferLoad(src, idx0))
return T.call_extern(src.dtype, "AtomicLoad", addr, _mo(memory_order))
🤖 Prompt for AI Agents
In tilelang/language/customize.py around lines 272 to 284, update atomic_load to
use the address-of(BufferLoad(...)) pattern and the centralized _mo validator:
construct a BufferLoad node for src (same shape/indices pattern used elsewhere)
and pass T.address_of(BufferLoad(...)) instead of T.address_of(src), and call
_mo(memory_order) to validate/normalize the memory order and obtain the mapped
id (use its return where _MEMORY_ORDER_ID_MAP[memory_order] was used). Ensure
the BufferLoad is created consistently with other atomic ops in the file so
address-of targets a BufferLoad AST node and the _mo helper is used to validate
the memory_order string before passing it to the extern call.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (17)
tilelang/language/allocate.py (1)

17-17: Remove unused and potentially confusing tvm alias or rename it to avoid ambiguity with upstream TVM.

from tilelang import tvm as tvm is unused in this module and can confuse readers since T is sourced from upstream tvm.script. If you plan to reference TileLang’s wrapped TVM later, prefer a distinct alias (e.g., tl_tvm) to avoid mixing it up with upstream tvm. If not needed now, drop the import.

Primary fix (remove unused import):

-from tilelang import tvm as tvm

If you intend to use it soon, keep it with a clearer alias and lint suppression until first use:

- from tilelang import tvm as tvm
+ import tilelang.tvm as tl_tvm  # noqa: F401 - reserved for reducer-aware allocation work
src/tl_templates/cuda/debug.h (1)

13-20: Prefer a single non-template overload taking const void*

This avoids unnecessary template instantiations, clearly signals no mutation, and excludes function pointers implicitly. It also incorporates the correct void* cast for %p.

-// Overload for pointer type (supports any cv-qualified T*)
-template <typename T> __device__ void debug_print_var(const char *msg, T *var) {
+// Overload for object pointers (any cv-qualification); prints the address
+__device__ void debug_print_var(const char *msg, const void *ptr) {
   printf(
       "msg='%s' BlockIdx=(%d, %d, %d), ThreadIdx=(%d, %d, %d): dtype=pointer "
       "value=%p\n",
-      msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
-      threadIdx.z, var);
+      msg, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y,
+      threadIdx.z, ptr);
 }

Note: This change is source-compatible for all calls passing object pointers (T*) and reduces code bloat. If you need to support volatile object pointers as well, consider an additional overload debug_print_var(const char*, const volatile void*) that forwards to this one.

src/op/copy.cc (1)

787-788: Consider consolidating the TMA op declaration.

The op variable is declared here but only used in the 1D TMA path below. This early declaration might be unnecessary since the non-1D path doesn't use it.

Consider moving the op declaration closer to its usage within the 1D TMA path:

-  auto op = is_load ? tma_load() : tma_store();
-
   // If both shared_layout and global tensor accessed region are contiguous, use
   // TMA 1D instead of 2D
   {
+    auto op = is_load ? tma_load() : tma_store();
tilelang/transform/__init__.py (1)

424-427: Guard against missing FFI symbol (optional).

If older wheels or partial builds are used, _ffi_api.LayoutReducer may be absent and raise AttributeError at import-time. Consider mirroring the conditional used for LowerHopperIntrin to make failures easier to diagnose.

-def LayoutReducer():
-    """LayoutReducer
-    """
-    return _ffi_api.LayoutReducer()  # type: ignore
+def LayoutReducer():
+    """LayoutReducer"""
+    return (_ffi_api.LayoutReducer() if hasattr(_ffi_api, "LayoutReducer") else (lambda f: f))  # type: ignore
tilelang/language/reduce.py (1)

190-203: Use read-write access for reducer pointer to reflect side effects.

Finalize likely reads partials and writes results in-place. Using "rw" better communicates effects to TIR analyses (even though the op is opaque).

-    return tir.call_intrin(
-        "handle",
-        tir.op.Op.get("tl.finalize_reducer"),
-        reducer.access_ptr("w"),
-    )
+    return tir.call_intrin(
+        "handle",
+        tir.op.Op.get("tl.finalize_reducer"),
+        reducer.access_ptr("rw"),
+    )
tilelang/engine/phase.py (1)

72-75: Add a brief note on dependency/assumption.

Small doc comment to future-proof: clarify that LayoutInference expects reducer annotations produced here.

-    # Set layouts for reducers
+    # Set layouts/metadata for reducers so LayoutInference can respect reducer layouts
src/op/parallel.h (2)

13-13: Header layering: avoid depending on a transform header in an op header.

Including ../transform/layout_reducer.h here couples op-layer headers to transform-layer headers and increases rebuild scope. Prefer a forward declaration of ReducerInfo (ObjectRef) in a small common header, or forward-declare the ReducerInfoNode and use ObjectRef aliasing, moving the heavy include to the .cc.

Possible direction:

  • Create src/common/reducer_info_fwd.h with class ReducerInfoNode; using ReducerInfo = tvm::ObjectRef; (or the generated ObjectRef alias).
  • Include the full header only in translation units that need method definitions.

81-82: New reducer_info_map_ storage — ensures reducer-aware layout.

Adding Map<Var, ReducerInfo> allows layout inference to skip/align reducer buffers. Looks good; ensure serialization/printing paths (if any) handle this member to aid debugging.

src/op/reduce.cc (2)

230-239: Minor duplication in AllReduce string building.

Consider building the common prefix once and appending the suffix/extra arg conditionally for Hopper to reduce churn and mistakes.


230-239: Enforce constexpr thread extent in Hopper AllReduce

Add an ICHECK(as_const_int(...)) guard around the Hopper path in src/op/reduce.cc to ensure T.thread_bounds->extent is a compile-time constant, then unwrap it for the template argument. You can also factor out the common stream insertion (ss << "tl::AllReduce<" …) outside the branches to reduce duplication.

  • File: src/op/reduce.cc (around lines 229–239):
   auto thread_offset = T.thread_bounds->min;
-  if (TargetIsHopper(T.target)) {
-    auto all_threads = T.thread_bounds->extent;
-    ss << "tl::AllReduce<" << this->MakeCodegenReducer() << ", "
-       << reducing_threads << ", " << (*scale) << ", " << thread_offset
-       << ", " << all_threads << ">::run_hopper";
-  } else {
+  if (TargetIsHopper(T.target)) {
+    auto pe_extent = T.thread_bounds->extent;
+    auto ci_extent = as_const_int(pe_extent);
+    ICHECK(ci_extent) << "Hopper AllReduce requires a static thread extent.";
+    ss << "tl::AllReduce<" << this->MakeCodegenReducer() << ", "
+       << reducing_threads << ", " << (*scale) << ", " << thread_offset
+       << ", " << *ci_extent << ">::run_hopper";
+  } else {
     ss << "tl::AllReduce<" << this->MakeCodegenReducer() << ", "
        << reducing_threads << ", " << (*scale) << ", " << thread_offset
        << ">::run";
   }
src/transform/layout_inference.cc (1)

618-627: Consider refactoring reducer detection logic.

The nested PostOrderVisit traversal inside a conditional check could be simplified. Additionally, the condition if (!has_reducer) inside the lambda is redundant since the loop terminates early once has_reducer becomes true.

Apply this diff to simplify the logic:

-      bool has_reducer = false;
-      PostOrderVisit(for_node->body, [&](const ObjectRef &obj) {
-        if (!has_reducer)
-          if (const auto *store = obj.as<BufferStoreNode>()) {
-            has_reducer = reducer_info.count(store->buffer->data) != 0;
-          }
-      });
+      bool has_reducer = false;
+      PostOrderVisit(for_node->body, [&](const ObjectRef &obj) {
+        if (const auto *store = obj.as<BufferStoreNode>()) {
+          if (reducer_info.count(store->buffer->data) != 0) {
+            has_reducer = true;
+          }
+        }
+      });
src/transform/layout_reducer.h (1)

28-35: Consider documenting/guarding enum-to-string invariants.

Since the public ReducerInfo(const String&, const String&) parses strings, explicitly document accepted values ("sum|max|min", "all|none") in the header, or expose an enum-based constructor to prevent invalid states at compile time.

src/transform/layout_reducer.cc (5)

48-56: Thread tag handling is too narrow; only supports threadIdx.x.

Hard-coding "threadIdx.x" excludes reducers mapped to threadIdx.y/z or block-level reductions. At minimum, generalize to any threadIdx.* (and consider blockIdx.* if applicable).

Example tweak:

-      if (iv->thread_tag == "threadIdx.x") {
+      if (iv->thread_tag.rfind("threadIdx.", 0) == 0) {  // starts with "threadIdx."
         ICHECK(iv->dom->extent.as<IntImmNode>());
         thread_var_ = iv;
       }

Also consider relaxing the IntImm check (see next comment).


53-55: Overly strict constant-extent check.

ICHECK(iv->dom->extent.as<IntImmNode>()) rejects valid cases with proven-constant extents. You already rely on const_int_bound later; this check is redundant and can cause false negatives.

-        ICHECK(iv->dom->extent.as<IntImmNode>());
         thread_var_ = iv;

112-121: Use 64-bit bounds to avoid narrowing and drop unused dtype.

  • dtype is unused.
  • const_int_bound carries 64-bit values; narrowing to int can truncate for large extents (even if unlikely for threads).
-        auto dtype = thread_var_->var.dtype();
-        int thread_min = const_int_bound->min_value;
-        int thread_extent =
-            const_int_bound->max_value - const_int_bound->min_value + 1;
+        int64_t thread_min = const_int_bound->min_value;
+        int64_t thread_extent =
+            const_int_bound->max_value - const_int_bound->min_value + 1;

126-137: Fragment construction looks consistent; add a guard for unexpected rep.

The two branches cover ALL/NONE. Consider adding an else with ICHECK(false) to catch future enum additions; this is especially useful if more replication modes are introduced.


143-146: Future: validate BufferStore against info->op semantics.

The TODO is important (e.g., ensuring the store op matches the reducer op semantics or is within the correct region). Call this out so it’s not forgotten.

I can help wire a verifier that flags illegal writes to reducer buffers outside the allowed combine op.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 556d411 and 4c0f978.

📒 Files selected for processing (25)
  • setup.py (0 hunks)
  • src/op/builtin.h (1 hunks)
  • src/op/copy.cc (1 hunks)
  • src/op/finalize_reducer.cc (1 hunks)
  • src/op/finalize_reducer.h (1 hunks)
  • src/op/parallel.cc (3 hunks)
  • src/op/parallel.h (2 hunks)
  • src/op/reduce.cc (2 hunks)
  • src/target/codegen_cuda.cc (1 hunks)
  • src/target/utils.cc (1 hunks)
  • src/tl_templates/cuda/common.h (4 hunks)
  • src/tl_templates/cuda/debug.h (1 hunks)
  • src/tl_templates/cuda/gemm_sm90.h (1 hunks)
  • src/transform/inject_tma_barrier.cc (3 hunks)
  • src/transform/layout_inference.cc (3 hunks)
  • src/transform/layout_reducer.cc (1 hunks)
  • src/transform/layout_reducer.h (1 hunks)
  • src/transform/merge_shared_memory_allocations.cc (1 hunks)
  • src/transform/warp_specialized_rewriter.cc (2 hunks)
  • tilelang/engine/phase.py (1 hunks)
  • tilelang/language/__init__.py (1 hunks)
  • tilelang/language/allocate.py (1 hunks)
  • tilelang/language/customize.py (4 hunks)
  • tilelang/language/reduce.py (1 hunks)
  • tilelang/transform/__init__.py (1 hunks)
💤 Files with no reviewable changes (1)
  • setup.py
🧰 Additional context used
🧬 Code graph analysis (17)
src/transform/warp_specialized_rewriter.cc (4)
src/transform/inject_tma_barrier.cc (2)
  • call (63-80)
  • call (63-63)
src/transform/lower_hopper_intrin.cc (2)
  • call (102-132)
  • call (102-102)
src/tl_templates/cuda/copy_sm90.h (6)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
tilelang/language/tir/op.py (1)
  • tvm_access_ptr (650-675)
tilelang/transform/__init__.py (1)
src/transform/layout_reducer.cc (2)
  • LayoutReducer (198-204)
  • LayoutReducer (198-198)
tilelang/engine/phase.py (2)
src/transform/layout_reducer.cc (2)
  • LayoutReducer (198-204)
  • LayoutReducer (198-198)
tilelang/transform/__init__.py (1)
  • LayoutReducer (424-427)
tilelang/language/reduce.py (1)
tilelang/language/tir/op.py (1)
  • call_intrin (119-144)
src/op/reduce.cc (1)
src/target/utils.cc (2)
  • TargetIsHopper (49-54)
  • TargetIsHopper (49-49)
tilelang/language/__init__.py (2)
tilelang/language/reduce.py (1)
  • finalize_reducer (190-203)
tilelang/language/customize.py (4)
  • atomic_max (95-109)
  • atomic_min (112-126)
  • atomic_load (272-283)
  • atomic_store (286-298)
src/transform/inject_tma_barrier.cc (4)
src/transform/warp_specialized_rewriter.cc (18)
  • call (43-48)
  • call (43-43)
  • op (50-55)
  • op (50-50)
  • op (85-95)
  • op (85-85)
  • op (97-105)
  • op (97-97)
  • op (107-112)
  • op (107-107)
  • op (114-122)
  • op (114-114)
  • op (146-158)
  • op (146-146)
  • op (160-189)
  • op (160-160)
  • op (191-201)
  • op (191-191)
src/transform/lower_hopper_intrin.cc (2)
  • call (102-132)
  • call (102-102)
tilelang/language/builtin.py (3)
  • tma_load (67-76)
  • create_tma_descriptor (55-64)
  • get_mbarrier (43-52)
src/tl_templates/cuda/copy_sm90.h (6)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
src/transform/layout_inference.cc (2)
src/op/parallel.cc (4)
  • op (96-106)
  • op (96-96)
  • op (108-116)
  • op (108-108)
src/transform/pipeline_planning.cc (10)
  • op (54-72)
  • op (54-54)
  • op (74-92)
  • op (74-74)
  • op (94-121)
  • op (94-94)
  • op (123-133)
  • op (123-123)
  • op (545-554)
  • op (545-545)
src/op/finalize_reducer.cc (6)
src/op/finalize_reducer.h (1)
  • FinalizeReducer (20-34)
src/transform/layout_reducer.h (1)
  • ReducerOpType (15-41)
src/op/reduce.cc (4)
  • Lower (119-285)
  • Lower (119-119)
  • Lower (379-403)
  • Lower (379-379)
src/transform/layout_reducer.cc (2)
  • op_ (148-176)
  • op_ (148-148)
src/target/utils.cc (2)
  • TargetIsHopper (49-54)
  • TargetIsHopper (49-49)
tilelang/language/reduce.py (1)
  • finalize_reducer (190-203)
src/transform/layout_reducer.h (3)
src/op/finalize_reducer.h (2)
  • tvm (15-37)
  • tl (16-36)
src/op/parallel.h (2)
  • tvm (16-91)
  • tl (17-90)
src/transform/layout_reducer.cc (1)
  • ReducerInfoNode (27-43)
src/transform/layout_reducer.cc (3)
src/op/parallel.cc (4)
  • VisitStmt_ (122-134)
  • VisitStmt_ (122-122)
  • VisitStmt_ (136-148)
  • VisitStmt_ (136-136)
tilelang/language/tir/op.py (2)
  • indexmod (2890-2915)
  • tvm_access_ptr (650-675)
tilelang/transform/__init__.py (1)
  • LayoutReducer (424-427)
src/op/copy.cc (2)
src/transform/inject_tma_barrier.cc (16)
  • op (82-87)
  • op (82-82)
  • op (120-130)
  • op (120-120)
  • op (132-163)
  • op (132-132)
  • op (165-177)
  • op (165-165)
  • op (202-227)
  • op (202-202)
  • op (229-237)
  • op (229-229)
  • op (288-292)
  • op (288-288)
  • op (294-309)
  • op (294-294)
tilelang/language/builtin.py (1)
  • tma_load (67-76)
src/op/parallel.cc (1)
src/transform/layout_inference.cc (18)
  • op (39-45)
  • op (39-39)
  • op (294-322)
  • op (294-294)
  • op (346-369)
  • op (346-346)
  • op (371-388)
  • op (371-371)
  • op (390-399)
  • op (390-390)
  • op (553-565)
  • op (553-553)
  • op (567-639)
  • op (567-567)
  • op (641-650)
  • op (641-641)
  • buffer (338-344)
  • buffer (338-338)
src/target/utils.cc (1)
src/op/atomic_add.cc (2)
  • GetArchInt (24-35)
  • GetArchInt (24-24)
src/op/finalize_reducer.h (4)
src/transform/layout_reducer.h (1)
  • tl (13-42)
src/op/finalize_reducer.cc (3)
  • FinalizeReducer (21-23)
  • Lower (25-79)
  • Lower (25-26)
src/op/op.h (1)
  • Operator (61-67)
src/transform/layout_reducer.cc (2)
  • op_ (148-176)
  • op_ (148-148)
tilelang/language/customize.py (2)
src/tl_templates/cuda/reduce.h (1)
  • T (75-147)
tilelang/language/tir/op.py (2)
  • call_extern (172-194)
  • address_of (463-479)
src/tl_templates/cuda/common.h (1)
src/tl_templates/cpp/half.hpp (22)
  • half (2475-2478)
  • half (2475-2475)
  • half (2556-2556)
  • half (2561-2561)
  • half (3000-3000)
  • half (3005-3007)
  • half (3234-3236)
  • half (3242-3242)
  • half (3416-3424)
  • half (3431-3439)
  • half (5248-5250)
  • int (760-765)
  • int (772-779)
  • int (787-797)
  • int (804-811)
  • int (816-821)
  • int (827-832)
  • int (838-843)
  • int (855-864)
  • int (872-879)
  • int (892-914)
  • int (5266-5272)
🪛 GitHub Actions: CI
src/target/utils.cc

[error] 114-114: redefinition of 'bool tvm::tl::TargetHasBulkCopy(tvm::Target)'. Previously defined at line 107.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-amd
🔇 Additional comments (35)
src/tl_templates/cuda/debug.h (1)

13-20: Nice addition—fills a real debug gap

Adding a pointer-specific overload makes device-side diagnostics much easier and avoids accidental dereferencing patterns. Message format is consistent with existing specializations.

src/transform/merge_shared_memory_allocations.cc (1)

963-969: No-op whitespace change — safe to merge

Only a blank line added after const_nbits computation. No behavior impact.

src/target/codegen_cuda.cc (1)

1135-1138: Simplify sync_grid emission

Inlining cooperative_groups::this_grid().sync() is cleaner than using a temporary grid variable, and the need_cooperative_groups_ flag still correctly guards emission of the <cooperative_groups.h> include. LGTM.

Please verify against your generated CUDA output:

  • File: src/target/codegen_cuda.cc
    • Sync call at lines 1135–1138
    • Include guard at lines 181–183

After building, run a grep over your generated .cu kernels to ensure the header appears whenever this_grid().sync() is emitted:

grep -RP -C3 'this_grid\(\)\.sync\(\)' <build-output-dir>/*.cu | grep -C3 '#include <cooperative_groups.h>'
src/transform/warp_specialized_rewriter.cc (2)

324-334: LGTM! Barrier insertion correctly adapted for 1D TMA path.

The 1D TMA detection logic correctly checks if the first argument is a Call that is not create_tma_descriptor, and places the barrier at args[2] for 1D TMA vs args[1] for other cases. This aligns with the 1D TMA handling in inject_tma_barrier.cc.


638-639: New public accessor looks good.

The onlyHasWgMMA() accessor properly exposes the internal only_has_wgmma_ flag for external use.

src/transform/inject_tma_barrier.cc (3)

65-77: Robust implementation of 1D TMA copy-byte calculation.

The 1D TMA detection logic is consistent across TmaTraitsCollector, TmaExpectTxRewriter, and TmaBarrierRewriter. The calculation correctly uses call->args[3] for 1D TMA's byte size argument and falls back to the previous 2D TMA calculation for other cases.


168-174: Consistent 1D TMA barrier placement in TmaExpectTxRewriter.

The barrier placement at index 2 for 1D TMA and index 1 for non-1D is consistent with both TmaTraitsCollector and TmaBarrierRewriter, ensuring proper barrier handling across the transform pipeline.


458-463: Clear 1D TMA barrier handling with helpful comment.

The implementation maintains consistency with the barrier placement pattern seen throughout the file. The comment "1D TMA has mbar at args[2]" helps clarify the argument layout difference.

src/op/copy.cc (1)

791-861: Well-structured 1D TMA optimization with comprehensive contiguity checks.

The 1D TMA path implementation is thorough with:

  1. Clear contiguity checks for both shared and global tensors
  2. Proper validation that total elements match to prevent out-of-bounds access
  3. Correct offset calculations for both shared and global addresses
  4. Appropriate use of the simpler 1D TMA copy form with byte-length argument

The implementation aligns well with the barrier handling for 1D TMA in other files.

tilelang/language/customize.py (4)

10-18: LGTM! Well-structured memory order mapping.

The memory order mapping follows the standard C++ memory model ordering with correct values (0-5). This provides a clean abstraction for CUDA atomic operations.


95-110: New atomic operations properly handle memory ordering.

The atomic_max and atomic_min functions correctly:

  • Support optional memory_order parameter (None for default)
  • Route to appropriate extern calls with or without memory order ID
  • Follow consistent pattern with atomic_add

152-159: Atomic_add memory ordering properly integrated.

The enhanced atomic_add function correctly:

  • Maintains backward compatibility (memory_order defaults to None)
  • Routes to AtomicAdd extern call with memory order when extent deduction fails
  • Preserves the existing region-based path for buffer operations

272-298: Well-designed atomic load/store operations.

The new atomic_load and atomic_store functions:

  • Use sensible default of "seq_cst" for sequential consistency
  • Properly map memory order strings to IDs
  • Follow consistent patterns with other atomic operations
  • Have clear docstrings explaining the parameters
src/tl_templates/cuda/common.h (5)

7-7: Good addition of cuda::atomic_ref header.

The inclusion of <cuda/atomic> enables the use of modern C++ atomic operations through cuda::atomic_ref, providing better type safety and memory ordering control.


122-134: Well-designed type normalization for atomic operations.

The normalize_atomic_type template with specializations for half_t and bfloat16_t provides a clean abstraction for mapping high-level types to their CUDA atomic-compatible counterparts. The arch-guard for bfloat16 is appropriate.


136-148: Clean implementation of type casting helpers.

The cuda_cast template with specializations for half and bfloat16 conversions provides type-safe casting between types. The arch-guard for bfloat16 support is correctly placed.


150-172: Modern and flexible atomic operations implementation.

The unified implementation using cuda::atomic_ref is excellent:

  • Provides memory ordering control with sensible defaults (relaxed)
  • Supports cross-type atomic operations through normalization and casting
  • Properly handles different numeric types through the template system
  • Consistent interface across AtomicAdd, AtomicMax, and AtomicMin

203-213: Well-implemented atomic load/store operations.

The AtomicLoad and AtomicStore functions properly use cuda::atomic_ref with memory ordering support, completing the atomic operations API surface.

tilelang/transform/__init__.py (2)

424-427: Public wrapper for LayoutReducer looks consistent with existing pattern.

Matches other wrappers and exposes the pass to Python as intended.


424-427: FFI registration confirmed for LayoutReducer

The C++ implementation registers the pass factory via tvm::ffi::reflection under the expected symbol. No further changes are needed.

• In src/transform/layout_reducer.cc (lines 206–209), you’ll find:

TVM_FFI_STATIC_INIT_BLOCK({
  namespace refl = tvm::ffi::reflection;
  refl::GlobalDef().def("tl.transform.LayoutReducer", LayoutReducer);
});

This matches the Python binding _ffi_api.LayoutReducer().

tilelang/language/reduce.py (2)

190-203: Expose finalize_reducer intrinsic — good addition.

The API is straightforward and consistent with other tl ops. Returning a TIR call handle aligns with the rest of this module.


190-203: Operator registration and effect kind confirmed

The tl.finalize_reducer intrinsic is registered with the correct name and marked as opaque:

  • In src/op/finalize_reducer.cc at line 81:
    TIR_REGISTER_TL_OP(FinalizeReducer, finalize_reducer)
  • Immediately thereafter at line 84:
    .set_attr<TCallEffectKind>("TCallEffectKind", Integer(CallEffectKind::kOpaque))

No further changes are needed here.

tilelang/engine/phase.py (1)

72-73: Pass ordering: LayoutReducer before LayoutInference is sensible.

This should let reducer metadata inform layout inference without re-traversals. Nice placement right after Simplify.

src/op/reduce.cc (1)

16-16: Centralize target checks via TargetIsHopper — good cleanup.

This improves readability and avoids duplicated arch parsing logic.

src/transform/layout_inference.cc (3)

23-23: LGTM!

The addition of the layout_reducer.h include is appropriate to support the reducer metadata handling introduced in this file.


568-573: LGTM! Correct reducer information extraction.

The code correctly extracts the reducer metadata from loop annotations when present. The conditional check and type casting are handled properly to avoid failures when the annotation is absent.


629-629: Logical condition correctly prevents vectorization when reducers are present.

The vectorization gating logic appropriately combines has_non_local and !has_reducer conditions to avoid vectorizing loops that contain reducer operations, which aligns with the comment about needing to isolate reduction axes from vectorization.

src/op/parallel.cc (2)

198-202: LGTM! Correctly skips buffers with ALL replication.

The logic appropriately skips reducer buffers marked with ReducerRepType::ALL, preventing them from being used as source buffers for layout inference since they're replicated across all threads.


299-308: LGTM! Improved vectorization alignment with thread bounds.

The addition of loop total size calculation and its use in the vectorization constraint ensures proper alignment with thread bounds, preventing invalid vectorization configurations.

tilelang/language/__init__.py (1)

57-57: LGTM! Clean API surface expansion.

The new exports for reducer finalization and atomic operations are well-placed and follow the existing pattern. The imports maintain consistency with the module structure.

Also applies to: 61-62, 70-71

src/op/finalize_reducer.h (1)

20-34: LGTM! Well-structured operator class definition.

The FinalizeReducer class follows the established operator pattern with proper inheritance, member functions, and data members. The interface is clean and consistent with other operators in the codebase.

src/op/finalize_reducer.cc (1)

25-79: LGTM! Well-structured lowering implementation.

The lowering logic correctly handles reducer finalization with appropriate validations, conditional paths for Hopper targets, workspace allocation for larger thread counts, and proper loop construction. The implementation follows patterns established in the reduce operation.

src/transform/layout_reducer.h (1)

37-39: Attribute key looks good; ensure consistency with other attr keys.

kReducerInfo = "reducer_info" aligns with usage in the pass and in ParallelOp. No change required.

src/transform/layout_reducer.cc (2)

198-210: Pass registration and FFI exposure look good.

Creation via CreatePrimFuncPass and FFI static registration under "tl.transform.LayoutReducer" matches the Python wrapper. No change requested.


166-174: Use Integer(...) for reducer op and confirm Map::erase availability

To ensure type safety and clarity, replace the raw‐int cast with an explicit Integer constructor, and verify that your Map<…> supports erase(var). If it doesn’t, fall back to copying and erasing as shown below.

Suggested diff:

-      op->args.push_back((int)inside_reducer_range_.Get(var).value()->op);
-      inside_reducer_range_.erase(var);
+      // Push reducer op index as a PrimExpr Integer, not a raw int
+      op->args.push_back(
+          Integer(static_cast<int>(inside_reducer_range_.Get(var).value()->op)));
+      // Remove from active range
+      inside_reducer_range_.erase(var);

Fallback if Map::erase is unsupported in your TVM baseline:

auto tmp = inside_reducer_range_;
tmp.erase(var);
inside_reducer_range_ = std::move(tmp);

Please manually verify that Map::erase(const Key&) exists; if not, apply the fallback above.

Comment on lines +132 to +139
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
const Op &ptx_fence_barrier_init();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Export macro missing on new intrinsic (will break dynamic linking/FFI)

All neighboring intrinsics use TVM_DLL. The new declaration lacks it, risking missing symbol export from the shared library.

Apply:

-const Op &ptx_fence_barrier_init();
+TVM_DLL const Op &ptx_fence_barrier_init();

Verify the op is also defined/registered (to avoid undefined reference at link/runtime):


🏁 Script executed:

#!/bin/bash
# Expect at least one definition/registration for the new op
rg -n 'ptx_fence_barrier_init' -C2

Length of output: 339


🏁 Script executed:

#!/bin/bash
# Search for any implementation or registration of the new intrinsic
rg -n 'ptx_fence_barrier_init' -C2
rg -n 'ptx_fence_barrier_init' -C2 --glob '*.cc'
rg -n 'ptx_fence_barrier_init' -C2 --glob '*.cpp'
rg -n 'ptx_fence_barrier_init' -C2 --glob '*.c'

Length of output: 784


#!/bin/bash
# List potential implementation files under src/op
echo "=== Source files in src/op with C/C++ extensions ==="
find src/op -type f | grep -E "\.(cc|cpp|c|cu|cuh)$"

# Search ignoring gitignore for any mention of the intrinsic name
echo "=== Searching for 'ptx_fence_barrier_init' across all files ==="
rg -uu -n "ptx_fence_barrier_init" -C2 .

# Broader search for any 'ptx_fence' identifiers
echo "=== Searching for 'ptx_fence' across all files ==="
rg -uu -n "ptx_fence" -C2 .

# Search for any global registrations in src/op
echo "=== Searching for TVM_REGISTER_GLOBAL in src/op ==="
rg -n "TVM_REGISTER_GLOBAL" -C2 src/op

🏁 Script executed:

#!/bin/bash
# Force search across all files (including those ignored by .gitignore) for the intrinsic name
echo "=== Full repo search for 'ptx_fence_barrier_init' ==="
rg -uu -n "ptx_fence_barrier_init" -C2 .

# Broader search for any fence barrier related symbols
echo "=== Full repo search for 'fence_barrier_init' ==="
rg -uu -n "fence_barrier_init" -C2 .

# Check for global registration patterns in the entire repo
echo "=== Search for TVM_REGISTER_GLOBAL ==="
rg -uu -n "TVM_REGISTER_GLOBAL" -C2 .

Length of output: 1617


Add export macro and provide implementation/registration for ptx_fence_barrier_init

The new intrinsic declaration in src/op/builtin.h is missing the TVM_DLL export specifier and has no matching implementation or TVM_REGISTER_GLOBAL call, which will lead to undefined references at link/runtime.

• File src/op/builtin.h (line 138): the declaration

- const Op &ptx_fence_barrier_init();
+ TVM_DLL const Op &ptx_fence_barrier_init();

needs the TVM_DLL prefix to ensure the symbol is exported.
• No implementation or registration found for ptx_fence_barrier_init in the repository—add the corresponding definition in a .cc/.cu file (e.g., src/op/ptx_fence.cc) and register it with TVM_REGISTER_GLOBAL (or the appropriate registration API) following the pattern of other PTX intrinsics.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
const Op &ptx_fence_barrier_init();
/*!
* \brief tvm intrinsics for barrier initialization fence
*
* ptx_fence_barrier_init()
*
*/
TVM_DLL const Op &ptx_fence_barrier_init();

Comment on lines 21 to 23
FinalizeReducer::FinalizeReducer(Array<PrimExpr> args, BufferMap vmap)
: reducer_(vmap[GetVarFromAccessPtr(args[0])]),
op_((ReducerOpType)*as_const_int(args[1])) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add bounds checking for args array access.

The constructor assumes args[1] exists without checking the array size, which could lead to out-of-bounds access.

Apply this diff to add validation:

 FinalizeReducer::FinalizeReducer(Array<PrimExpr> args, BufferMap vmap)
-    : reducer_(vmap[GetVarFromAccessPtr(args[0])]),
-      op_((ReducerOpType)*as_const_int(args[1])) {}
+    : reducer_(vmap[GetVarFromAccessPtr(args[0])]) {
+  ICHECK_EQ(args.size(), 2) << "FinalizeReducer expects exactly 2 arguments";
+  const int64_t* op_value = as_const_int(args[1]);
+  ICHECK(op_value) << "Second argument must be a constant integer";
+  op_ = (ReducerOpType)*op_value;
+}
🤖 Prompt for AI Agents
In src/op/finalize_reducer.cc around lines 21 to 23, the constructor uses
args[0] and args[1] without validating args size; add bounds checks (e.g.,
ensure args.size() >= 2) before accessing these elements and emit a clear
error/throw (or handle gracefully) when the check fails; only then proceed to
compute reducer_ and op_ to avoid out-of-bounds access and undefined behavior.

Comment on lines 47 to 48
std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
auto op_str = op_names[(int)op_];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add bounds checking for op_ enum cast.

The cast (int)op_ is used to index into op_names array without verifying that the enum value is within valid bounds.

Apply this diff to add validation:

 std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
+ICHECK(static_cast<int>(op_) >= 0 && static_cast<int>(op_) < op_names.size())
+    << "Invalid reducer operation type: " << static_cast<int>(op_);
 auto op_str = op_names[(int)op_];
🤖 Prompt for AI Agents
In src/op/finalize_reducer.cc around lines 47 to 48, the code indexes op_names
with (int)op_ without validating the enum value; add a bounds check that ensures
op_ is within [0, op_names.size()-1] (or compare against the enum's max) before
casting and indexing, and handle invalid values by logging/throwing or returning
a sensible default/error; implement the check and only perform the array access
when valid to avoid out-of-bounds access.

Comment on lines +127 to 133
auto reducer_info_map =
op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
if (reducer_info_map) {
for (auto &&[buffer, info] : reducer_info_map.value())
p->reducer_info_map_.Set(buffer, info);
}
StmtExprVisitor::VisitStmt_(op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Check for null annotation before casting.

The code assumes attr::kReducerInfo exists in annotations and performs an unchecked cast. This could cause runtime errors if the annotation is missing.

Apply this diff to add null checking:

-  auto reducer_info_map =
-      op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
-  if (reducer_info_map) {
-    for (auto &&[buffer, info] : reducer_info_map.value())
-      p->reducer_info_map_.Set(buffer, info);
-  }
+  if (op->annotations.count(attr::kReducerInfo)) {
+    auto reducer_info_map =
+        op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
+    if (reducer_info_map) {
+      for (auto &&[buffer, info] : reducer_info_map.value())
+        p->reducer_info_map_.Set(buffer, info);
+    }
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto reducer_info_map =
op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
if (reducer_info_map) {
for (auto &&[buffer, info] : reducer_info_map.value())
p->reducer_info_map_.Set(buffer, info);
}
StmtExprVisitor::VisitStmt_(op);
if (op->annotations.count(attr::kReducerInfo)) {
auto reducer_info_map =
op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
if (reducer_info_map) {
for (auto &&[buffer, info] : reducer_info_map.value())
p->reducer_info_map_.Set(buffer, info);
}
}
StmtExprVisitor::VisitStmt_(op);
🤖 Prompt for AI Agents
In src/op/parallel.cc around lines 127 to 133, the code unconditionally
retrieves and casts the attr::kReducerInfo annotation which can be absent and
cause a runtime error; change the logic to first check that op->annotations
contains attr::kReducerInfo (or that Get(attr::kReducerInfo) is non-null) before
calling as<Map<Var, ReducerInfo>>(), only perform the cast and iterate when the
annotation is present, otherwise skip setting reducer_info_map_; ensure you
handle both optional/nullable returns consistently to avoid dereferencing null.

Comment on lines +76 to +87
auto result = IRMutatorWithAnalyzer::VisitStmt_(op).as<Block>().value();
// After iterating over the body, set all layout_map to block
auto p_result = result.CopyOnWrite();
auto layout_map = p_result->annotations.Get(attr::kLayoutMap)
->as<Map<Var, Layout>>()
.value_or(Map<Var, Layout>());
for (auto &&[k, v] : new_layout_map_)
layout_map.Set(k, v);
if (layout_map.size())
p_result->annotations.Set(attr::kLayoutMap, layout_map);
new_layout_map_.clear();
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unsafe Optional dereference when merging kLayoutMap.

Get(attr::kLayoutMap)->as<...>() dereferences a possibly unset Optional and will CHECK-fail if the key is absent. Use a guarded Downcast path.

-    auto p_result = result.CopyOnWrite();
-    auto layout_map = p_result->annotations.Get(attr::kLayoutMap)
-                          ->as<Map<Var, Layout>>()
-                          .value_or(Map<Var, Layout>());
+    auto p_result = result.CopyOnWrite();
+    Map<Var, Layout> layout_map;
+    if (auto opt = p_result->annotations.Get(attr::kLayoutMap)) {
+      layout_map = Downcast<Map<Var, Layout>>(opt.value());
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto result = IRMutatorWithAnalyzer::VisitStmt_(op).as<Block>().value();
// After iterating over the body, set all layout_map to block
auto p_result = result.CopyOnWrite();
auto layout_map = p_result->annotations.Get(attr::kLayoutMap)
->as<Map<Var, Layout>>()
.value_or(Map<Var, Layout>());
for (auto &&[k, v] : new_layout_map_)
layout_map.Set(k, v);
if (layout_map.size())
p_result->annotations.Set(attr::kLayoutMap, layout_map);
new_layout_map_.clear();
return result;
auto result = IRMutatorWithAnalyzer::VisitStmt_(op).as<Block>().value();
// After iterating over the body, set all layout_map to block
auto p_result = result.CopyOnWrite();
Map<Var, Layout> layout_map;
if (auto opt = p_result->annotations.Get(attr::kLayoutMap)) {
layout_map = Downcast<Map<Var, Layout>>(opt.value());
}
for (auto &&[k, v] : new_layout_map_)
layout_map.Set(k, v);
if (layout_map.size())
p_result->annotations.Set(attr::kLayoutMap, layout_map);
new_layout_map_.clear();
return result;
🤖 Prompt for AI Agents
In src/transform/layout_reducer.cc around lines 76 to 87, the code currently
calls Get(attr::kLayoutMap)->as<Map<Var, Layout>>() which dereferences the
Optional returned by Get() and can CHECK-fail when the annotation is absent;
change to a guarded path: first capture the Optional returned by
p_result->annotations.Get(attr::kLayoutMap) into a local variable, check whether
it is defined (or has_value) and only then call its as<Map<Var, Layout>>()
(using value_or(Map<Var, Layout>()) if the downcast returns empty); otherwise
initialize an empty Map<Var, Layout>(). Then merge new_layout_map_ into that
safe map, set the annotation if non-empty, clear new_layout_map_, and return
result.

Comment on lines +90 to +109
Stmt VisitStmt_(const ForNode *op) final {
// only annotate the outermost loop
bool should_annotate = false;
if (inside_reducer_range_.size() > 0 && !already_annotated_) {
should_annotate = true;
already_annotated_ = true;
}

auto opt_result = IRMutatorWithAnalyzer::VisitStmt_(op).as<For>();
ICHECK(opt_result);
auto result = opt_result.value();

if (should_annotate) {
// we are leaving the current loop nest. later ones may annotate again
already_annotated_ = false;

auto p_result = result.CopyOnWrite();
p_result->annotations.Set(attr::kReducerInfo, inside_reducer_range_);

// Iterate over local.reducer.* buffers, append to reducer_op_map_, set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Outermost-loop annotation logic is incorrect (pre-order check).

should_annotate is computed before visiting the loop body, but inside_reducer_range_ is only populated while visiting the body (on T.fill). As a result, the condition never becomes true for the intended loop and annotations are never emitted.

Minimal fix: decide after visiting the body and use a dedicated "seen reducer in this subtree" signal to support outermost selection.

One practical approach is to track a stack of booleans indicating whether a subtree contains a reducer. Annotate a loop iff its subtree saw a reducer but its parent did not.

 class ReducerLayoutAnnotator : public IRMutatorWithAnalyzer {
 public:
 private:
+  // Tracks whether a loop subtree contains any reducer region.
+  std::vector<bool> reducer_seen_stack_;
@@
-  Stmt VisitStmt_(const ForNode *op) final {
-    // only annotate the outermost loop
-    bool should_annotate = false;
-    if (inside_reducer_range_.size() > 0 && !already_annotated_) {
-      should_annotate = true;
-      already_annotated_ = true;
-    }
-
-    auto opt_result = IRMutatorWithAnalyzer::VisitStmt_(op).as<For>();
+  Stmt VisitStmt_(const ForNode *op) final {
+    // Enter loop: assume not seen yet in this subtree
+    reducer_seen_stack_.push_back(false);
+    auto opt_result = IRMutatorWithAnalyzer::VisitStmt_(op).as<For>();
     ICHECK(opt_result);
     auto result = opt_result.value();
 
-    if (should_annotate) {
-      // we are leaving the current loop nest. later ones may annotate again
-      already_annotated_ = false;
+    // Leaving loop: decide outermost annotation
+    bool saw_here = reducer_seen_stack_.back();
+    reducer_seen_stack_.pop_back();
+    bool parent_saw_before = !reducer_seen_stack_.empty() && reducer_seen_stack_.back();
+    if (saw_here && !parent_saw_before) {
+      // This is the outermost loop that encloses at least one reducer region

And propagate "seen" upward at the end of the method:

-    }
-    return result;
+    }
+    if (!reducer_seen_stack_.empty()) {
+      reducer_seen_stack_.back() = reducer_seen_stack_.back() || saw_here;
+    }
+    return result;

Finally, set reducer_seen_stack_.back() = true; in the Fill case (see below).

🤖 Prompt for AI Agents
In src/transform/layout_reducer.cc around lines 90 to 109, the pre-order check
computes should_annotate before visiting the loop body so inside_reducer_range_
is empty and outermost annotations are never emitted; change to a post-order
approach by maintaining a reducer_seen_stack_ (stack of booleans) pushed at loop
entry, call the base VisitStmt_/mutator first, then inspect
reducer_seen_stack_.back() to decide annotation: annotate the loop if the
current subtree saw a reducer and the parent (second-from-top) did not, then
propagate the current seen flag to the parent (pop current and set parent |=
current) before returning; ensure the Fill visitor sets
reducer_seen_stack_.back() = true when encountering a reducer so the loop
decision can be made after visiting its body.

Comment on lines 148 to 166
PrimExpr VisitExpr_(const CallNode *op_) final {
auto op_ref = IRMutatorWithAnalyzer::VisitExpr_(op_).as<Call>().value();
auto op = op_ref.CopyOnWrite();
if (op->op.same_as(Fill::Get())) {
ICHECK(op->args.size() > 0);
if (auto arg0_call = op->args[0].as<Call>();
arg0_call &&
arg0_call.value()->op.same_as(builtin::tvm_access_ptr())) {
ICHECK(arg0_call.value()->args.size() > 1);
if (auto var = arg0_call.value()->args[1].as<Var>();
var && reducer_info_map_.count(var.value())) {
ICHECK(inside_reducer_range_.count(var.value()) == 0)
<< "T.fill on reducer must be enclosed with a T.finalize_reducer "
"before next.";
inside_reducer_range_.Set(var.value(),
reducer_info_map_.Get(var.value()).value());
}
}
} else if (op->op.same_as(FinalizeReducer::Get())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Mark subtree as containing a reducer when encountering the Fill on a reducer buffer.

To support the outermost-loop annotation fix, ensure we flip the "seen" flag here.

-          inside_reducer_range_.Set(var.value(),
-                                    reducer_info_map_.Get(var.value()).value());
+          inside_reducer_range_.Set(var.value(),
+                                    reducer_info_map_.Get(var.value()).value());
+          if (!reducer_seen_stack_.empty()) {
+            reducer_seen_stack_.back() = true;
+          }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PrimExpr VisitExpr_(const CallNode *op_) final {
auto op_ref = IRMutatorWithAnalyzer::VisitExpr_(op_).as<Call>().value();
auto op = op_ref.CopyOnWrite();
if (op->op.same_as(Fill::Get())) {
ICHECK(op->args.size() > 0);
if (auto arg0_call = op->args[0].as<Call>();
arg0_call &&
arg0_call.value()->op.same_as(builtin::tvm_access_ptr())) {
ICHECK(arg0_call.value()->args.size() > 1);
if (auto var = arg0_call.value()->args[1].as<Var>();
var && reducer_info_map_.count(var.value())) {
ICHECK(inside_reducer_range_.count(var.value()) == 0)
<< "T.fill on reducer must be enclosed with a T.finalize_reducer "
"before next.";
inside_reducer_range_.Set(var.value(),
reducer_info_map_.Get(var.value()).value());
}
}
} else if (op->op.same_as(FinalizeReducer::Get())) {
PrimExpr VisitExpr_(const CallNode *op_) final {
auto op_ref = IRMutatorWithAnalyzer::VisitExpr_(op_).as<Call>().value();
auto op = op_ref.CopyOnWrite();
if (op->op.same_as(Fill::Get())) {
ICHECK(op->args.size() > 0);
if (auto arg0_call = op->args[0].as<Call>();
arg0_call &&
arg0_call.value()->op.same_as(builtin::tvm_access_ptr())) {
ICHECK(arg0_call.value()->args.size() > 1);
if (auto var = arg0_call.value()->args[1].as<Var>();
var && reducer_info_map_.count(var.value())) {
ICHECK(inside_reducer_range_.count(var.value()) == 0)
<< "T.fill on reducer must be enclosed with a T.finalize_reducer "
"before next.";
inside_reducer_range_.Set(var.value(),
reducer_info_map_.Get(var.value()).value());
if (!reducer_seen_stack_.empty()) {
reducer_seen_stack_.back() = true;
}
}
}
} else if (op->op.same_as(FinalizeReducer::Get())) {
...
🤖 Prompt for AI Agents
In src/transform/layout_reducer.cc around lines 148 to 166, when you detect a
Fill on a reducer buffer you currently set inside_reducer_range_.Set(...) but do
not mark the reducer as "seen"; update the reducer state here so the reducer's
seen flag is flipped. Concretely, after confirming var and calling
inside_reducer_range_.Set(...), retrieve the reducer info for var and set its
seen flag (or update the reducer_info_map_ entry) to true so downstream
outermost-loop annotation logic recognizes this subtree contains a reducer.

Comment on lines +8 to +11
#include <tvm/tir/op.h>

#include "../layout/layout.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Include TVM runtime headers explicitly to avoid transitive-include breakage.

Object, ObjectRef, and String live under tvm/runtime. Relying on transitive includes from other headers is brittle.

Apply:

 #include <tvm/tir/op.h>
 
 #include "../layout/layout.h"
+// Explicit TVM runtime includes
+#include <tvm/runtime/object.h>
+#include <tvm/runtime/container.h>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <tvm/tir/op.h>
#include "../layout/layout.h"
#include <tvm/tir/op.h>
#include "../layout/layout.h"
// Explicit TVM runtime includes
#include <tvm/runtime/object.h>
#include <tvm/runtime/container.h>
🤖 Prompt for AI Agents
In src/transform/layout_reducer.h around lines 8 to 11, the file currently
relies on transitive includes for TVM runtime types (Object, ObjectRef, String);
explicitly include the runtime headers instead: add #include
<tvm/runtime/object.h> and #include <tvm/runtime/string.h> (or the appropriate
tvm/runtime headers that define Object/ObjectRef and String) alongside the
existing includes so the file does not break if transitive includes are removed.

Comment on lines +22 to +26
ReducerInfoNode() = default;
ReducerInfoNode(const String &op_str, const String &rep_str);
static constexpr const char *_type_key = "tl.ReducerInfo";
TVM_DECLARE_FINAL_OBJECT_INFO(ReducerInfoNode, Object);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid uninitialized enum fields in the default constructor.

ReducerInfoNode() leaves op and rep uninitialized, which is UB if a default-constructed node ever escapes (e.g., via reflection or deserialization). Either delete the default ctor or give safe defaults.

Two options (pick one):

Option A (initialize):

-  ReducerInfoNode() = default;
+  ReducerInfoNode() : op(ReducerOpType::SUM), rep(ReducerRepType::NONE) {}

Option B (disallow):

-  ReducerInfoNode() = default;
+  ReducerInfoNode() = delete;
🤖 Prompt for AI Agents
In src/transform/layout_reducer.h around lines 22 to 26, the default constructor
ReducerInfoNode() leaves the enum fields op and rep uninitialized; either
initialize them to safe defaults or remove the default constructor to avoid UB.
Fix option A: change the default ctor to initialize op and rep to a known safe
enum value (e.g., Op::kNone or Rep::kDefault) so any default-constructed node
has defined values. Fix option B: delete the default constructor (or mark it
=delete) so callers cannot default-construct the node; ensure any existing sites
create instances via the parameterized constructor or provide a factory that
sets valid enums. Ensure header includes or forward-declares the enums used for
the default values if you pick option A.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/op/parallel.cc (1)

53-55: Fix invalid use of ->dtype on PrimExpr in Range construction

ReplicateExtent() returns PrimExpr; use .dtype() or avoid specifying dtype entirely. The current ->dtype will not compile.

-  analyzer_.Bind(rep_small,
-                 Range(IntImm(small_frag->ReplicateExtent()->dtype, 0),
-                       small_frag->ReplicateExtent()),
-                 true); // Bind the replicate extent of small_frag.
+  analyzer_.Bind(
+      rep_small,
+      Range::FromMinExtent(0, small_frag->ReplicateExtent()),
+      true); // Bind the replicate extent of small_frag.
♻️ Duplicate comments (1)
src/op/parallel.cc (1)

127-132: Guard reducer annotation before casting to avoid null deref

op->annotations.Get(attr::kReducerInfo) may be empty; calling ->as<...>() directly can crash. Add an Optional check before casting. This mirrors patterns elsewhere in the codebase.

-  auto reducer_info_map =
-      op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
-  if (reducer_info_map) {
-    for (auto &&[buffer, info] : reducer_info_map.value())
-      p->reducer_info_map_.Set(buffer, info);
-  }
+  if (auto obj = op->annotations.Get(attr::kReducerInfo)) {
+    if (auto reducer_info_map = obj.value().as<Map<Var, ReducerInfo>>()) {
+      for (auto &&[buffer, info] : reducer_info_map.value()) {
+        p->reducer_info_map_.Set(buffer, info);
+      }
+    }
+  }
🧹 Nitpick comments (2)
src/op/parallel.cc (2)

299-308: Tighten loop_total_size computation and avoid repeated as()

Minor robustness/readability: iterate using a cached For handle and start from maybe_remapped_root_ (even if extents don’t change, it’s consistent with the vector-size path).

-      PrimExpr loop_total_size = 1;
-      for (Stmt l = root_; l.as<For>().has_value();
-           l = l.as<For>().value()->body)
-        loop_total_size = loop_total_size * l.as<For>().value()->extent;
-      while (!analyzer_.CanProve(
-                 floormod(loop_total_size,
-                          T.thread_bounds->extent * vector_size) == 0) &&
-             vector_size > 1)
-        vector_size /= 2;
+      PrimExpr loop_total_size = 1;
+      for (Stmt l = maybe_remapped_root_; l.as<For>().has_value();) {
+        auto f = l.as<For>().value();
+        loop_total_size = loop_total_size * f->extent;
+        l = f->body;
+      }
+      while (vector_size > 1 &&
+             !analyzer_.CanProve(
+                 floormod(loop_total_size,
+                          T.thread_bounds->extent * vector_size) == 0)) {
+        vector_size /= 2;
+      }

354-365: Improve replication-compat error for undefined source_buffer

If source_buffer is not defined (e.g., derived from read buffer path), the error message prints an empty buffer. Use a fallback to the current buffer for clearer diagnostics.

-        std::ostringstream oss;
-        oss << "Layout infer conflict between" << buffer << " and "
-            << source_buffer << " in T.Parallel loop:" << std::endl
+        std::ostringstream oss;
+        const Buffer &culprit = source_buffer.defined() ? source_buffer : buffer;
+        oss << "Layout infer conflict between " << buffer << " and "
+            << culprit << " in T.Parallel loop:" << std::endl
             << "    loop " << loop_layout_->DebugOutput() << std::endl
             << "    fragment " << fragment->DebugOutput() << std::endl;
         throw LayoutConflictException(oss.str());
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4c0f978 and a23a370.

📒 Files selected for processing (1)
  • src/op/parallel.cc (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/parallel.cc (1)
src/transform/layout_inference.cc (18)
  • op (39-45)
  • op (39-39)
  • op (294-322)
  • op (294-294)
  • op (346-369)
  • op (346-346)
  • op (371-388)
  • op (371-371)
  • op (390-399)
  • op (390-390)
  • op (553-565)
  • op (553-553)
  • op (567-639)
  • op (567-567)
  • op (641-650)
  • op (641-641)
  • buffer (338-344)
  • buffer (338-338)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-nvidia
  • GitHub Check: build-test-amd
🔇 Additional comments (1)
src/op/parallel.cc (1)

199-202: Confirm reducer_info_map_ keying and ALL semantics

Sanity-check that reducer_info_map_ is keyed by Var (i.e., buffer->data) and that skipping ReducerRepType::ALL matches the intended “ignore reducer buffers when choosing the source” semantics. If mixed keys or different rep semantics exist, selection may skip the wrong buffers.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/op/parallel.cc (1)

127-132: Fix unchecked annotation cast (possible null deref).

Get(attr::kReducerInfo)->as<...>() can dereference a missing annotation. Mirror the safe Get(...)+as(...) pattern used for coalesced_width.

-  auto reducer_info_map =
-      op->annotations.Get(attr::kReducerInfo)->as<Map<Var, ReducerInfo>>();
-  if (reducer_info_map) {
-    for (auto &&[buffer, info] : reducer_info_map.value())
-      p->reducer_info_map_.Set(buffer, info);
-  }
+  if (auto obj = op->annotations.Get(attr::kReducerInfo)) {
+    if (auto reducer_info_map = obj.value().as<Map<Var, ReducerInfo>>()) {
+      for (auto &&[buf_var, info] : reducer_info_map.value()) {
+        p->reducer_info_map_.Set(buf_var, info);
+      }
+    }
+  }
🧹 Nitpick comments (1)
src/op/parallel.cc (1)

299-308: Make vector-size adjustment loop more robust and readable.

Minor refactor: avoid repeated l.as<For>() calls, prefer CanProveEqual, and simplify the traversal. Behavior unchanged.

-      PrimExpr loop_total_size = 1;
-      for (Stmt l = root_; l.as<For>().has_value();
-           l = l.as<For>().value()->body)
-        loop_total_size = loop_total_size * l.as<For>().value()->extent;
-      while (!analyzer_.CanProve(
-                 floormod(loop_total_size,
-                          T.thread_bounds->extent * vector_size) == 0) &&
-             vector_size > 1)
-        vector_size /= 2;
+      PrimExpr loop_total_size = 1;
+      Stmt l = root_;
+      while (auto of = l.as<For>()) {
+        loop_total_size = loop_total_size * of.value()->extent;
+        l = of.value()->body;
+      }
+      while (vector_size > 1 &&
+             !analyzer_.CanProveEqual(
+                 floormod(loop_total_size,
+                          T.thread_bounds->extent * vector_size),
+                 make_const(DataType::Int(32), 0))) {
+        vector_size >>= 1;
+      }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 19c0fc8 and 5401c94.

📒 Files selected for processing (1)
  • src/op/parallel.cc (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/parallel.cc (1)
src/transform/layout_inference.cc (18)
  • op (39-45)
  • op (39-39)
  • op (294-322)
  • op (294-294)
  • op (346-369)
  • op (346-346)
  • op (371-388)
  • op (371-371)
  • op (390-399)
  • op (390-390)
  • op (553-565)
  • op (553-553)
  • op (567-639)
  • op (567-567)
  • op (641-650)
  • op (641-641)
  • buffer (338-344)
  • buffer (338-338)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: format-check
  • GitHub Check: build-test-amd
🔇 Additional comments (1)
src/op/parallel.cc (1)

198-202: Rep=ALL exclusion is correct.

Skipping buffers whose reducer rep is ALL avoids polluting source selection for loop layout inference. No change needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/op/copy.cc (2)

907-907: Bug: reversed contiguity check for global tensor in fallback 1D path.

!global_tensor->strides.empty() treats strided tensors as contiguous. This will misclassify non-dense layouts.

-    bool global_is_contiguous = !global_tensor->strides.empty();
+    bool global_is_contiguous = global_tensor->strides.empty();

1036-1039: Fix potential out-of-bounds read in while condition.

The current order dereferences shared_range[s_range_idx] before checking the bound.

-    while (is_one(shared_range[s_range_idx]->extent) &&
-           s_range_idx < shared_range.size()) {
+    while (s_range_idx < shared_range.size() &&
+           is_one(shared_range[s_range_idx]->extent)) {
🧹 Nitpick comments (2)
src/op/copy.cc (2)

855-861: Avoid variable shadowing for global_offset.

global_offset is re-declared inside this block, shadowing the earlier one. Reuse the existing variable to reduce confusion.

-      PrimExpr global_offset = 0;
+      global_offset = 0;

851-854: Consistency: use the same buffer when computing access_ptr offsets.

In the new 1D path you use shared_tensor (potentially remapped) with offset computed pre-remap; in the later path you use shared_tensor_before_remap. Prefer the latter here for clarity/safety.

-      PrimExpr shared_addr =
-          shared_tensor.access_ptr(is_load ? 2 : 1, DataType::Handle(), 1,
-                                   offset, shared_total_elements);
+      PrimExpr shared_addr =
+          shared_tensor_before_remap.access_ptr(is_load ? 2 : 1, DataType::Handle(), 1,
+                                                offset, shared_total_elements);

Also applies to: 954-957

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5401c94 and d9443a1.

📒 Files selected for processing (3)
  • src/op/copy.cc (1 hunks)
  • src/transform/storage_access.cc (1 hunks)
  • src/transform/warp_specialized_rewriter.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/transform/warp_specialized_rewriter.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/copy.cc (3)
src/transform/inject_tma_barrier.cc (16)
  • op (80-85)
  • op (80-80)
  • op (118-128)
  • op (118-118)
  • op (130-161)
  • op (130-130)
  • op (163-177)
  • op (163-163)
  • op (202-227)
  • op (202-202)
  • op (229-237)
  • op (229-229)
  • op (288-292)
  • op (288-288)
  • op (294-309)
  • op (294-294)
src/tl_templates/cuda/copy_sm90.h (12)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
  • tma_store (175-182)
  • tma_store (185-195)
  • tma_store (198-209)
  • tma_store (212-223)
  • tma_store (226-238)
  • tma_store (241-253)
tilelang/language/builtin.py (1)
  • tma_load (67-76)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task
🔇 Additional comments (2)
src/op/copy.cc (2)

804-878: Good addition: 1D TMA fast-path with tight contiguity/OOB checks.

The early-return path is well-gated (contiguity, element-count equality, thread guard) and correctly uses the pointer-style 1D TMA intrinsics with mbarrier placeholder for loads only.


1149-1186: Confirm PTX emitter consumes eviction_policy arg: Descriptor calls append eviction_policy as the final argument—verify the CUDA codegen (PTX emitter or TL templates) reads this to set the cache_hint, otherwise it will silently be ignored.

Comment on lines +247 to +248
With<arith::ConstraintContext> constraint(
&analyzer_, analyzer_.rewrite_simplify(Not(real_condition)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix Analyzer API call: use Simplify/RewriteSimplify (camel-case); snake_case won't compile

Analyzer methods are PascalCase in C++ (see other calls like Bind/Simplify in this file). analyzer_.rewrite_simplify(...) is invalid and will fail to build. Use Simplify for consistency, or RewriteSimplify if you specifically want rewrite-only simplifications.

-      With<arith::ConstraintContext> constraint(
-          &analyzer_, analyzer_.rewrite_simplify(Not(real_condition)));
+      With<arith::ConstraintContext> constraint(
+          &analyzer_, analyzer_.Simplify(Not(real_condition)));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
With<arith::ConstraintContext> constraint(
&analyzer_, analyzer_.rewrite_simplify(Not(real_condition)));
With<arith::ConstraintContext> constraint(
&analyzer_, analyzer_.Simplify(Not(real_condition)));
🤖 Prompt for AI Agents
In src/transform/storage_access.cc around lines 247-248, the call to
analyzer_.rewrite_simplify(...) uses snake_case and will not compile because the
Analyzer API uses PascalCase; replace
analyzer_.rewrite_simplify(Not(real_condition)) with the correct PascalCase
method — either analyzer_.Simplify(Not(real_condition)) for general
simplification or analyzer_.RewriteSimplify(Not(real_condition)) if you
specifically want rewrite-only simplifications — keeping the rest of the
With<arith::ConstraintContext> construction unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
testing/python/autotune/test_tilelang_autotune_with_inputs.py (1)

133-135: Switching to 1024³ problem size is a solid CI-friendly choice

Cuts runtime/memory while staying aligned with block sizes (64, 64, 32). No functional concerns.

Consider a small readability tweak:

 def test_autotune_matmul():
-    run_autotune(1024, 1024, 1024)
+    M = N = K = 1024
+    run_autotune(M, N, K)
testing/python/autotune/test_tilelang_autotune.py (2)

259-262: Updated get_configs to 1024³: LGTM

Keeps search space valid and materially reduces test time.

To avoid magic numbers:

 def test_autotune_get_configs():
-    get_configs(1024, 1024, 1024, with_roller=True)
-    get_configs(1024, 1024, 1024, with_roller=False)
+    size = 1024
+    get_configs(size, size, size, with_roller=True)
+    get_configs(size, size, size, with_roller=False)

264-267: Updated matmul to 1024³: LGTM

Appropriate for CI; keeps tiling multiples clean.

Mirror the same constant to reduce duplication:

 def test_autotune_matmul():
-    matmul(1024, 1024, 1024, with_roller=True)
-    matmul(1024, 1024, 1024, with_roller=False)
+    size = 1024
+    matmul(size, size, size, with_roller=True)
+    matmul(size, size, size, with_roller=False)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9443a1 and 4c1aa26.

📒 Files selected for processing (2)
  • testing/python/autotune/test_tilelang_autotune.py (1 hunks)
  • testing/python/autotune/test_tilelang_autotune_with_inputs.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/autotune/test_tilelang_autotune.py (1)
examples/gemm/example_gemm_autotune.py (2)
  • get_configs (22-105)
  • matmul (199-236)

…ated methods

- Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
- Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
- Adjusted header inclusions for improved organization and clarity across multiple files.
…le_flash_attention.py

- Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/tl_templates/cuda/common.h (2)

195-203: BF16x2 arch guard: use __CUDA_ARCH__ >= 800

Correct the conditional for availability of BF16 atomics.

-#if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ > 750))
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
 ...
 #endif

205-216: Float vector atomics guard: use __CUDA_ARCH__ >= 900

Replace non-standard __CUDA_ARCH_LIST__.

-#if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ >= 900))
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)
 ...
 #endif
src/transform/layout_inference.cc (1)

563-566: Fix typo: "local.framgent" -> "local.fragment".

This check never fires due to the misspelling and can mask missing fragment layouts.

Apply:

-      if (buffer.scope() == "local.framgent") {
+      if (buffer.scope() == "local.fragment") {
♻️ Duplicate comments (10)
src/tl_templates/cuda/common.h (4)

46-48: Fix format string: current macro misuses kernel_name and breaks non-literals

Use a "%s" placeholder to support variable kernel names and avoid accidental format parsing.

-      snprintf(error_buf, ERROR_BUF_SIZE, kernel_name ": %s - %s",             \
+      snprintf(error_buf, ERROR_BUF_SIZE, "%s: %s - %s", kernel_name,          \
                cudaGetErrorName(__err), cudaGetErrorString(__err));            \

136-148: Complete cuda_cast coverage for wrapper/storage types (both directions)

Casts between CUTLASS wrappers and CUDA storage are missing; AtomicLoad/Store need reverse casts.

 template <typename T1, typename T2> TL_DEVICE T1 cuda_cast(T2 val) {
   return T1(val);
 }
 
 template <> TL_DEVICE half cuda_cast<half, float>(float val) {
   return __float2half(val);
 }
 
-#if (defined(__CUDA_ARCH_LIST__) && (__CUDA_ARCH_LIST__ > 750))
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
 template <> TL_DEVICE __nv_bfloat16 cuda_cast<__nv_bfloat16, float>(float val) {
   return __float2bfloat16(val);
 }
 #endif
+
+// CUTLASS wrapper -> CUDA storage
+template <> TL_DEVICE half cuda_cast<half, half_t>(half_t val) { return val.to_half(); }
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+template <> TL_DEVICE __nv_bfloat16 cuda_cast<__nv_bfloat16, bfloat16_t>(bfloat16_t val) {
+  return __float2bfloat16(float(val));
+}
+#endif
+
+// CUDA storage -> CUTLASS wrapper (for AtomicLoad)
+template <> TL_DEVICE half_t cuda_cast<half_t, half>(half val) { return half_t(val); }
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+template <> TL_DEVICE bfloat16_t cuda_cast<bfloat16_t, __nv_bfloat16>(__nv_bfloat16 val) {
+  return bfloat16_t(__bfloat162float(val));
+}
+#endif

218-222: AtomicLoad: add default order, normalize storage, and cast back to API type

Current version lacks default and builds atomic_ref<T> from T&.

-template <typename T> TL_DEVICE T AtomicLoad(T *address, int memory_order) {
-  cuda::atomic_ref<T, cuda::thread_scope_device> aref(*address);
-  return aref.load(cuda::memory_order(memory_order));
-}
+template <typename T>
+TL_DEVICE T AtomicLoad(T* address,
+                       cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT = typename normalize_atomic_type<T>::type;
+  static_assert(sizeof(NT) == sizeof(T) && alignof(NT) == alignof(T), "Type mismatch");
+  NT* addr = reinterpret_cast<NT*>(address);
+  cuda::atomic_ref<NT, cuda::thread_scope_device> aref(*addr);
+  NT v = aref.load(mo);
+  return cuda_cast<T>(v);
+}

223-229: AtomicStore: mirror AtomicLoad (default order + normalize + cast)

Construct atomic_ref<NT1> from NT1&.

-template <typename T1, typename T2>
-TL_DEVICE void AtomicStore(T1 *address, T2 value, int memory_order) {
-  using NT1 = typename normalize_atomic_type<T1>::type;
-  cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
-  aref.store(cuda_cast<NT1>(value), cuda::memory_order(memory_order));
-}
+template <typename T1, typename T2>
+TL_DEVICE void AtomicStore(T1* address, T2 value,
+                           cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT1 = typename normalize_atomic_type<T1>::type;
+  static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1), "Type mismatch");
+  NT1* addr = reinterpret_cast<NT1*>(address);
+  cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
+  aref.store(cuda_cast<NT1>(value), mo);
+}
src/transform/layout_reducer.cc (5)

23-26: Register ReducerInfoNode for FFI/reflection.

Without registration, containers/FFI may fail at runtime.

 using arith::IRMutatorWithAnalyzer;
 
+TVM_REGISTER_OBJECT_TYPE(ReducerInfoNode);

65-71: Validate presence of "op" and "rep" keys in reducer_info.

Avoid unchecked .value() to prevent crashes on malformed annotations.

-      for (auto &&[var, rep] : map.value()) {
-        reducer_info_map_.Set(
-            var, ReducerInfo{rep.Get("op").value(), rep.Get("rep").value()});
-      }
+      for (auto &&[var, rep] : map.value()) {
+        auto op_opt = rep.Get("op");
+        auto rp_opt = rep.Get("rep");
+        ICHECK(op_opt && rp_opt)
+            << "reducer_info for var " << var->name_hint
+            << " must contain keys {\"op\", \"rep\"}";
+        reducer_info_map_.Set(var, ReducerInfo{op_opt.value(), rp_opt.value()});
+      }

79-87: Guard kLayoutMap merge; avoid Optional deref.

Current code can CHECK-fail when the annotation is absent.

-    auto p_result = result.CopyOnWrite();
-    auto layout_map = p_result->annotations.Get(attr::kLayoutMap)
-                          ->as<Map<Var, Layout>>()
-                          .value_or(Map<Var, Layout>());
+    auto p_result = result.CopyOnWrite();
+    Map<Var, Layout> layout_map;
+    if (auto opt = p_result->annotations.Get(attr::kLayoutMap)) {
+      layout_map = Downcast<Map<Var, Layout>>(opt.value());
+    }
     for (auto &&[k, v] : new_layout_map_)
       layout_map.Set(k, v);
     if (layout_map.size())
       p_result->annotations.Set(attr::kLayoutMap, layout_map);

90-141: Outermost-loop annotation is computed pre-order and never triggers. Switch to post-order with a reducer_seen_stack.

Annotate the outermost loop whose subtree contains a reducer; propagate the “seen” flag upward.

-  Stmt VisitStmt_(const ForNode *op) final {
-    // only annotate the outermost loop
-    bool should_annotate = false;
-    if (inside_reducer_range_.size() > 0 && !already_annotated_) {
-      should_annotate = true;
-      already_annotated_ = true;
-    }
-
-    auto opt_result = IRMutatorWithAnalyzer::VisitStmt_(op).as<For>();
+  Stmt VisitStmt_(const ForNode *op) final {
+    // Enter: track reducer presence in this subtree
+    reducer_seen_stack_.push_back(false);
+    auto opt_result = IRMutatorWithAnalyzer::VisitStmt_(op).as<For>();
     ICHECK(opt_result);
     auto result = opt_result.value();
 
-    if (should_annotate) {
-      // we are leaving the current loop nest. later ones may annotate again
-      already_annotated_ = false;
+    // Leave: decide outermost annotation
+    bool saw_here = reducer_seen_stack_.back();
+    reducer_seen_stack_.pop_back();
+    bool parent_saw = !reducer_seen_stack_.empty() && reducer_seen_stack_.back();
+    if (saw_here && !parent_saw) {
 
       auto p_result = result.CopyOnWrite();
       p_result->annotations.Set(attr::kReducerInfo, inside_reducer_range_);
@@
       }
       new_layout_map_.Set(buffer->data, f);
     }
-  }
-  return result;
+    }
+    if (!reducer_seen_stack_.empty()) {
+      reducer_seen_stack_.back() = reducer_seen_stack_.back() || saw_here;
+    }
+    return result;

And add the field:

 private:
+  // Tracks whether a loop subtree contains any reducer region.
+  std::vector<bool> reducer_seen_stack_;

148-166: Mark reducer presence on Fill to drive outermost-loop decision.

Flip the “seen” flag when encountering a reducer Fill.

           inside_reducer_range_.Set(var.value(),
                                     reducer_info_map_.Get(var.value()).value());
+          if (!reducer_seen_stack_.empty()) {
+            reducer_seen_stack_.back() = true;
+          }
src/op/finalize_reducer.cc (1)

50-52: Fix std::array CTAD and add enum bounds check before indexing.

Portable across toolchains and prevents OOB on invalid enum.

-  std::array op_names{"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
-  auto op_str = op_names[(int)op];
+  static constexpr const char* op_names[3] = {"tl::SumOp", "tl::MaxOp", "tl::MinOp"};
+  ICHECK(static_cast<int>(op) >= 0 && static_cast<int>(op) < 3)
+      << "Invalid reducer op enum: " << static_cast<int>(op);
+  const char* op_str = op_names[static_cast<int>(op)];
🧹 Nitpick comments (4)
src/op/copy.cc (2)

814-814: Prefer const reference for Op handle.

Use a const reference for consistency with the rest of the file and to avoid unnecessary handle copies.

-auto op = is_load ? tma_load() : tma_store();
+const Op &op = is_load ? tma_load() : tma_store();

837-848: Strengthen global contiguity check.

g_cont hinges solely on empty strides, then only disqualifies after a non-full-dim encounter. If explicit canonical strides are present (still contiguous), this path is skipped. Either accept canonical strides or make the intent explicit with a comment. Consider normalizing strides to “contiguous” rather than requiring .empty().

src/op/finalize_reducer.cc (1)

67-71: Reuse cached total_threads for workspace sizing.

Avoid duplicate as_const_int calls.

-  if (reducing_threads >= 32) {
-    PrimExpr workspace =
-        T.AddWorkspace(*as_const_int(T.thread_bounds->extent), buffer->dtype);
+  if (reducing_threads >= 32) {
+    PrimExpr workspace = T.AddWorkspace(total_threads, buffer->dtype);
     thread_reduce_args.push_back(workspace);
   }
src/op/finalize_reducer.h (1)

18-19: Avoid using namespace in headers.

Prevents namespace pollution for dependents.

-using namespace tir;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4c1aa26 and 67c1263.

📒 Files selected for processing (10)
  • src/op/builtin.h (1 hunks)
  • src/op/copy.cc (1 hunks)
  • src/op/finalize_reducer.cc (1 hunks)
  • src/op/finalize_reducer.h (1 hunks)
  • src/op/parallel.cc (3 hunks)
  • src/op/parallel.h (2 hunks)
  • src/op/reduce.cc (2 hunks)
  • src/tl_templates/cuda/common.h (4 hunks)
  • src/transform/layout_inference.cc (3 hunks)
  • src/transform/layout_reducer.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/op/builtin.h
  • src/op/parallel.h
  • src/op/reduce.cc
  • src/op/parallel.cc
🧰 Additional context used
🧬 Code graph analysis (6)
src/op/finalize_reducer.h (3)
src/transform/layout_reducer.h (1)
  • tl (13-42)
src/op/operator.h (2)
  • TileOperatorNode (54-95)
  • TileOperator (61-94)
src/op/finalize_reducer.cc (7)
  • Lower (28-82)
  • Lower (28-29)
  • InferLayout (84-89)
  • InferLayout (84-85)
  • Clone (91-94)
  • Clone (91-91)
  • FinalizeReducerOp (21-26)
src/transform/layout_reducer.cc (4)
src/op/parallel.cc (10)
  • op (96-106)
  • op (96-96)
  • op (108-116)
  • op (108-108)
  • VisitStmt_ (122-134)
  • VisitStmt_ (122-122)
  • VisitStmt_ (136-148)
  • VisitStmt_ (136-136)
  • VisitExpr_ (150-161)
  • VisitExpr_ (150-150)
src/transform/layout_inference.cc (20)
  • op (40-46)
  • op (40-40)
  • op (296-324)
  • op (296-296)
  • op (348-371)
  • op (348-348)
  • op (373-390)
  • op (373-373)
  • op (392-401)
  • op (392-392)
  • op (559-571)
  • op (559-559)
  • buffer (340-346)
  • buffer (340-340)
  • layout_map (427-536)
  • layout_map (427-428)
  • f (284-293)
  • f (284-284)
  • f (541-551)
  • f (541-541)
tilelang/language/tir/op.py (2)
  • indexmod (2890-2915)
  • tvm_access_ptr (650-675)
tilelang/transform/__init__.py (1)
  • LayoutReducer (424-427)
src/op/finalize_reducer.cc (6)
src/transform/layout_reducer.h (1)
  • ReducerOpType (15-41)
src/op/reduce.cc (4)
  • Clone (47-50)
  • Clone (47-47)
  • Clone (52-55)
  • Clone (52-52)
src/transform/layout_reducer.cc (8)
  • op (48-60)
  • op (48-48)
  • op (62-88)
  • op (62-62)
  • op (90-141)
  • op (90-90)
  • op (143-146)
  • op (143-143)
src/target/utils.cc (2)
  • TargetIsHopper (49-54)
  • TargetIsHopper (49-49)
src/op/operator.h (1)
  • TileOperator (61-94)
tilelang/language/reduce.py (1)
  • finalize_reducer (190-203)
src/op/copy.cc (3)
src/transform/inject_tma_barrier.cc (16)
  • op (80-85)
  • op (80-80)
  • op (118-128)
  • op (118-118)
  • op (130-161)
  • op (130-130)
  • op (163-177)
  • op (163-163)
  • op (202-227)
  • op (202-202)
  • op (229-237)
  • op (229-229)
  • op (288-292)
  • op (288-288)
  • op (294-309)
  • op (294-294)
src/tl_templates/cuda/copy_sm90.h (12)
  • tma_load (18-27)
  • tma_load (43-60)
  • tma_load (64-82)
  • tma_load (86-104)
  • tma_load (107-126)
  • tma_load (130-150)
  • tma_store (175-182)
  • tma_store (185-195)
  • tma_store (198-209)
  • tma_store (212-223)
  • tma_store (226-238)
  • tma_store (241-253)
tilelang/language/builtin.py (1)
  • tma_load (67-76)
src/transform/layout_inference.cc (2)
src/op/parallel.cc (4)
  • op (96-106)
  • op (96-96)
  • op (108-116)
  • op (108-108)
src/transform/pipeline_planning.cc (10)
  • op (54-72)
  • op (54-54)
  • op (74-92)
  • op (74-74)
  • op (94-121)
  • op (94-94)
  • op (123-133)
  • op (123-123)
  • op (545-554)
  • op (545-545)
src/tl_templates/cuda/common.h (1)
src/tl_templates/cuda/nvrtc_std.h (1)
  • std (47-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-test-nvidia
  • GitHub Check: format-check
  • GitHub Check: bot-task

src/op/copy.cc Outdated
Comment on lines 850 to 888
// We also ensure there's no OOB.
PrimExpr shared_total_elements = 1;
for (size_t i = 0; i < shared_range.size(); ++i)
shared_total_elements *= shared_range[i]->extent;
PrimExpr global_total_elements = 1;
for (size_t i = 0; i < global_range.size(); ++i)
global_total_elements *= global_range[i]->extent;
bool s_g_equal =
analyzer->CanProveEqual(global_total_elements, shared_total_elements);
if (s_cont && g_cont && s_g_equal) {
shared_total_elements = analyzer->Simplify(shared_total_elements);
PrimExpr shared_addr =
shared_tensor.access_ptr(is_load ? 2 : 1, DataType::Handle(), 1,
offset, shared_total_elements);

PrimExpr global_offset = 0;
ICHECK(global_tensor->strides.empty());
for (size_t i = 0; i < global_tensor->shape.size(); ++i) {
global_offset *= global_tensor->shape[i];
global_offset += global_range[i]->min;
}
PrimExpr global_addr =
global_tensor.access_ptr(is_load ? 1 : 2, DataType::Handle(), 1,
global_offset, global_total_elements);

Stmt tma_copy;
if (is_load)
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{shared_addr, global_addr, 0,
shared_total_elements * shared_tensor->dtype.bytes()}));
else
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{global_addr, shared_addr,
shared_total_elements * shared_tensor->dtype.bytes()}));
return IfThenElse(EQ(T.thread_var, T.thread_bounds->min), tma_copy);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing OOB guards in new 1D TMA path.

Unlike the later 1D path (Lines 934-959), this early return lacks checks that min + extent <= shape for both shared and global ranges. This can generate OOB TMA accesses when mins are non-zero. Gate the path with the same no_oob checks.

-    bool s_g_equal =
-        analyzer->CanProveEqual(global_total_elements, shared_total_elements);
-    if (s_cont && g_cont && s_g_equal) {
+    bool s_g_equal =
+        analyzer->CanProveEqual(global_total_elements, shared_total_elements);
+
+    // OOB guards (mirror the checks in the later 1D path)
+    bool no_oob = true;
+    for (size_t i = 0; i < shared_range.size(); ++i) {
+      if (!analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
+                              shared_tensor->shape[i])) {
+        no_oob = false;
+        break;
+      }
+    }
+    if (no_oob) {
+      for (size_t i = 0; i < global_range.size(); ++i) {
+        if (!analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
+                                global_tensor->shape[i])) {
+          no_oob = false;
+          break;
+        }
+      }
+    }
+
+    if (s_cont && g_cont && s_g_equal && no_oob) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// We also ensure there's no OOB.
PrimExpr shared_total_elements = 1;
for (size_t i = 0; i < shared_range.size(); ++i)
shared_total_elements *= shared_range[i]->extent;
PrimExpr global_total_elements = 1;
for (size_t i = 0; i < global_range.size(); ++i)
global_total_elements *= global_range[i]->extent;
bool s_g_equal =
analyzer->CanProveEqual(global_total_elements, shared_total_elements);
if (s_cont && g_cont && s_g_equal) {
shared_total_elements = analyzer->Simplify(shared_total_elements);
PrimExpr shared_addr =
shared_tensor.access_ptr(is_load ? 2 : 1, DataType::Handle(), 1,
offset, shared_total_elements);
PrimExpr global_offset = 0;
ICHECK(global_tensor->strides.empty());
for (size_t i = 0; i < global_tensor->shape.size(); ++i) {
global_offset *= global_tensor->shape[i];
global_offset += global_range[i]->min;
}
PrimExpr global_addr =
global_tensor.access_ptr(is_load ? 1 : 2, DataType::Handle(), 1,
global_offset, global_total_elements);
Stmt tma_copy;
if (is_load)
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{shared_addr, global_addr, 0,
shared_total_elements * shared_tensor->dtype.bytes()}));
else
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{global_addr, shared_addr,
shared_total_elements * shared_tensor->dtype.bytes()}));
return IfThenElse(EQ(T.thread_var, T.thread_bounds->min), tma_copy);
}
}
// We also ensure there's no OOB.
PrimExpr shared_total_elements = 1;
for (size_t i = 0; i < shared_range.size(); ++i)
shared_total_elements *= shared_range[i]->extent;
PrimExpr global_total_elements = 1;
for (size_t i = 0; i < global_range.size(); ++i)
global_total_elements *= global_range[i]->extent;
bool s_g_equal =
analyzer->CanProveEqual(global_total_elements, shared_total_elements);
// OOB guards (mirror the checks in the later 1D path)
bool no_oob = true;
for (size_t i = 0; i < shared_range.size(); ++i) {
if (!analyzer->CanProve(shared_range[i]->min + shared_range[i]->extent <=
shared_tensor->shape[i])) {
no_oob = false;
break;
}
}
if (no_oob) {
for (size_t i = 0; i < global_range.size(); ++i) {
if (!analyzer->CanProve(global_range[i]->min + global_range[i]->extent <=
global_tensor->shape[i])) {
no_oob = false;
break;
}
}
}
if (s_cont && g_cont && s_g_equal && no_oob) {
shared_total_elements = analyzer->Simplify(shared_total_elements);
PrimExpr shared_addr =
shared_tensor.access_ptr(is_load ? 2 : 1, DataType::Handle(), 1,
offset, shared_total_elements);
PrimExpr global_offset = 0;
ICHECK(global_tensor->strides.empty());
for (size_t i = 0; i < global_tensor->shape.size(); ++i) {
global_offset *= global_tensor->shape[i];
global_offset += global_range[i]->min;
}
PrimExpr global_addr =
global_tensor.access_ptr(is_load ? 1 : 2, DataType::Handle(), 1,
global_offset, global_total_elements);
Stmt tma_copy;
if (is_load)
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{shared_addr, global_addr, 0,
shared_total_elements * shared_tensor->dtype.bytes()}));
else
tma_copy = Evaluate(
Call(DataType::Handle(), op,
{global_addr, shared_addr,
shared_total_elements * shared_tensor->dtype.bytes()}));
return IfThenElse(EQ(T.thread_var, T.thread_bounds->min), tma_copy);
}
🤖 Prompt for AI Agents
In src/op/copy.cc around lines 850 to 888, the early 1D TMA return path is
missing OOB guards: add the same "no_oob" checks used in the later 1D path
(lines ~934-959) so the branch only fires when every range's min + extent <=
corresponding tensor shape for both shared and global ranges. Concretely,
compute/verify the no_oob condition for shared_range vs shared_tensor->shape and
global_range vs global_tensor->shape (using analyzer->CanProveEqual/CanProve or
equivalent checks already used later), and combine it with the existing s_cont
&& g_cont && s_g_equal gate before constructing shared_addr/global_addr and
returning the IfThenElse tma_copy. Ensure the added condition prevents the
early-return 1D TMA when mins are non-zero and could cause OOB accesses.

Comment on lines +21 to +26
FinalizeReducerOp::FinalizeReducerOp(Array<PrimExpr> args, BufferMap vmap) {
auto node = make_object<FinalizeReducerOpNode>();
node->reducer = vmap[GetVarFromAccessPtr(args[0])];
node->op = (ReducerOpType)*as_const_int(args[1]);
data_ = std::move(node);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate builder arguments and vmap lookup.

Prevent OOB access and undefined reducer buffer lookup.

-FinalizeReducerOp::FinalizeReducerOp(Array<PrimExpr> args, BufferMap vmap) {
+FinalizeReducerOp::FinalizeReducerOp(Array<PrimExpr> args, BufferMap vmap) {
   auto node = make_object<FinalizeReducerOpNode>();
-  node->reducer = vmap[GetVarFromAccessPtr(args[0])];
-  node->op = (ReducerOpType)*as_const_int(args[1]);
+  ICHECK_EQ(args.size(), 2) << "FinalizeReducer expects exactly 2 arguments";
+  Var data_var = GetVarFromAccessPtr(args[0]);
+  ICHECK(vmap.count(data_var)) << "Unknown reducer buffer var in access_ptr";
+  node->reducer = vmap[data_var];
+  const int64_t* op_i64 = as_const_int(args[1]);
+  ICHECK(op_i64) << "Second argument must be a constant integer (ReducerOpType)";
+  node->op = static_cast<ReducerOpType>(*op_i64);
   data_ = std::move(node);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FinalizeReducerOp::FinalizeReducerOp(Array<PrimExpr> args, BufferMap vmap) {
auto node = make_object<FinalizeReducerOpNode>();
node->reducer = vmap[GetVarFromAccessPtr(args[0])];
node->op = (ReducerOpType)*as_const_int(args[1]);
data_ = std::move(node);
}
FinalizeReducerOp::FinalizeReducerOp(Array<PrimExpr> args, BufferMap vmap) {
auto node = make_object<FinalizeReducerOpNode>();
ICHECK_EQ(args.size(), 2) << "FinalizeReducer expects exactly 2 arguments";
Var data_var = GetVarFromAccessPtr(args[0]);
ICHECK(vmap.count(data_var)) << "Unknown reducer buffer var in access_ptr";
node->reducer = vmap[data_var];
const int64_t* op_i64 = as_const_int(args[1]);
ICHECK(op_i64) << "Second argument must be a constant integer (ReducerOpType)";
node->op = static_cast<ReducerOpType>(*op_i64);
data_ = std::move(node);
}
🤖 Prompt for AI Agents
In src/op/finalize_reducer.cc around lines 21-26, the constructor assumes args
has at least 2 elements, that args[1] is a valid constant int, and that vmap
contains the reducer key returned by GetVarFromAccessPtr(args[0]); add explicit
validation: check args.size() >= 2 and throw or ICE with a clear message if not;
call as_const_int(args[1]) and verify it returns non-null before dereferencing,
then validate the resulting op index is within the expected enum range; compute
the reducer key from GetVarFromAccessPtr(args[0]) and ensure vmap contains that
key (use find/count) before indexing, throwing a descriptive error if missing;
only after these checks assign node->reducer and node->op and set data_.

Comment on lines +40 to +46
const int64_t *p_extent = as_const_int(layout->ReplicateExtent());
ICHECK(p_extent);
int extent = *p_extent, scale = 1;
ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent))
<< "Illegal finalize_reducer: extent=" << extent
<< "; T.thread_bounds=" << T.thread_bounds;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid null deref on thread_bounds->extent. Cache total threads.

Ensure extent is an IntImm and reuse it.

-  const int64_t *p_extent = as_const_int(layout->ReplicateExtent());
+  const int64_t *p_extent = as_const_int(layout->ReplicateExtent());
   ICHECK(p_extent);
-  int extent = *p_extent, scale = 1;
-  ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent))
+  int extent = *p_extent;
+  const int64_t* p_total = as_const_int(T.thread_bounds->extent);
+  ICHECK(p_total) << "T.thread_bounds->extent must be a constant integer";
+  int total_threads = static_cast<int>(*p_total);
+  ICHECK(extent == 1 || extent == total_threads)
       << "Illegal finalize_reducer: extent=" << extent
       << "; T.thread_bounds=" << T.thread_bounds;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const int64_t *p_extent = as_const_int(layout->ReplicateExtent());
ICHECK(p_extent);
int extent = *p_extent, scale = 1;
ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent))
<< "Illegal finalize_reducer: extent=" << extent
<< "; T.thread_bounds=" << T.thread_bounds;
const int64_t *p_extent = as_const_int(layout->ReplicateExtent());
ICHECK(p_extent);
int extent = *p_extent;
const int64_t* p_total = as_const_int(T.thread_bounds->extent);
ICHECK(p_total) << "T.thread_bounds->extent must be a constant integer";
int total_threads = static_cast<int>(*p_total);
ICHECK(extent == 1 || extent == total_threads)
<< "Illegal finalize_reducer: extent=" << extent
<< "; T.thread_bounds=" << T.thread_bounds;
🤖 Prompt for AI Agents
In src/op/finalize_reducer.cc around lines 40 to 46, avoid a potential
null-dereference and redundant casts by validating that thread_bounds and its
extent are non-null and an IntImm, cache the total thread count once, and reuse
that value: check T.thread_bounds is non-null, obtain and dynamic_cast/isa the
extent to IntImm (ICHECK with a clear message if not), store the integer extent
in a local variable (e.g. total_threads) and reuse it in the subsequent
checks/logging instead of re-evaluating or re-casting.

#include <cuda_runtime.h>
#endif

#include <cuda/atomic>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard headers for NVRTC and add missing includes for types/intrinsics

Ensure availability of type traits, stdio, and FP16/BF16 intrinsics across NVRTC/NVCC builds.

Apply:

-#include <cuda/atomic>
+#include <cuda/atomic>
+#include <cstdio>
+#if defined(__CUDACC_RTC__)
+#include "nvrtc_std.h"
+#else
+#include <type_traits>
+#endif
+#include <cuda_fp16.h>
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+#include <cuda_bf16.h>
+#endif
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <cuda/atomic>
// src/tl_templates/cuda/common.h
#include <cuda/atomic>
#include <cstdio>
#if defined(__CUDACC_RTC__)
#include "nvrtc_std.h"
#else
#include <type_traits>
#endif
#include <cuda_fp16.h>
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
#include <cuda_bf16.h>
#endif
// …rest of existing includes and code…
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around line 7, the file currently only
includes <cuda/atomic>; add guarded includes and feature checks so NVRTC builds
don't break and required types/intrinsics are available for NVCC: add standard
headers <type_traits> and <cstdio>, and include FP16/BF16 intrinsics
conditionally (e.g. use __has_include to include <cuda_fp16.h> and <cuda_bf16.h>
only if present and not building under NVRTC), and wrap <cuda/atomic> and the
FP16/BF16 includes with a check like #if !defined(__CUDACC_RTC__) ||
__has_include(...) to ensure NVRTC-safe compilation.

Comment on lines +150 to 161
template <typename T1, typename T2>
TL_DEVICE void AtomicMax(T1 *address, T2 val,
int memory_order = int(cuda::memory_order_relaxed)) {
using NT1 = typename normalize_atomic_type<T1>::type;
if constexpr (std::is_same_v<NT1, half> ||
std::is_same_v<NT1, __nv_bfloat16>) {
atomicMax(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
} else {
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.fetch_max(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

AtomicMax: wrong param type, bad atomic_ref construction, and invalid half/bf16 path

  • int memory_order is fragile; use cuda::memory_order.
  • Must construct atomic_ref<NT1> from NT1& (reinterpret address).
  • There is no atomicMax intrinsic for half/__nv_bfloat16; implement 16-bit CAS loop.
-template <typename T1, typename T2>
-TL_DEVICE void AtomicMax(T1 *address, T2 val,
-                         int memory_order = int(cuda::memory_order_relaxed)) {
-  using NT1 = typename normalize_atomic_type<T1>::type;
-  if constexpr (std::is_same_v<NT1, half> ||
-                std::is_same_v<NT1, __nv_bfloat16>) {
-    atomicMax(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
-  } else {
-    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
-    aref.fetch_max(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
-  }
-}
+template <typename T1, typename T2>
+TL_DEVICE void AtomicMax(T1* address, T2 val,
+                         cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT1 = typename normalize_atomic_type<T1>::type;
+  static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1),
+                "Normalized atomic type must match size/alignment.");
+  if constexpr (std::is_same_v<NT1, half> || std::is_same_v<NT1, __nv_bfloat16>) {
+    atomic_minmax_16bits<NT1>(reinterpret_cast<uint16_t*>(address),
+                              cuda_cast<NT1>(val), /*is_max=*/true, mo);
+  } else {
+    NT1* addr = reinterpret_cast<NT1*>(address);
+    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
+    aref.fetch_max(cuda_cast<NT1>(val), mo);
+  }
+}

Add this helper (place near the atomics block):

// Helper for min/max on 16-bit storage (half / __nv_bfloat16) using 32-bit CAS.
template <typename S>
TL_DEVICE uint16_t to_bits16(S x) {
  return *reinterpret_cast<uint16_t*>(&x);
}
template <typename S>
TL_DEVICE S from_bits16(uint16_t b) {
  return *reinterpret_cast<S*>(&b);
}
TL_DEVICE inline uint16_t combine16(uint32_t orig, uint16_t b, bool hi) {
  return hi ? uint16_t((orig & 0xFFFFu)) : uint16_t((orig >> 16) & 0xFFFFu);
}
TL_DEVICE inline uint32_t with16(uint32_t orig, uint16_t b, bool hi) {
  return hi ? ((orig & 0x0000FFFFu) | (uint32_t(b) << 16))
            : ((orig & 0xFFFF0000u) | uint32_t(b));
}
template <typename S>
TL_DEVICE void atomic_minmax_16bits(uint16_t* p, S v, bool is_max, cuda::memory_order mo) {
  uintptr_t up = reinterpret_cast<uintptr_t>(p);
  bool hi = (up & 0x2) != 0;
  auto* base = reinterpret_cast<uint32_t*>(up & ~uintptr_t(0x3));
  cuda::atomic_ref<uint32_t, cuda::thread_scope_device> aref(*base);
  uint32_t expected = aref.load(cuda::memory_order_relaxed);
  while (true) {
    uint16_t cur_bits = hi ? uint16_t(expected >> 16) : uint16_t(expected & 0xFFFF);
    S cur = from_bits16<S>(cur_bits);
    S desired = is_max ? (cur > v ? cur : v) : (cur < v ? cur : v);
    if (to_bits16<S>(desired) == cur_bits) break;
    uint32_t desired32 = with16(expected, to_bits16<S>(desired), hi);
    if (aref.compare_exchange_weak(expected, desired32, mo, cuda::memory_order_relaxed)) break;
  }
}
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 150 to 161, AtomicMax has three
problems: the memory_order parameter is declared as int, the cuda::atomic_ref is
constructed from the raw pointer instead of an NT1& reference, and there is no
atomicMax intrinsic for half/__nv_bfloat16 so the half/bf16 branch is invalid.
Change the function signature to take cuda::memory_order (default
cuda::memory_order_relaxed), construct cuda::atomic_ref<NT1,...> by
reinterpret_casting the pointer to NT1* and dereferencing to pass an NT1&, and
replace the half/bf16 branch with a 16-bit CAS loop implementation that uses
32-bit atomic_ref on the containing uint32_t word; add the provided helper
functions (to_bits16/from_bits16/combine16/with16/atomic_minmax_16bits) near the
atomics block and call atomic_minmax_16bits for the 16-bit types with the passed
cuda::memory_order.

Comment on lines +163 to +174
template <typename T1, typename T2>
TL_DEVICE void AtomicMin(T1 *address, T2 val,
int memory_order = int(cuda::memory_order_relaxed)) {
using NT1 = typename normalize_atomic_type<T1>::type;
if constexpr (std::is_same_v<NT1, half> ||
std::is_same_v<NT1, __nv_bfloat16>) {
atomicMin(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
} else {
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

AtomicMin: mirror the AtomicMax fixes

Same issues and solution as above.

-template <typename T1, typename T2>
-TL_DEVICE void AtomicMin(T1 *address, T2 val,
-                         int memory_order = int(cuda::memory_order_relaxed)) {
-  using NT1 = typename normalize_atomic_type<T1>::type;
-  if constexpr (std::is_same_v<NT1, half> ||
-                std::is_same_v<NT1, __nv_bfloat16>) {
-    atomicMin(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
-  } else {
-    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
-    aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
-  }
-}
+template <typename T1, typename T2>
+TL_DEVICE void AtomicMin(T1* address, T2 val,
+                         cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT1 = typename normalize_atomic_type<T1>::type;
+  static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1),
+                "Type mismatch");
+  if constexpr (std::is_same_v<NT1, half> || std::is_same_v<NT1, __nv_bfloat16>) {
+    atomic_minmax_16bits<NT1>(reinterpret_cast<uint16_t*>(address),
+                              cuda_cast<NT1>(val), /*is_max=*/false, mo);
+  } else {
+    NT1* addr = reinterpret_cast<NT1*>(address);
+    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
+    aref.fetch_min(cuda_cast<NT1>(val), mo);
+  }
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <typename T1, typename T2>
TL_DEVICE void AtomicMin(T1 *address, T2 val,
int memory_order = int(cuda::memory_order_relaxed)) {
using NT1 = typename normalize_atomic_type<T1>::type;
if constexpr (std::is_same_v<NT1, half> ||
std::is_same_v<NT1, __nv_bfloat16>) {
atomicMin(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
} else {
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
}
}
template <typename T1, typename T2>
TL_DEVICE void AtomicMin(T1* address, T2 val,
cuda::memory_order mo = cuda::memory_order_relaxed) {
using NT1 = typename normalize_atomic_type<T1>::type;
static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1),
"Type mismatch");
if constexpr (std::is_same_v<NT1, half> || std::is_same_v<NT1, __nv_bfloat16>) {
atomic_minmax_16bits<NT1>(reinterpret_cast<uint16_t*>(address),
cuda_cast<NT1>(val), /*is_max=*/false, mo);
} else {
NT1* addr = reinterpret_cast<NT1*>(address);
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
aref.fetch_min(cuda_cast<NT1>(val), mo);
}
}
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 163 to 174, the AtomicMin
implementation mirrors the AtomicMax bug: the cuda::atomic_ref is constructed
from *address (wrong type) so it should dereference a pointer cast to the
normalized type; update the atomic_ref construction to use
*reinterpret_cast<NT1*>(address) and ensure the value is cast to NT1 when
calling atomicMin for half/__nv_bfloat16 and when calling aref.fetch_min use
cuda_cast<NT1>(val) with the memory_order converted as
cuda::memory_order(memory_order).

Comment on lines +176 to +187
template <typename T1, typename T2>
TL_DEVICE void AtomicAdd(T1 *address, T2 val,
int memory_order = int(cuda::memory_order_relaxed)) {
using NT1 = typename normalize_atomic_type<T1>::type;
if constexpr (std::is_same_v<NT1, half> ||
std::is_same_v<NT1, __nv_bfloat16>) {
atomicAdd(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
} else {
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.fetch_add(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

AtomicAdd: strong memory_order type, correct aref binding, and proper casts

Use cuda_cast for half/bf16 and normalize address before constructing atomic_ref.

-template <typename T1, typename T2>
-TL_DEVICE void AtomicAdd(T1 *address, T2 val,
-                         int memory_order = int(cuda::memory_order_relaxed)) {
-  using NT1 = typename normalize_atomic_type<T1>::type;
-  if constexpr (std::is_same_v<NT1, half> ||
-                std::is_same_v<NT1, __nv_bfloat16>) {
-    atomicAdd(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
-  } else {
-    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
-    aref.fetch_add(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
-  }
-}
+template <typename T1, typename T2>
+TL_DEVICE void AtomicAdd(T1* address, T2 val,
+                         cuda::memory_order mo = cuda::memory_order_relaxed) {
+  using NT1 = typename normalize_atomic_type<T1>::type;
+  static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1),
+                "Type mismatch");
+  if constexpr (std::is_same_v<NT1, half> || std::is_same_v<NT1, __nv_bfloat16>) {
+    (void)mo; // intrinsic has fixed ordering semantics
+    atomicAdd(reinterpret_cast<NT1*>(address), cuda_cast<NT1>(val));
+  } else {
+    NT1* addr = reinterpret_cast<NT1*>(address);
+    cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
+    aref.fetch_add(cuda_cast<NT1>(val), mo);
+  }
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <typename T1, typename T2>
TL_DEVICE void AtomicAdd(T1 *address, T2 val,
int memory_order = int(cuda::memory_order_relaxed)) {
using NT1 = typename normalize_atomic_type<T1>::type;
if constexpr (std::is_same_v<NT1, half> ||
std::is_same_v<NT1, __nv_bfloat16>) {
atomicAdd(reinterpret_cast<NT1 *>(address), static_cast<NT1>(val));
} else {
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
aref.fetch_add(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
}
}
template <typename T1, typename T2>
TL_DEVICE void AtomicAdd(T1* address, T2 val,
cuda::memory_order mo = cuda::memory_order_relaxed) {
using NT1 = typename normalize_atomic_type<T1>::type;
static_assert(sizeof(NT1) == sizeof(T1) && alignof(NT1) == alignof(T1),
"Type mismatch");
if constexpr (std::is_same_v<NT1, half> || std::is_same_v<NT1, __nv_bfloat16>) {
(void)mo; // intrinsic has fixed ordering semantics
atomicAdd(reinterpret_cast<NT1*>(address), cuda_cast<NT1>(val));
} else {
NT1* addr = reinterpret_cast<NT1*>(address);
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*addr);
aref.fetch_add(cuda_cast<NT1>(val), mo);
}
}
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 176 to 187, AtomicAdd should use
a strong memory_order type, normalize the address before creating the
atomic_ref, and use cuda_cast for half/bf16 values: change the memory_order
parameter from int to cuda::memory_order (with a default of
cuda::memory_order_relaxed), in the half/__nv_bfloat16 branch call atomicAdd
with cuda_cast<NT1>(val), and in the generic branch construct the atomic_ref by
dereferencing a normalized pointer (reinterpret_cast<NT1*>(address)) and pass
cuda_cast<NT1>(val) and cuda::memory_order(memory_order) to fetch_add.

Comment on lines +574 to +579
Map<Var, ReducerInfo> reducer_info;
if (op->annotations.count(attr::kReducerInfo))
reducer_info = op->annotations.Get(attr::kReducerInfo)
->as<Map<Var, ReducerInfo>>()
.value();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard reducer_info extraction to avoid Optional/type pitfalls.

Avoid unchecked .value() and rely on Downcast to validate type.

-    Map<Var, ReducerInfo> reducer_info;
-    if (op->annotations.count(attr::kReducerInfo))
-      reducer_info = op->annotations.Get(attr::kReducerInfo)
-                         ->as<Map<Var, ReducerInfo>>()
-                         .value();
+    Map<Var, ReducerInfo> reducer_info;
+    if (auto opt = op->annotations.Get(attr::kReducerInfo)) {
+      reducer_info = Downcast<Map<Var, ReducerInfo>>(opt.value());
+    }
🤖 Prompt for AI Agents
In src/transform/layout_inference.cc around lines 574-579, the current
extraction of reducer_info uses unchecked .value() which can crash or throw on
missing/incorrect types; replace it with a guarded extraction that first checks
op->annotations.count(attr::kReducerInfo), then obtains the Annotation via
Get(attr::kReducerInfo), attempt a Downcast to Map<Var, ReducerInfo>, verify the
downcasted Optional/RuntimeObject is defined before using it, and only then
assign reducer_info; avoid calling .value() directly and handle the
undefined/mismatched-type case (e.g., leave reducer_info empty or log/return an
error) so the code is robust to missing or wrong-typed annotations.

Comment on lines +624 to 635
// Workaround: if reducer is presented, don't vectorize loop
// Best solution should be isolate reduction axis out of vectorization
bool has_reducer = false;
PostOrderVisit(for_node->body, [&](const ObjectRef &obj) {
if (!has_reducer)
if (const auto *store = obj.as<BufferStoreNode>()) {
has_reducer = reducer_info.count(store->buffer->data) != 0;
}
});

if (has_non_local) {
if (has_non_local && !has_reducer) {
for_node = VectorizeLoop(for_node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify reducer gating for vectorization.

If the loop carries kReducerInfo, skip vectorization directly; no need to rescan BufferStores (which can miss non-store reducer patterns).

-      // Workaround: if reducer is presented, don't vectorize loop
-      // Best solution should be isolate reduction axis out of vectorization
-      bool has_reducer = false;
-      PostOrderVisit(for_node->body, [&](const ObjectRef &obj) {
-        if (!has_reducer)
-          if (const auto *store = obj.as<BufferStoreNode>()) {
-            has_reducer = reducer_info.count(store->buffer->data) != 0;
-          }
-      });
-
-      if (has_non_local && !has_reducer) {
+      // Workaround: if reducer is present, don't vectorize loop
+      bool has_reducer = reducer_info.size() > 0;
+      if (has_non_local && !has_reducer) {
         for_node = VectorizeLoop(for_node);
       }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/transform/layout_inference.cc around lines 624–635, remove the
PostOrderVisit scan of BufferStore nodes and instead gate vectorization by
directly checking whether this loop is associated with reducer info; if
reducer_info indicates the loop carries reduction state, skip vectorization.
Replace the body-scan logic with a single direct lookup (e.g., check
reducer_info for the loop's reduction key such as the loop variable or the
specific reducer key used elsewhere in this codebase) and use that boolean to
decide whether to call VectorizeLoop; do not perform the expensive/rescan of
BufferStore nodes.

…cution

- Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
- Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.
@LeiWang1999 LeiWang1999 merged commit 8eab775 into tile-ai:main Aug 31, 2025
7 of 8 checks passed
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 31, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

coderabbitai bot added a commit that referenced this pull request Aug 31, 2025
Docstrings generation was requested by @LeiWang1999.

* #757 (comment)

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 31, 2025

Note

Generated docstrings for this pull request at #772

LeiWang1999 added a commit that referenced this pull request Aug 31, 2025
* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* #757 (comment)

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
margaretphillips96627margaretphillips added a commit to margaretphillips96627margaretphillips/tilelang that referenced this pull request Oct 6, 2025
* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* tile-ai/tilelang#757 (comment)

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
ebonyschneider462359 added a commit to ebonyschneider462359/tilelang that referenced this pull request Oct 12, 2025
* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* tile-ai/tilelang#757 (comment)

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
chengyupku added a commit to tile-ai/tilescale that referenced this pull request Oct 24, 2025
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling

- Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
- Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
- Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
- Removed obsolete code and improved overall code clarity and maintainability.

* lint fix

* Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls

- Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
- Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.

* test fix

* Enhance global read detection in pipeline planning

- Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
- Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
- Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.

* Add IndexLegalizer to enforce int64 for out-of-bound indices

- Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds.
- Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability.
- Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.

* [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow.

* Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job.

* fix: NVRTC backend (#717)

* fix: NVRTC backend

* fix: CI

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CUDA] Init support for sm_120 (#716)

* Init support for sm120

* fmt

* resolve comments

* unify mma gemm

* fmt

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CI] fix docs ci (#720)

* [Chore] fix typos (#719)

* chore: fix typos

* chore: fix ruff

* chore: fix clang-format

* [CI][AMD] Add AMD GPU CI and fix some related bugs (#694)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Update AMD FlashAttention example and TVM submodule

- Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
- Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
- Updated the TVM submodule to the latest commit for improved functionality.
- Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.

* Add CI workflow for automated format checking and testing

- Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
- The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
- Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.

* Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.

* Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.

* Update AMD CI workflow to install pytest directly instead of using requirements-test.txt

* Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt

* Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation

* Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.

* Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.

* Add installation of ROCm in AMD CI workflow

- Included a step to execute the `install_rocm.sh` script for improved setup.
- Removed unnecessary blank line for better readability in the workflow script.

* Remove installation step for ROCm in AMD CI workflow to simplify the setup process.

* Update AMD CI workflow to run specific test file with verbose output instead of all tests.

* Add new tilelang built-in operations for AMD architecture

- Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
- Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.

* Enhance autotuner configurations and GEMM operations in AMD example

- Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
- Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.

* Update autotuner configurations in AMD example for enhanced performance

- Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
- Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.

* Enhance autotuner configurations and memory handling in AMD example

- Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
- Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.

* Refine autotuner configurations and memory usage in AMD example

- Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
- Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.

* Update autotuner configurations in AMD example for enhanced performance

- Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
- Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.

* Enhance autotuner configurations and GEMM operations in AMD example

- Expanded thread counts in `get_configs` to include higher values for improved autotuning.
- Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.

* Update AMD CI workflow and remove obsolete test script

- Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
- Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.

* Remove TVM subproject from 3rdparty directory

* Refactor configuration generation and accumulation logic in AMD example

- Reformatted the `get_configs` function for improved readability by aligning parameters.
- Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.

* Enhance AMD CI workflow with additional logging and setup steps

- Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
- Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.

* Comment out package copying in AMD CI workflow to prevent potential issues during environment setup

* Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps

* Enhance BuildTileLangHIP function by adding whitespace for improved readability

* Refactor kTVMGridConstant definition for clarity and remove unnecessary comment

* Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* lint fix

* Update AMD CI workflow to use requirements-rocm.txt for dependency installation

* fix ci

* Remove dependency on format-check from AMD CI workflow

* fix ci

* fix ci

* fix ci

* Remove format-check job from AMD CI workflow

* Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow

* Add dependency on format-check job in AMD CI workflow

* Add format-check job to AMD CI workflow

* Update format-check job in AMD CI workflow to run on self-hosted environment

* Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes

* Update amd_ci.yml

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724)

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy

* [Typo] Correct architecture selection for CUDA and CDNA

* [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor CUDA code generation to simplify eviction policy handling

- Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity.
- Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations.

* lint fix

* [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Support strided tensors

* Refactor target attribute helper functions for improved clarity

* No code changes made in proxy.py and setup.py

* lint fix

* lint fix via gemini

* lint fix

* test fix

* test fix

* lint fix

* Update wrapper.py

* test fix

* Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.

* lint fix

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

* [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712)

* bug fix and support gemm_sr fallback for hopper

* Update gemm.cc

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `fix` (#726)

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851

The following files were modified:

* `src/op/gemm.cc`
* `src/tl_templates/cuda/gemm_sm90.h`
* `src/transform/warp_specialized_rewriter.cc`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [CI] Fix AMD CI (#729)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725)

* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16

* [Dequant] Add extern call and serial dequantization

* [Dequant] Parallel Dequant wait for fence debug.

* [Scale] Add scale matrix to mxfp4 gemm

* [Remove] Remove fence-buggy example and some generated source cuda code

* [MXFP4] Update initial version of MXFP4 GEMM

* [Scale] Add scale to latest mxfp4 gemm

* [Lint]

* [BugFix] Load Scale, disabe TMA to recover performance

* [Lint]

* [Lint]

* [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance

* [Lint]

* Update example_dequant_gemm_bf16_fp4_hopper_serial.py

* Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.

* Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

* Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

* lint fix

* ci fix

* Remove non-existent example

* [BugFix] Add smem swizzle to recover performance of TMA

* [BugFix] Enough reg for producer when threads=512

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `mxfp4` (#732)

* 📝 Add docstrings to `mxfp4`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561

The following files were modified:

* `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`
* `examples/dequantize_gemm/utils.py`
* `examples/gemm/example_gemm_autotune.py`
* `tilelang/intrinsics/utils.py`
* `tilelang/language/__init__.py`
* `tilelang/language/utils.py`
* `tilelang/quantize/mxfp.py`
* `tilelang/quantize/quantization.py`

* [Lint] More accurate docstring

* [Lint]

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Refactor] Refactor env into a more flexible version (#740)

* Fix environment variable name for compilation print setting in `env.py`

* Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.

* lint fix

* Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

* [Enhancement] Add stride index validation in CythonKernelWrapper (#743)

* Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`.
* This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code.

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742)

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error

* Update atomicadd_vectorize.cc

* format

* 📝 Add docstrings to PR #744 (#745)

* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Refactor] Refactor barrier management (#744)

* Introduce Barrier

* Enhance CUDA kernel with new barrier management and post-processing support

- Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
- Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
- Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
- Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
- Enhanced the overall structure and readability of the codebase.

* Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.

* Enhance barrier management in TileLang

- Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
- Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
- Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
- Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
- Removed deprecated memory scope handling code to enhance clarity and maintainability.

* lint fix

* lint fix

* Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.

* Refactor logging in JITKernel to improve kernel compilation tracking

- Removed unused import of `torch.backends` in the example file.
- Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
- Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.

* Refactor dequantization tests and update barrier function

- Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
- Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.

* Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.

* Fix typos in rasterization parameters and update import path for cached module

- Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
- Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
- Added `StridedTensor` import in the language module for enhanced tensor functionality.

* Update ci.yml

* [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746)

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

* [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741)

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.

* Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.

* Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.

* Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.

* typo fix

* Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.

* lint fix

* test fix

* Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.

* use ninja instead of make

* Add CMake configuration step for Ninja build system in setup.py

* Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.

* Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.

* Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.

* Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.

* Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.

* Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.

* bugfix

* Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.

* Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.

* Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

* [Enhancement] Optimize loop body handling in IR (#749)

- Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable.
- This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [MXFP4] Fix bugs and optimize exponential operation (#750)

* [MXFP4] Fix bugs
- Optimize exp2 with shift operation to boost performance
- Fix bug of simple dequantization function call
- Fix bug of scaling factor with bias

* [Lint]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751)

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

* [Enhancement] Add shape checking for reduce options (#748)

* Add shape checking for reduce options

* lint fix

* Handle special case reducing into shape-1 tensor

Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix] Add missing FP8 header include (#752)

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Include cuda_fp8.h in gemm_sm90.h

- Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* lint fix

* [Refactor] Remove unused tl_shuffle_elect and related functions from common.h

- Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase.
- Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations.
- Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability.

* lint fix

* [Refactor] Update header inclusions in common.h and gemm_sm90.h

- Removed the inclusion of "intrin.h" from common.h to streamline dependencies.
- Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability.

* bug fix

* [MXFP4] Add bias to MXFP4 GEMM kernel (#753)

* [MXFP4] Add bias to gemm kernel

* [Lint]

* [Lint] Rename "bias" to "Bias"

* [Bugfix][WS] Consider loop min extent when computing phase id (#754)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* [Typo] Remove `disable_cache` in some tests (#755)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Feature] Add 1D TMA support (#761)

* [Feature] Add 1D TMA support
- Check the contiguous conditions of 1D TMA copy
- Add new interface and params order of `tma_load` and `tma_store` call
- Add 1D `tma_store` interface in sm90 template
- Add elementwise kernel for 1D TMA example

* [Lint]

* [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors

* [Lint]

* [BugFix] 1D TMA load

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Lint]

* [Lint]

* [MXFP4] Add test for bf16&mxfp4 gemm

* [BugFix]

* [Lint]

---------

Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>

* [Example] Add vertical slash sparse attention pattern (#762)

* upd sparse attn

* lint

* rename

* update test file

* update benchmark

* lint

* update benchmark

* [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767)

* fix ci and pass bug

* fix

* try

* lint

* [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766)

* [TMA] Add 1D TMA copy for Scale tensor

* [Lint]

* [Test] Add test for kernel

* [BugFix]

* hot fix blackwell (#768)

* [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763)

* Refactor operator classes to inherit from TileOperator and update layout inference methods

- Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations.
- Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency.
- Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization.
- Added missing layout inference implementations for Fill and Conv2DIm2ColOp.
- Removed deprecated op.cc and op.h files to streamline the codebase.

* lint fix

* Refactor operator classes to use Node pattern and improve memory management

- Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation.
- Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access.
- Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design.
- Refactored InferLayout and Lower methods to ensure consistency across operator implementations.
- Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase.

* Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning

- Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects.
- Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations.
- Made minor adjustments in layout inference and other related methods for consistency and clarity.

* Refactor FillNode::Lower method to remove unused global function call

- Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity.
- Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies.

* [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757)

* [Enhancement] Introduce finalize_reducer operator and layout reducer support

- Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
- Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
- Updated `setup.py` to include logging for build directory paths, improving build process visibility.
- Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
- Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
- Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.

* Refactor code formatting and improve readability in multiple files

- Cleaned up whitespace in `setup.py` to enhance logging clarity.
- Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
- Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
- Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.

* Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.

* [Enhancement] Disable reuse of small arrays in shared memory allocation

- Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.

* Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.

* Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.

* bug fix

* Add thread checks workaround for replicated cases

* Remove the is_one check

* fix lint error

* lint fix

* Update autotune tests to use smaller matrix sizes for improved performance and reliability

* [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods

- Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
- Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
- Adjusted header inclusions for improved organization and clarity across multiple files.

* [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py

- Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.

* [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution

- Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
- Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.

---------

Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>
Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com>

* 📝 Add docstrings to `pytile_0826` (#770)

* 📝 Add docstrings to `pytile_0826`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814

The following files were modified:

* `src/op/atomic_add.cc`
* `src/op/atomic_add.h`
* `src/op/copy.cc`
* `src/op/copy.h`
* `src/op/elem.cc`
* `src/op/elem.h`
* `src/op/gemm.cc`
* `src/op/gemm.h`
* `src/op/gemm_sp.cc`
* `src/op/gemm_sp.h`
* `src/op/operator.cc`
* `src/op/operator.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/op/reduce.h`
* `src/op/region.cc`
* `src/op/region.h`
* `src/transform/layout_inference.cc`
* `src/transform/lower_tile_op.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix]:Fix atomic add auto vectorize negative optimization (#765)

* [Bugfix]:Fix atomic add auto vectorize negative optimization

* fixbug

* format

* fix bug

* 📝 Add docstrings to `reducer_0825` (#772)

* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Allow fill global buffer (#774)

* Allow fill global buffer

* fix lint error

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771)

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match

* [Lint]

* add bf16 exp fallback (#776)

* [Lint] Introduce clang-tidy into format.sh (#777)

* [Refactor] Update Clang-Tidy Checks and Improve Code Consistency

- Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
- Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
- Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
- Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
- General code cleanup and adherence to best practices for better maintainability.

* [Refactor] Enhance Code Consistency and Clang-Tidy Configuration

- Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
- Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
- Replaced size checks with `empty()` method calls in various locations for clearer intent.
- Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency

- Added clang-tidy checks to the format script for improved code quality assurance.
- Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
- Updated the requirements-lint.txt file to include clang-tidy as a dependency.
- General code cleanup to adhere to best practices and improve maintainability.

* [CI] Update AMD CI Workflow to Include Build Directory Creation

- Added steps to create a build directory and configure CMake with ROCm support during the format check process.
- Ensured cleanup of the build directory after the format check to maintain a clean workspace.

* [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode

- Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
- This change enhances code clarity and maintainability by focusing on relevant attributes for each class.

* [Refactor] Update Clang-Tidy Integration and Code Improvements

- Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
- Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
- Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize

- Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
- Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
- General code cleanup to maintain consistency and adhere to best practices.

* [CI] Add Git Submodule Initialization to CI Workflow

- Included a step to initialize and update git submodules recursively in the CI workflow.
- This change ensures that all necessary submodules are available during the format check process, improving build reliability.

* [CI] Add Git Submodule Update Step to Format Check

- Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
- This enhancement ensures that all required submodules are available, contributing to improved build reliability.

* [Refactor] Update Function Signatures in AtomicAddVectorize

- Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
- This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.

* [Cache] Introduce detailed target information for the disk kernel cache (#780)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* [Example]Adds example for top-k operation (#775)

* [Example]Adds example for top-k operation

Adds an example demonstrating the top-k operation using tilelang

* format

* Adds topk tilelang example test

* fix lint

* [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h

* lint fix

* remove cmake dep in pyproject as it may lead to different cmake paths in diff stages

* lint fix

* Add cmake dependency to pyproject.toml and improve build logging in setup.py

* [CI] Adds pytest-durations for test timing (#782)

* [Ci] Adds pytest-durations for test timing

Adds `pytest-durations` to the test requirements and configures pytest to display test durations.

This helps in identifying slow-running tests and optimizing the test suite for faster feedback.

* add amd ci durations

* Removes flash_attn installation from CI

* [Refactor] Support python reflection for tile operators (#783)

* Implement Fill operator and related reflection methods in TileLang

- Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
- Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
- Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
- Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
- Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.

* Refactor operator reflection methods and improve code clarity

- Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
- Consolidated static initialization blocks for various operators to a single line for improved consistency.
- Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
- Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.

* Refactor GEMM and AtomicAdd operations for improved clarity

- Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
- Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
- Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
- Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.

* Remove deprecated operator files from the tilelang IR module

- Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
- This cleanup enhances maintainability by removing unused code and improving overall organization of the module.

* Refactor imports in tilelang IR module for improved organization

- Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.

* lint fix

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
- Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
- Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
- Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.

* Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability

- Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
- Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
- Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
- Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.

* comment

* Refactor operator header files for improved readability

- Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
- Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.

* Refactor MakeReduce method in ReduceOpNode for clarity

- Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
- This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.

* Update Reduce operation type checks for consistency

- Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
- This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.

* [AMD] Fix amd tir&add examples (#784)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file na…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants