-
Couldn't load subscription status.
- Fork 286
[Refactor] Merge ThreadPartialSync and ThreadStorageSync #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.
…m the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughRemoved the tl.sync_thread_partial builtin and its Python wrapper; deleted the ThreadPartialSync transform/pass; added thread_sync_types.h; refactored thread-storage sync and storage-access pointer handling; replaced partial-thread sync lowering with mbarrier-style externs in HIP (CUDA path simplified); updated tests, CI, and build logic. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python API
participant TL as TileLang IR / Transform
participant CG as CodeGen (CUDA/HIP)
participant HW as Runtime/Hardware
rect #eef8ff
note left of Py: Old flow (with tl.sync_thread_partial)
Py->>TL: emit `tl.sync_thread_partial(...)`
TL->>CG: lower CallNode to target-specific NamedBarrier/extern
CG->>HW: call NamedBarrier / parity-wait extern
end
rect #fff8e6
note left of Py: New flow (partial-sync removed, mbarrier-based on HIP)
Py--x TL: no `sync_thread_partial` emitted
TL->>CG: emit mbarrier helpers / create/get/arrive/wait/init/cp_async/expect_tx
CG->>HW: call `mbarrier_*` externs (arrive, wait, init, cp_async, expect_tx)
end
sequenceDiagram
autonumber
participant Pipeline as OptimizeForTarget
participant Passes as PipelineSteps
rect #eef8ff
note right of Pipeline: Before
Pipeline->>Passes: ... -> ThreadPartialSync -> InferFragment -> LowerThreadAllreduce -> ...
end
rect #fff8e6
note right of Pipeline: After
Pipeline->>Passes: ... -> InferFragment -> LowerThreadAllreduce -> ...
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the thread synchronization mechanisms by merging the ThreadPartialSync functionality directly into the ThreadStorageSync pass. This change streamlines the codebase by eliminating a dedicated partial synchronization pass and consolidating related logic, leading to a more unified and potentially simpler approach to managing thread barriers and memory synchronization within the system.
Highlights
- Elimination of ThreadPartialSync Pass: The dedicated ThreadPartialSync transformation pass and its associated C++ and Python implementations have been completely removed.
- Consolidation of Synchronization Logic: The concepts and logic for partial thread synchronization, previously handled by the separate ThreadPartialSync pass, have been integrated into the existing ThreadStorageSync framework.
- Centralization of Synchronization Types: Common synchronization-related data structures, such as ThreadBoundKey and ReservedNamedBarriers, are now centralized in a new shared header file, improving code organization and reusability.
- Simplification of ThreadStorageSync: The implementation of ThreadStorageSync has been simplified by removing redundant partial synchronization logic, leading to a cleaner and more unified codebase.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the thread synchronization logic by merging ThreadPartialSync into ThreadStorageSync. This simplifies the pass pipeline by removing the separate ThreadPartialSync pass. A new header src/transform/common/thread_sync_types.h is introduced to centralize common types like ThreadBoundKey, which is a good improvement.
The changes look good overall, but I have identified two issues:
- The HIP backend (
CodeGenTileLangHIP) has not been updated to handle the rewritten partial synchronization calls, which could lead to incorrect code generation. - The new hash function for
ThreadBoundKeycould be improved for consistency by using thetvm::support::HashCombineutility.
Please see my detailed comments for suggestions.
| stmt = | ||
| ThreadSyncInserter(sync_scope, planner.syncs_inserted_)(std::move(stmt)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring merges ThreadPartialSync into ThreadStorageSync. The ThreadPartialSyncRewriter called later in this function can rewrite tvm_storage_sync calls to include extra arguments for partial synchronization (barrier ID and thread count). However, the HIP backend's CodeGenTileLangHIP::PrintStorageSync is not updated to handle these extra arguments. It only handles the single-argument version for full __syncthreads(). This will lead to incorrect code generation for HIP when partial synchronization is required. Please update CodeGenTileLangHIP::PrintStorageSync to correctly handle the rewritten calls.
| size_t operator()(const tvm::tl::ThreadBoundKey &k) const { | ||
| size_t h = std::hash<int64_t>()(k.tx_min); | ||
| h = h * 31 + std::hash<int64_t>()(k.tx_max); | ||
| h = h * 31 + std::hash<int64_t>()(k.ty_min); | ||
| h = h * 31 + std::hash<int64_t>()(k.ty_max); | ||
| h = h * 31 + std::hash<int64_t>()(k.tz_min); | ||
| h = h * 31 + std::hash<int64_t>()(k.tz_max); | ||
| return h; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom hash implementation for ThreadBoundKey is functional, but TVM provides a utility tvm::support::HashCombine for this purpose, which is more idiomatic and can provide better hash distribution. Consider using it for consistency with the rest of the codebase.
You would need to include #include <tvm/support/hash.h>.
size_t operator()(const tvm::tl::ThreadBoundKey &k) const {
size_t h = 0;
tvm::support::HashCombine(&h, k.tx_min);
tvm::support::HashCombine(&h, k.tx_max);
tvm::support::HashCombine(&h, k.ty_min);
tvm::support::HashCombine(&h, k.ty_max);
tvm::support::HashCombine(&h, k.tz_min);
tvm::support::HashCombine(&h, k.tz_max);
return h;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/transform/thread_storage_sync.cc (2)
598-617: Dropping sync when thread_count % 32 != 0 is unsafe; fallback to full sync insteadReturning an empty Stmt removes the synchronization entirely for non-warp-multiple participants, which can introduce data races. Prefer falling back to a full shared sync at that point.
Apply this diff:
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback to a full sync to preserve correctness. + return Evaluate(Call(op->dtype, op->op, {StringImm(scope)})); + }
670-678: Guard against non-constant thread extents to prevent crashesmin/extent are assumed IntImm. In practice, thread extents can be symbolic. Dereferencing nullptr here will crash. When non-constant, conservatively treat as “full extent” so the original sync is kept.
Apply this diff:
- const auto *min_node = iv->dom->min.as<IntImmNode>(); - const auto *extent_node = iv->dom->extent.as<IntImmNode>(); - - int64_t min = min_node->value; - int64_t extent = extent_node->value; + const auto *min_node = iv->dom->min.as<IntImmNode>(); + const auto *extent_node = iv->dom->extent.as<IntImmNode>(); + if (!min_node || !extent_node) { + // Unknown at compile-time; conservatively assume full extent. + return true; + } + int64_t min = min_node->value; + int64_t extent = extent_node->value;
🧹 Nitpick comments (7)
src/transform/common/thread_sync_types.h (4)
4-5: Align include guard with filename for clarity and consistencyThe guard macro references ThreadBoundKey rather than the file name. Prefer matching the filename to avoid future drift and to follow common TVM style.
Apply this diff:
-#ifndef TVM_TL_THREAD_BOUND_KEY_H_ -#define TVM_TL_THREAD_BOUND_KEY_H_ +#ifndef TVM_TL_THREAD_SYNC_TYPES_H_ +#define TVM_TL_THREAD_SYNC_TYPES_H_ @@ -#endif // TVM_TL_THREAD_BOUND_KEY_H_ +#endif // TVM_TL_THREAD_SYNC_TYPES_H_Also applies to: 51-51
13-20: Make operator== noexceptThe equality operator is trivially noexcept; marking it as such helps the compiler and communicates intent.
- bool operator==(const ThreadBoundKey &other) const { + bool operator==(const ThreadBoundKey &other) const noexcept {
37-49: Add for size_t and consider stronger hash mixing
- size_t is used but is not explicitly included. Add it to make the header self-sufficient.
- The linear “h = h*31 + …” is fine, but if collisions show up under load, consider a stronger combiner. Not blocking.
Add near other includes:
#include <cstdint> #include <functional> +#include <cstddef>
22-33: Rename barrier 0 identifier to indicate it’s reservedWe ran a code search and confirmed that
kSyncThreadsis only declared here and not used anywhere else. To prevent accidental use of ID 0, let’s rename it to clearly mark it as reserved.• Location:
src/transform/common/thread_sync_types.h(around line 28)Proposed diff:
enum class ReservedNamedBarriers { - kSyncThreads = 0, + kReserved0_DoNotUse = 0, kReduce_0 = 1, kReduce_1 = 2, kFirstUsedBarrier = kReduce_1 + 1 };src/transform/thread_storage_sync.cc (3)
62-75: Conservative aliasing of shared.dyn is reasonable; add a clarifying noteRedirecting all shared.dyn accesses to the same buffer var ensures conservative conflict detection for dynamic shared memory slices. This can insert extra barriers (acceptable for safety). Consider adding a brief comment stating the trade-off (false positives vs. safety).
76-201: Planner changes overall LGTMThe simplified planner and dynamic shared aliasing should conservatively preserve correctness. Consider adding unit tests for:
- Single vs multiple shared.dyn buffers being treated as one
- Cross-loop dependencies requiring a barrier
- Existing explicit barriers not being duplicated
I can draft TVM IR tests to cover these scenarios if helpful.
619-636: Guard against exceeding hardware’s 16 named barriers; fallback to full syncThe current allocation in GetOrCreateBarrier can generate IDs ≥ 16 (valid range 0–15) when there are more than 13 distinct ThreadBoundKey combinations. Such out-of-range barrier IDs will be rejected by the driver. Add a bounds check and fall back to a full sync call when the pool is exhausted.
• Location: src/transform/thread_storage_sync.cc
– GetOrCreateBarrier (around line 627)
– ProcessSharedSync (around line 600)
• Add at top of file:#include <limits>Proposed changes:
--- a/src/transform/thread_storage_sync.cc +++ b/src/transform/thread_storage_sync.cc @@ std::pair<size_t, size_t> GetOrCreateBarrier(const ThreadBoundKey &key, size_t extent_tx, size_t extent_ty, size_t extent_tz) { - size_t barrier_id = - barrier_id_map_.size() + - static_cast<size_t>(ReservedNamedBarriers::kFirstUsedBarrier); + constexpr size_t kNumNamedBarriers = 16; + size_t first_id = static_cast<size_t>(ReservedNamedBarriers::kFirstUsedBarrier); + size_t barrier_id = first_id + barrier_id_map_.size(); + if (barrier_id >= kNumNamedBarriers) { + // Exceeded hardware limit; signal fallback to full sync + return {std::numeric_limits<size_t>::max(), + extent_tx * extent_ty * extent_tz}; + } @@ Stmt ProcessSharedSync(const CallNode* op, const std::string &scope) { - auto [barrier_id, thread_count] = - GetOrCreateBarrier(key, extent_tx, extent_ty, extent_tz); + auto [barrier_id, thread_count] = + GetOrCreateBarrier(key, extent_tx, extent_ty, extent_tz); + if (barrier_id == std::numeric_limits<size_t>::max()) { + // Fallback to driver-built full sync when named barriers are exhausted + return Evaluate(Call(op->dtype, op->op, {StringImm(scope)})); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
src/op/builtin.cc(0 hunks)src/op/builtin.h(0 hunks)src/target/codegen_cuda.cc(0 hunks)src/target/codegen_hip.cc(0 hunks)src/transform/common/thread_sync_types.h(1 hunks)src/transform/thread_partial_sync.cc(0 hunks)src/transform/thread_storage_sync.cc(6 hunks)tilelang/engine/phase.py(0 hunks)tilelang/language/builtin.py(0 hunks)
💤 Files with no reviewable changes (7)
- tilelang/engine/phase.py
- src/op/builtin.h
- src/target/codegen_hip.cc
- tilelang/language/builtin.py
- src/target/codegen_cuda.cc
- src/op/builtin.cc
- src/transform/thread_partial_sync.cc
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/transform/thread_storage_sync.cc (1)
src/transform/storage_access.cc (8)
VisitStmt_(61-89)VisitStmt_(61-61)VisitStmt_(91-102)VisitStmt_(91-91)VisitStmt_(104-116)VisitStmt_(104-104)VisitStmt_(118-168)VisitStmt_(118-118)
🔇 Additional comments (5)
src/transform/common/thread_sync_types.h (1)
1-51: Centralizing sync types is a good moveConsolidating ThreadBoundKey, hashing, and named barrier reservations into a shared header reduces duplication and keeps thread-sync semantics consistent across transforms and backends.
src/transform/thread_storage_sync.cc (4)
34-34: Header refactor adoption LGTMIncluding the new shared thread_sync_types.h removes local duplication and aligns this pass with the new central definitions.
701-705: Pass wiring looks correctPlanner + inserter + partial-sync rewriter sequencing is sound: plan barriers, emit generic syncs, then opportunistically convert to named barriers.
351-374: Async wait-queue sync insertion remains intactThis safeguard for async_wait_queue still ensures a full syncthreads after waits. No concerns.
315-317: No remaining warp specialization references – removal is safeA project-wide search for
warp_specialization,kWarpSpecializationScope, andattr::warpacross all.cc,.cpp, and.hfiles returned no matches. Dropping the old warp-specialization special-case inthread_storage_sync.ccshould not affect any existing kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/target/codegen_cuda.cc (1)
1-2247: Action Required: Remove Legacy Partial-Thread Sync & Parity-Based Barrier IntrinsicsThe repo-wide verification script surfaced the following critical issues:
Lingering
__sync_thread_partialsymbols
• Defined insrc/tl_templates/cuda/common.h(lines 240–242)
• Invoked insrc/target/codegen_cuda.cc(around lines 648–656)
These must be replaced with the unified barrier mechanism.Remaining uses of
mbarrier_wait_parityin IR builders/transforms
• Definition and sugar wrapper intilelang/language/builtin.py
• Call sites in:
–src/transform/warp_specialized_rewriter.cc
–src/transform/inject_tma_barrier.cc
–src/transform/eliminate_storage_sync_for_mbarrier.cc
• Examples and tests (e.g.testing/python/transform/...,examples/warp_specialize/...) still invokeT.mbarrier_wait_parity()
All of these should be removed or redirected so that parity logic is handled viatl::mbarrier_wait(barrier, phase_bit)only.Unified wait is correctly defined
The new primitive
TL_DEVICE void mbarrier_wait(uint64_t &smem_barrier, int phase_bit)
is present insrc/tl_templates/cuda/barrier.hand codegen emits calls totl::mbarrier_wait(...)– this part is ✅.
create_barriersremains in lowering
Calls insrc/transform/lower_hopper_intrin.ccand corresponding codegen paths are expected for multi-barrier contexts but should be audited to ensure they integrate exclusively with the unified wait.Next steps:
- Eliminate all definitions and invocations of
__sync_thread_partial—migrate totl::mbarrier_waitor thembarrier_try_wait/mbarrier_waitprimitives inbarrier.h.- Remove the
mbarrier_wait_parityintrinsic frombuiltin.py, transforms, examples, and tests; ensure that any parity handling is expressed only via calls to the unifiedmbarrier_wait.- Audit
create_barriers+mbarrier_waitinteraction for correctness in multi-barrier kernels.- Rerun the verification script to confirm no residual partial-sync or parity intrinsics.
🧹 Nitpick comments (3)
src/target/codegen_cuda.cc (3)
642-656: Double-check: partial shared-memory sync path still generates tl::__sync_thread_partial<...>Given this PR removes partial-thread sync support and consolidates to a unified barrier API, this code path still generating tl::__sync_thread_partial<barrier_id[,thread_count]>() looks inconsistent and may produce unresolved symbols after the refactor.
Action options:
- If partial-thread sync is fully removed, delete these branches and keep only the __syncthreads() path (or translate to the new barrier-based primitive).
- If transitional support is required, gate this behind a feature flag and add a TODO with a removal plan.
Would you like me to draft a follow-up patch here mapping the N-arg storage sync to the new unified barrier wait semantics (or removing it entirely)?
985-1012: Sanity: all mbarrier ops pass an object except wait() — confirm semanticsEvery other mbarrier op here targets a specific barrier object (e.g., obj.init, obj.arrive, obj.expect_transaction). The unified wait is now object-less. Please confirm runtime semantics: if multiple barriers are created (create_barriers with barrier_count > 1), which barrier does tl::mbarrier_wait() wait on? If the new API relies on a current/implicit barrier context, please ensure it’s clearly set before this call is emitted; otherwise, we may be waiting on the wrong barrier.
I can add a defensive ICHECK on the wait path to enforce/validate the expected implicit barrier context if you share the intended invariants.
1008-1012: Nit: error message formattingMinor: “Invalid parameter for tl::arrive_barrier …” has a double space. If you touch this area again, consider fixing for consistency.
Apply this micro-diff:
- LOG(FATAL) << "Invalid parameter for tl::arrive_barrier " << op->args.size(); + LOG(FATAL) << "Invalid parameter for tl::arrive_barrier " << op->args.size();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
src/op/builtin.cc(0 hunks)src/op/builtin.h(0 hunks)src/target/codegen_cuda.cc(1 hunks)tilelang/engine/phase.py(0 hunks)
💤 Files with no reviewable changes (3)
- src/op/builtin.h
- tilelang/engine/phase.py
- src/op/builtin.cc
| } else if (op->op.same_as(tl::mbarrier_wait_parity())) { | ||
| ICHECK_EQ(op->args.size(), 2); | ||
| this->PrintIndent(); | ||
| auto mbarrier_obj = print_mbarrier_obj(op->args[0]); | ||
| auto phase = this->PrintExpr(op->args[1]); | ||
| this->stream << mbarrier_obj << ".wait(" << phase << ");\n"; | ||
| } else if (op->op.same_as(tl::sync_thread_partial())) { | ||
| print_extern_call_stmt("cutlass::arch::NamedBarrier::sync"); | ||
| print_extern_call_stmt("tl::mbarrier_wait"); | ||
| } else if (op->op.same_as(tl::no_set_max_nreg())) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix: mbarrier_wait branch still matches parity op and may emit wrong signature
This branch checks for tl::mbarrier_wait_parity() but prints tl::mbarrier_wait(...) using print_extern_call_stmt with default parameters, which forwards all op->args. If any args remain on the IR node, the generated call will be tl::mbarrier_wait(arg0, arg1), which contradicts the new zero-arg API and will not compile. Also, if upstream IR starts emitting a new builtin tl::mbarrier_wait(), this branch won’t match.
Recommend:
- Accept both tl::mbarrier_wait() and tl::mbarrier_wait_parity() (for backward compatibility).
- Assert zero arguments to fail fast if older IR inadvertently passes them.
- Emit a truly zero-arg call (don’t route through the helper that auto-forwards args).
Apply this diff:
- } else if (op->op.same_as(tl::mbarrier_wait_parity())) {
- print_extern_call_stmt("tl::mbarrier_wait");
+ } else if (op->op.same_as(tl::mbarrier_wait()) ||
+ op->op.same_as(tl::mbarrier_wait_parity())) {
+ ICHECK_EQ(op->args.size(), 0)
+ << "tl::mbarrier_wait no longer takes arguments";
+ this->PrintIndent();
+ this->stream << "tl::mbarrier_wait();\n";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if (op->op.same_as(tl::mbarrier_wait_parity())) { | |
| ICHECK_EQ(op->args.size(), 2); | |
| this->PrintIndent(); | |
| auto mbarrier_obj = print_mbarrier_obj(op->args[0]); | |
| auto phase = this->PrintExpr(op->args[1]); | |
| this->stream << mbarrier_obj << ".wait(" << phase << ");\n"; | |
| } else if (op->op.same_as(tl::sync_thread_partial())) { | |
| print_extern_call_stmt("cutlass::arch::NamedBarrier::sync"); | |
| print_extern_call_stmt("tl::mbarrier_wait"); | |
| } else if (op->op.same_as(tl::no_set_max_nreg())) { | |
| return; | |
| } else if (op->op.same_as(tl::mbarrier_wait()) || | |
| op->op.same_as(tl::mbarrier_wait_parity())) { | |
| ICHECK_EQ(op->args.size(), 0) | |
| << "tl::mbarrier_wait no longer takes arguments"; | |
| this->PrintIndent(); | |
| this->stream << "tl::mbarrier_wait();\n"; | |
| } else if (op->op.same_as(tl::no_set_max_nreg())) { | |
| return; |
…code clarity and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/target/codegen_hip.cc (1)
809-816: Typo in op symbol: tl::ptx_stmatirx() should be tl::ptx_stmatrix()The check uses a misspelled op name; this branch will never trigger and calls will fall through to default handling.
Apply this fix:
- } else if (op->op.same_as(tl::ptx_stmatirx())) { + } else if (op->op.same_as(tl::ptx_stmatrix())) {
🧹 Nitpick comments (1)
src/target/codegen_hip.cc (1)
141-147: Add header for barrier APIs (ensure declarations are visible)You’re now emitting tl::mbarrier_* calls, but this file doesn’t include a barrier header. If those functions live in a separate HIP header, include it to avoid implicit declarations.
Assuming the device library exposes a barrier header similar to other tl_templates components:
decl_stream << "#include <tl_templates/hip/gemm.h>\\n"; decl_stream << "#include <tl_templates/hip/copy.h>\\n"; decl_stream << "#include <tl_templates/hip/reduce.h>\\n"; decl_stream << "#include <tl_templates/hip/ldsm.h>\\n"; decl_stream << "#include <tl_templates/hip/threadblock_swizzle.h>\\n"; decl_stream << "#include <tl_templates/hip/debug.h>\\n"; + decl_stream << "#include <tl_templates/hip/barrier.h>\\n"; // Verify path/nameIf these symbols are already declared via an existing include, feel free to ignore. Please confirm the correct header path.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/target/codegen_hip.cc(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/target/codegen_hip.cc (2)
tilelang/language/tir/op.py (5)
create_barriers(1583-1596)ptx_arrive_barrier(1527-1541)ptx_init_barrier_thread_count(1507-1524)ptx_arrive_barrier_expect_tx(1544-1563)ptx_cp_async_barrier(1490-1504)tilelang/language/builtin.py (3)
get_mbarrier(43-52)mbarrier_expect_tx(219-228)mbarrier_wait_parity(151-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (2)
src/target/codegen_hip.cc (2)
798-806: LGTM: barrier arrive/init/expect-tx/cp-async arrive mappingsThese extern call mappings align with the new barrier-based primitives and look consistent with the Python intrinsics.
808-809: No changes needed formbarrier_wait_paritymappingThe HIP codegen correctly emits the CUDA‐style runtime call for the parity‐aware barrier intrinsic. Both CUDA and HIP codegen use:
} else if (op->op.same_as(tl::mbarrier_wait_parity())) { print_extern_call_stmt("tl::mbarrier_wait"); }– in CUDA (
src/target/codegen_cuda.cclines 1048–1049) and HIP (src/target/codegen_hip.cclines 807–808). The runtime APItl::mbarrier_wait(uint64_t &smem_barrier, int phase_bit)takes the parity bit as its second argument, so no separatembarrier_wait_paritysymbol exists in the device runtime (src/tl_templates/cuda/barrier.h).Since the parity semantics are passed correctly via the second argument to
mbarrier_wait, the existing code is correct and no symbol name change is required.
| } else if (op->op.same_as(builtin::create_barriers())) { | ||
| this->PrintIndent(); | ||
| int barrier_count = Downcast<IntImm>(op->args[0])->value; | ||
| std::string barrier_name = "_mbarrier"; | ||
| this->stream << "__shared__ uint64_t " << barrier_name << "[" | ||
| << barrier_count << "];\n"; | ||
| } else if (op->op.same_as(tl::get_mbarrier())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
create_barriers: enforce constant, positive count and guard against accidental re-declaration
- Downcast to IntImm without a check can crash if the arg isn’t constant.
- A zero barrier_count would emit a zero-length array (ill-formed in C++).
- If TIR emits create_barriers more than once in a kernel, this will re-declare _mbarrier and fail to compile.
Apply the checks inline:
- } else if (op->op.same_as(builtin::create_barriers())) {
- this->PrintIndent();
- int barrier_count = Downcast<IntImm>(op->args[0])->value;
- std::string barrier_name = "_mbarrier";
- this->stream << "__shared__ uint64_t " << barrier_name << "["
- << barrier_count << "];\n";
+ } else if (op->op.same_as(builtin::create_barriers())) {
+ this->PrintIndent();
+ ICHECK(op->args[0]->IsInstance<IntImmNode>())
+ << "create_barriers expects a compile-time constant barrier_count";
+ int barrier_count = Downcast<IntImm>(op->args[0])->value;
+ ICHECK_GT(barrier_count, 0) << "barrier_count must be > 0";
+ // NOTE: If multiple create_barriers calls can appear in the same kernel,
+ // we should hoist this into PreFunctionBody or guard via a class flag.
+ std::string barrier_name = "_mbarrier";
+ this->stream << "__shared__ uint64_t " << barrier_name << "["
+ << barrier_count << "];\n";Follow-up: please confirm TIR guarantees at most one create_barriers per kernel; if not, we should store a per-function flag (e.g., has_mbarrier_decl_) to avoid duplicate declarations. I can propose that change in codegen_hip.h/cpp if desired.
🏁 Script executed:
#!/bin/bash
set -e
echo "=== TIR builtin definitions ==="
rg -n "create_barriers" -g "src/tir/builtin.cc" || true
rg -n "create_barriers" -g "include/tvm/tir/builtin.h" || true
echo
echo "=== Codegen CUDA handling of create_barriers ==="
rg -n "create_barriers" -g "src/target/codegen_cuda.cc" -C3 || true
echo
echo "=== Codegen HIP context around create_barriers ==="
sed -n '760,820p' src/target/codegen_hip.cc || trueLength of output: 4514
Ensure constant, positive barrier count and single declaration for create_barriers in codegen_hip.cc
Please update the create_barriers case in src/target/codegen_hip.cc (around line 787) to:
- Enforce that the argument is a compile-time constant
IntImmand greater than zero. - Guard against emitting the same shared array more than once per kernel.
Suggested diff:
--- a/src/target/codegen_hip.cc
+++ b/src/target/codegen_hip.cc
@@ -783,12 +783,22 @@ void CodeGenHIP::VisitStmt_(const tir::CallNode* op) {
} else if (op->op.same_as(builtin::create_barriers())) {
this->PrintIndent();
+ // Must be a constant integer > 0
+ ICHECK(op->args[0]->IsInstance<IntImmNode>())
+ << "create_barriers expects a compile-time constant barrier_count";
+ int barrier_count = Downcast<IntImm>(op->args[0])->value;
+ ICHECK_GT(barrier_count, 0)
+ << "create_barriers generated zero or negative barrier_count";
+
+ // Emit the shared barrier array only once per kernel
+ if (!this->has_mbarrier_decl_) {
+ this->has_mbarrier_decl_ = true;
+ const std::string barrier_name = "_mbarrier";
+ this->stream << "__shared__ uint64_t " << barrier_name
+ << "[" << barrier_count << "];\n";
+ }
} else if (op->op.same_as(tl::get_mbarrier())) {
std::string barrier_name = "_mbarrier";
std::string barrier_id = this->PrintExpr(op->args[0]);Additionally:
- In
CodeGenHIP’s header (e.g.codegen_hip.h), declare and initialize the flag:
class CodeGenHIP : public CodeGenCUDA {
// ...
bool has_mbarrier_decl_{false};
// ...
};This change:
- Prevents a crash when the argument is not an
IntImm(downcast guard). - Avoids emitting a zero-length array (C++ disallows zero-size arrays).
- Ensures that if
create_barriersis called multiple times in one kernel, the shared buffer is declared only once, avoiding duplicate-definition errors.
🤖 Prompt for AI Agents
In src/target/codegen_hip.cc around lines 787-793, the create_barriers branch
must validate the barrier count is a compile-time IntImm and >0 and must only
emit the shared array once per kernel; add a runtime check that op->args[0] is
an IntImm (fail/throw/report an error if not), extract its value and require it
> 0 (fail/report if <= 0), introduce and use a CodeGenHIP member flag (e.g.
has_mbarrier_decl_) declared/initialized in the CodeGenHIP header to skip
emitting the "__shared__ uint64_t _mbarrier[...]" line if the flag is already
true, and set the flag to true immediately after emitting the declaration so
duplicate declarations are prevented.
| std::string barrier_name = "_mbarrier"; | ||
| std::string barrier_id = this->PrintExpr(op->args[0]); | ||
| os << barrier_name + "[" + barrier_id + "]"; | ||
| } else if (op->op.same_as(builtin::ptx_arrive_barrier())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
get_mbarrier likely needs the address-of element
Most mbarrier intrinsics operate on the address in shared memory. If tl::mbarrier_* expect a pointer/reference to the 64-bit slot, we should pass &_mbarrier[id], not the value.
If the API expects a pointer:
- std::string barrier_name = "_mbarrier";
- std::string barrier_id = this->PrintExpr(op->args[0]);
- os << barrier_name + "[" + barrier_id + "]";
+ std::string barrier_name = "_mbarrier";
+ std::string barrier_id = this->PrintExpr(op->args[0]);
+ os << "&" << barrier_name << "[" << barrier_id << "]";If it expects a reference, the current code is fine. Please confirm the exact signature of tl::mbarrier_init/tl::mbarrier_arrive in the HIP device library you’re linking against; I can adjust accordingly.
Pass the barrier’s address rather than its value to mbarrier intrinsics
The PTX mbarrier instructions (e.g. mbarrier.init.b64 [addr], count and mbarrier.arrive.b64 [addr]) expect the address of the 64-bit slot in shared memory, not the slot’s value (docs.nvidia.com). In your HIP codegen, you must therefore emit &_mbarrier[id] instead of _mbarrier[id].
Please update src/target/codegen_hip.cc around lines 794–797 as follows:
- std::string barrier_name = "_mbarrier";
- std::string barrier_id = this->PrintExpr(op->args[0]);
- os << barrier_name + "[" + barrier_id + "]";
+ std::string barrier_name = "_mbarrier";
+ std::string barrier_id = this->PrintExpr(op->args[0]);
+ // Intrinsics expect the address of the barrier object
+ os << "&" << barrier_name << "[" << barrier_id << "]";This ensures that calls to tl::mbarrier_init and tl::mbarrier_arrive receive a pointer to the mbarrier object in shared memory.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::string barrier_name = "_mbarrier"; | |
| std::string barrier_id = this->PrintExpr(op->args[0]); | |
| os << barrier_name + "[" + barrier_id + "]"; | |
| } else if (op->op.same_as(builtin::ptx_arrive_barrier())) { | |
| std::string barrier_name = "_mbarrier"; | |
| std::string barrier_id = this->PrintExpr(op->args[0]); | |
| // Intrinsics expect the address of the barrier object | |
| os << "&" << barrier_name << "[" << barrier_id << "]"; | |
| } else if (op->op.same_as(builtin::ptx_arrive_barrier())) { |
🤖 Prompt for AI Agents
In src/target/codegen_hip.cc around lines 794 to 797, the code emits the value
of the mbarrier slot (_mbarrier[id]) but the PTX mbarrier intrinsics require the
address of the 64-bit slot; change the emission to produce an address by
prefixing with & (i.e. emit &_mbarrier[id]) so that tl::mbarrier_init and
tl::mbarrier_arrive receive a pointer to the mbarrier object in shared memory
rather than its value.
…istency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (2)
15-16: Fix dead producer branch: threadIdx.x extent makesv >= 128unreachable
vis launched with extent 128, sov >= 128is never true. This makes the “producer” branch dead and weakens the intent of the test (even if the IR still contains both branches). Use 256 threads to match the comment and the[128, 128]warp specialization scope.Apply this diff:
- v = T.launch_thread("threadIdx.x", 128) + v = T.launch_thread("threadIdx.x", 256)
89-90: Same issue in the second test: unreachable “producer” branchMirror the fix in the no-set-max-nreg variant so the control-flow structure is consistent across both tests.
- v = T.launch_thread("threadIdx.x", 128) + v = T.launch_thread("threadIdx.x", 256)
♻️ Duplicate comments (2)
testing/python/primitives/test_tilelang_primitives_mma.py (2)
211-217: Reuse the shared PASS_CONFIGS_NO_TMA_WARPSame suggestion as above—replace the inline dict with the shared constant to avoid duplication.
- kernel = tilelang.compile( - program, - out_idx=[2], - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + kernel = tilelang.compile( + program, + out_idx=[2], + pass_configs=PASS_CONFIGS_NO_TMA_WARP)
339-345: Reuse the shared PASS_CONFIGS_NO_TMA_WARPSame as earlier locations; centralizing the pass configs keeps maintenance simpler.
- kernel = tilelang.compile( - program, - out_idx=[2], - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + kernel = tilelang.compile( + program, + out_idx=[2], + pass_configs=PASS_CONFIGS_NO_TMA_WARP)
🧹 Nitpick comments (5)
testing/python/language/test_tilelang_language_reshape.py (1)
110-118: Deduplicate pass_configs and keep aliasing consistent within the fileLGTM on the flags you pass; they match the available PassConfigKey entries and the intent to avoid TMA/warp specialization in this path. To reduce repetition (the same dict appears in three helpers in this file) and keep aliasing consistent with tl.compile, consider centralizing the config and using the same alias for keys.
Apply this diff here:
- jit_kernel = tl.compile( - program, - out_idx=-1, - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + jit_kernel = tl.compile( + program, + out_idx=-1, + pass_configs=PASS_CONFIGS_NO_TMA_WARP)Then add this near the top of the module (outside the selected range):
# Reused across tests in this module PASS_CONFIGS_NO_TMA_WARP = { tl.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tl.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, }testing/python/primitives/test_tilelang_primitives_mma.py (1)
84-91: Centralize pass_configs and (optionally) align alias to reduce noiseThe pass configs are correct and align with the test’s TODO about TMA. To keep this file DRY (you repeat the same dict in three places) and make future toggles easier, hoist the dict to a module-level constant and reuse it.
Apply this diff here:
- # TODO(lei): gemm_v2 with tma is not fully tested. - kernel = tilelang.compile( - program, - out_idx=[2], - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + # TODO(lei): gemm_v2 with tma is not fully tested. + kernel = tilelang.compile( + program, + out_idx=[2], + pass_configs=PASS_CONFIGS_NO_TMA_WARP)Add once at the top of this file (outside the selected range):
PASS_CONFIGS_NO_TMA_WARP = { tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, }Optional: for consistency with other tests that use
tl.compile, you couldimport tilelang as tland referencetl.compileandtl.PassConfigKeyeverywhere in tests. Not required, just consistency polish.testing/python/language/test_tilelang_language_reduce_sum.py (1)
68-71: Compile call is fine; minor variable-name typo belowThe tl.compile(out_idx=-1) usage is correct here. Minor nit: on Line 76, variable name “dummp_A” is likely a typo—rename to “dummy_A” for readability.
Fix outside the selected range:
dummy_A = torch.randn((M, N), dtype=getattr(torch, dtype)).cuda() ref_out = ref_program(dummy_A) tl_out = jit_kernel(dummy_A)Optional cleanup (outside the selected range): since the helpers now default to float32, the explicit "float32" arguments in test_reduce_sum_clear() are redundant and can be dropped for brevity.
testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (2)
80-81: Avoid prints in testsThe prints are unnecessary noise under pytest and can be dropped.
- print("InjectSetMaxNReg test passed!") + # Test passes if assertions above hold. @@ - print("InjectSetMaxNReg with no_set_max_nreg test passed!") + # Test passes if assertions above hold.Also applies to: 130-131
134-136: Main block only runs one test; either run both or rely on pytest discoveryRight now
__main__executes onlytest_inject_set_max_nreg, skippingtest_inject_set_max_nreg_no_set_max_nreg. If you keep a manual entry point, invoke both (or switch tounittest.main()). Otherwise, consider removing the block and rely on pytest.Option A (run both):
if __name__ == "__main__": - # tilelang.testing.main() - test_inject_set_max_nreg() + test_inject_set_max_nreg() + test_inject_set_max_nreg_no_set_max_nreg()Option B (use unittest):
if __name__ == "__main__": - # tilelang.testing.main() - test_inject_set_max_nreg() + import unittest + unittest.main()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
testing/python/kernel/test_tilelang_kernel_flash_linear_attention.py(0 hunks)testing/python/language/test_tilelang_language_reduce_sum.py(4 hunks)testing/python/language/test_tilelang_language_reshape.py(1 hunks)testing/python/primitives/test_tilelang_primitives_mma.py(3 hunks)testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py(1 hunks)
💤 Files with no reviewable changes (1)
- testing/python/kernel/test_tilelang_kernel_flash_linear_attention.py
🧰 Additional context used
🧬 Code graph analysis (3)
testing/python/primitives/test_tilelang_primitives_mma.py (2)
tilelang/jit/__init__.py (1)
compile(32-81)tilelang/transform/pass_config.py (1)
PassConfigKey(6-81)
testing/python/language/test_tilelang_language_reshape.py (3)
src/op/builtin.h (1)
tl(14-306)tilelang/jit/__init__.py (1)
compile(32-81)tilelang/transform/pass_config.py (1)
PassConfigKey(6-81)
testing/python/language/test_tilelang_language_reduce_sum.py (1)
tilelang/jit/__init__.py (1)
compile(32-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (4)
testing/python/language/test_tilelang_language_reduce_sum.py (3)
8-8: Default dtype → float32 looks goodShifting the default to float32 is sane for numerical stability on reductions. No API break in callers that pass dtype explicitly.
30-30: run_reduce_sum default dtype aligned to float32Consistent with the helper change; good.
48-48: reduce_sum_test_clear default dtype aligned to float32Matches the rest of the file; fine.
testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (1)
66-79: Incorrect assumption abouttl.set_max_nregarguments
Thetl.set_max_nregintrinsic is defined asdef set_max_nreg(reg_count: int, is_inc: bool): return tir.call_intrin("handle", tir.op.Op.get("tl.set_max_nreg"), reg_count, is_inc)– it only takes the register count and a Boolean
is_incflag. There is no integer “role” argument for producer/consumer, so you can’t assert role coverage from a second argument. You can safely ignore the suggestion to collect roles; the existing assertion on the number of calls (>= 2) is appropriate.Likely an incorrect or invalid review comment.
…and for better visibility during installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
setup.py (1)
163-177: PEP 440 violation: multiple ‘+’ local version segments.You append local version metadata (e.g., os/cuda/rocm) and then append commit id with another “+”. PEP 440 allows only one local version segment. Append the commit id into the same local part.
- if local_version_parts: - version += f"+{'.'.join(local_version_parts)}" - - if with_commit_id: - commit_id = None - try: - commit_id = subprocess.check_output(['git', 'rev-parse', 'HEAD'], - stderr=subprocess.DEVNULL, - encoding='utf-8').strip() - except subprocess.SubprocessError as error: - logger.warning(f"Ignore commit id because failed to get git commit id: {str(error)}") - if commit_id: - version += f"+{commit_id}" + if with_commit_id: + commit_id = None + try: + commit_id = subprocess.check_output( + ['git', 'rev-parse', 'HEAD'], + stderr=subprocess.DEVNULL, + encoding='utf-8' + ).strip() + except subprocess.SubprocessError as error: + logger.warning(f"Ignore commit id because failed to get git commit id: {str(error)}") + if commit_id: + local_version_parts.append(commit_id) + + if local_version_parts: + version += f"+{'.'.join(local_version_parts)}"
🧹 Nitpick comments (3)
.github/workflows/ci.yml (1)
107-107: Verbose pip logs are helpful, but watch GH log size; consider gating verbosity.Adding -v improves diagnosability, but on large C++ builds it can blow up logs. Consider an env switch (e.g., PIP_VERBOSE=1) or only enabling -v on pull_request labeled “ci-verbose”.
setup.py (2)
775-781: Add a generator fallback if Ninja isn’t present.You always pass -G Ninja but don’t verify ninja is installed. On some build hosts, this will cause CMake to fail. Prefer detecting ninja and falling back to the default generator.
- cmake_args = [ - f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}", - f"-DPython_EXECUTABLE={sys.executable}", - f"-DCMAKE_BUILD_TYPE={'Debug' if DEBUG_MODE else 'Release'}", - "-G", - "Ninja", - ] + cmake_args = [ + f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}", + # Newer CMake prefers Python3_EXECUTABLE; keep both for compatibility. + f"-DPYTHON_EXECUTABLE={sys.executable}", + f"-DPython3_EXECUTABLE={sys.executable}", + f"-DCMAKE_BUILD_TYPE={'Debug' if DEBUG_MODE else 'Release'}", + ] + if shutil.which("ninja"): + cmake_args += ["-G", "Ninja"] + else: + logger.info("Ninja not found; falling back to default CMake generator.")
115-116: Be resilient to nonstandard CUDA layouts; fall back to ‘nvcc’ on PATH and raise a clear error.Hardcoding CUDA_HOME/bin/nvcc breaks when CUDA is installed via pkg managers or PATH-only setups. Add a fallback and better error reporting.
- nvcc_path = os.path.join(CUDA_HOME, "bin", "nvcc") - nvcc_output = subprocess.check_output([nvcc_path, "-V"], universal_newlines=True) + nvcc_path = os.path.join(CUDA_HOME, "bin", "nvcc") + if not os.path.exists(nvcc_path): + nvcc_path = shutil.which("nvcc") or nvcc_path + try: + nvcc_output = subprocess.check_output([nvcc_path, "-V"], universal_newlines=True) + except (OSError, subprocess.CalledProcessError) as e: + raise RuntimeError(f"Failed to invoke nvcc at '{nvcc_path}': {e}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/ci.yml(1 hunks)setup.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (1)
setup.py (1)
792-829: Incorrect: Configure step already present
The call to configure CMake is already invoked earlier in setup.py (around line 361) via:subprocess.check_call(["cmake", ".."])This generates the initial CMake cache and project files, so the later
cmake --buildinvocation will succeed without needing another explicitcmakeconfigure call.Nit: for better readability, you could switch from
-j Nto the more descriptive--parallel Nin your build command:- subprocess.check_call( - ["cmake", "--build", ".", "--config", "Release", "-j", str(num_jobs)] - ) + subprocess.check_call( + ["cmake", "--build", ".", "--config", "Release", "--parallel", str(num_jobs)] + )Likely an incorrect or invalid review comment.
| - name: Install project (wheel form) | ||
| run: | | ||
| source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" | ||
| pip install . --no-user | ||
| pip install . --no-user -v | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid double-installing the project; fix marker hashing for stable venv reuse.
- The project is installed in the “Ensure venv” step (Line 100) and again in “Install project (wheel form)” (Lines 104-108). This needlessly rebuilds wheels and costs minutes on GPU runners. Install it once (preferably in the dedicated “Install project” step).
- The venv cache marker in this job uses the first 8 characters of the raw file contents of requirements-test.txt (Lines 85-87 in the same step), whereas the format-check job hashes the file with sha256sum. The current approach invalidates the cache only when the first 8 characters change, which is brittle.
Apply the following diffs:
- Use a real hash for REQS_HASH (consistent with the format-check job):
- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true)
+ REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")- Install the project only in the “Install project (wheel form)” step and remove the earlier install:
- pip install . --no-userOptionally, if you specifically want wheel isolation/logs here, enforce it only once:
- pip install . --no-user -v
+ pip install . --no-user -v(Keep just one of these occurrences in the job.)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In .github/workflows/ci.yml around lines 104-108 (and referencing the venv cache
marker at ~lines 85-87 and the earlier "Ensure venv" install at ~line 100): the
job double-installs the project (once in Ensure venv and again in Install
project) and the REQS_HASH uses the first 8 chars of raw file contents rather
than a proper sha256, making cache invalidation brittle. Remove the earlier pip
install from the "Ensure venv" step so the project is installed only in the
dedicated "Install project (wheel form)" step, and change the REQS_HASH
calculation to use the sha256 checksum of requirements-test.txt (same method as
format-check job) so the venv cache key is stable and invalidates correctly; if
you need wheel-isolated logging, keep that behavior only in the single install
step (do not duplicate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
setup.py (3)
825-826: Ensure configure step re-runs when config.cmake changes.Gating solely on the existence of build.ninja can miss necessary reconfigures after config updates. Tie the configure step to the
config_updatedflag you set above.Apply this diff:
- if not os.path.exists(build_temp + "/build.ninja"): - subprocess.check_call(["cmake", ext.sourcedir] + cmake_args, cwd=build_temp) + if config_updated or not os.path.exists(os.path.join(build_temp, "build.ninja")): + logger.info("Running CMake configure (first run or config changed).") + subprocess.check_call(["cmake", ext.sourcedir] + cmake_args, cwd=build_temp)
829-832: Respect CMAKE_BUILD_PARALLEL_LEVEL; fall back to 75% CPUs.Good call throttling to 75%. Consider honoring the standard env var if set, so users/CI can override without code changes.
Apply this diff:
- num_jobs = max(1, int(multiprocessing.cpu_count() * 0.75)) - subprocess.check_call(["cmake", "--build", ".", "--config", "Release", "-j", - str(num_jobs)], - cwd=build_temp) + # Prefer user/CI override if provided + jobs_env = os.environ.get("CMAKE_BUILD_PARALLEL_LEVEL") + build_cmd = ["cmake", "--build", ".", "--config", "Release"] + if jobs_env: + logger.info(f"Using CMAKE_BUILD_PARALLEL_LEVEL={jobs_env}") + else: + num_jobs = max(1, int(multiprocessing.cpu_count() * 0.75)) + build_cmd += ["-j", str(num_jobs)] + subprocess.check_call(build_cmd, cwd=build_temp)
115-120: Enhanceget_nvcc_cuda_versionfor robust nvcc discovery and error handlingThe current implementation in
setup.pyunconditionally invokes${CUDA_HOME}/bin/nvcc, which will:
- Fail if
CUDA_HOMEis set to a non-standard location or unset (despite earlier guards).- Propagate opaque
FileNotFoundErrororCalledProcessErrorwithout guidance.To improve resilience:
• In
get_nvcc_cuda_version()(around lines 115–120 ofsetup.py), first try$CUDA_HOME/bin/nvcc, then fall back to whatevernvccis on the user’sPATHviashutil.which("nvcc").
• If neither exists, raise a clearFileNotFoundErrorexplaining how to setCUDA_HOMEor updatePATH.
• Wrap thesubprocess.check_outputcall in atry/except subprocess.CalledProcessErrorto emit a more user-friendlyRuntimeErroron execution failure.Apply this diff to
get_nvcc_cuda_version():-def get_nvcc_cuda_version(): +def get_nvcc_cuda_version(): """Get the CUDA version from nvcc. Adapted from https://github.com/NVIDIA/apex/.../setup.py """ - nvcc_path = os.path.join(CUDA_HOME, "bin", "nvcc") - nvcc_output = subprocess.check_output([nvcc_path, "-V"], universal_newlines=True) + # Prefer CUDA_HOME/bin/nvcc, but fall back to PATH + candidates = [] + if CUDA_HOME: + candidates.append(os.path.join(CUDA_HOME, "bin", "nvcc")) + env_nvcc = shutil.which("nvcc") + if env_nvcc: + candidates.append(env_nvcc) + nvcc_path = next((p for p in candidates if p and os.path.exists(p)), None) + if not nvcc_path: + raise FileNotFoundError( + "Could not locate 'nvcc'. Ensure CUDA is installed and either set " + "CUDA_HOME to the toolkit root (e.g., /usr/local/cuda) or add nvcc to PATH." + ) + try: + nvcc_output = subprocess.check_output([nvcc_path, "-V"], universal_newlines=True) + except subprocess.CalledProcessError as e: + raise RuntimeError(f"Failed to run '{nvcc_path} -V' to detect CUDA version") from e output = nvcc_output.split() release_idx = output.index("release") + 1 nvcc_cuda_version = Version(output[release_idx].split(",")[0]) return nvcc_cuda_versionThis change ensures that users with non-standard installations still get a clear error and that runtime failures surface with actionable messages.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
setup.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-test-nvidia
🔇 Additional comments (1)
setup.py (1)
792-823: Verify external TVM config and harden CMake template generationPlease ensure that the external TVM source tree provides the expected
config.cmakeand that its CMake scripts actually consume theUSE_CUDA,USE_ROCM, andUSE_LLVMvariables. Then apply the following improvements tosetup.py:• Add an existence check for the template and fail fast if it’s missing
• Introduce a helper to quote path-like values but leave booleans unquoted
• Switch fromlogger.infofor consistent logging
• Track whether the config was updated via aconfig_updatedflag for downstream reconfigure logic--- a/setup.py +++ b/setup.py @@ -796,6 +796,10 @@ src_config = Path(ext.sourcedir) / "3rdparty" / "tvm" / "cmake" / "config.cmake" dst_config = Path(build_temp) / "config.cmake" + if not src_config.exists(): + raise FileNotFoundError(f"Missing TVM config template: {src_config}") + + # Quote paths/strings but leave ON/OFF unquoted content_lines = src_config.read_text().splitlines() def _cmake_val(v: str) -> str: return v if v in ("ON", "OFF") else f'"{v}"' @@ -803,7 +807,7 @@ # Add common LLVM configuration - content_lines.append(f"set(USE_LLVM {llvm_config_path})") + content_lines.append(f"set(USE_LLVM {_cmake_val(llvm_config_path)})") # Append GPU backend configuration based on environment if USE_ROCM: @@ -825,11 +829,15 @@ new_content = "\n".join(content_lines) + "\n" config_updated = False - if not dst_config.exists() or dst_config.read_text() != new_content: - dst_config.write_text(new_content) - print(f"[Config] Updated: {dst_config}") - else: - print(f"[Config] No changes: {dst_config}") + if not dst_config.exists() or dst_config.read_text() != new_content: + dst_config.write_text(new_content) + logger.info(f"[Config] Updated: {dst_config}") + config_updated = True + else: + logger.info(f"[Config] No changes: {dst_config}") # (downstream) trigger CMake reconfigure if config_updated is True• Manual verification required:
– Confirmconfig.cmakeis present underext.sourcedir/3rdparty/tvm/cmake
– Ensure TVM’s rootCMakeLists.txtor included modules actually load this file and respectUSE_CUDA,USE_ROCM, andUSE_LLVMsettings
…, torch, tox, auditwheel, patchelf, and ninja.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pyproject.toml (1)
3-3: Removebuildfrom[build-system].requires(it's a front-end, not a backend requirement).
buildis a CLI used to invoke PEP 517 builds; it shouldn't be a requirement for building your project inside the isolated build env. Keeping it here bloats the builder environment and can introduce version coupling for no benefit.Apply this diff to drop it:
requires = [ - "build", "cmake>=3.26", "packaging", "setuptools>=61", "torch", "wheel", "tox", "auditwheel", "patchelf", "ninja", ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
pyproject.toml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: bot-task
| "cython", | ||
| "packaging", | ||
| "setuptools>=61", | ||
| "torch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not list torch under [build-system].requires; move to runtime or an optional extra.
Including torch as a build requirement forces PEP 517 build environments (e.g., pip/PEP 517 isolated builds, PyPI builders) to install a heavyweight framework during the build step, which is unnecessary for compiling TileLang and can fail across platforms (Linux/macOS/Windows, CUDA vs. CPU variants). This is likely to break source builds and slow CI.
Minimal fix (remove from build requirements):
requires = [
"cmake>=3.26",
"packaging",
"setuptools>=61",
- "torch",
"wheel",
"tox",
"auditwheel",
"patchelf",
"ninja",
]If torch is a runtime optional integration, add an extra instead (outside this hunk):
[project.optional-dependencies]
torch = ["torch>=2.2"] # choose the lowest version you truly supportIf it’s not required at runtime, rely on docs to instruct users to install it separately.
Run this to confirm torch isn’t used in the build step (only in tests/examples is fine):
🏁 Script executed:
#!/bin/bash
# Verify torch is not used by build tooling
set -euo pipefail
echo "== Search for torch in setup/build tooling =="
rg -n -C2 -i 'import\s+torch|from\s+torch\s+import|torch\.' setup.py pyproject.toml || true
echo "== Search for torch in CMake or codegen paths (build-time find_package) =="
rg -n -C2 -i 'find_package\s*\(\s*Torch|PyTorch|torch' || true
echo "== Search for torch in packaging/CI scripts =="
fd -t f | rg -n -C2 -i 'audit|wheel|cibuildwheel|torch' || trueLength of output: 318379
Action required — remove "torch" from [build-system].requires in pyproject.toml (it's a runtime dependency).
Verified: pyproject.toml lists "torch" under build-system.requires (pyproject.toml:7). The repo also imports/uses PyTorch in core library modules and C++ extension templates (examples: tilelang/utils/tensor.py, tilelang/utils/sparse.py, tilelang/testing/init.py, tilelang/jit/, tilelang/profiler/, src/tl_templates/cuda/compress_sm90.cu), so PyTorch belongs as a runtime dependency (project.dependencies) or an optional extra — not a PEP 517 build requirement.
Files/locations needing attention:
- pyproject.toml — remove "torch" from [build-system].requires (pyproject.toml:7).
- requirements-build.txt — remove "torch" from build-only requirements (requirements-build.txt:7).
- Add "torch" to runtime deps (or an optional extra) in pyproject.toml [project.dependencies] or [project.optional-dependencies], depending on whether the package requires PyTorch at import-time (the codebase shows many top-level imports, so prefer [project.dependencies]).
- Review CI/docker scripts that install build deps (they should not be required to install full PyTorch just to build wheels).
Minimal fix (remove from build-system.requires):
requires = [
"cmake>=3.26",
"packaging",
"setuptools>=61",
- "torch",
"wheel",
"tox",
"auditwheel",
"patchelf",
"ninja",
]Recommended addition (declare as runtime dep or extra; adjust version to the minimum you support — e.g. code references float8 checks requiring newer torch):
[project]
dependencies = [
"torch>=2.2", # move here if torch is required at import/runtime
...
]
# OR, if optional:
[project.optional-dependencies]
torch = ["torch>=2.2"]🤖 Prompt for AI Agents
In pyproject.toml around line 7 remove "torch" from the [build-system].requires
entry (it is a runtime dependency), also remove "torch" from
requirements-build.txt (line 7) so build-only tooling doesn't pull full PyTorch;
then add "torch>=<minimum-version-you-support>" to pyproject.toml under
[project].dependencies (or to [project.optional-dependencies] if you intend it
to be optional), and update CI/Docker build scripts to stop installing torch as
a build dependency so builders don't need the full package to build wheels.
| "tox", | ||
| "auditwheel", | ||
| "patchelf", | ||
| "ninja", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Keep ninja, but drop tox, auditwheel, and patchelf from build requirements.
toxis a test runner and shouldn’t be required to build the wheel/sdist.auditwheel/patchelfare wheel-repair tools used post-build on Linux; listing them here pulls them into any isolated build and can cause cross-platform issues (e.g., Windows/macOS), while providing no benefit during the PEP 517 build itself.
Apply this diff:
requires = [
"cmake>=3.26",
"packaging",
"setuptools>=61",
"wheel",
- "tox",
- "auditwheel",
- "patchelf",
"ninja",
]Move them to dev/release extras instead (outside this hunk), with platform markers where appropriate:
[project.optional-dependencies]
dev = [
"tox>=4",
"pytest",
"ruff",
"yapf",
"codespell",
]
release = [
"build>=1.2.1",
"twine>=5",
"auditwheel>=6; platform_system == 'Linux'",
]If you suspect these are needed by setup.py or CI, run:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check if tox/auditwheel/patchelf are used by setup.py or build scripts =="
rg -n -C2 -i 'tox|auditwheel|patchelf' setup.py || true
echo "== Check CI configs for auditwheel/patchelf usage (safe to keep in CI only) =="
fd -t f | rg -n -C3 -i 'auditwheel|patchelf|cibuildwheel|repair' || trueLength of output: 1072
Adjust build requirements: retain patchelf, drop tox and auditwheel
The call to patchelf in setup.py confirms it must stay in requires (lines 382–390 reference it directly), but tox and auditwheel can be removed from the PEP 517 build requirements. Move tox to your dev dependencies and auditwheel to the release extras with a Linux platform marker.
• In pyproject.toml under the [build-system] or requires section, update to:
requires = [
"cmake>=3.26",
"packaging",
"setuptools>=61",
"wheel",
- "tox",
- "auditwheel",
"patchelf",
"ninja",
]• Then, in your optional dependencies:
[project.optional-dependencies]
dev = [
"tox>=4",
"pytest",
"ruff",
"yapf",
"codespell",
]
release = [
"build>=1.2.1",
"twine>=5",
"auditwheel>=6; platform_system == 'Linux'",
]🤖 Prompt for AI Agents
In pyproject.toml around lines 9 to 12, the build-system requires list currently
includes "tox", "auditwheel", "patchelf", "ninja"; keep "patchelf" there but
remove "tox" and "auditwheel" from the PEP 517 build requirements, then add
"tox" to the project.optional-dependencies under a dev group and add
"auditwheel" to a release extra with a Linux platform marker (e.g.,
auditwheel>=6; platform_system == 'Linux'), and ensure the release extra also
contains build/twine entries per the review.
…for improved test visibility.
…read storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/transform/thread_storage_sync.cc (2)
626-630: Bug: const_int_bound invoked with IterVar instead of VarAnalyzer expects a PrimExpr; passing IterVar compiles incorrectly or yields wrong lookup. Use iv->var.
Apply:
- auto bound_tx = analyzer_->const_int_bound(tx_); - auto bound_ty = analyzer_->const_int_bound(ty_); - auto bound_tz = analyzer_->const_int_bound(tz_); + auto bound_tx = analyzer_->const_int_bound(tx_->var); + auto bound_ty = analyzer_->const_int_bound(ty_->var); + auto bound_tz = analyzer_->const_int_bound(tz_->var);
650-656: Do not drop synchronization when thread_count is not a multiple of 32Returning an empty Stmt() removes the barrier entirely, risking data races. If partial sync cannot be formed, preserve the original full sync.
Apply:
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback: keep original full sync to maintain correctness. + return Evaluate(IRMutatorWithAnalyzer::VisitExpr_(op)); + }
♻️ Duplicate comments (5)
.github/workflows/ci.yml (2)
82-101: Avoid duplicate project installation in the same job.The job installs the project inside “Ensure venv” (Line 100) and again in “Install project (wheel form)” (Lines 104–108). This wastes build time and can mask packaging issues. Keep only the wheel-form install.
Apply:
- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true) + REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") @@ - pip install . --no-user + # Project install happens later in the dedicated "Install project (wheel form)" step.
85-87: Stabilize venv cache key hashing (use sha256, match other job).Using the raw file contents and truncation is brittle; the other job already uses sha256. Align here to ensure correct cache invalidation.
Apply:
- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true) - MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}" + REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") + MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}"pyproject.toml (2)
7-7: Remove torch from PEP 517 build requirements; declare it as runtime or optional extra.Putting
torchin[build-system].requiresforces heavyweight installs during isolated builds and often breaks cross‑platform source builds. Move it to[project].dependencies(if imported at runtime) or[project.optional-dependencies](if optional).Apply within this hunk:
- "torch",Then add one of the following sections outside this hunk (pick one):
Option A — required at import/runtime:
[project] dependencies = [ "torch>=2.2", ]Option B — optional integration:
[project.optional-dependencies] torch = ["torch>=2.2"]If you want, I can scan the repo for top‑level
import torchto decide A vs. B.
9-10: Drop tox and auditwheel from build requirements; move to dev/release extras.These tools are not needed to execute the build backend and can destabilize isolated builds (especially on non‑Linux platforms).
Apply within this hunk:
- "tox", - "auditwheel",Then add outside this hunk:
[project.optional-dependencies] dev = [ "tox>=4", "pytest", "ruff", "yapf", "codespell", ] release = [ "build>=1.2.1", "twine>=5", "auditwheel>=6; platform_system == 'Linux'", ]src/transform/thread_storage_sync.cc (1)
658-661: 3-arg tvm_storage_sync requires backend support (HIP, CUDA) — verify codegenThis pass rewrites shared syncs to tvm_storage_sync(scope, barrier_id, thread_count) for partial barriers. Ensure both CUDA and HIP backends accept the 3-arg variant. A prior review noted HIP’s CodeGenTileLangHIP::PrintStorageSync only handled the single-arg (__syncthreads) form, which would be incorrect for partial syncs.
Action:
- Update HIP codegen to accept 3 args and either implement named/partial barrier semantics or conservatively fall back to full barrier while ignoring extra args.
#!/bin/bash # Locate PrintStorageSync implementations and inspect argument handling. rg -nP -C3 'PrintStorageSync\s*\(' --type=cpp # Also find tvm_storage_sync call lowering paths in HIP/CUDA codegens. rg -nP -C3 'tvm_storage_sync\W' --type=cpp
🧹 Nitpick comments (5)
.github/workflows/ci.yml (1)
21-25: Upgrade actions/setup-python to v5.actions/setup-python@v2 is old; v5 brings perf and security updates (Node20 runtime). Safe, drop-in upgrade.
Apply:
- uses: actions/setup-python@v2 + uses: actions/setup-python@v5Also applies to: 77-81
pyproject.toml (3)
3-3: Remove build frontendbuildfrom[build-system].requires.The PEP 517 frontend (the
buildpackage) is used by humans/CI to invoke builds, not by the backend to build your project. Move it to thereleaseextra.Apply within this hunk:
- "build",
13-13: Pin a minimum Cython compatible version.If you’re on Cython 3 semantics, declare
Cython>=3.0to avoid older 0.29.x pulls; if you still rely on 0.29 behavior, considerCython>=0.29.36,<3.Example change:
- "Cython", + "Cython>=3.0",
11-12: Refactor build-system dependencies: removepatchelfPatchelf is only checked for presence in setup.py and used indirectly by the auditwheel repair step in your CI/tox pipeline—not during the core PEP 517 build. Ninja, on the other hand, is required as your CMake generator.
Please update as follows:
• pyproject.toml, under
[build-system].requires:
- Remove
"patchelf"from the list.
• Keeppatchelfin your release/dev tooling (e.g.requirements-build.txt, tox.ini, or a[project.optional-dependencies.release]extra) so auditwheel still functions in CI.Suggested diff for pyproject.toml:
[build-system] requires = [ "tox", "auditwheel", - "patchelf", "ninja", "Cython", ](Optional) Add a release extra in pyproject.toml:
[project.optional-dependencies.release] patchelf = "*"src/transform/thread_storage_sync.cc (1)
64-75: Coalescing shared.dyn buffers for planning may cause benign over-synchronizationRedirecting all shared.dyn to a single buffer var makes planning simpler and safer. One caveat: if multiple shared.dyn buffers are intentionally disjoint (different element dtypes/shapes), this coalescing can insert extra barriers pessimistically. Acceptable if the goal is correctness-first.
Option: maintain a small map keyed by (scope.rank, scope.tag, maybe dtype) to coalesce per “class” rather than a single first-seen var.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.github/workflows/ci.yml(1 hunks)pyproject.toml(1 hunks)src/transform/thread_storage_sync.cc(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/thread_storage_sync.cc (2)
src/target/codegen_cuda.cc (14)
op(128-143)op(128-128)op(1358-1360)op(1358-1358)op(1361-1363)op(1361-1361)VisitStmt_(207-226)VisitStmt_(207-207)VisitStmt_(1605-1644)VisitStmt_(1605-1605)VisitStmt_(1646-1706)VisitStmt_(1646-1646)VisitStmt_(1708-1724)VisitStmt_(1708-1708)src/transform/storage_access.cc (8)
VisitStmt_(61-89)VisitStmt_(61-61)VisitStmt_(91-102)VisitStmt_(91-91)VisitStmt_(104-116)VisitStmt_(104-104)VisitStmt_(118-168)VisitStmt_(118-118)
🪛 GitHub Actions: CI
src/transform/thread_storage_sync.cc
[error] 1-1: tile-lang clang-format: Reformatted files. Please review and stage the changes.
🪛 GitHub Actions: CI Test on AMD
src/transform/thread_storage_sync.cc
[error] 1-1: Command: ./format.sh. clang-format reformatted files. Please review and stage the changes. Changes not staged for commit: src/transform/thread_storage_sync.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (10)
.github/workflows/ci.yml (2)
104-108: Wheel-form install: good; keep this as the single project install.This step is correct and the added verbosity helps diagnose build issues. See my separate comment on the earlier "Ensure venv" step to avoid a duplicate install in the same job.
121-121: Test verbosity flags look good.‘-v -r fE’ surfaces failures and errors clearly; with xdist it’s still readable. LGTM.
src/transform/thread_storage_sync.cc (8)
34-34: Externalizing thread-sync types — LGTMIncluding the shared types via "./common/thread_sync_types.h" removes duplication and clarifies ownership. Looks good.
248-253: Explicit PrimExpr for byte-index calculations — LGTMSwitching to explicit PrimExpr types avoids template-dependent deduction pitfalls and reads clearer.
341-353: Thread-extent capture simplified — LGTMUpdating tx_/ty_/tz_ on thread_extent and delegating the rest to the base visitor keeps the planner consistent with current env thread state.
355-360: Deduplicated sync insertion — LGTMEarly-return guard plus set insertion is simple and correct.
365-370: Safe defaults for thread itervars — LGTMInitializing tx/ty/tz to [0,1) ensures analysis won’t dereference undefined doms when kernels lack specific extents.
423-426: ThreadSyncInserter ctor reduction — LGTMDropping the partial-sync map cleans up state and matches the new planner flow. Call site below is updated accordingly.
746-749: Planner → Inserter → PartialSyncRewriter pipeline — LGTM (with backend caveat)Flow is coherent and keeps rewriting local to this pass. Just ensure all backends understand the rewritten call (see earlier comment).
1-19: CI formatting failureCI indicates clang-format reformatted this file. Please run ./format.sh (or clang-format with project style) and commit the formatted changes so CI passes.
| - name: Run examples | ||
| run: | | ||
| source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" | ||
| cd examples | ||
| unset PYTHONPATH | ||
| python -m pytest -n 4 **/test*.py | ||
| python -m pytest -n 4 **/test*.py -v -r fE | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fragile glob pattern may skip example tests; rely on pytest discovery.
The shell’s globstar is typically disabled on GitHub runners, so “**/test*.py” may not expand and pytest will treat it as a literal path, risking zero collection. Let pytest discover tests or enable globstar explicitly.
Prefer discovery:
- python -m pytest -n 4 **/test*.py -v -r fE
+ python -m pytest -n 4 -v -r fEIf you must keep explicit patterns, enable globstar first:
- python -m pytest -n 4 **/test*.py -v -r fE
+ shopt -s globstar
+ python -m pytest -n 4 **/test*.py -v -r fE📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Run examples | |
| run: | | |
| source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" | |
| cd examples | |
| unset PYTHONPATH | |
| python -m pytest -n 4 **/test*.py | |
| python -m pytest -n 4 **/test*.py -v -r fE | |
| - name: Run examples | |
| run: | | |
| source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" | |
| cd examples | |
| unset PYTHONPATH | |
| python -m pytest -n 4 -v -r fE |
🤖 Prompt for AI Agents
.github/workflows/ci.yml lines 109-115: the workflow uses a shell glob
"**/test*.py" which may not expand on GitHub runners (causing pytest to treat it
as a literal path and collect nothing); fix by either letting pytest discover
tests (remove the explicit glob and run pytest -v -r fE or similar in the
examples directory) or, if you must keep the explicit glob, enable bash globstar
before expansion (e.g., run a shell command to enable globstar) so the pattern
expands correctly; pick one approach and update the run step accordingly.
| requires = [ | ||
| "build", | ||
| "cmake>=3.26", | ||
| "cython", | ||
| "packaging", | ||
| "setuptools>=61", | ||
| "torch", | ||
| "wheel", | ||
| "tox", | ||
| "auditwheel", | ||
| "patchelf", | ||
| "ninja", | ||
| "Cython", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidated cleanup of [build-system].requires.
This diff reflects all of the above: keep minimal, backend-only deps; move tooling to extras.
Apply:
[build-system]
requires = [
- "build",
"cmake>=3.26",
"packaging",
"setuptools>=61",
- "torch",
"wheel",
- "tox",
- "auditwheel",
- "patchelf",
"ninja",
- "Cython",
+ "Cython>=3.0",
]
build-backend = "setuptools.build_meta"Then add outside this hunk as needed:
[project.optional-dependencies]
dev = [
"tox>=4",
"pytest",
"ruff",
"yapf",
"codespell",
]
release = [
"build>=1.2.1",
"twine>=5",
"auditwheel>=6; platform_system == 'Linux'",
]
# Choose one of these depending on runtime import requirements:
# [project]
# dependencies = ["torch>=2.2"]
# or
# [project.optional-dependencies]
# torch = ["torch>=2.2"]🤖 Prompt for AI Agents
In pyproject.toml around lines 2 to 14, the [build-system].requires list
includes development and platform tooling (tox, pytest, auditwheel, ninja,
Cython, torch, etc.) that should be removed and kept minimal for the PEP 517
build backend; replace that list with only the build-backend runtime
requirements (e.g., "build", "setuptools>=61", "wheel") and move all tooling and
release helpers into [project.optional-dependencies] sections as shown in the
review comment (add dev and release groups, put auditwheel under Linux selector,
and decide whether to place torch under [project].dependencies or as an optional
dependency per runtime import needs).
…consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.
…n TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/thread_storage_sync.cc (1)
657-663: Critical: Do not drop barriers when thread count isn’t a multiple of 32Returning an empty Stmt removes the sync entirely, risking data races. Fallback to the original full barrier in this case.
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback to full barrier to preserve correctness + return Evaluate(Call(op->dtype, op->op, {StringImm(scope)})); + }
♻️ Duplicate comments (2)
src/transform/thread_storage_sync.cc (2)
743-761: Planner seeding via buffer_map is good; ensure downstream codegen accepts partial-sync argsThe pass may rewrite tvm_storage_sync into multi-arg partial-sync calls. Please confirm CUDA and HIP codegens accept the extra arguments. The existing HIP path reportedly only handles the single-arg variant.
If needed, I can help draft updates to CodeGenTileLangHIP::PrintStorageSync to handle (scope, is_lead, num_blocks) and the named/partial barrier forms.
298-307: Fix placeholder Var dtypes to match thread var dtype (not index expr dtype)Constructing prev_var/curr_var with prev_indice.dtype() can mismatch threadIdx.{x,y,z} dtype (commonly int32), weakening analysis. Use the underlying thread var dtype.
- for (const auto &info : thread_vars) { - Var prev_var(info.name_prev, prev_indice.dtype()); - Var curr_var(info.name_curr, curr_indice.dtype()); + for (const auto &info : thread_vars) { + const DataType thread_dt = info.iv->var.dtype(); + Var prev_var(info.name_prev, thread_dt); + Var curr_var(info.name_curr, thread_dt); analyzer_.Bind(prev_var, info.iv->dom); analyzer_.Bind(curr_var, info.iv->dom); prev_indice_bytes = Substitute(prev_indice_bytes, {{info.iv->var, prev_var}}); curr_indice_bytes = Substitute(curr_indice_bytes, {{info.iv->var, curr_var}}); }
🧹 Nitpick comments (11)
testing/python/language/test_tilelang_language_cumsum.py (4)
47-53: Use out_idx=-1 is fine here; consider asserting tile divisibility in testsout_idx=-1 returning the last buffer (B) aligns with the kernel signature. Minor: since ref_program only covers full tiles, add quick assertions to make intent explicit.
def run_cumsum(M, N, block_M, block_N, dim=0, reverse=False, dtype="float32", scope="smem"): + assert M % block_M == 0 and N % block_N == 0, "Tests assume full tiles" if scope == "smem": program = cumsum_smem_test(M, N, block_M, block_N, dim, reverse, dtype) elif scope == "fragment": program = cumsum_fragment_test(M, N, block_M, block_N, dim, reverse, dtype) jit_kernel = tl.compile(program, out_idx=-1)
54-54: Guard CUDA usage or mark the test to skip on CPU-only environmentsDirect .cuda() will fail on CI without a GPU. Either guard with a skip or plumb a device.
- A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() + import pytest + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for this test") + A = torch.randn(M, N, dtype=getattr(torch, dtype), device="cuda")
82-82: Remove duplicate dtype test (leftover from dropping float16)This duplicates the previous call. Either delete it or reintroduce a second dtype (e.g., bfloat16) if intended.
- run_cumsum(256, 256, 128, 128, dtype="float32")
92-92: Remove duplicate dtype test in fragment pathSame duplication as above.
- run_cumsum(256, 256, 128, 128, dtype="float32", scope="fragment")src/transform/storage_access.h (1)
103-104: Expose BlockNode visitor in header (style nit: override vs final)Declaring VisitStmt_(const BlockNode*) here is correct. Minor style nit: other methods use final; AttrStmt uses override. Consider making annotation style consistent across overrides in this class.
src/transform/storage_access.cc (4)
41-41: Minor: simplify Set with existing VarYou can pass buf directly without GetRef(buf.get()). Pure nit.
- buffer_data_to_buffer_.Set(GetRef<Var>(buf.get()), op->buffer); + buffer_data_to_buffer_.Set(buf, op->buffer);
68-68: Minor: same Set simplification for storesSame nit as the load path.
- buffer_data_to_buffer_.Set(GetRef<Var>(buf.get()), op->buffer); + buffer_data_to_buffer_.Set(buf, op->buffer);
120-127: Pre-populating alloc_buffers is correct; consider covering match_buffersAlloc buffers are handled; if this pass needs to reason about match_buffers as well, consider pre-populating those too.
333-345: Index dtype for stride mathlinear_to_indices builds stride with Int(32) regardless of index type. Prefer a target index dtype (e.g., 64-bit on some platforms) to avoid overflow in large shapes.
- PrimExpr stride = make_const(DataType::Int(32), 1); + const auto idx_dt = offset.dtype().is_int() ? offset.dtype() : DataType::Int(32); + PrimExpr stride = make_const(idx_dt, 1);src/transform/thread_storage_sync.cc (2)
351-362: Capturing thread extents is correct; consider early exit when no thread tagsThis capture works; optional micro-optimization: return fast if thread_tag is empty. Not important.
770-771: Minor nit: remove stray semicolonHarmless but noisy.
- return tl::TileLangThreadSync(std::move(f), storage_scope); - ; + return tl::TileLangThreadSync(std::move(f), storage_scope);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
src/tl_templates/cuda/common.h(1 hunks)src/transform/storage_access.cc(5 hunks)src/transform/storage_access.h(3 hunks)src/transform/thread_storage_sync.cc(11 hunks)testing/python/language/test_tilelang_language_cumsum.py(6 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/tl_templates/cuda/common.h
🧰 Additional context used
🧬 Code graph analysis (4)
src/transform/thread_storage_sync.cc (4)
src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)src/transform/flatten_buffer.cc (4)
func(44-55)func(44-44)buffer(240-244)buffer(240-241)src/transform/warp_specialized_rewriter.cc (10)
n(944-948)n(944-944)stmt(57-68)stmt(57-57)stmt(127-130)stmt(127-127)stmt(132-136)stmt(132-132)f(1155-1174)f(1155-1156)src/transform/inject_tma_barrier.cc (4)
f(86-90)f(86-86)f(365-391)f(365-365)
src/transform/storage_access.cc (1)
src/transform/thread_storage_sync.cc (21)
buf(55-57)buf(55-55)op(350-362)op(350-350)op(408-422)op(408-408)op(453-459)op(453-453)op(460-466)op(460-460)op(467-485)op(467-467)op(487-526)op(487-487)op(541-560)op(541-541)op(611-630)op(611-611)ICHECK(561-583)buffer_var(536-538)buffer_var(536-536)
testing/python/language/test_tilelang_language_cumsum.py (3)
tilelang/jit/__init__.py (1)
compile(32-81)tilelang/jit/kernel.py (1)
out_idx(446-447)testing/python/language/test_tilelang_language_reduce_sum.py (2)
ref_program(35-36)ref_program(72-73)
src/transform/storage_access.h (1)
src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-nvidia
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (11)
testing/python/language/test_tilelang_language_cumsum.py (3)
7-7: Default dtype switch to float32 looks fineThe API shift to default dtype="float32" is consistent with the rest of the PR. No functional concerns.
26-26: Fragment path default dtype alignedMatching the smem path to default float32 keeps both scopes consistent.
69-71: Direct execution and assert_close LGTMReplacing profiler-based checks with direct execution and numerical comparison improves determinism. Tolerances look reasonable for float32 on GPU.
src/transform/storage_access.h (3)
82-84: New is_pointer_access flag is a good additionThis provides a clear, explicit signal down the pipeline for conservative handling of pointer-derived accesses.
105-107: Public setter for buffer_data_to_buffer_ is appropriateThis enables planners (e.g., ThreadSync) to seed mappings from func->buffer_map.
168-170: Map<Var, Buffer> buffer_data_to_buffer_ additionGood to persist the mapping across visits. Ensure all insertion points (BufferLoad, BufferStore, Block alloc_buffers, and external setter) cover all buffer origins.
src/transform/storage_access.cc (1)
361-364: Conservative conflict handling for pointer access is fineGiven the uncertainty, marking pointer-based accesses with e.is_pointer_access and handling conservatively in conflict detection is reasonable.
src/transform/thread_storage_sync.cc (4)
64-75: Good: unify shared.dyn buffer identity before planningCoalescing shared.dyn to a single Var improves sync planning coherence across dynamic shared allocations.
249-257: Use bytes() multiply with analyzer Simplify is soundComputing byte addresses to reason about disjointness is a practical approach.
367-368: insert_syncs now unconditionally records syncsThis simplification is fine and avoids partial-sync-specific bookkeeping.
431-433: Constructor signature update propagates correctlyThreadSyncInserter updated to drop partial-sync state is consistent with planner changes.
…tilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/transform/storage_access.cc (2)
285-293: Bug: Ramp lanes must be a compile-time int; shape[i] can be dynamic
Ramp(load->indices[i], 1, buffer->shape[i])passes a PrimExpr for lanes; this will fail whenshape[i]isn’tIntImm. Also, the stride literal1should match the index dtype.Apply:
- for (size_t i = 0; i < buffer->shape.size(); ++i) { - buffer_indices.push_back(Ramp(load->indices[i], 1, buffer->shape[i])); - } + for (size_t i = 0; i < buffer->shape.size(); ++i) { + if (const auto* lanes_imm = buffer->shape[i].as<IntImmNode>()) { + buffer_indices.push_back( + Ramp(load->indices[i], + make_const(load->indices[i].dtype(), 1), + static_cast<int>(lanes_imm->value))); + } else { + // Fallback: unknown lanes — keep scalar index + buffer_indices.push_back(load->indices[i]); + } + }
328-329: Bug: dynamic lanes in Ramp for tvm_access_ptr; add guards and typed strideBoth
extentand(end - start)can be non-IntImm. Only constructRampwhen lanes are compile-time ints; otherwise keep scalar indices. Also ensure the stride1matches the index dtype.Apply:
- // cannot find buffer map, use the default buffer - buffer_indices = {Ramp(offset, 1, extent)}; + // cannot find buffer map; use 1D index. Use Ramp only if extent is const. + if (const auto* lanes_imm = extent.as<IntImmNode>()) { + buffer_indices = {Ramp(offset, + make_const(offset.dtype(), 1), + static_cast<int>(lanes_imm->value))}; + } else { + buffer_indices = {offset}; + } @@ - for (size_t i = 0; i < buffer_shape.size(); ++i) { - buffer_indices.push_back( - Ramp(start_indices[i], 1, - analyzer_.Simplify(end_indices[i] - start_indices[i]))); - } + for (size_t i = 0; i < buffer_shape.size(); ++i) { + PrimExpr lanes_expr = + analyzer_.Simplify(end_indices[i] - start_indices[i]); + if (const auto* lanes_imm = lanes_expr.as<IntImmNode>()) { + buffer_indices.push_back( + Ramp(start_indices[i], + make_const(start_indices[i].dtype(), 1), + static_cast<int>(lanes_imm->value))); + } else { + buffer_indices.push_back(start_indices[i]); + } + }Also applies to: 352-355
🧹 Nitpick comments (5)
src/transform/storage_access.cc (5)
120-127: Also prepopulate match_buffers; ICHECK is redundant
- Block allocations are covered, but buffers introduced via
match_buffersare not prepopulated. Add them so address_of/tvm_access_ptr can resolve shapes for matched buffers too.ICHECK(buffer->IsInstance<BufferNode>())is always true forBuffer; it’s a nit.Apply:
void TileLangStorageAccessVisitor::VisitStmt_(const BlockNode *op) { auto block = Downcast<Block>(op); for (const auto &buffer : block->alloc_buffers) { - ICHECK(buffer->IsInstance<BufferNode>()); buffer_data_to_buffer_.Set(buffer->data, buffer); } + for (const auto& mb : block->match_buffers) { + const Buffer& buf = mb->buffer; + buffer_data_to_buffer_.Set(buf->data, buf); + } IRVisitorWithAnalyzer::VisitStmt_(op); }
304-307: Confirm address_of access classification (read vs write)
address_ofitself doesn’t perform a memory read/write, but downstream passes (thread_storage_sync.cc) treat it as both read and write for stats. If your analysis relies on AccessEntry types, consider emitting two entries (kRead and kWrite) here for consistency with those stats, or document why a single kRead is sufficient.Proposed change if you want both:
- e.is_pointer_access = true; - e.type = kRead; - e.scope = scope; - curr_stmt_.access.emplace_back(e); + e.is_pointer_access = true; + e.scope = scope; + e.type = kRead; + curr_stmt_.access.emplace_back(e); + e.type = kWrite; + curr_stmt_.access.emplace_back(e);
333-347: Index conversion: fix dtype of constants; consider indexdiv/indexmod and O(N) stride precompute
- Initialize
stridewith the same dtype asoffsetto avoid mixed-dtype arithmetic.- Consider
indexdiv/indexmod(or keepfloordiv/floormodif that’s the project’s norm).- Minor: precompute strides once (O(N)) rather than recomputing inside the outer loop (O(N^2)).
Apply minimal dtype fix:
- PrimExpr indices; - PrimExpr remaining = offset; - for (size_t i = 0; i < shape.size(); ++i) { - PrimExpr stride = make_const(DataType::Int(32), 1); + PrimExpr indices; + PrimExpr remaining = offset; + for (size_t i = 0; i < shape.size(); ++i) { + PrimExpr stride = make_const(offset.dtype(), 1); for (size_t j = i + 1; j < shape.size(); ++j) { stride = stride * shape[j]; } PrimExpr idx = FloorDiv(remaining, stride); remaining = FloorMod(remaining, stride); indices.push_back(analyzer_.Simplify(idx)); }If you want the O(N) version, I can provide a follow-up patch.
320-320: Optional: assert flag is an IntImmMostly stylistic, but consistent with nearby code:
ICHECK(flag)after theas<IntImmNode>()cast can help catch malformed IR early.- const IntImmNode *flag = op->args[4].as<IntImmNode>(); + const IntImmNode* flag = op->args[4].as<IntImmNode>(); + ICHECK(flag) << "tvm_access_ptr flag must be IntImm";
314-376: Tests recommended: dynamic shapes and pointer pathsPlease add tests that:
- exercise address_of and tvm_access_ptr with dynamic shapes/extent (to validate the Ramp fallbacks),
- validate multi-dimensional index reconstruction from a linear offset,
- cover both read and write flags on tvm_access_ptr.
I can draft unit tests mirroring the scenarios above if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/storage_access.cc(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/storage_access.cc (1)
src/transform/thread_storage_sync.cc (21)
buf(55-57)buf(55-55)op(350-362)op(350-350)op(408-422)op(408-408)op(453-459)op(453-453)op(460-466)op(460-460)op(467-485)op(467-467)op(487-526)op(487-487)op(541-560)op(541-541)op(611-630)op(611-611)ICHECK(561-583)buffer_var(536-538)buffer_var(536-536)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (4)
src/transform/storage_access.cc (4)
41-41: Good: populate Var→Buffer map on BufferLoadStoring
buffer_data_to_buffer_here is correct and unblocks later pointer→shape resolution.
68-68: Good: populate Var→Buffer map on BufferStoreSymmetric to the load path; this keeps the map consistent.
315-316: Likely wrong dtype derivation for tvm_access_ptr
DataType dtype = op->args[0].dtype();is almost certainly the scalar dtype of the “dtype argument expression” (commonly an int), not the pointed-to element type. Fore.dtype, prefer the buffer dtype when available; otherwise derive element type per TVM convention.Do you want me to adjust this to:
- use
buffer->dtypewhenbuffer_data_to_buffer_has the mapping, and- otherwise fall back to an agreed derivation for the element dtype?
Also applies to: 361-361
362-365: Good: mark pointer-based accesses
e.is_pointer_access = trueis useful for downstream sync planning and differentiates pointer-origin accesses from direct buffer indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/thread_storage_sync.cc (1)
657-663: Danger: dropping the sync when thread_count % 32 != 0 silently removes required barriersReturning an empty Stmt here erases the synchronization entirely for partial groups that aren’t warp-multiples. That can introduce races. If a named/partial barrier can’t be emitted, fall back to the original full shared sync instead of removing it.
Apply this diff to preserve correctness:
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback to the original full sync to preserve correctness. + return Evaluate(IRMutatorWithAnalyzer::VisitExpr_(op)); + }If you prefer to avoid any rewrite when thread_count isn’t warp-aligned, another safe alternative is to return the original call unchanged:
- return Stmt(); + return Evaluate(Call(op->dtype, op->op, {StringImm(scope)}));
♻️ Duplicate comments (2)
src/transform/thread_storage_sync.cc (2)
298-307: Correct dtype choice for placeholder Vars (matches thread var dtype)Using info.iv->var.dtype() for the split placeholders fixes the prior unsoundness where these Vars could differ from threadIdx.* dtype. This keeps the analyzer’s reasoning consistent.
743-761: PrimFunc-based API transition looks good; verify backend support for rewritten tvm_storage_syncSwitching to PrimFunc and ordering the passes (planner → insertion → partial-sync rewrite) is coherent. However, after rewriting, tvm_storage_sync in shared scope may carry extra args (barrier_id, thread_count). Please verify that:
- CUDA and HIP codegen paths accept the multi-arg variant for shared sync, and
- HIP codegen maps it to the intended mbarrier-style ops.
This was flagged earlier; confirming resolution prevents backend miscompiles.
Use this script to locate and inspect codegen handlers:
#!/bin/bash # Locate HIP/CUDA handlers that print/emit storage sync rg -nP -C3 'PrintStorageSync|tvm_storage_sync|storage_sync' -- src # Specifically verify HIP handler recognizes >1 arg in shared scope rg -nP -C3 'class\s+CodeGenTileLangHIP|PrintStorageSync' -- srcIf HIP still only prints __syncthreads() for the single-arg case, add handling for the (scope, barrier_id, thread_count) form.
🧹 Nitpick comments (2)
src/transform/thread_storage_sync.cc (2)
281-297: Micro-optimization: skip split-substitution when a thread var isn’t usedBefore substituting tx/ty/tz with split placeholders, check whether the index expressions actually reference the thread var. This avoids unnecessary analyzer Bind/Substitute work on hot paths.
Apply this diff inside the for-loop over thread_vars:
- for (const auto &info : thread_vars) { + for (const auto &info : thread_vars) { + // Skip if neither side references this thread var + if (!UsesVar(prev_indice_bytes, {info.iv->var.get()}) && + !UsesVar(curr_indice_bytes, {info.iv->var.get()})) { + continue; + } Var prev_var(info.name_prev, info.iv->var.dtype()); Var curr_var(info.name_curr, info.iv->var.dtype()); analyzer_.Bind(prev_var, info.iv->dom); analyzer_.Bind(curr_var, info.iv->dom); prev_indice_bytes = Substitute(prev_indice_bytes, {{info.iv->var, prev_var}}); curr_indice_bytes = Substitute(curr_indice_bytes, {{info.iv->var, curr_var}}); }
769-772: Nit: remove unused variable and stray semicolon in pass lambdaauto *n = f.CopyOnWrite(); is unused, and there’s a dangling semicolon on a separate line.
Apply this tidy-up:
- auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { - auto *n = f.CopyOnWrite(); - return tl::TileLangThreadSync(std::move(f), storage_scope); - ; - }; + auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { + return tl::TileLangThreadSync(std::move(f), storage_scope); + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/thread_storage_sync.cc(11 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/thread_storage_sync.cc (4)
src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)src/transform/flatten_buffer.cc (4)
func(44-55)func(44-44)buffer(240-244)buffer(240-241)src/transform/warp_specialized_rewriter.cc (10)
n(944-948)n(944-944)stmt(57-68)stmt(57-57)stmt(127-130)stmt(127-127)stmt(132-136)stmt(132-132)f(1155-1174)f(1155-1156)src/transform/pipeline_planning.cc (2)
f(147-157)f(147-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (7)
src/transform/thread_storage_sync.cc (7)
64-75: Unifying shared.dyn accesses under a single buffer var is sound hereRedirecting all shared.dyn accesses in a sequence to a common Var improves the planner’s visibility of conflicts for dynamic shared memory. This matches the single dynamic shared segment model and should reduce spurious misses in the planner.
250-254: Minor clarity win: explicit PrimExpr typesSwitching to explicit PrimExpr for prev_indice_bytes and curr_indice_bytes avoids type deduction surprises and improves readability.
351-362: Accurately tracking threadIdx.{x,y,z} extentsCapturing the thread extent IterVars and delegating to the base visitor is correct. This enables the later disjointness proof and keeps analyzer bindings intact.
367-368: Always recording sync sites is fine (set prevents duplication)Unconditionally inserting into syncs_inserted_ is safe due to the set semantics and simplifies the planner logic.
430-433: Constructor simplification is appropriateRemoving the unused partial-sync map streamlines ThreadSyncInserter now that partial sync is unified.
752-754: Good: bind buffer data->buffer before planningThis mirrors other passes (e.g., pipeline planning, warp-specialized rewriter) and ensures planner sees canonical buffers.
34-34: Ensurethread_sync_types.hIs Included in All Build TargetsI located the new header at
src/transform/common/thread_sync_types.h, so the include path inthread_storage_sync.ccis correct. However, I didn’t find any references to this header in the CMake or BUCK files undersrc. Please manually verify and update your build configuration to include this header in the appropriate targets:
- Confirm that
src/transform/common/thread_sync_types.his added to the correspondingadd_libraryortarget_sourcesentries in yourCMakeLists.txt.- If you’re using BUCK, ensure the header appears in the
srcsorexported_headersof the relevantcxx_libraryrule.- Double-check any
source_groupdirectives or IDE project files (if maintained) to keep IDE views in sync.
…simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/transform/storage_access.cc (2)
287-293: Bug: Ramp lanes must be compile-time IntImm; add guard + correct stride dtype
Ramp(base, stride, lanes)requireslanesto be anintknown at compile time. Usingbuffer->shape[i](PrimExpr) will fail if it’s not anIntImm. Also, the stride literal should match the index dtype viamake_const.Apply:
- Array<PrimExpr> buffer_indices; - // from indices to buffer indices - ICHECK(buffer->shape.size() == load->indices.size()); - for (size_t i = 0; i < buffer->shape.size(); ++i) { - buffer_indices.push_back(Ramp(load->indices[i], 1, buffer->shape[i])); - } + Array<PrimExpr> buffer_indices; + // from indices to buffer indices + ICHECK(buffer->shape.size() == load->indices.size()); + for (size_t i = 0; i < buffer->shape.size(); ++i) { + PrimExpr lanes_expr = analyzer_.Simplify(buffer->shape[i]); + if (const auto* lanes_imm = lanes_expr.as<IntImmNode>()) { + buffer_indices.push_back( + Ramp(load->indices[i], + make_const(load->indices[i].dtype(), 1), + static_cast<int>(lanes_imm->value))); + } else { + // Fallback: unknown vector width — keep scalar index + buffer_indices.push_back(load->indices[i]); + } + }
316-333: Bug: tvm_access_ptr Ramp uses dynamicextent; guard lanes + fix stride dtype
extentis aPrimExprthat may be dynamic. ConstructRamponly whenextentis anIntImm; otherwise fall back to scalar indexing. Also, ensure stride ismake_const(offset.dtype(), 1).- Array<PrimExpr> buffer_indices; - buffer_indices = {Ramp(offset, 1, extent)}; + Array<PrimExpr> buffer_indices; + if (const auto* lanes_imm = extent.as<IntImmNode>()) { + buffer_indices = { + Ramp(offset, make_const(offset.dtype(), 1), + static_cast<int>(lanes_imm->value))}; + } else { + // Fallback: unknown vector width — keep scalar index + buffer_indices = {offset}; + }
🧹 Nitpick comments (3)
src/transform/storage_access.cc (3)
41-41: Use Var directly instead of GetRef(...get()) for clarity
bufis already a Var. Re-wrapping the raw pointer adds noise without benefit. Same in BufferStore.- buffer_data_to_buffer_.Set(GetRef<Var>(buf.get()), op->buffer); + buffer_data_to_buffer_.Set(buf, op->buffer);And similarly:
- buffer_data_to_buffer_.Set(GetRef<Var>(buf.get()), op->buffer); + buffer_data_to_buffer_.Set(buf, op->buffer);Also applies to: 68-68
120-127: Pre-populate alloc_buffers map — good; remove redundant type checkThe prepopulation of
buffer_data_to_buffer_fromBlock::alloc_buffersis helpful and correct. TheICHECK(buffer->IsInstance<BufferNode>())is redundant becausebufferis aBufferhandle and will always wrap aBufferNode.- for (const auto &buffer : block->alloc_buffers) { - ICHECK(buffer->IsInstance<BufferNode>()); - buffer_data_to_buffer_.Set(buffer->data, buffer); - } + for (const auto& buffer : block->alloc_buffers) { + buffer_data_to_buffer_.Set(buffer->data, buffer); + }
279-313: Alignaddress_of()Semantics Across AnalysesThe stats pass in
thread_storage_sync.cccurrently attributes both reads and writes to global buffers when it seesaddress_of(), whereasstorage_access.cconly emits akRead. For consistency—and to avoid undercounting pointer writes—you should either:
- Emit both a
kReadand akWriteAccessEntryhere, or- Document why this pass intentionally records only reads.
The audit of all
Ramp(base, stride, lanes)calls (via the providedrgscript) uncovered these locations:
- src/transform/storage_access.cc (lines 291, 325)
- src/transform/vectorize_loop.cc (lines 211, 256, 260, 318, 776)
- src/transform/common/loop_vectorization_utils.h (lines 180, 225, 229, 287, 768)
- src/transform/storage_rewrite.cc (line 1542)
If you choose to align here, please also consider whether similar pointer‐access logic in the vectorizer and rewrite passes needs updating.
Minimal change in
storage_access.cc:- e.is_pointer_access = true; - e.type = kRead; - e.scope = scope; - curr_stmt_.access.emplace_back(e); + e.is_pointer_access = true; + e.scope = scope; + e.type = kRead; + curr_stmt_.access.emplace_back(e); + // Also count writes to match thread_storage_sync.cc + e.type = kWrite; + curr_stmt_.access.emplace_back(e);Optional follow-up: propagate this pattern (or document read-only intent) in the other Ramp-using modules listed above.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/storage_access.cc(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/storage_access.cc (1)
src/transform/thread_storage_sync.cc (21)
buf(55-57)buf(55-55)op(350-362)op(350-350)op(408-422)op(408-408)op(453-459)op(453-453)op(460-466)op(460-460)op(467-485)op(467-467)op(487-526)op(487-487)op(541-560)op(541-541)op(611-630)op(611-611)ICHECK(561-583)buffer_var(536-538)buffer_var(536-536)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (2)
src/transform/storage_access.cc (2)
285-285: LGTM: Capture Var->Buffer mapping on address_of()Storing
buffer_var -> Bufferhere unblocks downstream pointer-based access accounting.
300-305: Pointer access metadata looks rightAssigning
e.buffer_indicesand settinge.is_pointer_access = truefor address_of() is appropriate.
…buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/thread_storage_sync.cc (1)
663-665: Critical: never drop a sync — fallback to full sync instead of returning empty StmtReturning Stmt() removes the barrier entirely when thread_count % 32 != 0, which can break correctness. Fallback to the original full-sync Evaluate instead.
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback: keep full barrier semantics instead of dropping the sync. + return Evaluate(IRMutatorWithAnalyzer::VisitExpr_(op)); + }
♻️ Duplicate comments (3)
src/transform/thread_storage_sync.cc (3)
242-248: Correct: treat any pointer access as conflicting, unconditionallyThis avoids unsound “non-overlap” conclusions for opaque pointer-based accesses.
296-315: Good fix: placeholder Vars now match thread var dtypesUsing info.iv->var.dtype() aligns analyzer bindings and avoids subtle unsoundness.
489-528: RW stats on tvm_access_ptr/address_of updated — OK; verify HIP/CUDA codegen handles rewritten sync formsGiven ThreadPartialSyncRewriter may emit tvm_storage_sync(scope, barrier_id, thread_count) for shared/shared.dyn and ThreadSyncInserter emits the 3-arg global form, ensure backends print these signatures. Past HIP codegen only handled the 1-arg form.
Run the following to check codegen coverage:
#!/bin/bash # Locate codegen handlers for tvm_storage_sync across backends rg -nP --glob '!**/third_party/**' 'Print(Storage|.*)Sync|tvm_storage_sync\(' -C3 # Specifically check HIP & CUDA emitters rg -nP --glob '!**/third_party/**' 'CodeGen.*(HIP|CUDA).*tvm_storage_sync' -C3If HIP/CUDA still only accept a single argument, please update them to handle:
- shared/shared.dyn partial sync: (scope: "shared"|"shared.dyn", barrier_id:int, thread_count:int)
- global sync: (scope:"global", is_lead:bool, num_blocks:int)
🧹 Nitpick comments (4)
src/transform/storage_access.h (2)
68-70: Define semantics of buffer_ranges explicitly; current use suggests “min/extent” per dim for pointer accessPlease add a brief contract here (e.g., ranges are best-effort and may over-approximate) to prevent downstream consumers from assuming they are exact. This will avoid misinterpretation when pointer-based accesses are conservatively treated as conflicting anyway.
Apply this comment-only diff to clarify:
- /*! \brief The buffer ranges for pointer access */ + /*! \brief The buffer ranges for pointer access (per-dimension min/extent). + * + * Note: These ranges are best-effort/over-approximations derived from pointer + * expressions; downstream analyses must not assume exactness. Pointer-based + * accesses are conservatively considered conflicting regardless. + */
107-109: Public setter is fine; consider documenting lifecycleA short note that this map is per-visitor instance and intentionally not cleared between statements/blocks would help avoid accidental misuse.
src/transform/thread_storage_sync.cc (2)
745-763: PrimFunc-based ThreadSync refactor looks good; minor nits
- Good: return PrimFunc and feed planner with func->buffer_map via SetBufferDataToBuffer.
- Consider moving ThreadPartialSyncRewriter behind a feature flag or pass option if we intend to eventually retire partial syncs.
771-774: Remove unused variable and stray semicolon in pass wrapperSmall cleanup to avoid warnings.
- auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { - auto *n = f.CopyOnWrite(); - return tl::TileLangThreadSync(std::move(f), storage_scope); - ; - }; + auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { + return tl::TileLangThreadSync(std::move(f), storage_scope); + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
src/transform/storage_access.cc(5 hunks)src/transform/storage_access.h(4 hunks)src/transform/thread_storage_sync.cc(11 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/transform/storage_access.h (1)
src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)
src/transform/thread_storage_sync.cc (6)
src/target/codegen_cuda.cc (6)
op(128-143)op(128-128)op(1358-1360)op(1358-1358)op(1361-1363)op(1361-1361)src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)src/transform/flatten_buffer.cc (4)
func(44-55)func(44-44)buffer(240-244)buffer(240-241)src/transform/warp_specialized_rewriter.cc (10)
n(944-948)n(944-944)stmt(57-68)stmt(57-57)stmt(127-130)stmt(127-127)stmt(132-136)stmt(132-132)f(1155-1174)f(1155-1156)src/transform/legalize_safe_memory_access.cc (13)
buffer(78-86)buffer(78-78)buffer(89-128)buffer(89-90)buffer(233-236)buffer(233-233)buffer(238-241)buffer(238-238)buffer(243-246)buffer(243-243)buffer(248-253)f(262-274)f(262-262)src/transform/pipeline_planning.cc (2)
f(147-157)f(147-147)
src/transform/storage_access.cc (2)
src/transform/thread_storage_sync.cc (21)
buf(55-57)buf(55-55)op(352-364)op(352-352)op(410-424)op(410-410)op(455-461)op(455-455)op(462-468)op(462-462)op(469-487)op(469-469)op(489-528)op(489-489)op(543-562)op(543-543)op(613-632)op(613-613)ICHECK(563-585)buffer_var(538-540)buffer_var(538-538)src/transform/storage_rewrite.cc (1)
i(703-839)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (8)
src/transform/storage_access.h (3)
85-86: Good addition: explicit flag for pointer-based accessesThe new flag is clear and enables conservative conflict handling upstream.
105-106: Block visitor override is appropriateVisiting Block to prepopulate alloc_buffers improves coverage for intra-block allocations.
170-172: Var→Buffer map is the right primitiveThis internal state is necessary for pointer-based access reconstruction. Looks good.
src/transform/storage_access.cc (2)
350-357: Index-difference extent can be dynamic; that’s OK, but keep the dtype consistentGood use of Simplify on (end - start). With the dtype fix above, this stays coherent across targets.
414-419: GetScope fallback to global is reasonablePointer-annotated vars use GetPtrStorageScope; others default to global. LGTM.
src/transform/thread_storage_sync.cc (3)
64-75: Nice: unify shared.dyn accesses onto a single Var for planningThis improves sync planning for dynamic shared memory. Good conservative approach.
353-364: Thread extent tracking integrated into plannerCapturing tx/ty/tz here and delegating to base visitor is clean. LGTM.
430-435: Signature reduction of ThreadSyncInserter is consistent with planner changesLooks correct; all sites updated in this file.
| buffer_data_to_buffer_.Set(GetRef<Var>(buffer_var), buffer); | ||
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | ||
| Array<Range> buffer_ranges; | ||
| // from indices to buffer indices | ||
| ICHECK(buffer->shape.size() == load->indices.size()); | ||
| for (size_t i = 0; i < buffer->shape.size(); ++i) { | ||
| buffer_ranges.push_back( | ||
| Range::FromMinExtent(load->indices[i], buffer->shape[i])); | ||
| } | ||
| if (Enabled(buffer_var, scope)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: buffer_ranges extent for address_of is incorrect; use unit-extent instead of full dim size
For address_of(A[i, j, ...]), the constructed per-dimension range currently uses extent = shape[i], which can exceed bounds (e.g., start+extent > dim_size) and grossly over-approximate. Since address_of yields a pointer to a single element (or vector lane count, see below), the conservative and type-safe choice is unit extent.
Also ensure the literal 1 matches the index dtype to avoid dtype mismatches.
Apply:
- for (size_t i = 0; i < buffer->shape.size(); ++i) {
- buffer_ranges.push_back(
- Range::FromMinExtent(load->indices[i], buffer->shape[i]));
- }
+ for (size_t i = 0; i < buffer->shape.size(); ++i) {
+ // address_of points to the element at indices[i]; conservatively model as 1 element
+ buffer_ranges.push_back(Range::FromMinExtent(
+ load->indices[i],
+ make_const(load->indices[i].dtype(), 1)));
+ }If vectorized BufferLoad is supported here (buffer->dtype.lanes() > 1), optionally make extent = lanes for the innermost dim with a guard when lanes is IntImm; otherwise keep unit extent.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| buffer_data_to_buffer_.Set(GetRef<Var>(buffer_var), buffer); | |
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | |
| Array<Range> buffer_ranges; | |
| // from indices to buffer indices | |
| ICHECK(buffer->shape.size() == load->indices.size()); | |
| for (size_t i = 0; i < buffer->shape.size(); ++i) { | |
| buffer_ranges.push_back( | |
| Range::FromMinExtent(load->indices[i], buffer->shape[i])); | |
| } | |
| if (Enabled(buffer_var, scope)) { | |
| buffer_data_to_buffer_.Set(GetRef<Var>(buffer_var), buffer); | |
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | |
| Array<Range> buffer_ranges; | |
| // from indices to buffer indices | |
| ICHECK(buffer->shape.size() == load->indices.size()); | |
| - for (size_t i = 0; i < buffer->shape.size(); ++i) { | |
| - buffer_ranges.push_back( | |
| - Range::FromMinExtent(load->indices[i], buffer->shape[i])); | |
| for (size_t i = 0; i < buffer->shape.size(); ++i) { | |
| // address_of points to the element at indices[i]; conservatively model as 1 element | |
| buffer_ranges.push_back(Range::FromMinExtent( | |
| load->indices[i], | |
| make_const(load->indices[i].dtype(), 1))); | |
| } | |
| if (Enabled(buffer_var, scope)) { |
| DataType dtype = op->args[0].dtype(); | ||
| const VarNode *buffer = op->args[1].as<VarNode>(); | ||
| const VarNode *buffer_var = op->args[1].as<VarNode>(); | ||
| PrimExpr offset = op->args[2]; | ||
| PrimExpr extent = op->args[3]; | ||
| const IntImmNode *flag = op->args[4].as<IntImmNode>(); | ||
| StorageScope scope = GetScope(GetRef<Var>(buffer)); | ||
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | ||
| // The buffer scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard tvm_access_ptr arg[1] cast and prefer buffer dtype when available
- Add ICHECK for arg[1] being a Var to avoid null deref.
- The current dtype = op->args[0].dtype() is likely not the element dtype (often IntImm). Prefer buffer dtype when the map is available; otherwise keep a safe fallback.
- DataType dtype = op->args[0].dtype();
- const VarNode *buffer_var = op->args[1].as<VarNode>();
+ const VarNode *buffer_var = op->args[1].as<VarNode>();
+ ICHECK(buffer_var) << "tvm_access_ptr expects a Var as its second argument";
+ // Prefer buffer dtype when we know the buffer; fall back conservatively otherwise.
+ DataType dtype = DataType::UInt(8);And below, after resolving the buffer from buffer_data_to_buffer_, set:
- } else {
+ } else {
Buffer buffer = buffer_data_to_buffer_.at(GetRef<Var>(buffer_var));
+ dtype = buffer->dtype.element_of();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DataType dtype = op->args[0].dtype(); | |
| const VarNode *buffer = op->args[1].as<VarNode>(); | |
| const VarNode *buffer_var = op->args[1].as<VarNode>(); | |
| PrimExpr offset = op->args[2]; | |
| PrimExpr extent = op->args[3]; | |
| const IntImmNode *flag = op->args[4].as<IntImmNode>(); | |
| StorageScope scope = GetScope(GetRef<Var>(buffer)); | |
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | |
| // The buffer scope. | |
| // Guard and extract the buffer variable | |
| const VarNode* buffer_var = op->args[1].as<VarNode>(); | |
| ICHECK(buffer_var) | |
| << "tvm_access_ptr expects a Var as its second argument"; | |
| // Prefer the buffer’s element dtype when available; otherwise use a safe default | |
| DataType dtype = DataType::UInt(8); | |
| PrimExpr offset = op->args[2]; | |
| PrimExpr extent = op->args[3]; | |
| const IntImmNode* flag = op->args[4].as<IntImmNode>(); | |
| StorageScope scope = GetScope(GetRef<Var>(buffer_var)); | |
| // The buffer scope. | |
| // ... earlier code to populate buffer_data_to_buffer_ ... | |
| if (buffer_data_to_buffer_.count(GetRef<Var>(buffer_var))) { | |
| Buffer buffer = buffer_data_to_buffer_.at(GetRef<Var>(buffer_var)); | |
| // Override dtype with the actual element type of the buffer | |
| dtype = buffer->dtype.element_of(); | |
| // existing access-ptr logic using `dtype`… | |
| } else { | |
| // fallback path for unmapped buffers | |
| // existing default logic… | |
| } |
🤖 Prompt for AI Agents
In src/transform/storage_access.cc around lines 316 to 322, guard the cast of
op->args[1] to Var by adding an ICHECK that arg[1].as<VarNode>() is non-null to
avoid a null deref, and prefer using the resolved buffer's element dtype (when
available from buffer_data_to_buffer_ lookup) instead of op->args[0].dtype();
specifically, after resolving the buffer from buffer_data_to_buffer_ set dtype
to the buffer's dtype, and only fall back to op->args[0].dtype() if no buffer
dtype is available.
| auto linear_to_indices = [this](PrimExpr offset, | ||
| const Array<PrimExpr> &shape) { | ||
| Array<PrimExpr> indices; | ||
| PrimExpr remaining = offset; | ||
| for (size_t i = 0; i < shape.size(); ++i) { | ||
| PrimExpr stride = make_const(DataType::Int(32), 1); | ||
| for (size_t j = i + 1; j < shape.size(); ++j) { | ||
| stride = stride * shape[j]; | ||
| } | ||
| PrimExpr idx = FloorDiv(remaining, stride); | ||
| remaining = FloorMod(remaining, stride); | ||
| indices.push_back(analyzer_.Simplify(idx)); | ||
| } | ||
| return indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix dtype handling in linear_to_indices: use index dtype consistently
Stride is initialized as int32 literal “1” and multiplied by shape[j]. Mixed dtypes can trigger type errors in FloorDiv/FloorMod. Use offset’s dtype and cast shapes to match.
- PrimExpr indices;
- PrimExpr remaining = offset;
- for (size_t i = 0; i < shape.size(); ++i) {
- PrimExpr stride = make_const(DataType::Int(32), 1);
- for (size_t j = i + 1; j < shape.size(); ++j) {
- stride = stride * shape[j];
- }
- PrimExpr idx = FloorDiv(remaining, stride);
- remaining = FloorMod(remaining, stride);
- indices.push_back(analyzer_.Simplify(idx));
- }
+ PrimExpr indices;
+ DataType idx_dt = offset.dtype();
+ PrimExpr remaining = offset;
+ for (size_t i = 0; i < shape.size(); ++i) {
+ PrimExpr stride = make_const(idx_dt, 1);
+ for (size_t j = i + 1; j < shape.size(); ++j) {
+ stride = analyzer_.Simplify(stride * cast(idx_dt, shape[j]));
+ }
+ PrimExpr idx = FloorDiv(remaining, stride);
+ remaining = FloorMod(remaining, stride);
+ indices.push_back(analyzer_.Simplify(idx));
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/transform/storage_access.cc around lines 334 to 347, linear_to_indices
initializes stride as an Int(32) constant and mixes dtypes with shape elements,
which can cause FloorDiv/FloorMod type errors; change stride's initial constant
to use offset.dtype() (make_const with offset->dtype()) and cast each shape[j]
to offset.dtype() before multiplying (or use explicit Cast to offset.dtype()) so
all arithmetic (stride * shape[j], FloorDiv, FloorMod) uses the same index dtype
as offset.
…'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/thread_storage_sync.cc (1)
659-663: Do not drop the barrier when thread_count is not a multiple of 32.Returning an empty Stmt() removes the synchronization entirely, risking data races. If partial barriers are unsupported for a non-warp-aligned count, fall back to the original full sync to preserve correctness.
Suggested fix:
- if (thread_count % 32 != 0) { - // TODO(lei): This is a workaround for the case where the thread count is - // not a multiple of 32. we should enhance the pass to analysis index - // instead of buffer expression etc. - return Stmt(); - } + if (thread_count % 32 != 0) { + // Fallback to the original full sync to maintain correctness when + // partial sync cannot be reliably emitted. + return Evaluate(IRMutatorWithAnalyzer::VisitExpr_(op)); + }Follow-up:
- Consider emitting a diagnostic (e.g., pass-context warning) to highlight the fallback, so future improvements can target these cases.
♻️ Duplicate comments (2)
src/transform/thread_storage_sync.cc (2)
241-247: Pointer access conservatively forces conflict — resolved as previously requested.Moving the pointer-access check ahead of index-equality prevents unsound “no-conflict” conclusions for opaque address_of(...) uses. This addresses the earlier concern.
666-669: HIP codegen must handle the 3-arg tvm_storage_sync for partial barriers.The rewriter emits tvm_storage_sync(scope, barrier_id, thread_count). Ensure CodeGenTileLangHIP::PrintStorageSync (or equivalent) recognizes and lowers this form; otherwise HIP kernels will miscompile or fail. This was raised earlier and still applies here.
Run this script to verify HIP codegen support:
#!/bin/bash # Search HIP codegen for storage sync handling and argument count. rg -nP -C3 '(CodeGen.*HIP|HIP.*CodeGen).*|PrintStorageSync|tvm_storage_sync\s*\(' --type=cc --type=cpp --type=cuh --type=c --type=h # Optionally, check CUDA codegen to mirror behavior for consistency. rg -nP -C3 '(CodeGen.*CUDA|CUDA.*CodeGen).*|PrintStorageSync|tvm_storage_sync\s*\(' --type=cc --type=cpp --type=cuh --type=c --type=h
🧹 Nitpick comments (4)
testing/python/transform/test_tilelang_transform_thread_sync.py (1)
109-111: Nit: Reuse the helper that sets target/device attrs for consistency.You inline the pass here, while other tests use run_passes() to attach target, global_symbol, and device region annotations. Using the helper keeps tests uniform and avoids drift if the setup evolves.
Apply within these lines:
- mod = tvm.IRModule({"main": func}) - mod = tilelang.transform.ThreadSync("shared")(mod) - tvm.ir.assert_structural_equal(mod["main"], expected) + mod = run_passes(func) + tvm.ir.assert_structural_equal(mod["main"], expected)src/transform/thread_storage_sync.cc (1)
768-774: Nit: Remove unused variable and stray semicolon.Minor cleanup in the pass factory.
- auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { - auto *n = f.CopyOnWrite(); - return tl::TileLangThreadSync(std::move(f), storage_scope); - ; - }; + auto pass_func = [storage_scope](PrimFunc f, IRModule m, PassContext ctx) { + return tl::TileLangThreadSync(std::move(f), storage_scope); + };tilelang/engine/phase.py (2)
169-171: Both ThreadSync passes are required for static and dynamic shared-memory barriersThe two back-to-back calls each target a different storage scope:
- ThreadSync("shared") runs the pass with an empty tag, enabling the wait-queue inserter and syncing accesses to statically-allocated shared buffers.
- ThreadSync("shared.dyn") skips the wait-queue step (since its tag is
"dyn") and inserts barriers for dynamically-sized shared buffers.Since each invocation only processes buffers whose
scope()exactly matches its argument, omitting one would leave that class of buffers unsynchronized. To prevent future cleanups from removing one of these passes by mistake, please add a clarifying comment intilelang/engine/phase.pyabove these lines:# Insert thread-sync barriers: # 1. for statically-allocated shared memory ("shared") with wait-queue sync # 2. for dynamically-allocated shared memory ("shared.dyn") without wait-queue mod = tilelang.transform.ThreadSync("shared")(mod) mod = tilelang.transform.ThreadSync("shared.dyn")(mod)
168-168: tilelang/engine/phase.py: Gate IR dump behind a debug flagUnconditionally printing the IRModule on every pass can flood stdout (especially in CI) and break any tooling that parses compiler output. It’s safest to make this dump opt-in. I’ve verified that no existing tests use pytest’s capture fixtures (capsys/capfd) or explicitly assert on this pass’s print output, so this change shouldn’t break downstream tests.
Suggested minimal patch (uses default False):
--- a/tilelang/engine/phase.py +++ b/tilelang/engine/phase.py @@ -165,7 +165,12 @@ class MergeSharedMemoryAllocations(Pass): mod = self._merge_shared_memory_allocations(pass_ctx, mod) - print("mod \n", mod) + # Dump IR only when explicitly enabled; IR dumps can be very large. + if pass_ctx.config.get("tl.debug.dump_ir_after_merge_shared", False): + print("IR after MergeSharedMemoryAllocations:\n", mod) return modOptionally, switch to the standard logging module so downstream consumers can control verbosity via log levels:
- if pass_ctx.config.get("tl.debug.dump_ir_after_merge_shared", False): - print("IR after MergeSharedMemoryAllocations:\n", mod) + if pass_ctx.config.get("tl.debug.dump_ir_after_merge_shared", False): + import logging + logging.getLogger(__name__).debug( + "IR after MergeSharedMemoryAllocations:\n%s", mod + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
src/transform/thread_storage_sync.cc(11 hunks)testing/python/transform/test_tilelang_transform_thread_sync.py(2 hunks)tilelang/engine/phase.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
testing/python/transform/test_tilelang_transform_thread_sync.py (4)
tilelang/language/ast/ir.py (2)
func(1343-1356)allocate(986-1015)src/transform/thread_storage_sync.cc (2)
ThreadSync(767-774)ThreadSync(767-767)tilelang/transform/__init__.py (1)
ThreadSync(116-129)tilelang/env.py (1)
disable_cache(232-233)
src/transform/thread_storage_sync.cc (3)
src/transform/storage_access.cc (16)
VisitStmt_(62-91)VisitStmt_(62-62)VisitStmt_(93-104)VisitStmt_(93-93)VisitStmt_(106-118)VisitStmt_(106-106)VisitStmt_(120-127)VisitStmt_(120-120)VisitStmt_(129-179)VisitStmt_(129-129)VisitStmt_(181-207)VisitStmt_(181-181)VisitStmt_(221-258)VisitStmt_(221-221)VisitStmt_(260-276)VisitStmt_(260-260)src/transform/flatten_buffer.cc (4)
func(44-55)func(44-44)buffer(240-244)buffer(240-241)src/transform/legalize_safe_memory_access.cc (13)
buffer(78-86)buffer(78-78)buffer(89-128)buffer(89-90)buffer(233-236)buffer(233-233)buffer(238-241)buffer(238-238)buffer(243-246)buffer(243-243)buffer(248-253)f(262-274)f(262-262)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (8)
testing/python/transform/test_tilelang_transform_thread_sync.py (3)
73-91: Consolidation from shared.dyn to shared looks correct in this test.
- Renamed test entrypoint and updated all allocations/buffer scopes from "shared.dyn" to "shared". This aligns the test with the new ThreadSync behavior and the planner’s treatment of shared memory. Expected function changes mirror the new scopes consistently. LGTM.
Also applies to: 95-105
163-164: Barrier-before-LetStmt expectation is precise and valuable.Adding T.tvm_storage_sync("shared") before the dependent Let chain is the correct synchronization point for this access pattern. This should protect against subtle cross-thread hazards in reduction-like accumulation. LGTM.
192-192: Main guard change is fine.Switching to tilelang.disable_cache() is harmless for pytest-driven runs and avoids stale cache interactions when invoking directly.
src/transform/thread_storage_sync.cc (5)
64-75: Plan shared.dyn as a single buffer: LGTM.Redirecting all shared.dyn accesses to a single buffer var ensures the planner sees aliasing correctly and places barriers conservatively where needed. This matches how dynamic shared memory behaves per-kernel.
288-315: Provable-disjointness via split thread vars is a solid improvement.Substituting tx→{tx1,tx2} (and ty,tz) with correct dtypes and bound bindings improves conflict pruning without sacrificing safety. Good use of Analyzer.CanProve with SymbolicBound.
If you want extra assurance, consider a focused unit test where writes/read indices differ by a provably positive offset across threads (e.g., tx+128 vs tx) to confirm barriers are omitted only when truly disjoint.
352-363: Thread extent tracking is correct and still delegates to base visitor.Capturing threadIdx.{x,y,z} before calling TileLangStorageAccessVisitor::VisitStmt_ keeps both analyzer bindings and env thread tracking intact. LGTM.
431-434: Constructor and usage cleanup: partial-sync map removed.ThreadSyncInserter now only takes syncs; call site updated accordingly. This simplifies the inserter and matches the merged design.
Also applies to: 757-759
744-761: TileLangThreadSync now returns a PrimFunc and seeds buffer map: LGTM.
- Switching to a PrimFunc-returning transformer is consistent with other passes.
- Seeding buffer_data_to_buffer from func->buffer_map ensures aliasing for argument buffers is visible to the planner. Good catch.
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…
Summary by CodeRabbit
Breaking Changes
Refactor
Tests
Chores
Docs/Style