-
Couldn't load subscription status.
- Fork 286
[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a unified Copy operator with multi-path lowering (including TMA descriptors), removes legacy bulk_copy/elem Copy implementation, renames builtin Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant TL as TileLang Lowering
participant Copy as Copy Operator
participant Target as Target Utils
participant CG as CUDA Codegen
TL->>Copy: Lower(args, target)
Copy->>Target: TargetHasBulkCopy(target)?
alt Bulk/TMA allowed
Copy->>Copy: GetCopyInst() => kBulkLoad / kBulkStore
Copy->>CG: create_tma_descriptor(...)
CG->>CG: emit tma_load / tma_store
else LDSM/STSM path
Copy->>Copy: GetCopyInst() => kLDSM / kSTSM
Copy->>CG: emit ldmatrix / stmatrix sequence
else Normal path
Copy->>CG: emit SIMT copy loop
end
sequenceDiagram
autonumber
participant TL as TileLang Lowering
participant Im2Col as Conv2DIm2ColOp
participant CG as CUDA Codegen
TL->>Im2Col: Lower(args, target)
Im2Col->>Im2Col: Build TMAIm2ColDesc
Im2Col->>CG: create_tma_im2col_descriptor(...)
CG->>CG: emit tma_load_im2col(...)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant refactoring of TVM's TensorIR memory copy mechanisms. It consolidates the previously separate bulk_copy functionality into a single, more versatile Copy operator. The primary goal is to unify various memory transfer strategies, including optimized bulk copies leveraging NVIDIA's Tensor Memory Accelerator (TMA) on Hopper architectures, matrix load/store operations (LDSM/STSM) for tensor cores, and general-purpose normal copies. A key improvement is the enhanced layout inference for bulk copy, which now intelligently handles shared memory layouts and provides robust fallback mechanisms for unsupported configurations. This change improves code organization, maintainability, and performance for GPU code generation by centralizing and optimizing data movement operations.
Highlights
- Refactoring of Copy Operators: The
bulk_copyoperator has been fully merged into a new, unifiedCopyoperator, centralizing memory transfer logic within TVM's TensorIR. This streamlines the codebase and simplifies future enhancements. - Unified Memory Transfer Strategy Selection: The
Copyoperator now intelligently selects the optimal memory transfer strategy (e.g., Hopper's TMA bulk copy, LDMATRIX/STMATRIX for tensor cores, or a general normal copy) based on the target architecture and buffer memory scopes. - Enhanced Layout Inference for Bulk Copy: Layout inference for bulk copy operations has been significantly improved, including the introduction of
ComputeLinearLayoutfor TMA and more robust detection of shared memory swizzling patterns. This ensures more efficient data packing and access on specialized hardware. - Improved Robustness and Fallback Mechanisms: The
LowerBulkCopyandLowerLDSMCopyfunctions now include comprehensive checks for unsupported configurations (e.g., non-swizzled global layouts, non-8x8 fragment layouts) and gracefully fall back to a normal copy, improving the stability and reliability of code generation.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a significant refactoring that merges the bulk_copy logic into the copy operator, unifying different copy strategies under a single, more organized structure. The changes are generally well-structured and improve the codebase.
I have identified a critical bug in the new Copy class's copy constructor where a member variable is not initialized, which could lead to undefined behavior. Additionally, I've found several incorrect or misleading comments and log messages that should be corrected to improve code clarity and maintainability. There is also a potential regression regarding memory scopes for bulk copy that should be addressed.
| Copy(const Copy &other) | ||
| : args_(other.args_), src(other.src), dst(other.dst), | ||
| src_range(other.src_range), dst_range(other.dst_range), | ||
| coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) { | ||
| // Deep copy ParallelOp if it exists | ||
| if (other.par_op_) | ||
| par_op_ = std::unique_ptr<ParallelOp>( | ||
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eviction_policy member is not initialized in the copy constructor. This can lead to undefined behavior when a Copy object is copied.
| Copy(const Copy &other) | |
| : args_(other.args_), src(other.src), dst(other.dst), | |
| src_range(other.src_range), dst_range(other.dst_range), | |
| coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) { | |
| // Deep copy ParallelOp if it exists | |
| if (other.par_op_) | |
| par_op_ = std::unique_ptr<ParallelOp>( | |
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | |
| } | |
| Copy(const Copy &other) | |
| : args_(other.args_), src(other.src), dst(other.dst), | |
| src_range(other.src_range), dst_range(other.dst_range), | |
| coalesced_width(other.coalesced_width), disable_tma(other.disable_tma), | |
| eviction_policy(other.eviction_policy) { | |
| // Deep copy ParallelOp if it exists | |
| if (other.par_op_) | |
| par_op_ = std::unique_ptr<ParallelOp>( | |
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | |
| } |
src/op/copy.cc
Outdated
| if (!TargetHasBulkCopy(target)) | ||
| return false; | ||
| // 2. src and dst must be shared.dyn and local.fragment | ||
| if (src.scope() != "global" || dst.scope() != "shared.dyn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for bulk load and store operations seems to have dropped support for the shared memory scope, only allowing shared.dyn. The previous implementation in bulk_copy.cc supported both shared and shared.dyn. This could be a regression for use cases that rely on statically allocated shared memory. Please clarify if this change was intentional.
src/op/copy.cc
Outdated
| // 2. src and dst must be shared.dyn and local.fragment | ||
| if (src.scope() != "global" || dst.scope() != "shared.dyn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 362 is incorrect. It states that src and dst must be shared.dyn and local.fragment, which is not the case for a bulk load. The code correctly checks for a transfer from global to shared.dyn memory. This comment appears to be a copy-paste error and should be updated to reflect the actual logic.
| // 2. src and dst must be shared.dyn and local.fragment | |
| if (src.scope() != "global" || dst.scope() != "shared.dyn") | |
| // 2. src must be global and dst must be shared.dyn | |
| if (src.scope() != "global" || dst.scope() != "shared.dyn") |
src/op/copy.cc
Outdated
| // 2. src and dst must be shared.dyn and local.fragment | ||
| if (src.scope() != "shared.dyn" || dst.scope() != "global") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 389 is incorrect. It states that src and dst must be shared.dyn and local.fragment, which is not the case for a bulk store. The code correctly checks for a transfer from shared.dyn to global memory. This comment appears to be a copy-paste error and should be updated to reflect the actual logic.
| // 2. src and dst must be shared.dyn and local.fragment | |
| if (src.scope() != "shared.dyn" || dst.scope() != "global") | |
| // 2. src must be shared.dyn and dst must be global | |
| if (src.scope() != "shared.dyn" || dst.scope() != "global") |
| // TMA ldmatrix/stmatrix cannot support non-1-d layout, will be fallback to | ||
| // normal copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here and in several other places in this function (lines 610, 618, 626, 635, 683) incorrectly refers to "TMA ldmatrix/stmatrix". LDSM/STSM are PTX instructions, while TMA is a different hardware feature for bulk memory transfers. This can be confusing. Please replace "TMA" with "PTX" or remove it to avoid confusion.
| // TMA ldmatrix/stmatrix cannot support non-1-d layout, will be fallback to | |
| // normal copy | |
| // ldmatrix/stmatrix cannot support non-1-d layout, will be fallback to | |
| // normal copy |
| LOG(WARNING) << "TMA bulk copy cannot support a non-swizzled global " | ||
| "layout, fallback to normal copy."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message is a bit misleading. It says "non-swizzled global layout", but the condition T.layout_map.count(global_tensor) checks for the presence of any layout on the global tensor, not specifically a non-swizzled one. Please make the message more accurate.
| LOG(WARNING) << "TMA bulk copy cannot support a non-swizzled global " | |
| "layout, fallback to normal copy."; | |
| LOG(WARNING) << "TMA bulk copy does not support a layout on the global " | |
| "tensor, fallback to normal copy."; |
src/op/copy.cc
Outdated
| LOG(WARNING) << "TMA bulk copy cannot support a non-swizzled global layout " | ||
| "with inner_box_dim_ % 256 != 0" | ||
| << "inner_box_dim_: " << inner_box_dim_ | ||
| << ", will be fallback to normal copy"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message "TMA bulk copy cannot support a non-swizzled global layout" is confusing. The swizzling is on the shared memory layout, not the global layout. Please clarify the message to indicate that this constraint applies to the shared memory layout.
LOG(WARNING) << "TMA bulk copy with no swizzling on shared memory requires the copy size in bytes "
"along the contiguous dimension to be a multiple of 256, but got "
<< inner_box_dim_ << ", will be fallback to normal copy";| LOG(WARNING) << "TMA bulk copy cannot support a swizzled global layout " | ||
| "with inner_box_dim_ > " | ||
| << check.max_dim << ", will be fallback to normal copy"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message "TMA bulk copy cannot support a swizzled global layout" is confusing. The swizzling is on the shared memory layout, not the global layout. Please clarify the message.
LOG(WARNING) << "TMA bulk copy cannot support a swizzled shared memory layout "
"with inner_box_dim_ > "
<< check.max_dim << ", will be fallback to normal copy";There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (16)
src/op/builtin.cc (1)
32-33: Expose cuTensorMapType via builtin: LGTM, but consider constexpr.The return type and value (u8x128) make sense for CUtensorMap. Optional: mark it constexpr to enable compile-time usage and avoid any potential ODR issues if inlined across multiple TUs.
-DataType cuTensorMapType() { return DataType::UInt(8, 128); } +constexpr DataType cuTensorMapType() { return DataType::UInt(8, 128); }src/target/utils.h (1)
28-29: New capability predicate added: OK; please document intent (SM90+ TMA).The name aligns with existing helpers. Add a brief comment to clarify that “BulkCopy” currently means CUDA TMA (Hopper/SM90+), so readers don’t confuse it with cp.async.
-bool TargetHasBulkCopy(Target target); +// Returns true when target supports CUDA TMA-based bulk copy (Hopper/SM90+). +bool TargetHasBulkCopy(Target target);src/op/builtin.h (1)
57-64: Public cuTensorMapType() declaration: consider including the defining header.This header already compiles in the project, but to make it self-sufficient, add an explicit include for DataType (if not transitively included by op.h in all build configurations).
#include "op.h" #include <tvm/ir/transform.h> +#include <tvm/runtime/data_type.h>src/op/copy.cc (4)
19-21: Duplicate include of utils.h.utils.h is included twice (Lines 13 and 20). Drop one to avoid confusion and speed up builds slightly.
#include "../target/cuda.h" -#include "../target/utils.h" #include "builtin.h"
1031-1041: IntImm extraction style is brittle; use Downcast consistently.The current pattern (args[i].as().value()->value) is error-prone. Prefer Downcast or as() with checks for clarity and consistency.
- kernel = args[4].as<IntImm>().value()->value; - stride = args[5].as[IntImm>().value()->value; - dilation = args[6].as<IntImm>().value()->value; - padding = args[7].as[IntImm>().value()->value; - eviction_policy = args[8].as[IntImm>().value()->value; + kernel = Downcast<IntImm>(args[4])->value; + stride = Downcast<IntImm>(args[5])->value; + dilation = Downcast<IntImm>(args[6])->value; + padding = Downcast<IntImm>(args[7])->value; + eviction_policy = Downcast<IntImm>(args[8])->value;
901-919: Inner-box dimension logic: hard ICHECK on constant may be too strict.as_const_int(desc.smem_box[0]) enforces a constant at compile time; if symbolic, the pass will crash. Recommend graceful fallback to normal copy instead of hard fail.
- auto inner_box_dim = as_const_int(desc.smem_box[0]); - ICHECK(inner_box_dim != nullptr); + auto inner_box_dim = as_const_int(desc.smem_box[0]); + if (!inner_box_dim) { + LOG(WARNING) << "Non-constant inner box dim; fallback to normal copy."; + return LowerNormalCopy(T, analyzer); + }
740-746: Warning text likely inverted (“non-swizzled global layout”).The condition checks T.layout_map.count(global_tensor) (i.e., there is some annotated layout). The message says “cannot support a non-swizzled global layout,” which reads contradictory. Consider clarifying to avoid confusion.
- LOG(WARNING) << "TMA bulk copy cannot support a non-swizzled global " - "layout, fallback to normal copy."; + LOG(WARNING) << "TMA bulk copy requires raw (non-annotated) global layout; " + "found annotated layout, fallback to normal copy.";src/op/copy.h (9)
2-4: Fix file path in header docblock.The path in the comment says tl/op/elem.h, but this file is op/copy.h. Update to avoid confusion during grepping and docs generation.
Apply:
- * \file tl/op/elem.h + * \file op/copy.h
17-20: Avoidusing namespacein a public header.Public headers should not use wildcard namespace imports; they leak into all includers. Prefer qualified names or selective using-declarations.
Apply:
-namespace tvm { -namespace tl { -using namespace tir; +namespace tvm { +namespace tl { +using tir::Array; +using tir::Bool; +using tir::Buffer; +using tir::BufferMap; +using tir::Call; +using tir::For; +using tir::IterVar; +using tir::Layout; +using tir::LayoutMap; +using tir::LowerArgs; +using tir::LayoutInferArgs; +using tir::Op; +using tir::PrimExpr; +using tir::Range; +using tir::Stmt;
29-47: Descriptor field types: prefer explicit, TVM-aligned widths and invariants.
- rank: TVM often passes ranks as int (32-bit). Using size_t here may cause implicit narrowing in EncodeCallArgs() or builtins. Consider int for consistency.
- data_type: if this is CUtensorMapDataType, document expected enum width and ensure EncodeCallArgs() casts appropriately.
- Consider asserting invariants (e.g., global_shape.size() == rank, smem_box.size() == rank) inside EncodeCallArgs() to fail early.
Example adjustment (header-only):
- size_t rank; // Tensor rank (number of dimensions) - int data_type; // Data type identifier (numeric code) + int rank; // Tensor rank (number of dimensions) + int data_type; // CUtensorMapDataType enum valueAnd add comments to EncodeCallArgs() contract noting required array sizes.
55-77: Clarify im2col descriptor dimensional contracts and guard in EncodeCallArgs.For TMAIm2ColDesc, specify that lower_corner/upper_corner and elem_stride are rank-2 (spatial/channel) sized. Add ICHECKs in EncodeCallArgs() to validate:
- global_shape.size() == rank
- lower_corner.size() == upper_corner.size()
- lower_corner.size() == rank - 2
I can draft the EncodeCallArgs() ICHECKs if you want them in copy.cc.
86-94: Make constructor explicit.Prevent unintended implicit conversions from Array by marking the constructor explicit.
- Copy(Array<PrimExpr> args, BufferMap vmap); + explicit Copy(Array<PrimExpr> args, BufferMap vmap);
117-124: Enum doc: align names with PTX/TMA terminology.Minor nit: capitalize LDSM/STSM consistently in comments and consider noting they correspond to ldmatrix/stmatrix intrinsics to aid readers.
221-227: Prefer Optional for optional fields; avoid NodeRef-specific types in storage.coalesced_width and disable_tma are optional parameters. Storing IntImm/Bool (ObjectRef types) in the class makes APIs awkward. Prefer:
- Optional or Optional for coalesced_width
- bool for disable_tma
Example:
- IntImm coalesced_width; // Width (in elements) for coalesced memory access - Bool disable_tma = Bool(false); // Whether to disable TMA acceleration + Optional<Integer> coalesced_width; // Width (elements) for coalesced access + bool disable_tma = false; // Disable TMA accelerationAdjust constructor accordingly to set these fields.
231-238: Unify eviction policy typing and usage.There are two eviction policy representations: the inner enum and a raw int field. Prefer storing the enum to get compile-time safety, and cast to int only when forming call args.
Apply:
- enum class EvictionPolicy { + enum class EvictionPolicy : int { kEvictNormal = 0, kEvictFirst = 1, kEvictLast = 2, }; - - int eviction_policy; // Policy for cache eviction + EvictionPolicy eviction_policy = EvictionPolicy::kEvictNormal; // PolicyThen, when pushing call args: static_cast(eviction_policy).
246-254: Make Conv2DIm2ColOp constructor explicit and document arg order.As with Copy, mark explicit. Also, since the constructor relies on positional args, add a short comment enumerating expected arg indices (src, dst, nhw_step, c_step, kernel, stride, dilation, padding, eviction_policy).
- Conv2DIm2ColOp(Array<PrimExpr> args, BufferMap vmap); + explicit Conv2DIm2ColOp(Array<PrimExpr> args, BufferMap vmap);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (14)
src/op/builtin.cc(1 hunks)src/op/builtin.h(1 hunks)src/op/bulk_copy.cc(0 hunks)src/op/bulk_copy.h(0 hunks)src/op/copy.cc(1 hunks)src/op/copy.h(1 hunks)src/op/elem.cc(0 hunks)src/op/elem.h(0 hunks)src/target/codegen_cuda.cc(0 hunks)src/target/utils.cc(1 hunks)src/target/utils.h(1 hunks)src/transform/lower_hopper_intrin.cc(0 hunks)src/transform/lower_l2_persistent_annotation.cc(0 hunks)src/transform/persist_threadblock.cc(0 hunks)
💤 Files with no reviewable changes (8)
- src/target/codegen_cuda.cc
- src/transform/persist_threadblock.cc
- src/transform/lower_l2_persistent_annotation.cc
- src/transform/lower_hopper_intrin.cc
- src/op/elem.cc
- src/op/bulk_copy.h
- src/op/elem.h
- src/op/bulk_copy.cc
🧰 Additional context used
🧬 Code graph analysis (5)
src/target/utils.cc (1)
src/op/gemm.cc (2)
GetArchInt(363-374)GetArchInt(363-363)
src/target/utils.h (1)
src/target/utils.cc (2)
TargetHasBulkCopy(107-112)TargetHasBulkCopy(107-107)
src/op/copy.cc (5)
src/op/copy.h (2)
Copy(86-283)Conv2DIm2ColOp(246-281)src/transform/lower_hopper_intrin.cc (4)
call(102-132)call(102-102)op(55-100)op(55-55)src/target/utils.cc (8)
TargetHasBulkCopy(107-112)TargetHasBulkCopy(107-107)TargetHasLdmatrix(93-98)TargetHasLdmatrix(93-93)TargetHasStmatrix(100-105)TargetHasStmatrix(100-100)TargetIsHopper(49-54)TargetIsHopper(49-49)tilelang/language/copy.py (1)
copy(84-152)src/op/bulk_copy.cc (1)
tir(17-533)
src/op/copy.h (3)
src/op/builtin.h (2)
tvm(13-313)tl(14-312)src/target/utils.h (2)
tvm(12-32)tl(13-31)src/op/copy.cc (34)
Copy(113-139)Lower(464-481)Lower(464-464)Lower(1053-1175)Lower(1053-1054)InferLayout(319-348)InferLayout(319-319)CheckBulkLoad(358-375)CheckBulkLoad(358-358)CheckBulkStore(385-402)CheckBulkStore(385-385)CheckLDSMCopy(412-415)CheckLDSMCopy(412-412)CheckSTSMCopy(425-428)CheckSTSMCopy(425-425)GetCopyInst(438-450)GetCopyInst(438-438)LowerBulkCopy(731-991)LowerBulkCopy(731-732)LowerLDSMCopy(538-718)LowerLDSMCopy(538-539)LowerNormalCopy(492-525)LowerNormalCopy(492-493)MakeSIMTLoop(236-281)MakeSIMTLoop(236-236)ComputeLinearLayout(292-307)ComputeLinearLayout(292-292)MakeIterVars(147-159)MakeIterVars(147-147)MakeIndices(170-187)MakeIndices(170-171)MakePredicate(198-226)MakePredicate(198-200)Conv2DIm2ColOp(1031-1041)
src/op/builtin.h (1)
src/op/builtin.cc (2)
cuTensorMapType(32-32)cuTensorMapType(32-32)
🔇 Additional comments (5)
src/target/utils.cc (1)
107-112: TargetHasBulkCopy gating looks correct (Hopper+).Matches the SM version checks used elsewhere and cleanly short-circuits non-CUDA targets. No changes requested.
src/op/copy.cc (2)
36-97: Incorrect CUtensorMap data-type mapping for int16/int8 (and fp8 handling).
- Signed integers 16/8 are mapped to unsigned enums (UINT16/UINT8), which is incorrect and can cause misinterpreted data or runtime errors when creating descriptors.
- fp8 formats are mapped to UINT8; if CUDA provides explicit fp8 enums, prefer those; otherwise leave a TODO with a clear rationale.
} else if (dtype.is_int()) { switch (dtype.bits()) { case 64: tp = CU_TENSOR_MAP_DATA_TYPE_INT64; break; case 32: tp = CU_TENSOR_MAP_DATA_TYPE_INT32; break; case 16: - tp = CU_TENSOR_MAP_DATA_TYPE_UINT16; + tp = CU_TENSOR_MAP_DATA_TYPE_INT16; break; case 8: - tp = CU_TENSOR_MAP_DATA_TYPE_UINT8; + tp = CU_TENSOR_MAP_DATA_TYPE_INT8; break; default: ICHECK(0) << dtype; } } else if (dtype.is_uint()) { @@ - } else if (dtype.is_bfloat16()) { + } else if (dtype.is_bfloat16()) { tp = CU_TENSOR_MAP_DATA_TYPE_BFLOAT16; - } else if (dtype.is_float8_e4m3() || dtype.is_float8_e5m2()) { - tp = CU_TENSOR_MAP_DATA_TYPE_UINT8; + } else if (dtype.is_float8_e4m3() || dtype.is_float8_e5m2()) { + // Prefer native FP8 enums when available; otherwise, conservatively treat as u8 and + // rely on consumers to reinterpret. Consider guarding with CUDA version checks. + // TODO(lei): switch to CU_TENSOR_MAP_DATA_TYPE_FLOAT8_* enums when minimum CUDA >= 12.X. + tp = CU_TENSOR_MAP_DATA_TYPE_UINT8;Likely an incorrect or invalid review comment.
957-987: Bulk-copy mbarrier placeholder resolution verified
The zero mbarrier ID pushed for loads in src/op/copy.cc is correctly replaced by theinject_tma_barrierpass (src/transform/inject_tma_barrier.cc:157–160), which emits theget_mbarrier()call and the matching arrive/wait sequence upstream. No changes are needed here.src/op/copy.h (2)
283-286: Include guard and trailing newline: LGTM.
1-286: No duplicate type definitions foundI ran searches across the repository for TMADesc, TMAIm2ColDesc, Copy, and Conv2DIm2ColOp and confirmed they only appear in src/op/copy.h. There are no other headers or source files declaring these types, so there’s no ODR risk from leftover definitions in tl/op/elem.h or elsewhere.
| if (is_scalar) { | ||
| return For(Var("i"), 0, 1, ForKind::kSerial, | ||
| BufferStore(dst, BufferLoad(src, {0}), {0})); | ||
| } | ||
|
|
||
| for (const auto &iv : loop_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalar-copy path ignores region minima; indexes should use ranges.
Using BufferLoad(src, {0})/BufferStore(dst, {0}) will break when min != 0. Reuse MakeIndices with an empty iv list to honor ranges.
- if (is_scalar) {
- return For(Var("i"), 0, 1, ForKind::kSerial,
- BufferStore(dst, BufferLoad(src, {0}), {0}));
- }
+ if (is_scalar) {
+ Array<PrimExpr> src_indices = MakeIndices({}, 0);
+ Array<PrimExpr> dst_indices = MakeIndices({}, 1);
+ return For(Var("i"), 0, 1, ForKind::kSerial,
+ BufferStore(dst, BufferLoad(src, src_indices), dst_indices));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (is_scalar) { | |
| return For(Var("i"), 0, 1, ForKind::kSerial, | |
| BufferStore(dst, BufferLoad(src, {0}), {0})); | |
| } | |
| for (const auto &iv : loop_vars) | |
| if (is_scalar) { | |
| Array<PrimExpr> src_indices = MakeIndices({}, 0); | |
| Array<PrimExpr> dst_indices = MakeIndices({}, 1); | |
| return For(Var("i"), 0, 1, ForKind::kSerial, | |
| BufferStore(dst, BufferLoad(src, src_indices), dst_indices)); | |
| } | |
| for (const auto &iv : loop_vars) |
🤖 Prompt for AI Agents
In src/op/copy.cc around lines 239-244, the scalar-copy branch currently uses
BufferLoad(src, {0}) and BufferStore(dst, {0}) which ignores region minima;
change it to compute indices via MakeIndices(...) with an empty loop-var list so
ranges/minima are honored, then use those indices for BufferLoad and
BufferStore. Concretely, replace the hardcoded {0} index usage by calling
MakeIndices(src, {}) (or the existing MakeIndices helper with an empty iv list)
to produce the proper index vector and pass that to BufferLoad/BufferStore.
| // Can only support local_range to be a full range | ||
| for (size_t i = 0; i < dst_range.size(); i++) { | ||
| if (!is_zero(dst_range[i]->min) || | ||
| !analyzer->CanProveEqual(dst_range[i]->extent, dst->shape[i])) | ||
| // TMA ldmatrix/stmatrix cannot support non-full range, will be fallback | ||
| // to normal copy | ||
| return LowerNormalCopy(T, analyzer); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LDSM/STSM local-range check uses dst_range unconditionally.
For STSM, the “local” tensor is src, but the code checks dst_range. This can incorrectly fall back or miss invalid ranges.
- // Can only support local_range to be a full range
- for (size_t i = 0; i < dst_range.size(); i++) {
- if (!is_zero(dst_range[i]->min) ||
- !analyzer->CanProveEqual(dst_range[i]->extent, dst->shape[i]))
+ // Can only support local_range to be a full range
+ const auto &local_range = is_ldmatrix ? dst_range : src_range;
+ const auto &local_buf = is_ldmatrix ? dst : src;
+ for (size_t i = 0; i < local_range.size(); i++) {
+ if (!is_zero(local_range[i]->min) ||
+ !analyzer->CanProveEqual(local_range[i]->extent, local_buf->shape[i]))
// TMA ldmatrix/stmatrix cannot support non-full range, will be fallback
// to normal copy
return LowerNormalCopy(T, analyzer);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Can only support local_range to be a full range | |
| for (size_t i = 0; i < dst_range.size(); i++) { | |
| if (!is_zero(dst_range[i]->min) || | |
| !analyzer->CanProveEqual(dst_range[i]->extent, dst->shape[i])) | |
| // TMA ldmatrix/stmatrix cannot support non-full range, will be fallback | |
| // to normal copy | |
| return LowerNormalCopy(T, analyzer); | |
| } | |
| // Can only support local_range to be a full range | |
| const auto &local_range = is_ldmatrix ? dst_range : src_range; | |
| const auto &local_buf = is_ldmatrix ? dst : src; | |
| for (size_t i = 0; i < local_range.size(); i++) { | |
| if (!is_zero(local_range[i]->min) || | |
| !analyzer->CanProveEqual(local_range[i]->extent, local_buf->shape[i])) | |
| // TMA ldmatrix/stmatrix cannot support non-full range, will be fallback | |
| // to normal copy | |
| return LowerNormalCopy(T, analyzer); | |
| } |
| TIR_REGISTER_TL_OP(Copy, copy) | ||
| .set_num_inputs(4) | ||
| .set_attr<TCallEffectKind>("TCallEffectKind", | ||
| Integer(CallEffectKind::kOpaque)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl.copy registered with wrong arity (should be 5, not 4).
tilelang.language.copy passes src, dst, coalesced_width, disable_tma, eviction_policy (5 args). Current registration as 4 will assert at runtime.
TIR_REGISTER_TL_OP(Copy, copy)
- .set_num_inputs(4)
+ .set_num_inputs(5)
.set_attr<TCallEffectKind>("TCallEffectKind",
Integer(CallEffectKind::kOpaque));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TIR_REGISTER_TL_OP(Copy, copy) | |
| .set_num_inputs(4) | |
| .set_attr<TCallEffectKind>("TCallEffectKind", | |
| Integer(CallEffectKind::kOpaque)); | |
| TIR_REGISTER_TL_OP(Copy, copy) | |
| .set_num_inputs(5) | |
| .set_attr<TCallEffectKind>("TCallEffectKind", | |
| Integer(CallEffectKind::kOpaque)); |
🤖 Prompt for AI Agents
In src/op/copy.cc around lines 1216 to 1219, the tl.copy operator is registered
with set_num_inputs(4) but tilelang.language.copy supplies five arguments (src,
dst, coalesced_width, disable_tma, eviction_policy), causing a runtime
assertion; update the registration to set_num_inputs(5) so the operator arity
matches the language call (keep the existing attribute settings unchanged) and
run tests to ensure no other call sites assume the old arity.
| Copy(const Copy &other) | ||
| : args_(other.args_), src(other.src), dst(other.dst), | ||
| src_range(other.src_range), dst_range(other.dst_range), | ||
| coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) { | ||
| // Deep copy ParallelOp if it exists | ||
| if (other.par_op_) | ||
| par_op_ = std::unique_ptr<ParallelOp>( | ||
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy constructor drops eviction_policy (semantic bug).
The copy ctor doesn’t copy eviction_policy, so Clone() produces an operator with an indeterminate policy. This can mis-lower or change runtime behavior.
Apply:
Copy(const Copy &other)
: args_(other.args_), src(other.src), dst(other.dst),
src_range(other.src_range), dst_range(other.dst_range),
- coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) {
+ coalesced_width(other.coalesced_width),
+ disable_tma(other.disable_tma),
+ eviction_policy(other.eviction_policy) {
// Deep copy ParallelOp if it exists
if (other.par_op_)
par_op_ = std::unique_ptr<ParallelOp>(
static_cast<ParallelOp *>(other.par_op_->Clone().release()));
}Also consider default-initializing eviction_policy to a sane default (kEvictNormal) at declaration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Copy(const Copy &other) | |
| : args_(other.args_), src(other.src), dst(other.dst), | |
| src_range(other.src_range), dst_range(other.dst_range), | |
| coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) { | |
| // Deep copy ParallelOp if it exists | |
| if (other.par_op_) | |
| par_op_ = std::unique_ptr<ParallelOp>( | |
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | |
| } | |
| Copy(const Copy &other) | |
| : args_(other.args_), src(other.src), dst(other.dst), | |
| src_range(other.src_range), dst_range(other.dst_range), | |
| coalesced_width(other.coalesced_width), | |
| disable_tma(other.disable_tma), | |
| eviction_policy(other.eviction_policy) { | |
| // Deep copy ParallelOp if it exists | |
| if (other.par_op_) | |
| par_op_ = std::unique_ptr<ParallelOp>( | |
| static_cast<ParallelOp *>(other.par_op_->Clone().release())); | |
| } |
🤖 Prompt for AI Agents
In src/op/copy.h around lines 153-161 the Copy copy-constructor omits copying
eviction_policy, producing cloned objects with an indeterminate policy; update
the constructor to copy eviction_policy from other (e.g., add
eviction_policy(other.eviction_policy) to the member initializer list or assign
it inside the body) so Clone() preserves policy; additionally, set a sane
default for eviction_policy at its declaration (e.g., kEvictNormal) to avoid
indeterminate values for default-constructed instances.
…ck_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/transform/inject_fence_proxy.cc (1)
57-66: Optional: centralize the “generic-proxy op” predicate.To reduce future churn when adding more smem warp-level ops (e.g., additional PTX matrix ops), consider factoring the op check into a private helper. This keeps the visitor logic cleaner and makes it easy to extend.
Apply within this hunk:
- if (call->op.same_as(ptx_ldmatrix()) || - call->op.same_as(ptx_stmatrix())) { + if (IsGenericProxyCall(call)) { proxy = Proxy::kGeneric; }Add this helper inside ProxyMarker (private section):
// inside class ProxyMarker (private:) static bool IsGenericProxyCall(const CallNode* call) { return call->op.same_as(ptx_ldmatrix()) || call->op.same_as(ptx_stmatrix()); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
examples/seer_attention/block_sparse_attn_tilelang.py(0 hunks)src/op/builtin.cc(2 hunks)src/op/builtin.h(2 hunks)src/op/copy.cc(1 hunks)src/target/codegen_cuda.cc(1 hunks)src/transform/inject_fence_proxy.cc(1 hunks)
💤 Files with no reviewable changes (1)
- examples/seer_attention/block_sparse_attn_tilelang.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/target/codegen_cuda.cc
- src/op/builtin.h
- src/op/copy.cc
- src/op/builtin.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/inject_fence_proxy.cc (1)
tilelang/language/tir/op.py (1)
ptx_ldmatrix(1123-1159)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (2)
src/transform/inject_fence_proxy.cc (2)
60-62: Fix looks correct: use ptx_ldmatrix() instead of the misspelled symbol.The rename aligns with the Python/TIR binding (tilelang/language/tir/op.py) and keeps EvaluateNode classification consistent: ldmatrix/stmatrix -> kGeneric; otherwise remains kAsync. No behavior change beyond the typo fix. LGTM.
57-66: All ‘ptx_ldmatirx’ misspellings have been removed
A full repo‐wide search found zero occurrences of the old typo. All remaining calls toptx_ldmatrixare correct and expected.
…ation - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
…opy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/atomicadd_vectorize.cc (1)
175-209: Guard dynamic‐shape loops in VisitStmt_ before constant‐extent checksThe
VisitStmt_implementation insrc/transform/atomicadd_vectorize.cccurrently performsauto extent_ptr = as_const_int(fnode->extent); ICHECK(extent_ptr) << fnode->extent; … ICHECK(is_zero(fnode->min));even when
dynamic_ == true, causing a hard crash if the loop extent isn’t a compile‐time constant. We should early‐return the original loop for dynamic shapes before anyICHECKassertions.Affected location:
- File: src/transform/atomicadd_vectorize.cc
Method:AtomicAddVectorizeRewriter::VisitStmt_(around lines 175–209)Suggested diff:
Stmt VisitStmt_(const ForNode* node) final { inner_for_ = node; auto ret = StmtExprMutator::VisitStmt_(node); if (inner_for_ == node) { // rewrite the innermost loop - For fnode = ret.as<For>().value(); + // For dynamic shapes, skip vectorization and avoid constant‐extent checks. + if (dynamic_) { + return ret; + } + For fnode = ret.as<For>().value(); auto old_var = fnode->loop_var; auto extent_ptr = as_const_int(fnode->extent); ICHECK(extent_ptr) << fnode->extent; int extent = *extent_ptr; ICHECK(extent % vector_size_ == 0) << "extent: " << extent << " vector_size_: " << vector_size_; ICHECK(is_zero(fnode->min)); - if (!dynamic_) { - Var tx_var; + { + Var tx_var; PostOrderVisit(fnode->body, [&tx_var](const ObjectRef& node) { if (const VarNode* var = node.as<VarNode>()) { if (var->name_hint == "tx") { tx_var = GetRef<Var>(var); } } }); ICHECK(tx_var.defined()) << "Failed to find tx var"; Var outer_var = Var(old_var->name_hint + "_outer"); Map<Var, PrimExpr> vmap; // Scale thread index (tx) and loop variable by vector_size to map each // new iteration to a vectorized chunk vmap.Set(tx_var, tx_var * vector_size_); vmap.Set(fnode->loop_var, outer_var * vector_size_); Stmt body = Substitute(fnode->body, vmap); return For(outer_var, 0, extent / vector_size_, fnode->kind, body, fnode->thread_binding, fnode->annotations, fnode->span); - } + } } return ret; }
♻️ Duplicate comments (2)
src/op/copy.h (2)
153-161: Copy constructor drops eviction_policy (semantic bug) — preserve policy.The copy ctor doesn't copy eviction_policy. Cloned/ copied instances can end up with an indeterminate policy, changing lowering behavior nondeterministically. Fix by adding it to the initializer list.
Copy(const Copy &other) : args_(other.args_), src(other.src), dst(other.dst), src_range(other.src_range), dst_range(other.dst_range), - coalesced_width(other.coalesced_width), disable_tma(other.disable_tma) { + coalesced_width(other.coalesced_width), + disable_tma(other.disable_tma), + eviction_policy(other.eviction_policy) { // Deep copy ParallelOp if it exists if (other.par_op_) par_op_ = std::unique_ptr<ParallelOp>( static_cast<ParallelOp *>(other.par_op_->Clone().release())); }
237-238: Default-initialize eviction_policy to a sane value (UB avoidance).eviction_policy is a raw int and remains uninitialized for constructors that don't set it. Default it to kEvictNormal to prevent undefined behavior.
- int eviction_policy; // Policy for cache eviction + int eviction_policy = static_cast<int>(EvictionPolicy::kEvictNormal); // Policy for cache evictionOptional: consider strongly typing this field as EvictionPolicy to prevent invalid values; if you do so, adjust assignments in copy.cc with static_cast.
🧹 Nitpick comments (3)
src/op/copy.h (2)
1-9: Stale @file path in header doc.The Doxygen header says tl/op/elem.h but this file is op/copy.h. Update to avoid confusion in generated docs and code search.
-/*! - * \file tl/op/elem.h +/*! + * \file op/copy.h
17-20: Avoidusing namespacein headers.Placing
using namespace tir;in a public header pollutes downstream translation units. Prefer fully qualified names or move the using-directive into the .cc.If this pattern is pervasive in the codebase, align with the prevailing style; otherwise, consider a follow-up to gradually remove header-level using-directives.
src/op/atomic_add.cc (1)
189-195: ParallelOp::InferLayout is order-independent (no overwrites)In src/op/parallel.cc (around lines 182–188), the implementation of
ParallelOp::InferLayoutensures that onceloop_layout_is set, subsequent calls do not modify it:
- At the start of the method it does
so only the first non-strict pass (kCommon, or kFree if no source buffer) can assignif (loop_layout_.defined()) return {}; if (level == InferLevel::kStrict) return {};loop_layout_.- After that initial assignment, both kStrict and kFree early-return without side effects, ensuring no later, more permissive pass can overwrite a previously chosen layout.
- The final layout is then retrieved via
GetLoopLayout()(src/op/parallel.h: 61), reflecting exactly the first successful inference.Optional refactor (recommended):
- Add a brief comment in the doc-string for
ParallelOp::InferLayoutto document its idempotent, order-independent behavior for future maintainers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
src/op/atomic_add.cc(1 hunks)src/op/copy.cc(1 hunks)src/op/copy.h(1 hunks)src/op/op.h(0 hunks)src/transform/atomicadd_vectorize.cc(1 hunks)src/transform/lower_tile_op.cc(1 hunks)
💤 Files with no reviewable changes (1)
- src/op/op.h
🚧 Files skipped from review as they are similar to previous changes (1)
- src/op/copy.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/copy.h (4)
src/op/builtin.h (2)
tvm(13-313)tl(14-312)src/target/utils.h (2)
tvm(12-32)tl(13-31)src/op/op.h (3)
tl(18-95)Operator(61-67)- `` (74-76)
src/op/copy.cc (34)
Copy(114-139)Lower(471-491)Lower(471-471)Lower(1064-1186)Lower(1064-1065)InferLayout(319-352)InferLayout(319-319)CheckBulkLoad(362-379)CheckBulkLoad(362-362)CheckBulkStore(389-406)CheckBulkStore(389-389)CheckLDSMCopy(416-419)CheckLDSMCopy(416-416)CheckSTSMCopy(429-432)CheckSTSMCopy(429-429)GetCopyInst(442-457)GetCopyInst(442-442)LowerBulkCopy(741-1002)LowerBulkCopy(741-742)LowerLDSMCopy(548-728)LowerLDSMCopy(548-549)LowerNormalCopy(502-535)LowerNormalCopy(502-503)MakeSIMTLoop(236-281)MakeSIMTLoop(236-236)ComputeLinearLayout(292-307)ComputeLinearLayout(292-292)MakeIterVars(147-159)MakeIterVars(147-147)MakeIndices(170-187)MakeIndices(170-171)MakePredicate(198-226)MakePredicate(198-200)Conv2DIm2ColOp(1042-1052)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-nvidia
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (1)
src/transform/lower_tile_op.cc (1)
447-451: LowerArgs initializer verified – struct fields match and no staledisable_tma_lowerusesAll checks passed:
- The
LowerArgsdefinition insrc/op/op.hlists fields in this exact order:
Target target;Range thread_bounds;Var thread_var;AddWorkspaceCallback AddWorkspace;LayoutMap layout_map;Map<Buffer, Buffer> buffer_remap;
(confirmed viarg -nP 'struct\s+LowerArgs\b' -C10 src/op/op.h)- The single call site in
src/transform/lower_tile_op.ccuses
LowerArgs{target_, thread_bounds, thread_var_->var, callback, layout_map_, buffer_remap_},
exactly matching the struct’s field order.- No other positional initializations of
LowerArgsexist, and no remaining occurrences ofdisable_tma_lowerin C++ call sites.Since the initializer aligns perfectly with the updated struct and no obsolete flag usages remain, no further changes are needed here.
| // TODO(@dyq): buggy implementation, need to fix | ||
| // vectorized_thread_loop = VectorizeAtomicAdd( | ||
| // thread_loop, thread_var, thread_bounds, GetArchInt(target)); | ||
| auto vectorized_thread_loop = VectorizeLoop(thread_loop); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Replacing specialized AtomicAdd vectorizer with generic VectorizeLoop likely breaks vectorization semantics
The previous path used VectorizeAtomicAdd which:
- plans a vector size based on arch and dtype,
- and rewrites
call_extern("AtomicAdd", ...)intoAtomicAddx2/AtomicAddx4with pointer-to-src conventions.
Switching to VectorizeLoop(thread_loop) bypasses that rewriter. Unless VectorizeLoop knows how to handle AtomicAdd and emit AtomicAddx{2,4}, vectorization will either be skipped or produce invalid IR for the extern call.
Reinstate the specialized vectorizer here until VectorizeLoop subsumes it:
- // TODO(@dyq): buggy implementation, need to fix
- // vectorized_thread_loop = VectorizeAtomicAdd(
- // thread_loop, thread_var, thread_bounds, GetArchInt(target));
- auto vectorized_thread_loop = VectorizeLoop(thread_loop);
+ // Use the specialized AtomicAdd vectorizer pending a robust generic path.
+ auto vectorized_thread_loop =
+ VectorizeAtomicAdd(thread_loop, thread_var, thread_bounds,
+ GetArchInt(target));-
Confirm that
AtomicAddx2/AtomicAddx4are defined/linked in the runtime: -
If you intend to keep
VectorizeLoop, point me to the lowering that handlesAtomicAddso I can review that path.
🏁 Script executed:
#!/bin/bash
rg -nP '\bAtomicAddx(2|4)\b' -g 'src/**|include/**' -C2Length of output: 206
🏁 Script executed:
#!/bin/bash
# Properly search for AtomicAddx2/4 definitions or references across the entire repository
rg -nP '\bAtomicAddx(2|4)\b' -C2 || echo "No matches found in repository"
# Additionally, constrain search to src and include directories
rg -nP '\bAtomicAddx(2|4)\b' -g 'src/**' -g 'include/**' -C2 || echo "No matches found in src/ or include/"Length of output: 4981
🏁 Script executed:
#!/bin/bash
# Search for the definition and usage of VectorizeLoop in the source
rg -n -C3 'VectorizeLoop' -g 'src/**' || echo "VectorizeLoop not found in src/"Length of output: 6630
Reinstate the specialized AtomicAdd vectorizer to preserve correct semantics
The change from VectorizeAtomicAdd to a plain VectorizeLoop bypasses the custom rewriting that emits AtomicAddx2/AtomicAddx4, leading to either skipped vectorization or invalid IR for the extern call.
• We verified that AtomicAddx2 and AtomicAddx4 are indeed defined and used in the codebase:
- In the generic customization layer (tilelang), they’re generated by calls in
tilelang/language/customize.py. - The specialized mutator in
src/transform/atomicadd_vectorize.ccpushes"AtomicAddx2"/"AtomicAddx4"into the call arguments based on vector size . - Their device implementations live in
src/tl_templates/cuda/common.h, underlined byTL_DEVICE void AtomicAddx2(…)and…AtomicAddx4(…).
• The generic VectorizeLoop implementation (in src/transform/loop_vectorize.cc and loop_vectorize.cc) contains no special handling for extern calls to "AtomicAdd"—it merely vectorizes based on arithmetic patterns, so it won’t rewrite or emit the correct AtomicAddxN calls.
Action: Revert to the specialized vectorizer call in src/op/atomic_add.cc until the generic path is extended to cover AtomicAdd.
--- a/src/op/atomic_add.cc
+++ b/src/op/atomic_add.cc
@@ -200,7 +200,10 @@ Stmt MakeAtomicAddOp(...){
- // TODO(@dyq): buggy implementation, need to fix
- // vectorized_thread_loop = VectorizeAtomicAdd(
- // thread_loop, thread_var, thread_bounds, GetArchInt(target));
- auto vectorized_thread_loop = VectorizeLoop(thread_loop);
+ // Use the specialized AtomicAdd vectorizer to emit AtomicAddx2/4 calls
+ auto vectorized_thread_loop =
+ VectorizeAtomicAdd(thread_loop, thread_var, thread_bounds,
+ GetArchInt(target));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TODO(@dyq): buggy implementation, need to fix | |
| // vectorized_thread_loop = VectorizeAtomicAdd( | |
| // thread_loop, thread_var, thread_bounds, GetArchInt(target)); | |
| auto vectorized_thread_loop = VectorizeLoop(thread_loop); | |
| // Use the specialized AtomicAdd vectorizer to emit AtomicAddx2/4 calls | |
| auto vectorized_thread_loop = | |
| VectorizeAtomicAdd(thread_loop, thread_var, thread_bounds, | |
| GetArchInt(target)); |
🤖 Prompt for AI Agents
In src/op/atomic_add.cc around lines 200 to 204, the code currently uses
VectorizeLoop which skips the specialized AtomicAdd rewriting and produces
incorrect IR for extern AtomicAdd calls; restore the original specialized
vectorizer by calling VectorizeAtomicAdd(thread_loop, thread_var, thread_bounds,
GetArchInt(target)) (and remove or update the TODO/commented buggy note) so the
mutator emits AtomicAddx2/AtomicAddx4 as before until the generic vectorizer is
enhanced to handle AtomicAdd.
…smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
examples/flash_attention/example_gqa_bwd.py (3)
292-295: Avoid potential CPU/GPU device mismatch in reference scalingDividing by torch.sqrt(torch.tensor(dim_qk, dtype=scores.dtype)) constructs a CPU tensor, which can trigger device-mismatch on some setups. Use a Python float to keep computation on the tensor’s device and reduce overhead.
Apply:
- scores = scores / torch.sqrt(torch.tensor(dim_qk, dtype=scores.dtype)) + scores = scores / (dim_qk ** 0.5)
364-372: Fix boolean CLI flag parsing for --causalargparse with type=bool is error-prone:
--causal Falsestill parses as True. Prefer store_true.- parser.add_argument('--causal', type=bool, default=False, help='Causal flag') + parser.add_argument('--causal', action='store_true', help='Causal flag')If you also want an explicit opt-out flag, add:
parser.add_argument('--no-causal', dest='causal', action='store_false', help='Disable causal flag')
304-330: Make example runs reproducible to reduce flaky close-checksRandom init without a fixed seed can make close-checks flaky across runs/devices.
Consider seeding once inside main before tensor inits:
def main(BATCH: int = 1, @@ - Q = ( + torch.manual_seed(0) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(0) + Q = (testing/python/language/test_tilelang_language_reshape.py (4)
23-24: Clarify/track the TODO about reshape + shared memory.The note is helpful but vague for future contributors. Please reference (or open) a GitHub issue explaining why reshape cannot currently apply shared memory and what conditions would allow re-enabling optimizations, or convert this into a targeted xfail/skip with a reason.
I can draft the issue description with a minimal reproducer and link it here—want me to open it?
69-75: Mirror the same de-dup and alias consistency here.Same suggestions as above: reuse a shared
RESHAPE_TEST_PASS_CONFIGSand prefertl.PassConfigKeyfor consistency with thetl.compilecall.Apply at this call site as well:
- pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + pass_configs=RESHAPE_TEST_PASS_CONFIGS)
67-69: Replicated TODO — consolidate with a single explanatory comment.Since the same limitation applies in both 1D→2D reshape cases, consider consolidating into one well-explained comment near the helper or at the top of the file to avoid drift.
25-31: Factor out duplicated reshape pass configuration into a module‐level constantThe inline
pass_configsdict is duplicated in twotl.compilecalls and uses the longertilelang.PassConfigKeyqualifier. Let’s consolidate and alias totl.PassConfigKeyfor consistency and maintainability.Locations to update in
testing/python/language/test_tilelang_language_reshape.py:
- Lines 25–31: first
tl.compilecall- Lines 69–74: second
tl.compilecallProposed changes:
--- a/testing/python/language/test_tilelang_language_reshape.py +++ b/testing/python/language/test_tilelang_language_reshape.py @@ (near top of file, after imports) +# Module‐level reshape test pass configuration +RESHAPE_TEST_PASS_CONFIGS = { + tl.PassConfigKey.TL_DISABLE_TMA_LOWER: True, + tl.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, +} @@ -25,7 +25,7 @@ def run_reshape_1d_2d(N, M, dtype): - jit_kernel = tl.compile( - program, - out_idx=-1, - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + jit_kernel = tl.compile(program, out_idx=-1, pass_configs=RESHAPE_TEST_PASS_CONFIGS) @@ -69,7 +69,7 @@ def run_reshape_2d_1d(N, M, dtype): - jit_kernel = tl.compile( - program, - out_idx=-1, - pass_configs={ - tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, - tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, - }) + jit_kernel = tl.compile(program, out_idx=-1, pass_configs=RESHAPE_TEST_PASS_CONFIGS)This removes duplication, uses the
tlalias consistently, and keeps the keys verified as present inpass_config.py.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/flash_attention/example_gqa_bwd.py(1 hunks)testing/python/language/test_tilelang_language_reshape.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_reshape.py (2)
tilelang/jit/__init__.py (1)
compile(32-81)tilelang/transform/pass_config.py (1)
PassConfigKey(6-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-nvidia
- GitHub Check: build-test-amd
🔇 Additional comments (1)
examples/flash_attention/example_gqa_bwd.py (1)
342-346: Migrating to torch.testing.assert_close is correct and improves failure diagnosticsThis replaces boolean asserts with structured checks that aren’t stripped by Python’s -O and yield clearer diffs on failure. Tolerances look reasonable for fp16. LGTM.
…n options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/deepseek_nsa/example_tilelang_nsa_decode.py (1)
31-31: Fix typo in comment.There's an extra 'a' character at the end of the comment.
- # Modified shapes for inference (q has seq_len=1)a + # Modified shapes for inference (q has seq_len=1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/deepseek_nsa/example_tilelang_nsa_decode.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/deepseek_nsa/example_tilelang_nsa_decode.py (2)
tilelang/jit/__init__.py (1)
jit(232-305)tilelang/transform/pass_config.py (1)
PassConfigKey(6-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-test-nvidia
🔇 Additional comments (1)
examples/deepseek_nsa/example_tilelang_nsa_decode.py (1)
12-17: LGTM! Configuration aligns with PR objectives.The addition of
pass_configsto disable TMA lowering and warp specialization is consistent with the broader refactoring described in the PR summary. This change allows fine-grained control over the TileLang compilation pipeline.
- Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
src/transform/warp_specialized_rewriter.cc (3)
1408-1408: Clarify the comment; avoid the ambiguous “aws”.The current comment is ambiguous and slightly ungrammatical.
- // return true means this aws will be disabled + // Returns true when auto warp specialization should be disabled.
1437-1438: Default state risks false negatives; prefer neutral default or drop assignment.Initializing
num_threads_is_divisible_by_warp_group_to false in the ctor means “not divisible” until proven otherwise, which biases detection to disable. Either rely on the guard proposed above or initialize to true.- num_threads_is_divisible_by_warp_group_ = false; + // Leave default or set to true to avoid accidental disable when extent is unknown. + num_threads_is_divisible_by_warp_group_ = true;
1482-1484: Hard-coded warp-group size and default value.
- Using a hard-coded 128 works for SM90 WGMMA, but consider sourcing from target features or a central constant to future-proof Hopper/Blackwell variants.
- Also consider default-initializing
num_threads_is_divisible_by_warp_group_to true to avoid accidental disables when no thread extent is seen (or rely on the guard above).For example:
// e.g., query from target or centralize: static constexpr int kWarpGroupSize = 128;And then refer to it consistently.
tilelang/jit/kernel.py (1)
129-133: Tighten the completion log message and avoid redundant assertMinor wording: “completes to compile kernel” reads awkwardly. Prefer “finished compiling kernel …”. Also, you re-assert
global_symbolhere after doing so above; consider asserting once or cachingfunc_name.Apply this diff:
- logger.info(f"TileLang completes to compile kernel `{func_name}`") + logger.info(f"TileLang finished compiling kernel `{func_name}`")Optional: compute
should_log = env.is_print_on_compilation_enabled()once and reuse for both pre/post logs, and hoistfunc_namecomputation once whenshould_logis True. This removes duplication and a second assert.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
examples/deepseek_nsa/example_tilelang_nsa_decode.py(2 hunks)src/op/copy.cc(1 hunks)src/transform/warp_specialized_rewriter.cc(3 hunks)testing/python/kernel/test_tilelang_kernel_dequantize_gemm.py(0 hunks)tilelang/env.py(1 hunks)tilelang/jit/kernel.py(2 hunks)
💤 Files with no reviewable changes (1)
- testing/python/kernel/test_tilelang_kernel_dequantize_gemm.py
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/deepseek_nsa/example_tilelang_nsa_decode.py
- src/op/copy.cc
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/jit/kernel.py (1)
tilelang/env.py (1)
is_print_on_compilation_enabled(235-236)
src/transform/warp_specialized_rewriter.cc (1)
src/transform/thread_storage_sync.cc (4)
iv(708-714)iv(708-709)iv(730-748)iv(730-731)
🔇 Additional comments (4)
src/transform/warp_specialized_rewriter.cc (2)
1420-1429: LGTM: preserve existing disable conditions and messages.Manual warp specialization and the TMA+mbarrier coexistence checks remain intact with clear warnings. No issues here.
1409-1410: Remove unusedskip_thread_partitionparameter inDetectThe
skip_thread_partitionargument isn’t referenced anywhere inDetectand no callers supply a value (all calls rely on the default). Removing it will eliminate confusion.• In src/transform/warp_specialized_rewriter.cc, change
static bool Detect(Stmt stmt, bool skip_thread_partition = false) { WarpSpecializedDetector detector; … }to
static bool Detect(Stmt stmt) { WarpSpecializedDetector detector; … }• Update the single call site (at line 1494) remains correct:
bool warp_specialized = WarpSpecializedDetector::Detect(f->body);tilelang/jit/kernel.py (2)
120-125: Good centralization of the print toggleSwitching to
env.is_print_on_compilation_enabled()removes duplicated literal checks and keeps the policy in one place. LGTM.
120-125: Import ofenvcorrectly resolves to the singleton instance
I’ve confirmed that:
tilelang/env.pydefinesis_print_on_compilation_enabledon theEnvironmentclass.tilelang/__init__.pyre-exports the instance viafrom .env import env as env(lines 56–57).- All call sites use
from tilelang import env, soenvrefers to the instance, not the module.No changes needed here.
| if (!detector.num_threads_is_divisible_by_warp_group_) { | ||
| LOG(WARNING) | ||
| << "Auto warp specialization will be disabled because the number of " | ||
| "threads" | ||
| << detector.thread_var_->dom->extent | ||
| << "is not divisible by warp group size"; | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing threadIdx.x; fix potential null deref and misleading disable.
If the function has no threadIdx.x binding, thread_var_ is undefined and dereferencing detector.thread_var_ in the warning will crash. Also, defaulting num_threads_is_divisible_by_warp_group_ to false causes spurious disable. Guard with defined(), use the existing skip_thread_partition parameter to allow bypass, and fix message formatting.
- if (!detector.num_threads_is_divisible_by_warp_group_) {
- LOG(WARNING)
- << "Auto warp specialization will be disabled because the number of "
- "threads"
- << detector.thread_var_->dom->extent
- << "is not divisible by warp group size";
- return true;
- }
+ if (!skip_thread_partition &&
+ detector.thread_var_.defined() &&
+ !detector.num_threads_is_divisible_by_warp_group_) {
+ LOG(WARNING)
+ << "Auto warp specialization will be disabled because threadIdx.x extent ("
+ << detector.thread_var_->dom->extent
+ << ") is not divisible by warp group size (" << detector.warp_group_size_ << ").";
+ return true;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!detector.num_threads_is_divisible_by_warp_group_) { | |
| LOG(WARNING) | |
| << "Auto warp specialization will be disabled because the number of " | |
| "threads" | |
| << detector.thread_var_->dom->extent | |
| << "is not divisible by warp group size"; | |
| return true; | |
| } | |
| if (!skip_thread_partition && | |
| detector.thread_var_.defined() && | |
| !detector.num_threads_is_divisible_by_warp_group_) { | |
| LOG(WARNING) | |
| << "Auto warp specialization will be disabled because threadIdx.x extent (" | |
| << detector.thread_var_->dom->extent | |
| << ") is not divisible by warp group size (" << detector.warp_group_size_ << ")."; | |
| return true; | |
| } |
| def is_print_on_compilation_enabled(self) -> bool: | ||
| return self.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Harden boolean parsing; avoid AttributeError on non-string overrides
EnvVar.set allows any type, so .lower() on a non-str (e.g., env.TILELANG_PRINT_ON_COMPILATION = 0 in tests) will raise. Also, trimming whitespace improves robustness. Suggest coercing to str, stripping, and expanding accepted truthy tokens. A short docstring helps too.
Apply this diff:
- def is_print_on_compilation_enabled(self) -> bool:
- return self.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on")
+ def is_print_on_compilation_enabled(self) -> bool:
+ """Return True if TILELANG_PRINT_ON_COMPILATION is set to a truthy value."""
+ raw = self.TILELANG_PRINT_ON_COMPILATION
+ if raw is None:
+ return False
+ return str(raw).strip().lower() in ("1", "true", "t", "yes", "y", "on")Optional follow-up: define a module-level TRUTHY = {...} and reuse across the codebase.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def is_print_on_compilation_enabled(self) -> bool: | |
| return self.TILELANG_PRINT_ON_COMPILATION.lower() in ("1", "true", "yes", "on") | |
| def is_print_on_compilation_enabled(self) -> bool: | |
| """Return True if TILELANG_PRINT_ON_COMPILATION is set to a truthy value.""" | |
| raw = self.TILELANG_PRINT_ON_COMPILATION | |
| if raw is None: | |
| return False | |
| return str(raw).strip().lower() in ("1", "true", "t", "yes", "y", "on") |
🤖 Prompt for AI Agents
In tilelang/env.py around lines 235 to 237, the boolean parsing uses .lower()
directly which can raise AttributeError if the env var is non-string; change the
method to coerce the value to str(), .strip(), then .lower() before membership
check against an expanded set of truthy tokens (e.g., "1","true","yes","on");
add a one-line docstring stating this method returns whether
print-on-compilation is enabled and mention that non-string values are coerced;
optionally factor the truthy set into a module-level TRUTHY constant for reuse.
…eLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
- Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/transform/warp_specialized_rewriter.cc (2)
543-552: Fix WgMMA detection: incorrect default and update logic make HasWgMMA() always-true or brittle.
has_wgmma_defaults totrue, so functions without any GEMM calls are incorrectly treated as having WgMMA.- The update short-circuits unless
has_wgmma_is already true, preventing flipping from false→true.Apply this diff:
void VisitExpr_(const CallNode *op) final { if (op->op.same_as(tl_gemm()) || op->op.same_as(tl_gemm_sp())) { auto op_name = std::string(op->args[0].as<StringImmNode>()->value); - if (has_wgmma_) { - has_wgmma_ = - op_name.find("false") == std::string::npos && !in_if_scope_; - } + has_wgmma_ = has_wgmma_ || + (op_name.find("false") == std::string::npos && !in_if_scope_); } StmtExprVisitor::VisitExpr_(op); } - bool has_wgmma_{true}; + bool has_wgmma_{false};Also applies to: 569-570
1380-1385: Do not classify set_max_nreg as a TMA op; this spuriously disables auto-WS.
VisitExpr_setshas_tma_op_ = trueonset_max_nreg(). That’s unrelated to TMA; combined with any mbarrier op it will incorrectly disable auto warp specialization.Apply this diff:
- if (op->op.same_as(tma_load()) || op->op.same_as(tma_load_im2col()) || - op->op.same_as(set_max_nreg())) { + if (op->op.same_as(tma_load()) || op->op.same_as(tma_load_im2col())) { has_tma_op_ = true; }
♻️ Duplicate comments (1)
src/transform/warp_specialized_rewriter.cc (1)
1393-1399: Don’t ICHECK constant thread extent; allow symbolic extents.Re-raising prior feedback:
ICHECK(iv->dom->extent.as<IntImmNode>())aborts on unknown/symbolic extents. This was previously flagged; please guard instead of crashing.Apply this diff:
- if (iv->thread_tag == "threadIdx.x") { - ICHECK(iv->dom->extent.as<IntImmNode>()); - thread_var_ = iv; - } + if (iv->thread_tag == "threadIdx.x") { + thread_var_ = iv; + // Extent may be symbolic; avoid crashing here. Use analyzer if needed elsewhere. + }
🧹 Nitpick comments (16)
tilelang/transform/__init__.py (1)
192-205: Nice, consistent Python wrapper and helpful docstring. Consider a defensive fallback like LowerHopperIntrin.If the backend isn’t compiled with this pass (dev builds, feature flags), the current wrapper will raise at import/use-time. For parity with LowerHopperIntrin(), you can add a soft fallback to a no-op to improve resilience of mixed environments.
Apply this diff:
-def AnnotateWarpGroupRegAlloc(): +def AnnotateWarpGroupRegAlloc(): """Inject set_max_nreg calls into warp-specialized functions. @@ - return _ffi_api.AnnotateWarpGroupRegAlloc() # type: ignore + return (_ffi_api.AnnotateWarpGroupRegAlloc() + if hasattr(_ffi_api, "AnnotateWarpGroupRegAlloc") + else (lambda f: f)) # type: ignoresrc/op/builtin.h (3)
18-20: Attr key string is unconventional; consider a stable, non-“k” value.*Most keys here are semantic strings (e.g., "padding_map", "tl.dynamic_alignment"). Using "kWarpSpecializationScope" (the variable-like name) as the value is surprising and easy to mistype. Consider "warp_specialization_scope" or "tl.warp_specialization_scope" for long-term consistency. This would require updating all producers/tests.
59-66: New cuTensorMapType() API: ensure header includes DataType definition.If not already pulled in indirectly, add the explicit include to avoid ODR/portability issues on some toolchains.
Add this include near the other headers:
#include <tvm/runtime/data_type.h>
156-163: Doc comment for stmatrix references the wrong intrinsic.The comment under “stmatrix” still says “ptx_ldmatrix(...)”, which is misleading.
Apply this diff:
- * \brief tvm intrinsics for stmatrix + * \brief tvm intrinsics for stmatrix @@ - * ptx_ldmatrix(transposed, num, shared_addr, int32_values...) + * ptx_stmatrix(transposed, num, shared_addr, int32_values...)testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (3)
31-33: Attr key string matches the new constant, but consider centralizing to avoid typos.
"kWarpSpecializationScope"is a raw string here; a Python-level alias (e.g., T.kWarpSpecializationScope or T.attr_keys.warp_specialization_scope) would reduce drift if the C++ key changes.
115-120: Mirror the fix in the “no_set_max_nreg” test when collecting arg values (if you add value checks later).Currently you only assert the absence of calls, which is fine. If you later add negative assertions on values, reuse the same
.args[0]extraction pattern to avoid subscript issues.Also applies to: 130-136
1-8: Minor: module aliasing is a bit inconsistent but harmless.You import both
tilelang as tlandimport tilelang.testingwhile callingtilelang.disable_cache()andtilelang.testing.main(). Either call viatlor consistently throughtilelang—purely stylistic.Also applies to: 140-142
src/transform/warp_specialized_rewriter.cc (3)
1345-1359: Unused parameter; removeskip_thread_partitionfrom Detect().
skip_thread_partitionis no longer used after the recent refactor. Keep the API surface lean to avoid confusion.Apply this diff:
- static bool Detect(Stmt stmt, bool skip_thread_partition = false) { + static bool Detect(Stmt stmt) {Also ensure the sole call site remains
WarpSpecializedDetector::Detect(f->body);(it already matches).
1404-1406: Remove dead state.
thread_var_is assigned but never used byWarpSpecializedDetectoranymore. Drop it to reduce cognitive load.Apply this diff:
- IterVar thread_var_;
1417-1423: Clarify variable naming (boolean inversion).
bool warp_specialized = WarpSpecializedDetector::Detect(f->body);returns “disable auto WS?”; the name reads as the opposite. Consider renaming todisable_auto_ws(or flip the boolean logic) for readability.src/transform/annotate_warp_group_reg_alloc.cc (6)
35-41: Defensive parsing: validate IntImm args or gracefully skip.
call->args[0/1].as<IntImmNode>()may be null if a user passes non-constant values; current code will segfault. Validate and log/skip instead of crashing.Apply this diff:
- int reg_hint = call->args[0].as<IntImmNode>()->value; - int is_inc = call->args[1].as<IntImmNode>()->value; + const auto* reg_imm = call->args[0].as<IntImmNode>(); + const auto* inc_imm = call->args[1].as<IntImmNode>(); + if (!reg_imm || !inc_imm) { + // Ignore non-constant hints for now; optionally warn in debug builds. + return; + } + int reg_hint = reg_imm->value; + int is_inc = inc_imm->value;
78-91: Drop unused thread extent rewrite path from this pass.This pass doesn’t mutate thread extent and never sets
need_update_thread_extent_. The extra code and state can be removed.Apply this diff:
- if (op->attr_key == tir::attr::thread_extent && - Downcast<IterVar>(op->node)->thread_tag == "threadIdx.x") { - thread_iv_ = Downcast<IterVar>(op->node); - need_update_thread_extent_ = false; - AttrStmt attr_stmt = Downcast<AttrStmt>(StmtExprMutator::VisitStmt_(op)); - if (need_update_thread_extent_) { - thread_iv_.CopyOnWrite()->dom = {0, updated_thread_extent_.value()}; - attr_stmt.CopyOnWrite()->node = thread_iv_; - attr_stmt.CopyOnWrite()->value = updated_thread_extent_.value(); - } - thread_iv_ = {}; - return attr_stmt; - } else + if (op->attr_key == attr::kWarpSpecializationScope) { ... - } else { + } else { return StmtExprMutator::VisitStmt_(op); }And remove the unused fields:
- IterVar thread_iv_; - Optional<PrimExpr> updated_thread_extent_; - bool need_update_thread_extent_ = false;
103-116: Inject hints per-side; don’t require both inc and dec to be present.If only one hint is provided, the other path should still be annotated. Also, prefer
Integer(...)for clarity of types.Apply this diff:
- if (dec_reg >= 0 && inc_reg >= 0 && !has_simt_copy) { - inc_reg_stmt = Evaluate(Call(DataType::Handle(), set_max_nreg(), - {inc_reg == 0 ? 240 : inc_reg, 1})); - dec_reg_stmt = Evaluate(Call(DataType::Handle(), set_max_nreg(), - {dec_reg == 0 ? 24 : dec_reg, 0})); - } + if (!has_simt_copy) { + if (inc_reg >= 0) { + inc_reg_stmt = Evaluate(Call( + DataType::Handle(), set_max_nreg(), + {Integer(inc_reg == 0 ? 240 : inc_reg), Integer(1)})); + } + if (dec_reg >= 0) { + dec_reg_stmt = Evaluate(Call( + DataType::Handle(), set_max_nreg(), + {Integer(dec_reg == 0 ? 24 : dec_reg), Integer(0)})); + } + }
51-53: Document sentinel semantics.Using
0to mean “no user hint; clamp to 240/24” is fine; add a brief comment here to make it explicit for future maintainers.
9-11: Remove unused include.
<unordered_set>isn’t used.Apply this diff:
-#include <unordered_set>
111-116: Optional: warn when hints are disabled due to SIMT copy.If/when SIMT detection is added here, consider logging a
VLOG(1)orDLOG(WARNING)to aid debugging.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
src/op/builtin.h(3 hunks)src/transform/annotate_warp_group_reg_alloc.cc(1 hunks)src/transform/warp_specialized_rewriter.cc(2 hunks)testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py(1 hunks)tilelang/engine/phase.py(1 hunks)tilelang/language/builtin.py(1 hunks)tilelang/transform/__init__.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tilelang/engine/phase.py (2)
src/transform/annotate_warp_group_reg_alloc.cc (2)
AnnotateWarpGroupRegAlloc(147-152)AnnotateWarpGroupRegAlloc(147-147)tilelang/transform/__init__.py (1)
AnnotateWarpGroupRegAlloc(192-204)
tilelang/transform/__init__.py (1)
src/transform/annotate_warp_group_reg_alloc.cc (2)
AnnotateWarpGroupRegAlloc(147-152)AnnotateWarpGroupRegAlloc(147-147)
src/transform/annotate_warp_group_reg_alloc.cc (3)
src/transform/warp_specialized_rewriter.cc (22)
f(1155-1174)f(1155-1156)op(44-49)op(44-44)op(79-89)op(79-79)op(91-99)op(91-91)op(101-106)op(101-101)op(108-116)op(108-108)op(140-152)op(140-140)op(154-183)op(154-154)op(185-195)op(185-185)call(37-42)call(37-37)has_simt_copy(275-275)has_simt_copy(279-279)tilelang/language/builtin.py (2)
set_max_nreg(115-130)no_set_max_nreg(157-160)tilelang/transform/__init__.py (1)
AnnotateWarpGroupRegAlloc(192-204)
testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py (4)
tilelang/language/ast/ir.py (6)
launch_thread(1208-1240)block(342-358)reads(390-407)writes(410-427)alloc_buffer(441-508)evaluate(1319-1331)tilelang/language/builtin.py (8)
annotate_producer_reg_dealloc(145-148)annotate_consumer_reg_alloc(151-154)create_list_of_mbarrier(11-40)mbarrier_wait_parity(163-210)get_mbarrier(43-52)tma_load(67-76)create_tma_descriptor(55-64)no_set_max_nreg(157-160)src/transform/annotate_warp_group_reg_alloc.cc (8)
AnnotateWarpGroupRegAlloc(147-152)AnnotateWarpGroupRegAlloc(147-147)op(32-49)op(32-32)op(66-76)op(66-66)op(78-137)op(78-78)tilelang/transform/__init__.py (2)
AnnotateWarpGroupRegAlloc(192-204)LowerOpaqueBlock(406-409)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-test-nvidia
🔇 Additional comments (6)
tilelang/engine/phase.py (1)
104-104: Pass placement LGTM; runs after WarpSpecialized/InjectTmaBarrier and before LowerOpaqueBlock.This is the right window to see kWarpSpecializationScope and inject set_max_nreg before opaque lowering. No ordering concerns from me.
src/op/builtin.h (2)
151-155: Fixed typo: ptx_ldmatrix symbol name.Renaming ptx_ldmatirx → ptx_ldmatrix is correct and aligns with CUDA PTX. Thanks for cleaning this up.
59-66: No staleptx_ldmatirxreferences foundI ran the suggested search (
rg -nP 'ptx_ldmatirx' -C2) across the repository and it returned no matches, confirming that all occurrences of the old, misspelled symbol have been removed.src/transform/annotate_warp_group_reg_alloc.cc (2)
67-76: OK to strip original hints here.Removing original
set_max_nreg/no_set_max_nregcalls at parse time and re-injecting later is consistent with the new design.
147-152: Pass name and registration LGTM.Registering as
tl.AnnotateWarpGroupRegAllocat opt-level 0 is consistent with the pipeline.src/transform/warp_specialized_rewriter.cc (1)
1345-1359: Behavioral check: disabling auto-WS when both TMA and mbarrier are present.Confirm this matches intended policy. Some kernels legitimately use both (TMA producer, mbarrier consumer). If the goal is to disable only when there is pre-existing mbarrier choreography (i.e., manual sync), consider tightening the condition to detect “manual” mbarrier usage rather than any mbarrier op.
Would you like a repository-wide scan to identify kernels that would flip behavior due to this change?
| Stmt VisitStmt_(const EvaluateNode *op) final { | ||
| if (const CallNode *call = op->value.as<CallNode>()) { | ||
| if (call->op.same_as(set_max_nreg()) || | ||
| call->op.same_as(no_set_max_nreg())) { | ||
| // Remove the original set_max_nreg calls as they will be re-inserted | ||
| // at appropriate locations | ||
| return Evaluate(0); | ||
| } | ||
| } | ||
| return StmtExprMutator::VisitStmt_(op); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Semantic change risk: stripping set_max_nreg outside WS scope.
This pass removes all original set_max_nreg/no_set_max_nreg calls, even if they occur in functions without kWarpSpecializationScope. That may silently drop user-intended hints.
Here is a script to audit current call sites and their context:
🏁 Script executed:
#!/bin/bash
# List all set_max_nreg / no_set_max_nreg call sites with 3 lines of context.
rg -n -C3 -e 'set_max_nreg\s*\(' -e 'no_set_max_nreg\s*\('
# Heuristic: show files that contain kWarpSpecializationScope or warp_specialize attr.
rg -n -C2 -e 'kWarpSpecializationScope' -e 'warp_specialize'Length of output: 48970
Restrict removal of set_max_nreg/no_set_max_nreg calls to within WarpSpecializationScope
The current mutator unconditionally strips all set_max_nreg and no_set_max_nreg calls—regardless of whether they appear inside a kWarpSpecializationScope block—silently dropping user hints in non–warp‐specialized code. To avoid this semantic break:
• In src/transform/annotate_warp_group_reg_alloc.cc, update the VisitStmt_(const EvaluateNode*) override (around lines 66–76) to only remove calls when inside the warp‐specialization scope. For example, guard on a flag set when entering AttrStmt(attr::kWarpSpecializationScope):
- Stmt VisitStmt_(const EvaluateNode *op) final {
- if (const CallNode *call = op->value.as<CallNode>()) {
- if (call->op.same_as(set_max_nreg()) ||
- call->op.same_as(no_set_max_nreg())) {
- // Remove the original set_max_nreg calls as they will be re-inserted
- // at appropriate locations
- return Evaluate(0);
- }
- }
- return StmtExprMutator::VisitStmt_(op);
- }
+ Stmt VisitStmt_(const EvaluateNode *op) final {
+ if (in_warp_specialization_scope_ &&
+ op->value.as<CallNode>() &&
+ (call->op.same_as(set_max_nreg()) ||
+ call->op.same_as(no_set_max_nreg()))) {
+ // Remove only within warp‐specialization blocks; other hints must be preserved
+ return Evaluate(0);
+ }
+ return StmtExprMutator::VisitStmt_(op);
+ }• Ensure in_warp_specialization_scope_ (or an equivalent boolean) is set to true when visiting AttrStmtNode with key attr::kWarpSpecializationScope and reset upon exiting.
This change confines hint stripping to the intended region and preserves user‐supplied register hints elsewhere.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/transform/annotate_warp_group_reg_alloc.cc around lines 66 to 76, the
Evaluate visitor currently strips all set_max_nreg/no_set_max_nreg calls
unconditionally; limit this removal to only when inside a
WarpSpecializationScope by adding a boolean member (e.g.
in_warp_specialization_scope_) initialized false, set it to true on entering
AttrStmt(attr::kWarpSpecializationScope) and restore it when exiting (push/pop
or save/restore around recursive visit), and change VisitStmt_(const
EvaluateNode*) to only return Evaluate(0) for those calls when
in_warp_specialization_scope_ is true so hints outside the scope are preserved.
| Array<IntImm> ws_partition = {Downcast<IntImm>(producer_thread_extent), | ||
| Downcast<IntImm>(consumer_thread_extent)}; | ||
| body = AttrStmt(ws_partition, "kWarpSpecializationScope", 0, body); | ||
| body = AttrStmt(ws_partition, attr::kWarpSpecializationScope, 0, body); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid crashing on non-constant thread extents; store partition as PrimExpr instead of IntImm.
Downcasting producer_thread_extent / consumer_thread_extent to IntImm will ICHECK-crash when extents are symbolic. The attribute only needs to carry the values; it can hold PrimExprs directly.
Apply this diff:
- Array<IntImm> ws_partition = {Downcast<IntImm>(producer_thread_extent),
- Downcast<IntImm>(consumer_thread_extent)};
- body = AttrStmt(ws_partition, attr::kWarpSpecializationScope, 0, body);
+ Array<PrimExpr> ws_partition = {producer_thread_extent,
+ consumer_thread_extent};
+ body = AttrStmt(ws_partition, attr::kWarpSpecializationScope, 0, body);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Array<IntImm> ws_partition = {Downcast<IntImm>(producer_thread_extent), | |
| Downcast<IntImm>(consumer_thread_extent)}; | |
| body = AttrStmt(ws_partition, "kWarpSpecializationScope", 0, body); | |
| body = AttrStmt(ws_partition, attr::kWarpSpecializationScope, 0, body); | |
| Array<PrimExpr> ws_partition = {producer_thread_extent, | |
| consumer_thread_extent}; | |
| body = AttrStmt(ws_partition, attr::kWarpSpecializationScope, 0, body); |
🤖 Prompt for AI Agents
In src/transform/warp_specialized_rewriter.cc around lines 1320 to 1323, the
code downcasts producer_thread_extent and consumer_thread_extent to IntImm which
will ICHECK-fail for symbolic extents; instead build the partition array from
the PrimExprs directly (do not Downcast to IntImm) and pass that PrimExpr array
into the AttrStmt so the attribute can carry symbolic extents without crashing.
Ensure the variable type is changed from Array<IntImm> to Array<PrimExpr> (or
the appropriate PrimExpr container used in this file) and remove the Downcast
calls when constructing ws_partition.
| def collect_set_max_nreg(stmt): | ||
| if (isinstance(stmt, tvm.tir.Evaluate) and hasattr(stmt.value, 'op') and | ||
| hasattr(stmt.value.op, 'name') and stmt.value.op.name == "tl.set_max_nreg"): | ||
| set_max_nreg_calls.append(stmt.value) | ||
|
|
||
| tvm.tir.stmt_functor.post_order_visit(main_func.body, collect_set_max_nreg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: tvm.tir.Call is not subscriptable; use .args[0] and extract .value.
The test currently does reg_values = [call[0] for call in set_max_nreg_calls], which will fail at runtime. Use call.args[0] and unwrap IntImm to Python ints.
Apply this diff:
- # Check that we have the expected register values
- reg_values = [call[0] for call in set_max_nreg_calls]
+ # Check that we have the expected register values
+ reg_values = []
+ for call in set_max_nreg_calls:
+ arg0 = call.args[0]
+ if isinstance(arg0, tvm.tir.IntImm):
+ reg_values.append(int(arg0.value))
+ else:
+ reg_values.append(arg0)Also applies to: 83-86
🤖 Prompt for AI Agents
In testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py
around lines 71-76 and 83-86, the test treats tvm.tir.Call objects as
subscriptable (call[0]) which raises at runtime; update the code to access the
first argument via call.args[0] and then unwrap the IntImm to a Python int
(e.g., extract .value and cast to int) when building reg_values so the test
collects integer register values correctly.
| def annotate_producer_reg_dealloc(reg_count: int = 24): | ||
| """Annotate the producer reg dealloc. | ||
| """ | ||
| return dec_max_nreg(reg_count) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard invalid reg_count and document allowed range [24, 240].
The C++ collector asserts 24 ≤ reg_hint ≤ 240. Add an early Python-side check to fail fast with a clear error and document the bounds in the wrapper’s docstring.
Apply this diff:
-def annotate_producer_reg_dealloc(reg_count: int = 24):
- """Annotate the producer reg dealloc.
- """
- return dec_max_nreg(reg_count)
+def annotate_producer_reg_dealloc(reg_count: int = 24):
+ """Annotate producer branch to deallocate registers (decrement).
+
+ Parameters
+ ----------
+ reg_count : int, default 24
+ Requested max register count. Must be in [24, 240].
+ """
+ if not (24 <= reg_count <= 240):
+ raise ValueError(f"reg_count must be in [24, 240], got {reg_count}")
+ return dec_max_nreg(reg_count)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def annotate_producer_reg_dealloc(reg_count: int = 24): | |
| """Annotate the producer reg dealloc. | |
| """ | |
| return dec_max_nreg(reg_count) | |
| def annotate_producer_reg_dealloc(reg_count: int = 24): | |
| """Annotate producer branch to deallocate registers (decrement). | |
| Parameters | |
| ---------- | |
| reg_count : int, default 24 | |
| Requested max register count. Must be in [24, 240]. | |
| """ | |
| if not (24 <= reg_count <= 240): | |
| raise ValueError(f"reg_count must be in [24, 240], got {reg_count}") | |
| return dec_max_nreg(reg_count) |
🤖 Prompt for AI Agents
In tilelang/language/builtin.py around lines 145 to 149, the wrapper
annotate_producer_reg_dealloc lacks validation and documentation for reg_count
bounds required by the C++ collector; update the docstring to state the allowed
range [24, 240] and add an early guard that checks reg_count is an int within
24..240, raising a ValueError with a clear message if not, before returning
dec_max_nreg(reg_count).
| def annotate_consumer_reg_alloc(reg_count: int = 240): | ||
| """Annotate the consumer reg alloc. | ||
| """ | ||
| return inc_max_nreg(reg_count) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Mirror the bounds check for the consumer-side alias.
Same rationale as producer wrapper; keep behavior symmetric and self-documented.
Apply this diff:
-def annotate_consumer_reg_alloc(reg_count: int = 240):
- """Annotate the consumer reg alloc.
- """
- return inc_max_nreg(reg_count)
+def annotate_consumer_reg_alloc(reg_count: int = 240):
+ """Annotate consumer branch to allocate more registers (increment).
+
+ Parameters
+ ----------
+ reg_count : int, default 240
+ Requested max register count. Must be in [24, 240].
+ """
+ if not (24 <= reg_count <= 240):
+ raise ValueError(f"reg_count must be in [24, 240], got {reg_count}")
+ return inc_max_nreg(reg_count)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def annotate_consumer_reg_alloc(reg_count: int = 240): | |
| """Annotate the consumer reg alloc. | |
| """ | |
| return inc_max_nreg(reg_count) | |
| def annotate_consumer_reg_alloc(reg_count: int = 240): | |
| """Annotate consumer branch to allocate more registers (increment). | |
| Parameters | |
| ---------- | |
| reg_count : int, default 240 | |
| Requested max register count. Must be in [24, 240]. | |
| """ | |
| if not (24 <= reg_count <= 240): | |
| raise ValueError(f"reg_count must be in [24, 240], got {reg_count}") | |
| return inc_max_nreg(reg_count) |
🤖 Prompt for AI Agents
In tilelang/language/builtin.py around lines 151–155, the consumer-side wrapper
annotate_consumer_reg_alloc currently just returns inc_max_nreg(reg_count)
without mirroring the producer-side bounds validation; update this function to
replicate the exact bounds/type checks and error message used by the producer
wrapper (validate reg_count is an int and within the same min/max limits, raise
the same ValueError if invalid), then call inc_max_nreg(reg_count) so consumer
behavior is symmetric and self-documented.
- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (7)
src/op/copy.cc (7)
394-397: Fix incorrect comment for bulk store memory scopes.The code correctly checks src ∈ {shared, shared.dyn} and dst ∈ {global}, but the comment says “shared.dyn and local.fragment.” Update the comment to avoid confusion.
Apply this diff:
- // 2. src and dst must be shared.dyn and local.fragment + // 2. src must be shared/shared.dyn and dst must be global
582-585: Replace “TMA ldmatrix/stmatrix” wording with “PTX ldmatrix/stmatrix”.LDSM/STSM are PTX instructions, not TMA. The messages are misleading.
Apply this diff:
- // TMA ldmatrix/stmatrix cannot support non-1-d layout, will be fallback to + // PTX ldmatrix/stmatrix cannot support non-1-d layout, will fall back to - // TMA ldmatrix/stmatrix cannot support non-8x8 layout, will be fallback to + // PTX ldmatrix/stmatrix cannot support non-8x8 layout, will fall back to - // TMA ldmatrix/stmatrix cannot support non-16 bytes continuous layout, will - // be fallback to normal copy + // PTX ldmatrix/stmatrix requires 16-byte (8x b16) contiguous layout; fall + // back to normal copy otherwiseAlso applies to: 623-626, 639-642
753-759: Log message mischaracterizes the constraint as a “non-swizzled global layout.”The condition is “any layout on the global tensor,” and swizzling concerns shared memory layouts. Update message for accuracy.
Apply this diff:
- // TMA bulk copy cannot support a non-swizzled global layout, will be fallback - // to normal copy + // TMA bulk copy does not support a layout annotation on the global tensor; + // fall back to normal copy if (T.layout_map.count(global_tensor)) { - LOG(WARNING) << "TMA bulk copy cannot support a non-swizzled global " - "layout, fallback to normal copy."; + LOG(WARNING) << "TMA bulk copy does not support a layout on the global " + "tensor; falling back to normal copy."; return LowerNormalCopy(T, analyzer); }
951-955: Clarify: It’s the swizzled shared-memory layout constraint, not “swizzled global layout.”The warning should reference shared memory.
Apply this diff:
- LOG(WARNING) << "TMA bulk copy cannot support a swizzled global layout " + LOG(WARNING) << "TMA bulk copy cannot support a swizzled shared memory layout " "with inner_box_dim_ > " << check.max_dim << ", will be fallback to normal copy";
239-242: Scalar-copy path ignores buffer region minima; use MakeIndices() instead of hardcoded {0}.Current scalar path uses BufferLoad(src, {0})/BufferStore(dst, {0}) and drops non-zero Range::min, producing wrong addresses for sliced buffers.
Apply this diff:
- if (is_scalar) { - return For(Var("i"), 0, 1, ForKind::kSerial, - BufferStore(dst, BufferLoad(src, {0}), {0})); - } + if (is_scalar) { + Array<PrimExpr> src_indices = MakeIndices({}, 0); + Array<PrimExpr> dst_indices = MakeIndices({}, 1); + return For(Var("i"), 0, 1, ForKind::kSerial, + BufferStore(dst, BufferLoad(src, src_indices), dst_indices)); + }
644-651: Local-range check uses dst_range unconditionally; for STSM the “local” tensor is src.This causes incorrect fallbacks or missed invalid ranges. Use the appropriate range based on copy_inst.
Apply this diff:
- // Can only support local_range to be a full range - for (size_t i = 0; i < dst_range.size(); i++) { - if (!is_zero(dst_range[i]->min) || - !analyzer->CanProveEqual(dst_range[i]->extent, dst->shape[i])) + // Can only support local_range to be a full range + const auto& local_range = is_ldmatrix ? dst_range : src_range; + const auto& local_buf = is_ldmatrix ? dst : src; + for (size_t i = 0; i < local_range.size(); i++) { + if (!is_zero(local_range[i]->min) || + !analyzer->CanProveEqual(local_range[i]->extent, local_buf->shape[i])) // PTX ldmatrix/stmatrix cannot support non-full range, will be fallback // to normal copy return LowerNormalCopy(T, analyzer); }
1224-1229: tl.copy operator arity mismatch: set_num_inputs should be 5 (includes eviction_policy).tilelang.language.copy passes (src, dst, coalesced_width, disable_tma, eviction_policy). Current registration as 4 will assert at runtime.
Apply this diff:
TIR_REGISTER_TL_OP(Copy, copy) - .set_num_inputs(4) + .set_num_inputs(5) .set_attr<TCallEffectKind>("TCallEffectKind", Integer(CallEffectKind::kOpaque));Run this to validate call sites vs. registration:
#!/bin/bash # Expect python wrappers (if any) to pass 5 args for tl.copy; list occurrences. rg -nP --type=py -C2 'Op\.get\(\s*["\']tl\.copy["\']\s*\)' || true # Sanity: show operator registration lines rg -n 'TIR_REGISTER_TL_OP\(Copy, copy\)|set_num_inputs' src || true
🧹 Nitpick comments (3)
src/op/copy.cc (1)
799-800: De-duplicate CUDA datatype mapping: use cuTensorMapType(dtype) instead of a local to_CUtensorMapDataType.There’s already a public helper (declared in src/op/builtin.h). Keeping two mappings risks divergence for new types (e.g., FP8 variants).
Apply this diff:
-// Helper to map TVM's DataType to CUDA's CUtensorMapDataType enum value. -// ... -static int to_CUtensorMapDataType(DataType dtype) { - ... -} +// Use cuTensorMapType(dtype) declared in builtin.h for CUDA tensor-map dtype mapping.And replace call sites:
- desc.data_type = to_CUtensorMapDataType(global_tensor->dtype); + desc.data_type = cuTensorMapType(global_tensor->dtype);- desc.data_type = to_CUtensorMapDataType(src->dtype); + desc.data_type = cuTensorMapType(src->dtype);Also applies to: 1078-1079, 32-98
examples/gdn/example_wy_fast.py (1)
110-110: API migration looks good; document the intent for future readers.Unconditionally disabling warp-group reg allocation here matches the new API and likely reflects a performance/occupancy trade-off for this kernel. Add a brief comment so the choice is clear to future maintainers.
Apply:
- T.disable_warp_group_reg_alloc() + # Disable warp-group reg alloc for this kernel to avoid maxnreg-side occupancy drops. + T.disable_warp_group_reg_alloc()examples/gdn/example_chunk_o.py (1)
125-125: Consistent migration to disable_warp_group_reg_alloc; add a one-liner comment.This aligns with the new warp-group reg-alloc controls. A short rationale comment will help keep examples consistent and self-explanatory.
Apply:
- T.disable_warp_group_reg_alloc() + # Disable warp-group reg alloc to keep occupancy healthy for this kernel configuration. + T.disable_warp_group_reg_alloc()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
benchmark/matmul/benchmark_matmul_sp.py(1 hunks)examples/deepseek_mla/experimental/example_mla_decode_kv_fp8.py(1 hunks)examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py(1 hunks)examples/gdn/example_chunk_o.py(1 hunks)examples/gdn/example_chunk_scaled_dot_kkt.py(1 hunks)examples/gdn/example_wy_fast.py(1 hunks)src/op/copy.cc(1 hunks)testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp.py(1 hunks)testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py(1 hunks)tilelang/language/builtin.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py
- tilelang/language/builtin.py
🧰 Additional context used
🧬 Code graph analysis (8)
examples/gdn/example_chunk_o.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
examples/deepseek_mla/experimental/example_mla_decode_kv_fp8.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
examples/gdn/example_chunk_scaled_dot_kkt.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
benchmark/matmul/benchmark_matmul_sp.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
examples/gdn/example_wy_fast.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1)
tilelang/language/builtin.py (1)
disable_warp_group_reg_alloc(163-166)
src/op/copy.cc (5)
src/op/copy.h (2)
Copy(86-283)Conv2DIm2ColOp(246-281)src/transform/lower_hopper_intrin.cc (2)
call(102-132)call(102-102)src/target/utils.cc (6)
TargetHasBulkCopy(107-112)TargetHasBulkCopy(107-107)TargetHasLdmatrix(93-98)TargetHasLdmatrix(93-93)TargetHasStmatrix(100-105)TargetHasStmatrix(100-100)tilelang/language/builtin.py (2)
create_tma_descriptor(55-64)tma_load(67-76)src/tl_templates/cuda/copy_sm90.h (1)
tma_load_im2col(154-172)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (6)
src/op/copy.cc (1)
367-369: Nice: bulk load check preserves support for both shared and shared.dyn scopes.The regression noted earlier is addressed; allowing "shared" in addition to "shared.dyn" avoids breaking statically allocated smem paths.
examples/deepseek_mla/experimental/example_mla_decode_kv_fp8.py (1)
55-55: Good rename: prefer T.disable_warp_group_reg_alloc() over T.no_set_max_nreg().Aligns with the new public API and the AnnotateWarpGroupRegAlloc pass.
examples/gdn/example_chunk_scaled_dot_kkt.py (1)
104-104: Good API transition to disable_warp_group_reg_alloc().Consistent with the new warp-group register allocation controls; no functional change.
benchmark/matmul/benchmark_matmul_sp.py (1)
195-195: Consistent API usage: disable_warp_group_reg_alloc().Maintains semantics after clearing C_local; matches repository-wide change.
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp.py (1)
73-75: Test updated to new API; placement looks correct.Disabling warp-group reg alloc before compute is consistent with intended coverage of the new pass.
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py (1)
341-341: Add rationale comment for 512-thread blocksVerification results:
- No stale calls to the old
no_set_max_nreg()API were found in examples or tests.- The new
disable_warp_group_reg_alloc()API is already used across examples and tests.- The
AnnotateWarpGroupRegAllocpass is correctly wired into the optimization pipeline (engine/phase.py:104).- No misspellings of the PTX intrinsic (
ptx_ldmatrix) were detected.Apply this small comment for maintainability:
• File:
examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py, line 341- T.disable_warp_group_reg_alloc() + # 512-thread blocks: disable warp-group maxnreg annotation to avoid occupancy loss on Hopper. + T.disable_warp_group_reg_alloc()
| Array<PrimExpr> indices; | ||
| for (auto r : shared_range) | ||
| indices.push_back(r->min); | ||
| std::vector<PrimExpr> strides; | ||
| PrimExpr stride = 1; | ||
| for (size_t i = 0; i < shared_tensor->shape.size(); i++) { | ||
| auto s = shared_tensor->shape[shared_tensor->shape.size() - i - 1]; | ||
| strides.insert(strides.begin(), stride); | ||
| stride *= s; | ||
| } | ||
|
|
||
| ICHECK(strides.size() == indices.size()) | ||
| << "strides.size() != indices.size()" << strides.size() << " " | ||
| << indices.size(); | ||
| PrimExpr offset = 0; | ||
| for (size_t i = 0; i < indices.size(); i++) { | ||
| offset += indices[i] * strides[i]; | ||
| } | ||
| Layout shared_layout; | ||
| if (T.layout_map.count(shared_tensor)) { | ||
| shared_layout = T.layout_map[shared_tensor]; | ||
| shared_tensor = T.buffer_remap[shared_tensor]; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared-memory offset is computed in pre-remap coordinates; wrong when layout_map/buffer_remap is present.
You compute a row-major offset before applying shared_layout and buffer remap, then use it with the remapped buffer. This misaddresses smem for swizzled layouts.
Apply this diff to compute the offset after applying layout/buffer remap and using Buffer::OffsetOf:
- Array<PrimExpr> indices;
- for (auto r : shared_range)
- indices.push_back(r->min);
- std::vector<PrimExpr> strides;
- PrimExpr stride = 1;
- for (size_t i = 0; i < shared_tensor->shape.size(); i++) {
- auto s = shared_tensor->shape[shared_tensor->shape.size() - i - 1];
- strides.insert(strides.begin(), stride);
- stride *= s;
- }
-
- ICHECK(strides.size() == indices.size())
- << "strides.size() != indices.size()" << strides.size() << " "
- << indices.size();
- PrimExpr offset = 0;
- for (size_t i = 0; i < indices.size(); i++) {
- offset += indices[i] * strides[i];
- }
- Layout shared_layout;
- if (T.layout_map.count(shared_tensor)) {
- shared_layout = T.layout_map[shared_tensor];
- shared_tensor = T.buffer_remap[shared_tensor];
- }
+ Layout shared_layout;
+ Array<PrimExpr> indices;
+ for (auto r : shared_range) indices.push_back(r->min);
+ if (T.layout_map.count(shared_tensor)) {
+ shared_layout = T.layout_map[shared_tensor];
+ // Transform indices into the remapped buffer's coordinate space
+ indices = shared_layout->Forward(indices);
+ shared_tensor = T.buffer_remap[shared_tensor];
+ }
+ PrimExpr offset = shared_tensor.OffsetOf(indices).back();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Array<PrimExpr> indices; | |
| for (auto r : shared_range) | |
| indices.push_back(r->min); | |
| std::vector<PrimExpr> strides; | |
| PrimExpr stride = 1; | |
| for (size_t i = 0; i < shared_tensor->shape.size(); i++) { | |
| auto s = shared_tensor->shape[shared_tensor->shape.size() - i - 1]; | |
| strides.insert(strides.begin(), stride); | |
| stride *= s; | |
| } | |
| ICHECK(strides.size() == indices.size()) | |
| << "strides.size() != indices.size()" << strides.size() << " " | |
| << indices.size(); | |
| PrimExpr offset = 0; | |
| for (size_t i = 0; i < indices.size(); i++) { | |
| offset += indices[i] * strides[i]; | |
| } | |
| Layout shared_layout; | |
| if (T.layout_map.count(shared_tensor)) { | |
| shared_layout = T.layout_map[shared_tensor]; | |
| shared_tensor = T.buffer_remap[shared_tensor]; | |
| } | |
| Layout shared_layout; | |
| Array<PrimExpr> indices; | |
| for (auto r : shared_range) { | |
| indices.push_back(r->min); | |
| } | |
| if (T.layout_map.count(shared_tensor)) { | |
| shared_layout = T.layout_map[shared_tensor]; | |
| // Transform indices into the remapped buffer's coordinate space | |
| indices = shared_layout->Forward(indices); | |
| shared_tensor = T.buffer_remap[shared_tensor]; | |
| } | |
| PrimExpr offset = shared_tensor.OffsetOf(indices).back(); |
| kernel = args[4].as<IntImm>().value()->value; | ||
| stride = args[5].as<IntImm>().value()->value; | ||
| dilation = args[6].as<IntImm>().value()->value; | ||
| padding = args[7].as<IntImm>().value()->value; | ||
| eviction_policy = args[8].as<IntImm>().value()->value; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid IntImm extraction; this won’t compile. Use Downcast().
as() is not a valid node cast; use Downcast(...) and read ->value.
Apply this diff:
- kernel = args[4].as<IntImm>().value()->value;
- stride = args[5].as(IntImm)().value()->value;
- dilation = args[6].as<IntImm>().value()->value;
- padding = args[7].as<IntImm>().value()->value;
- eviction_policy = args[8].as<IntImm>().value()->value;
+ kernel = Downcast<IntImm>(args[4])->value;
+ stride = Downcast<IntImm>(args[5])->value;
+ dilation = Downcast<IntImm>(args[6])->value;
+ padding = Downcast<IntImm>(args[7])->value;
+ eviction_policy = Downcast<IntImm>(args[8])->value;Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/op/copy.cc around lines 1046 to 1051, the code uses args[i].as<IntImm>()
which is an invalid node cast and will not compile; replace each use with
Downcast<IntImm>(args[i])->value to correctly extract the IntImm value (e.g.,
kernel = Downcast<IntImm>(args[4])->value; and similarly for stride, dilation,
padding, eviction_policy). Ensure you include the Downcast header/namespace if
not already present.
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Chores