-
Couldn't load subscription status.
- Fork 286
[Refactor] Support python reflection for tile operators #783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.
- Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.
- Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.
WalkthroughAdds FFI reflection, structural equality, and hashing to many TileLang op nodes and registers them at static init; moves GEMM warp-partitioning into a policy object; modernizes C++ idioms and arch parsing; introduces Fill and Python tilelang.ir; minor Python API + debug-print additions; small transform and reducer updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant GemmNode
participant GemmWarpPolicy
Caller->>GemmNode: Lower/InferLayout(M,N,block_size,target)
GemmNode->>GemmWarpPolicy: policy->ComputeWarpPartition(M,N,block_size,target,use_wgmma)
rect rgb(245,245,255)
note right of GemmWarpPolicy: Compute m_warp/n_warp\nbased on policy type
GemmWarpPolicy-->>GemmNode: (m_warp, n_warp)
end
GemmNode-->>Caller: continue lowering with warps
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the Tile language's operator definitions to enable Python reflection. The primary goal is to expose the internal structure and properties of C++-defined Tile operators to the Python frontend, allowing for more dynamic and inspectable IR manipulation. This change improves the usability and debuggability of the Tile IR by making its core components accessible from Python.
Highlights
- Python Reflection Support: Introduced Python reflection capabilities for various Tile operators (e.g., AtomicAdd, Copy, Gemm, Fill, ParallelOp, ReduceOp, FinalizeReducerOp, RegionOp). This allows C++ operator nodes to be inspected and manipulated from Python, enhancing debugging and programmatic interaction with the Tile IR.
- Operator Node Enhancements: Added
RegisterReflection,SEqualReduce, andSHashReducemethods to all relevantTileOperatorNodesubclasses. These methods facilitate structural equality checks, hashing, and expose internal attributes for Python reflection, improving the robustness and introspection of the IR nodes. - Code Modernization and Cleanup: Refactored C++ code to use modern C++ features like
std::stringandstd::stoifor string-to-integer conversions, andempty()for checking container sizes. Extensive Doxygen comments were removed from header files, streamlining the code and reducing verbosity. Theelemoperator was renamed tofillfor clarity. - Python IR Module Creation: A new
tilelang.irPython module was added, containing Python representations of the Tile operators. This module serves as the Python-side interface for the newly exposed C++ reflection mechanisms.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a significant refactoring to add Python reflection support for various tile operators. The changes are mostly well-structured, introducing RegisterReflection, SEqualReduce, and SHashReduce methods to the C++ operator nodes and creating corresponding Python classes. This is a great enhancement for introspection and debugging from the Python side. There are also several nice cleanups and stylistic improvements throughout the C++ code.
However, I've identified a critical bug in GemmNode::SEqualReduce that needs to be fixed. Additionally, a substantial amount of documentation has been removed from several header files, which could impact maintainability. Finally, some debugging print statements were left in the Python code and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/op/gemm.cc (1)
263-265: Potential division by zero in score calculation.The per-warp work calculation could result in division by zero if the denominators become zero.
float m_per_warp = static_cast<float>(this->M) / (m * kMPerWarp); float n_per_warp = static_cast<float>(this->N) / (n * kNPerWarp); + // Avoid division by zero in the score calculation + if (n_per_warp == 0) continue; float score = std::abs(m_per_warp / n_per_warp - ideal);src/op/copy.cc (1)
1316-1321: Incorrect IntImm extraction; use IntImmNode or Downcast consistently.
.as<IntImm>()is not the supported pattern here and can break builds. Useas<IntImmNode>()orDowncast<IntImm>(...)as elsewhere in this file.Apply this diff:
- node->kernel = args[4].as<IntImm>().value()->value; - node->stride = args[5].as[IntImm>().value()->value; - node->dilation = args[6].as[IntImm>().value()->value; - node->padding = args[7].as[IntImm>().value()->value; - node->eviction_policy = args[8].as[IntImm>().value()->value; + node->kernel = Downcast<IntImm>(args[4])->value; + node->stride = Downcast<IntImm>(args[5])->value; + node->dilation = Downcast<IntImm>(args[6])->value; + node->padding = Downcast<IntImm>(args[7])->value; + node->eviction_policy = Downcast<IntImm>(args[8])->value;src/op/finalize_reducer.cc (1)
159-163: set_num_inputs is wrong (constructor reads args[1]).The constructor uses
args[1]forop, but the op is registered with.set_num_inputs(1). This will mis-validate calls.Apply this diff:
- TIR_REGISTER_TL_OP(FinalizeReducerOp, finalize_reducer) - .set_num_inputs(1) + TIR_REGISTER_TL_OP(FinalizeReducerOp, finalize_reducer) + .set_num_inputs(2) .set_attr<TCallEffectKind>("TCallEffectKind", Integer(CallEffectKind::kOpaque));tilelang/language/proxy.py (1)
113-118: Fix getitem: support T.Tensor[128] and avoid over-wrapping tuplesThe assert rejects non-tuple keys (e.g., T.Tensor[128]) and the current logic wraps tuples again, altering the call signature. Mirror BufferProxy’s tolerant behavior.
- def __getitem__(self, keys) -> tir.Buffer: - assert isinstance(keys, tuple) - # Single argument (the shape) - if all([type(s) not in (tuple, str, list) for s in keys]): - keys = (keys,) - return self(*keys) + def __getitem__(self, keys) -> tir.Buffer: + if not isinstance(keys, tuple): + # Single item like T.Tensor[128] -> shape=128 + return self(keys) + # If second item is not dtype string, treat the whole tuple as shape + if len(keys) >= 2 and not isinstance(keys[1], str): + return self(keys) + # Otherwise forward as positional args (e.g., (shape_tuple, "float32")) + return self(*keys)
🧹 Nitpick comments (11)
tilelang/ir/region.py (1)
3-3: Drop redundant import of the tvm.ffi submoduleYou already access the decorator via the tvm alias; importing tvm.ffi again is unnecessary and can confuse which tvm reference is used.
-import tvm.ffitilelang/engine/phase.py (1)
97-99: Gate debug prints behind a pass-context flagUnconditional prints spam stdout and can be very expensive (mod.script()) in large modules. Guard them with a config flag.
- print("LowerTileOp") - print(mod.script()) + if tilelang.transform.get_pass_context().config.get("tl.debug.dump_ir_before_lower_tileop", False): + print("LowerTileOp") + print(mod.script())tilelang/ir/fill.py (1)
3-3: Remove unnecessary tvm.ffi importThe decorator is available via the tvm alias; this extra import is not needed.
-import tvm.ffisrc/op/region.h (1)
81-82: Optional: use a strong type for access_mask_If this is a bitmask, consider an enum class with underlying int32_t for type-safety and stable ABI across platforms.
src/op/gemm.cc (1)
251-251: Add validation for division by zero.While unlikely in practice, the code should handle the edge case where
Nis 0 to prevent division by zero.- float ideal = this->N > 0 ? static_cast<float>(this->M) / this->N : 1.f; + float ideal = (this->N > 0 && this->M > 0) ? static_cast<float>(this->M) / this->N : 1.0f;src/op/copy.h (2)
118-126: Reflection registration looks good but missing some fields.The RegisterReflection method exposes the key fields, but omits
disable_tmaandeviction_policyfields that are public members of the class.Consider adding the missing fields for completeness:
static void RegisterReflection() { namespace refl = tvm::ffi::reflection; refl::ObjectDef<CopyNode>() .def_ro("src", &CopyNode::src) .def_ro("dst", &CopyNode::dst) .def_ro("src_range", &CopyNode::src_range) .def_ro("dst_range", &CopyNode::dst_range) - .def_ro("coalesced_width", &CopyNode::coalesced_width); + .def_ro("coalesced_width", &CopyNode::coalesced_width) + .def_ro("disable_tma", &CopyNode::disable_tma) + .def_ro("eviction_policy", &CopyNode::eviction_policy); }
296-306: Conv2DIm2ColOpNode reflection missing fields.The reflection registration omits
nhw_stepandc_stepfields that are public members.Consider adding the missing fields:
static void RegisterReflection() { namespace refl = tvm::ffi::reflection; refl::ObjectDef<Conv2DIm2ColOpNode>() .def_ro("src", &Conv2DIm2ColOpNode::src) .def_ro("dst", &Conv2DIm2ColOpNode::dst) .def_ro("stride", &Conv2DIm2ColOpNode::stride) .def_ro("padding", &Conv2DIm2ColOpNode::padding) .def_ro("dilation", &Conv2DIm2ColOpNode::dilation) .def_ro("kernel", &Conv2DIm2ColOpNode::kernel) - .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy); + .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy) + .def_ro("nhw_step", &Conv2DIm2ColOpNode::nhw_step) + .def_ro("c_step", &Conv2DIm2ColOpNode::c_step); }tilelang/__init__.py (1)
108-109: Exposetilelang.ir— OK; drop unusednoqaper Ruff hint.Ruff flags the
# noqa: F401here as unused (RUF100). Consider removing it or addingirto__all__if you want an explicit re-export.Apply this minimal diff:
-from . import ir # noqa: F401 +from . import ir # re-export: exposes tilelang.irOptional (outside this hunk): add at top-level
__all__ = [*globals().get('__all__', []), 'ir']tilelang/ir/__init__.py (1)
1-9: Replace unused# noqa: F401with explicit__all__for clean re-exportsRuff flags RUF100 because F401 isn’t enabled. Make the public surface explicit and remove the unused directives.
-from .fill import Fill # noqa: F401 -from .atomic_add import AtomicAdd # noqa: F401 -from .copy import Copy, Conv2DIm2ColOp # noqa: F401 -from .gemm import Gemm # noqa: F401 -from .gemm_sp import GemmSP # noqa: F401 -from .finalize_reducer import FinalizeReducerOp # noqa: F401 -from .parallel import ParallelOp # noqa: F401 -from .reduce import ReduceOp, CumSumOp # noqa: F401 -from .region import RegionOp # noqa: F401 +from .fill import Fill +from .atomic_add import AtomicAdd +from .copy import Copy +from .Conv2DIm2ColOp import Conv2DIm2ColOp +from .gemm import Gemm +from .gemm_sp import GemmSP +from .finalize_reducer import FinalizeReducerOp +from .parallel import ParallelOp +from .reduce import ReduceOp, CumSumOp +from .region import RegionOp + +__all__ = [ + "Fill", + "AtomicAdd", + "Copy", + "Conv2DIm2ColOp", + "Gemm", + "GemmSP", + "FinalizeReducerOp", + "ParallelOp", + "ReduceOp", + "CumSumOp", + "RegionOp", +]tilelang/language/proxy.py (2)
89-99: Preserve explicit 0 values; use None checks instead of “or” defaultsUsing “x or default” collapses legitimate falsy values (e.g., align=0 when a subclass sets a nonzero default). Prefer None checks.
- # Use class defaults if not specified - scope = scope or self.default_scope - align = align or self.default_align - offset_factor = offset_factor or self.default_offset_factor + # Use class defaults only when None is provided + if scope is None: + scope = self.default_scope + if align is None: + align = self.default_align + if offset_factor is None: + offset_factor = self.default_offset_factor
177-185: Nit: simplify predicate and shorten messages to appease TRY003Minor cleanups: drop unnecessary bool(...) and use concise error texts (or centralize messages).
- if len(shape) != len(strides): - raise ValueError("Invalid shape/strides' dimensions") - if not bool(strides[-1] == 1): - # TODO(chenggang): shall we support non-contiguous even for the last dimension? - raise ValueError("The stride of the last dimension must be 1 (contiguous)") + if len(shape) != len(strides): + raise ValueError("shape/strides rank mismatch") + if strides[-1] != 1: + # TODO(chenggang): support non-contiguous last dim? + raise ValueError("last stride must be 1")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (30)
src/op/atomic_add.cc(3 hunks)src/op/atomic_add.h(1 hunks)src/op/copy.cc(2 hunks)src/op/copy.h(2 hunks)src/op/fill.cc(2 hunks)src/op/fill.h(1 hunks)src/op/finalize_reducer.cc(1 hunks)src/op/finalize_reducer.h(1 hunks)src/op/gemm.cc(1 hunks)src/op/gemm.h(2 hunks)src/op/gemm_sp.cc(1 hunks)src/op/gemm_sp.h(1 hunks)src/op/parallel.cc(2 hunks)src/op/parallel.h(2 hunks)src/op/reduce.h(1 hunks)src/op/region.h(1 hunks)src/transform/layout_reducer.cc(4 hunks)tilelang/__init__.py(1 hunks)tilelang/engine/phase.py(1 hunks)tilelang/ir/__init__.py(1 hunks)tilelang/ir/atomic_add.py(1 hunks)tilelang/ir/copy.py(1 hunks)tilelang/ir/fill.py(1 hunks)tilelang/ir/finalize_reducer.py(1 hunks)tilelang/ir/gemm.py(1 hunks)tilelang/ir/gemm_sp.py(1 hunks)tilelang/ir/parallel.py(1 hunks)tilelang/ir/reduce.py(1 hunks)tilelang/ir/region.py(1 hunks)tilelang/language/proxy.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (26)
src/op/fill.h (4)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/atomic_add.h (1)
tvm(86-155)src/op/region.h (1)
tvm(72-129)
src/op/copy.cc (1)
src/op/copy.h (2)
RegisterReflection(118-126)RegisterReflection(296-306)
tilelang/ir/region.py (1)
src/op/region.h (1)
tvm(72-129)
tilelang/ir/reduce.py (1)
src/op/reduce.h (1)
ReduceOp(76-81)
src/op/fill.cc (1)
src/op/fill.h (1)
RegisterReflection(87-93)
tilelang/ir/parallel.py (2)
src/op/atomic_add.h (1)
tvm(86-155)src/op/parallel.h (2)
tvm(22-158)ParallelOp(148-155)
tilelang/ir/finalize_reducer.py (2)
src/op/finalize_reducer.h (1)
tvm(20-68)src/op/finalize_reducer.cc (1)
FinalizeReducerOp(35-40)
tilelang/ir/__init__.py (21)
tilelang/language/fill.py (1)
fill(9-21)src/op/fill.cc (1)
Fill(61-108)tilelang/ir/fill.py (1)
Fill(7-8)src/op/atomic_add.cc (1)
AtomicAdd(68-86)tilelang/ir/atomic_add.py (1)
AtomicAdd(7-8)tilelang/language/copy.py (1)
copy(84-152)src/op/copy.cc (2)
Copy(131-158)Conv2DIm2ColOp(1310-1322)tilelang/ir/copy.py (2)
Copy(7-8)Conv2DIm2ColOp(12-13)src/op/gemm.h (1)
Gemm(136-141)tilelang/ir/gemm.py (1)
Gemm(7-8)tilelang/language/experimental/gemm_sp.py (1)
gemm_sp(9-86)src/op/gemm_sp.cc (1)
GemmSP(66-90)tilelang/ir/gemm_sp.py (1)
GemmSP(7-8)tilelang/language/reduce.py (1)
finalize_reducer(187-204)src/op/finalize_reducer.cc (1)
FinalizeReducerOp(35-40)tilelang/ir/finalize_reducer.py (1)
FinalizeReducerOp(7-8)src/op/parallel.h (1)
ParallelOp(148-155)tilelang/ir/parallel.py (1)
ParallelOp(7-8)src/op/reduce.h (1)
ReduceOp(76-81)tilelang/ir/reduce.py (2)
ReduceOp(7-8)CumSumOp(12-13)tilelang/ir/region.py (1)
RegionOp(7-8)
tilelang/ir/gemm_sp.py (1)
src/op/gemm_sp.cc (1)
GemmSP(66-90)
src/op/finalize_reducer.cc (1)
src/op/finalize_reducer.h (1)
RegisterReflection(33-38)
src/op/finalize_reducer.h (3)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/reduce.h (3)
RegisterReflection(36-45)SEqualReduce(47-50)SHashReduce(52-59)src/op/fill.h (4)
RegisterReflection(87-93)tvm(70-122)SEqualReduce(95-97)SHashReduce(99-103)
src/op/parallel.cc (1)
src/op/parallel.h (1)
RegisterReflection(72-78)
src/op/gemm.h (3)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/atomic_add.h (1)
tvm(86-155)
tilelang/ir/atomic_add.py (2)
src/op/atomic_add.h (1)
tvm(86-155)src/op/atomic_add.cc (1)
AtomicAdd(68-86)
tilelang/ir/gemm.py (1)
src/op/gemm.h (1)
Gemm(136-141)
src/op/reduce.h (2)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)
src/op/gemm_sp.h (2)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/gemm.h (3)
RegisterReflection(56-78)SEqualReduce(80-92)SHashReduce(94-115)
src/op/atomic_add.h (3)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/reduce.h (3)
RegisterReflection(36-45)SEqualReduce(47-50)SHashReduce(52-59)
src/op/region.h (3)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/parallel.h (4)
RegisterReflection(72-78)tvm(22-158)SEqualReduce(80-84)SHashReduce(86-90)src/op/fill.h (4)
RegisterReflection(87-93)tvm(70-122)SEqualReduce(95-97)SHashReduce(99-103)
tilelang/ir/copy.py (1)
src/op/copy.cc (2)
Copy(131-158)Conv2DIm2ColOp(1310-1322)
src/op/copy.h (2)
src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/atomic_add.h (1)
tvm(86-155)
src/transform/layout_reducer.cc (1)
src/transform/layout_inference.cc (6)
op(40-46)op(40-40)op(296-324)op(296-296)op(348-371)op(348-348)
src/op/atomic_add.cc (6)
src/op/copy.h (2)
RegisterReflection(118-126)RegisterReflection(296-306)src/op/finalize_reducer.h (1)
RegisterReflection(33-38)src/op/gemm.h (1)
RegisterReflection(56-78)src/op/parallel.h (1)
RegisterReflection(72-78)src/op/reduce.h (1)
RegisterReflection(36-45)src/op/fill.h (1)
RegisterReflection(87-93)
tilelang/ir/fill.py (2)
src/op/fill.h (2)
tvm(70-122)Fill(114-119)src/op/fill.cc (1)
Fill(61-108)
src/op/parallel.h (3)
src/op/copy.h (6)
RegisterReflection(118-126)RegisterReflection(296-306)SEqualReduce(128-133)SEqualReduce(308-314)SHashReduce(135-141)SHashReduce(316-324)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/parallel.cc (1)
ParallelOpNode(171-173)
src/op/gemm_sp.cc (2)
src/op/copy.h (2)
RegisterReflection(118-126)RegisterReflection(296-306)src/op/gemm.h (1)
RegisterReflection(56-78)
🪛 Ruff (0.12.2)
tilelang/__init__.py
108-108: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/ir/__init__.py
1-1: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
2-2: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
3-3: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
4-4: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
5-5: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
6-6: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
7-7: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
8-8: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
9-9: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/language/proxy.py
180-180: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: format-check
- GitHub Check: bot-task
🔇 Additional comments (49)
tilelang/ir/region.py (1)
6-8: FFI registration looks correctType key "tl.RegionOp" matches C++ _type_key; class derives from Node, Scriptable; OK.
tilelang/ir/fill.py (1)
6-8: FFI stub is consistent with C++ FillNodeRegistration key and base classes look good.
src/op/region.h (2)
97-104: RegisterReflection added—ensure static init calls itLooks good. Please confirm there is a corresponding static init in the .cc to invoke RegisterReflection; otherwise fields won’t be visible to Python reflection.
Add to src/op/region.cc:
TVM_FFI_STATIC_INIT_BLOCK(RegionOpNode) { RegionOpNode::RegisterReflection(); }
105-114: SEqualReduce/SHashReduce are correct and completeAll identity fields are included; ordering matches member layout.
src/op/atomic_add.h (1)
107-134: Reflection + SEqual/SHash additions look solidField coverage mirrors other ops; naming is consistent. Assuming the .cc contains the TVM_FFI_STATIC_INIT_BLOCK that calls RegisterReflection.
src/op/finalize_reducer.h (2)
33-38: LGTM! Standard reflection registration pattern.The
RegisterReflectionmethod correctly follows the established TVM FFI pattern, exposing thereducerandopfields as read-only properties viadef_ro.
40-50: LGTM! Complete structural equality and hashing implementation.The implementation correctly provides:
SEqualReducemethod comparing all relevant fieldsSHashReducemethod hashing all relevant fields- Required type trait flags for the TVM FFI system
This follows the same pattern established in other operator nodes and enables proper structural equality semantics.
src/op/parallel.h (3)
61-61: LGTM! Exposing root member for reflection.Moving
root_to public access enables reflection registration while maintaining the existing functionality. The member is clearly documented and follows the established naming convention.
72-78: LGTM! Comprehensive reflection registration.The reflection registration correctly exposes all three key fields (
root_,loop_layout_,predicate_) as read-only properties, enabling proper FFI access to the parallel operator's state.
80-92: LGTM! Complete structural equality and hashing support.The implementation provides proper structural comparison and hashing for all fields, with correct type trait declarations. This enables deterministic equality semantics and hash-based operations for
ParallelOpNode.src/op/parallel.cc (2)
381-383: LGTM! Consistent formatting improvement.Using explicit newline characters (
'\n') instead ofstd::endlis more efficient as it avoids unnecessary buffer flushes while maintaining the same formatting.
430-430: LGTM! Proper static reflection registration.The static initialization block correctly calls
ParallelOpNode::RegisterReflection()at module load time, following the established TVM FFI pattern for reflection registration.src/op/gemm.h (2)
56-78: LGTM! Clean reflection registration implementation.The reflection registration follows the established pattern seen in other operators. All relevant fields are properly exposed as read-only, which is appropriate for introspection purposes.
116-117: LGTM! Proper declaration of structural equality/hashing support.The static constexpr declarations correctly indicate that this class supports structural equality and hashing, which is required for TVM's object system.
src/op/gemm.cc (2)
456-459: LGTM! Cleaner string-based architecture parsing.The modernized approach using
std::stringmethods is more idiomatic and safer than C-style string manipulation.
246-274: Improved WGMMA Square policy warp partitioning logic.The new implementation provides a more targeted search for balanced warp partitioning:
- Uses a coarser iteration step (kGroup) instead of exhaustive search
- Computes ideal work ratio based on M/N dimensions
- Minimizes deviation from ideal ratio for better load balancing
This should result in better performance for Square policy on WGMMA-enabled hardware.
src/op/gemm_sp.h (3)
96-112: LGTM! Consistent reflection registration for GemmSP.The reflection registration properly exposes all relevant fields including the additional
Ebuffer that distinguishes GemmSP from regular Gemm.
114-121: LGTM! Correct structural equality implementation.The SEqualReduce implementation correctly compares all fields including the
Ebuffer specific to GemmSP.
123-138: LGTM! Complete hash reduction implementation.The SHashReduce correctly includes all fields in the hash computation, maintaining consistency with SEqualReduce.
src/op/gemm_sp.cc (1)
426-426: LGTM! Reflection registration follows the established pattern.The static initialization block correctly registers reflection for
GemmSPNodeat startup, consistent with the pattern used across other operator nodes in this PR.src/op/fill.h (3)
87-93: LGTM! Well-structured reflection registration.The
RegisterReflection()method correctly exposes all member fields (dst,value,region) as read-only properties through TVM's FFI reflection system, following the established pattern in other operator nodes.
95-103: LGTM! Correct structural equality and hashing implementation.The
SEqualReduceandSHashReducemethods properly compare and hash all member fields, enabling structural equality comparisons and consistent hashing forFillNodeobjects.
104-105: LGTM! Correct type trait indicators.The type trait flags properly indicate that
FillNodesupports structural equality and hashing, enabling TVM's reflection system to use these methods.src/op/fill.cc (3)
2-2: LGTM! File rename aligns with header structure.The file comment update from
elem.cctofill.ccis consistent with the header rename tofill.hand improves code organization.
7-7: LGTM! Header include updated correctly.The include directive properly references the renamed header file
fill.h.
228-230: LGTM! Static reflection registration implemented correctly.The static initialization block ensures
FillNode::RegisterReflection()is called at program startup, enabling FFI reflection for the Fill operator. This follows the same pattern established in other operator files.src/transform/layout_reducer.cc (4)
15-15: LGTM! Header include change is appropriate.The change from
elem.htofill.haligns with the usage ofFill::Get()in line 275.
135-135: Good use of idiomatic container check.Replacing
layout_map.size() > 0with!layout_map.empty()is more idiomatic and potentially more efficient.
181-181: Consistent use of empty() for container checks.Good consistency in replacing
inside_reducer_range_.size() > 0with!inside_reducer_range_.empty().
276-276: Improved safety check for args container.Using
!op->args.empty()instead ofop->args.size() > 0is more idiomatic and consistent with the codebase style.src/op/reduce.h (2)
36-45: Well-structured reflection registration for ReduceOpNode.The RegisterReflection method correctly exposes the read-only fields. The TODO comment appropriately indicates that the
typefield needs to be converted to an object node before being exposed.
61-62: Type traits correctly declared.The constexpr bool flags for structural equality and hashing are properly declared.
src/op/copy.cc (2)
300-301: Use of.empty()is correct and idiomatic.No behavior change; improves readability.
1562-1565: Static FFI reflection registration runs once — no duplicates found. Verified a singleTVM_FFI_STATIC_INIT_BLOCKentry registeringCopyNode::RegisterReflection()andConv2DIm2ColOpNode::RegisterReflection(), with only one_type_key = "tl.Copy"definition; no other init blocks for these types insrc/op.src/op/finalize_reducer.cc (1)
164-165: Static FFI reflection registration — LGTM.Matches the header’s RegisterReflection() definition.
tilelang/ir/gemm.py (1)
1-8: FFI registration validated: C++ node_type_keyinsrc/op/gemm.his"tl.Gemm", matching the Python stub.tilelang/ir/finalize_reducer.py (1)
1-8: Approve FFI type registration – Verified that the C++_type_key = "tl.FinalizeReducerOp"insrc/op/finalize_reducer.hmatches the Python registration keytl.FinalizeReducerOp.tilelang/ir/atomic_add.py (1)
6-8: Registration and stub LGTM.Type key "tl.AtomicAdd" matches the C++ node key. Nothing else blocking here.
tilelang/ir/gemm_sp.py (1)
6-8: Registration and stub LGTM."tl.GemmSP" matches the expected naming. No issues spotted in the Python layer.
tilelang/ir/parallel.py (2)
6-8: Registration and stub LGTM.Matches C++ type key "tl.ParallelOp" shown in the header snippet.
1-3: No issues found—export and reflection registration confirmed. ParallelOp is re-exported in tilelang/ir/init.py, and ParallelOpNode::RegisterReflection() is statically initialized via TVM_FFI_STATIC_INIT_BLOCK in src/op/parallel.cc.tilelang/ir/copy.py (2)
6-9: Copy registration LGTM."tl.Copy" is consistent with common naming in the repo.
11-13: Type key matches C++ – no change needed
The Python@tvm.ffi.register_object("tl.Conv2DIm2Col")key exactly matches the C++_type_key = "tl.Conv2DIm2Col"inConv2DIm2ColOpNode. No update required.Likely an incorrect or invalid review comment.
tilelang/ir/reduce.py (2)
6-9: ReduceOp registration LGTM.Assuming C++ uses "tl.ReduceOp", this is consistent.
11-13: CumSumOp C++ _type_key matches expected "tl.CumSumOp"
Verified in src/op/reduce.h:88.tilelang/ir/__init__.py (1)
3-3: [The above scripts will list all files intilelang/irand show the first 200 lines of__init__.pyto confirm whereConv2DIm2ColOpis defined.]src/op/atomic_add.cc (2)
258-258: LGTM: idiomatic empty() checkSwitch to loop_vars.empty() is clean and safe given the early return on scalar.
428-429: Static reflection init looks goodStatic init for AtomicAddNode::RegisterReflection aligns with other ops.
tilelang/language/proxy.py (1)
156-166: Scope-threaded TensorProxy: LGTMShape normalization + forwarding data/scope to BaseTensorProxy is correct.
| std::string arch = s.value(); | ||
| if (arch.rfind("sm_", 0) == 0) { | ||
| arch_int = std::stoi(arch.substr(3)); | ||
| } else { | ||
| arch_int = 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid std::stoi; handle “sm_90a” and prevent exceptions in GetArchInt
std::stoi throws on non-digit suffixes (e.g., valid CUDA arch strings like “sm_90a”) and changes behavior vs. lenient parsing. Parse the digit span only and use from_chars to avoid exceptions.
Apply this diff:
- std::string arch = s.value();
- if (arch.rfind("sm_", 0) == 0) {
- arch_int = std::stoi(arch.substr(3));
- } else {
- arch_int = 0;
- }
+ std::string arch = s.value();
+ if (arch.rfind("sm_", 0) == 0) {
+ const char* first = arch.data() + 3;
+ const char* last = arch.data() + arch.size();
+ const char* p = first;
+ while (p < last && std::isdigit(static_cast<unsigned char>(*p))) ++p;
+ if (p > first) {
+ int v = 0;
+ auto res = std::from_chars(first, p, v);
+ if (res.ec == std::errc()) arch_int = v;
+ }
+ } else {
+ arch_int = 0;
+ }Add required headers (outside the hunk):
#include <charconv>
#include <cctype>| bool SEqualReduce(const CopyNode *other, SEqualReducer equal) const { | ||
| return equal(src, other->src) && equal(dst, other->dst) && | ||
| equal(src_range, other->src_range) && | ||
| equal(dst_range, other->dst_range) && | ||
| equal(coalesced_width, other->coalesced_width); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
SEqualReduce missing fields for complete equality check.
The equality comparison excludes disable_tma and eviction_policy fields, which could lead to incorrect equality results.
Add the missing fields to ensure correct structural equality:
bool SEqualReduce(const CopyNode *other, SEqualReducer equal) const {
return equal(src, other->src) && equal(dst, other->dst) &&
equal(src_range, other->src_range) &&
equal(dst_range, other->dst_range) &&
- equal(coalesced_width, other->coalesced_width);
+ equal(coalesced_width, other->coalesced_width) &&
+ equal(disable_tma, other->disable_tma) &&
+ eviction_policy == other->eviction_policy;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool SEqualReduce(const CopyNode *other, SEqualReducer equal) const { | |
| return equal(src, other->src) && equal(dst, other->dst) && | |
| equal(src_range, other->src_range) && | |
| equal(dst_range, other->dst_range) && | |
| equal(coalesced_width, other->coalesced_width); | |
| } | |
| bool SEqualReduce(const CopyNode *other, SEqualReducer equal) const { | |
| return equal(src, other->src) && equal(dst, other->dst) && | |
| equal(src_range, other->src_range) && | |
| equal(dst_range, other->dst_range) && | |
| equal(coalesced_width, other->coalesced_width) && | |
| equal(disable_tma, other->disable_tma) && | |
| eviction_policy == other->eviction_policy; | |
| } |
🤖 Prompt for AI Agents
In src/op/copy.h around lines 128 to 133, the SEqualReduce implementation omits
disable_tma and eviction_policy from the equality checks, causing incomplete
structural equality; update the return expression to also compare
equal(disable_tma, other->disable_tma) and equal(eviction_policy,
other->eviction_policy) (using the same SEqualReducer equal function) so all
relevant fields are included in the combined boolean result.
| void SHashReduce(SHashReducer hash_reduce) const { | ||
| hash_reduce(src); | ||
| hash_reduce(dst); | ||
| hash_reduce(src_range); | ||
| hash_reduce(dst_range); | ||
| hash_reduce(coalesced_width); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
SHashReduce missing fields for complete hashing.
Similar to SEqualReduce, the hash function excludes important fields.
Add the missing fields to ensure proper hashing:
void SHashReduce(SHashReducer hash_reduce) const {
hash_reduce(src);
hash_reduce(dst);
hash_reduce(src_range);
hash_reduce(dst_range);
hash_reduce(coalesced_width);
+ hash_reduce(disable_tma);
+ hash_reduce(static_cast<int>(eviction_policy));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(src); | |
| hash_reduce(dst); | |
| hash_reduce(src_range); | |
| hash_reduce(dst_range); | |
| hash_reduce(coalesced_width); | |
| } | |
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(src); | |
| hash_reduce(dst); | |
| hash_reduce(src_range); | |
| hash_reduce(dst_range); | |
| hash_reduce(coalesced_width); | |
| hash_reduce(disable_tma); | |
| hash_reduce(static_cast<int>(eviction_policy)); | |
| } |
🤖 Prompt for AI Agents
In src/op/copy.h around lines 135-141, SHashReduce only hashes src, dst,
src_range, dst_range and coalesced_width; update it so it mirrors SEqualReduce
and calls hash_reduce on every remaining member of the struct/class (e.g. any
flags, strides, offsets, size/count fields, enums or other config members) in
the same order SEqualReduce uses so all fields are included in the hash.
| bool SEqualReduce(const Conv2DIm2ColOpNode *other, | ||
| SEqualReducer equal) const { | ||
| return equal(src, other->src) && equal(dst, other->dst) && | ||
| equal(stride, other->stride) && equal(padding, other->padding) && | ||
| equal(dilation, other->dilation) && equal(kernel, other->kernel) && | ||
| equal(eviction_policy, other->eviction_policy); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Conv2DIm2ColOpNode SEqualReduce missing fields.
The equality check omits nhw_step and c_step fields.
Include all fields for complete equality:
bool SEqualReduce(const Conv2DIm2ColOpNode *other,
SEqualReducer equal) const {
return equal(src, other->src) && equal(dst, other->dst) &&
equal(stride, other->stride) && equal(padding, other->padding) &&
equal(dilation, other->dilation) && equal(kernel, other->kernel) &&
- equal(eviction_policy, other->eviction_policy);
+ equal(eviction_policy, other->eviction_policy) &&
+ equal(nhw_step, other->nhw_step) && equal(c_step, other->c_step);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool SEqualReduce(const Conv2DIm2ColOpNode *other, | |
| SEqualReducer equal) const { | |
| return equal(src, other->src) && equal(dst, other->dst) && | |
| equal(stride, other->stride) && equal(padding, other->padding) && | |
| equal(dilation, other->dilation) && equal(kernel, other->kernel) && | |
| equal(eviction_policy, other->eviction_policy); | |
| } | |
| bool SEqualReduce(const Conv2DIm2ColOpNode *other, | |
| SEqualReducer equal) const { | |
| return equal(src, other->src) && equal(dst, other->dst) && | |
| equal(stride, other->stride) && equal(padding, other->padding) && | |
| equal(dilation, other->dilation) && equal(kernel, other->kernel) && | |
| equal(eviction_policy, other->eviction_policy) && | |
| equal(nhw_step, other->nhw_step) && equal(c_step, other->c_step); | |
| } |
🤖 Prompt for AI Agents
In src/op/copy.h around lines 308 to 314, the SEqualReduce implementation for
Conv2DIm2ColOpNode omits comparing the nhw_step and c_step members; update the
return expression to also call equal(nhw_step, other->nhw_step) and
equal(c_step, other->c_step) so all relevant fields are included in the equality
check.
| void SHashReduce(SHashReducer hash_reduce) const { | ||
| hash_reduce(src); | ||
| hash_reduce(dst); | ||
| hash_reduce(stride); | ||
| hash_reduce(padding); | ||
| hash_reduce(dilation); | ||
| hash_reduce(kernel); | ||
| hash_reduce(eviction_policy); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Conv2DIm2ColOpNode SHashReduce missing fields.
The hash function should include all fields for proper hashing.
Add the missing fields:
void SHashReduce(SHashReducer hash_reduce) const {
hash_reduce(src);
hash_reduce(dst);
hash_reduce(stride);
hash_reduce(padding);
hash_reduce(dilation);
hash_reduce(kernel);
hash_reduce(eviction_policy);
+ hash_reduce(nhw_step);
+ hash_reduce(c_step);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(src); | |
| hash_reduce(dst); | |
| hash_reduce(stride); | |
| hash_reduce(padding); | |
| hash_reduce(dilation); | |
| hash_reduce(kernel); | |
| hash_reduce(eviction_policy); | |
| } | |
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(src); | |
| hash_reduce(dst); | |
| hash_reduce(stride); | |
| hash_reduce(padding); | |
| hash_reduce(dilation); | |
| hash_reduce(kernel); | |
| hash_reduce(eviction_policy); | |
| hash_reduce(nhw_step); | |
| hash_reduce(c_step); | |
| } |
🤖 Prompt for AI Agents
In src/op/copy.h around lines 316 to 324, the SHashReduce implementation for
Conv2DIm2ColOpNode only hashes src, dst, stride, padding, dilation, kernel and
eviction_policy; update it to call hash_reduce on every remaining member field
of the Conv2DIm2ColOpNode struct (i.e., add hash_reduce(field) calls for each
field declared in the struct that is not yet included) so that all struct fields
are incorporated into the hash function.
tilelang/ir/__init__.py
Outdated
| @@ -0,0 +1,9 @@ | |||
| from .fill import Fill # noqa: F401 | |||
| from .atomic_add import AtomicAdd # noqa: F401 | |||
| from .copy import Copy, Conv2DIm2ColOp # noqa: F401 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect import source for Conv2DIm2ColOp
Conv2DIm2ColOp is defined in tilelang/ir/Conv2DIm2ColOp.py, not in tilelang/ir/copy.py. Importing from .copy will fail at import time.
-from .copy import Copy, Conv2DIm2ColOp # noqa: F401
+from .copy import Copy
+from .Conv2DIm2ColOp import Conv2DIm2ColOp📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from .copy import Copy, Conv2DIm2ColOp # noqa: F401 | |
| from .copy import Copy | |
| from .Conv2DIm2ColOp import Conv2DIm2ColOp |
🧰 Tools
🪛 Ruff (0.12.2)
3-3: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In tilelang/ir/__init__.py around line 3, the Conv2DIm2ColOp symbol is being
imported from .copy but its implementation lives in
tilelang/ir/Conv2DIm2ColOp.py; update the import to pull Conv2DIm2ColOp from the
correct module (e.g. from .Conv2DIm2ColOp import Conv2DIm2ColOp) while keeping
the existing Copy import as needed so the package-level exports import the real
symbol source.
tilelang/ir/atomic_add.py
Outdated
| from tilelang import tvm as tvm | ||
| from tvm.ir import Node, Scriptable | ||
| import tvm.ffi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify exports and type-key consistency across repo (one-shot check).
Run this to confirm Python exports and C++ type keys are aligned:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Verify Python exports =="
rg -nP 'from\s+\.atomic_add\s+import\s+AtomicAdd\b' tilelang/ir/__init__.py || echo "Missing export: AtomicAdd"
echo "== Verify C++ type key =="
rg -nP '_type_key\s*=\s*"tl\.AtomicAdd"' -g 'src/**.{h,cc}' -C2Length of output: 339
Add missing C++ _type_key for AtomicAdd
The C++ IR node class for AtomicAdd doesn’t define _type_key = "tl.AtomicAdd", so the Python export won’t map correctly—add that definition in the corresponding header/source.
🤖 Prompt for AI Agents
In tilelang/ir/atomic_add.py around lines 1 to 3, the Python side expects a C++
IR node with a matching type key but the C++ AtomicAdd class is missing its
_type_key; update the C++ IR node definition (header/source where class
AtomicAdd is declared) to include the line that sets the type key to
"tl.AtomicAdd" (e.g., static constexpr const char* _type_key = "tl.AtomicAdd";
or equivalent in your codebase) and rebuild so the Python export/registration
maps correctly.
tilelang/ir/gemm_sp.py
Outdated
| from tilelang import tvm as tvm | ||
| from tvm.ir import Node, Scriptable | ||
| import tvm.ffi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Quick consistency check recommended.
Confirm C++ type key and Python export exist:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Python export =="
rg -nP 'from\s+\.gemm_sp\s+import\s+GemmSP\b' tilelang/ir/__init__.py || echo "Missing export: GemmSP"
echo "== C++ type key =="
rg -nP '_type_key\s*=\s*"tl\.GemmSP"' -g 'src/**.{h,cc}' -C2Length of output: 294
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate any C++ definition of GemmSP =="
rg -nP '\bGemmSP\b' -g 'src/**/*.{h,cc,cpp}' -n -C2 || echo "No C++ definition of GemmSP found"
echo "== Search for gemm_sp filename =="
rg -l 'gemm_sp' -g 'src/**/*.{h,cc,cpp}' || echo "No file gemm_sp.* found in src"Length of output: 3695
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Display Python class definition and type key =="
sed -n '1,200p' tilelang/ir/gemm_sp.pyLength of output: 332
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Show GemmSPNode definition in header =="
rg -nP 'class\s+GemmSPNode' -g 'src/op/gemm_sp.h' -n -C3 || echo "GemmSPNode not found"
echo "== Check for node registration macro =="
rg -nP 'TVM_REGISTER_NODE_TYPE' -g 'src/op/gemm_sp.*' -n -C3 || echo "No TVM_REGISTER_NODE_TYPE macro for GemmSPNode"Length of output: 782
Missing C++ type registration for GemmSPNode
Python uses @tvm.ffi.register_object("tl.GemmSP"), but no TVM_REGISTER_NODE_TYPE(GemmSPNode); (or equivalent) is present in the C++ sources. Add the node registration macro (e.g. in src/op/gemm_sp.cc) so that the "tl.GemmSP" key is recognized by the C++ runtime.
- Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module.
- Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tilelang/ir.py (3)
22-26: Align naming with type key: add an alias for Conv2DIm2ColThe class is
Conv2DIm2ColOpbut it registers under"tl.Conv2DIm2Col". Add a public alias so Python users can importConv2DIm2Colconsistently with the type key.@register_object("tl.Conv2DIm2Col") class Conv2DIm2ColOp(Node, Scriptable): - pass + pass + +# Public alias for consistency with the registered type key +Conv2DIm2Col = Conv2DIm2ColOp
8-59: Preferpassover...in empty class bodiesEllipsis works but tends to trip linters and reads less cleanly.
If not applying the larger diff above, minimally:
- ... + pass(Repeat for each class.)
59-59: Export an explicit public API via allMake star-imports predictable and stabilize the package surface.
+__all__ = [ + "Fill", + "AtomicAdd", + "Copy", + "Conv2DIm2Col", # alias exported + "Conv2DIm2ColOp", + "Gemm", + "GemmSP", + "FinalizeReducerOp", + "ParallelOp", + "ReduceOp", + "CumSumOp", + "RegionOp", +]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/op/copy.cc(3 hunks)tilelang/ir.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/op/copy.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
🔇 Additional comments (1)
tilelang/ir.py (1)
1-1: Eager import and type-key alignment verified
tilelang.iris imported intilelang/__init__.py; each Pythontvm.ffi.register_object("tl.*")key appears exactly once; all C++static constexpr const char* _type_keyvalues match the Python registration strings.
| from tilelang import tvm as tvm | ||
| from tvm.ir.base import Node | ||
| from tvm.runtime import Scriptable | ||
| import tvm.ffi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix decorator import: mixing tilelang.tvm with tvm.ffi is brittle and can break registration
The local name tvm points to tilelang.tvm, while import tvm.ffi loads the upstream module. Using @tvm.ffi.register_object through the tilelang.tvm alias may fail if that alias doesn’t expose .ffi. Bind the decorator directly and avoid the aliasing trap.
Apply:
-from tilelang import tvm as tvm
-from tvm.ir.base import Node
-from tvm.runtime import Scriptable
-import tvm.ffi
+import tvm
+from tvm.ir.base import Node
+from tvm.runtime import Scriptable
+try:
+ from tvm._ffi import register_object # standard decorator in TVM
+except Exception: # pragma: no cover — back-compat if environments expose tvm.ffi
+ register_object = tvm.ffi.register_object📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from tilelang import tvm as tvm | |
| from tvm.ir.base import Node | |
| from tvm.runtime import Scriptable | |
| import tvm.ffi | |
| import tvm | |
| from tvm.ir.base import Node | |
| from tvm.runtime import Scriptable | |
| try: | |
| from tvm._ffi import register_object # standard decorator in TVM | |
| except Exception: # pragma: no cover — back-compat if environments expose tvm.ffi | |
| register_object = tvm.ffi.register_object |
🤖 Prompt for AI Agents
In tilelang/ir.py around lines 1 to 4, the code aliases tilelang.tvm to the
local name tvm while also importing tvm.ffi, which may cause
@tvm.ffi.register_object to fail if the tiled alias lacks .ffi; change the
imports to import the real upstream tvm/ffi symbol directly (for example, import
tvm and from tvm import ffi or from tvm.ffi import register_object) and update
any decorator uses to call the directly imported register_object (or
tvm.ffi.register_object) instead of relying on tilelang.tvm alias.
tilelang/ir.py
Outdated
| @tvm.ffi.register_object("tl.Fill") | ||
| class Fill(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.AtomicAdd") | ||
| class AtomicAdd(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.Copy") | ||
| class Copy(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.Conv2DIm2Col") | ||
| class Conv2DIm2ColOp(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.Gemm") | ||
| class Gemm(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.GemmSP") | ||
| class GemmSP(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.FinalizeReducerOp") | ||
| class FinalizeReducerOp(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.ParallelOp") | ||
| class ParallelOp(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.ReduceOp") | ||
| class ReduceOp(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.CumSumOp") | ||
| class CumSumOp(Node, Scriptable): | ||
| ... | ||
|
|
||
|
|
||
| @tvm.ffi.register_object("tl.RegionOp") | ||
| class RegionOp(Node, Scriptable): | ||
| ... No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use the bound decorator handle (@register_object) for all registrations
After binding register_object, replace @tvm.ffi.register_object(...) with @register_object(...) for clarity and version-compatibility.
-@tvm.ffi.register_object("tl.Fill")
+@register_object("tl.Fill")
class Fill(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.AtomicAdd")
+@register_object("tl.AtomicAdd")
class AtomicAdd(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.Copy")
+@register_object("tl.Copy")
class Copy(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.Conv2DIm2Col")
+@register_object("tl.Conv2DIm2Col")
-class Conv2DIm2ColOp(Node, Scriptable):
- ...
+class Conv2DIm2ColOp(Node, Scriptable):
+ pass
-@tvm.ffi.register_object("tl.Gemm")
+@register_object("tl.Gemm")
class Gemm(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.GemmSP")
+@register_object("tl.GemmSP")
class GemmSP(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.FinalizeReducerOp")
+@register_object("tl.FinalizeReducerOp")
class FinalizeReducerOp(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.ParallelOp")
+@register_object("tl.ParallelOp")
class ParallelOp(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.ReduceOp")
+@register_object("tl.ReduceOp")
class ReduceOp(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.CumSumOp")
+@register_object("tl.CumSumOp")
class CumSumOp(Node, Scriptable):
- ...
+ pass
-@tvm.ffi.register_object("tl.RegionOp")
+@register_object("tl.RegionOp")
class RegionOp(Node, Scriptable):
- ...
+ pass📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @tvm.ffi.register_object("tl.Fill") | |
| class Fill(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.AtomicAdd") | |
| class AtomicAdd(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.Copy") | |
| class Copy(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.Conv2DIm2Col") | |
| class Conv2DIm2ColOp(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.Gemm") | |
| class Gemm(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.GemmSP") | |
| class GemmSP(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.FinalizeReducerOp") | |
| class FinalizeReducerOp(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.ParallelOp") | |
| class ParallelOp(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.ReduceOp") | |
| class ReduceOp(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.CumSumOp") | |
| class CumSumOp(Node, Scriptable): | |
| ... | |
| @tvm.ffi.register_object("tl.RegionOp") | |
| class RegionOp(Node, Scriptable): | |
| ... | |
| @register_object("tl.Fill") | |
| class Fill(Node, Scriptable): | |
| pass | |
| @register_object("tl.AtomicAdd") | |
| class AtomicAdd(Node, Scriptable): | |
| pass | |
| @register_object("tl.Copy") | |
| class Copy(Node, Scriptable): | |
| pass | |
| @register_object("tl.Conv2DIm2Col") | |
| class Conv2DIm2ColOp(Node, Scriptable): | |
| pass | |
| @register_object("tl.Gemm") | |
| class Gemm(Node, Scriptable): | |
| pass | |
| @register_object("tl.GemmSP") | |
| class GemmSP(Node, Scriptable): | |
| pass | |
| @register_object("tl.FinalizeReducerOp") | |
| class FinalizeReducerOp(Node, Scriptable): | |
| pass | |
| @register_object("tl.ParallelOp") | |
| class ParallelOp(Node, Scriptable): | |
| pass | |
| @register_object("tl.ReduceOp") | |
| class ReduceOp(Node, Scriptable): | |
| pass | |
| @register_object("tl.CumSumOp") | |
| class CumSumOp(Node, Scriptable): | |
| pass | |
| @register_object("tl.RegionOp") | |
| class RegionOp(Node, Scriptable): | |
| pass |
🤖 Prompt for AI Agents
In tilelang/ir.py around lines 7 to 59, the classes are registered using
tvm.ffi.register_object(...) directly; replace those usages with the
already-bound decorator alias register_object(...) to improve clarity and
compatibility. Update each decorator line (e.g.,
@tvm.ffi.register_object("tl.Fill")) to use @register_object("tl.Fill") for all
listed classes (Fill, AtomicAdd, Copy, Conv2DIm2ColOp, Gemm, GemmSP,
FinalizeReducerOp, ParallelOp, ReduceOp, CumSumOp, RegionOp) so they use the
bound handle instead of accessing tvm.ffi each time.
…bility - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
…arity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.
- Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/op/atomic_add.h (1)
75-81: Add missingTVM_REGISTER_NODE_TYPEforAtomicAddNode.The TIR operator is already registered via
TIR_REGISTER_TL_OP(AtomicAdd, atomicadd), but there is noTVM_REGISTER_NODE_TYPE(AtomicAddNode)—please add it (e.g. alongside other node registrations insrc/ir.ccor near the atomic_add implementation).src/op/copy.h (1)
91-99: Uninitialized eviction_policy can cause nondeterminismInitialize to a sensible default to keep equality/hashing stable.
Apply:
enum class EvictionPolicy : uint8_t { kEvictNormal = 0, kEvictFirst = 1, kEvictLast = 2, }; - uint8_t eviction_policy; // Policy for cache eviction + uint8_t eviction_policy = static_cast<uint8_t>(EvictionPolicy::kEvictNormal); // Policy for cache evictionsrc/op/gemm.cc (2)
256-290: Square policy (non-WGMMA) can return invalid partitions (m*n != num_warps) and ignore N constraints.The loop uses integer division without enforcing m*n == num_warps and doesn’t bound n by max_n_warps (N/8). This will silently drop warps and produce inconsistent layouts. Align with the WGMMA branch and the Python reference by enforcing both constraints.
Apply:
} else if (this->isSquare()) { // First calculate the maximum possible warps for each dimension - int max_m_warps = - M / kMPerWarp; // Each warp needs at least 16 elements in M + int max_m_warps = M / kMPerWarp; // Each warp needs at least 16 elements in M + int max_n_warps = N / kNPerWarp; // Each warp needs at least 8 elements in N @@ - // Try all possible combinations that satisfy the constraints - for (int m = 1; m <= max_m_warps && m <= num_warps; m++) { - int n = num_warps / m; + // Try all possible combinations that satisfy the constraints + for (int m = 1; m <= max_m_warps && m <= num_warps; m++) { + if (num_warps % m != 0) continue; + int n = num_warps / m; + if (n > max_n_warps) continue; @@ - float m_per_warp = static_cast<float>(M) / (m * kMPerWarp); - float n_per_warp = static_cast<float>(N) / (n * kNPerWarp); + float m_per_warp = static_cast<float>(M) / (m * kMPerWarp); + float n_per_warp = static_cast<float>(N) / (n * kNPerWarp);
401-407: GetArchInt may throw on non-numeric suffix; honor the documented “return 0” behavior.std::stoi will throw for values like "sm_" or "sm_xx". Guard parsing or catch exceptions and fall back to 0.
Apply:
- std::string arch = s.value(); - if (arch.rfind("sm_", 0) == 0) { - arch_int = std::stoi(arch.substr(3)); - } else { - arch_int = 0; - } + std::string arch = s.value(); + if (arch.rfind("sm_", 0) == 0) { + const std::string suffix = arch.substr(3); + bool all_digits = !suffix.empty() && + std::all_of(suffix.begin(), suffix.end(), ::isdigit); + if (all_digits) { + arch_int = std::stoi(suffix); + } else { + arch_int = 0; + } + } else { + arch_int = 0; + }
♻️ Duplicate comments (10)
src/op/copy.h (6)
101-109: Reflection omits disable_tma and eviction_policyExpose all semantically relevant fields for parity with equality/hash and Python reflection.
Apply:
static void RegisterReflection() { namespace refl = tvm::ffi::reflection; refl::ObjectDef<CopyNode>() .def_ro("src", &CopyNode::src) .def_ro("dst", &CopyNode::dst) .def_ro("src_range", &CopyNode::src_range) .def_ro("dst_range", &CopyNode::dst_range) - .def_ro("coalesced_width", &CopyNode::coalesced_width); + .def_ro("coalesced_width", &CopyNode::coalesced_width) + .def_ro("disable_tma", &CopyNode::disable_tma) + .def_ro("eviction_policy", &CopyNode::eviction_policy); }
111-116: Structural equality missing disable_tma and eviction_policyExcluding these fields can lead to cache key collisions and incorrect common-subexpr reuse.
Apply:
bool SEqualReduce(const CopyNode *other, SEqualReducer equal) const { return equal(src, other->src) && equal(dst, other->dst) && equal(src_range, other->src_range) && equal(dst_range, other->dst_range) && - equal(coalesced_width, other->coalesced_width); + equal(coalesced_width, other->coalesced_width) && + equal(disable_tma, other->disable_tma) && + eviction_policy == other->eviction_policy; }
118-124: Hash reduction missing disable_tma and eviction_policyHash must mirror equality to avoid violating the hash/equal contract.
Apply:
void SHashReduce(SHashReducer hash_reduce) const { hash_reduce(src); hash_reduce(dst); hash_reduce(src_range); hash_reduce(dst_range); hash_reduce(coalesced_width); + hash_reduce(disable_tma); + hash_reduce(static_cast<int>(eviction_policy)); }
279-289: Conv2DIm2Col reflection omits nhw_step and c_stepExpose stepping params for full introspection and scripting.
Apply:
static void RegisterReflection() { namespace refl = tvm::ffi::reflection; refl::ObjectDef<Conv2DIm2ColOpNode>() .def_ro("src", &Conv2DIm2ColOpNode::src) .def_ro("dst", &Conv2DIm2ColOpNode::dst) .def_ro("stride", &Conv2DIm2ColOpNode::stride) .def_ro("padding", &Conv2DIm2ColOpNode::padding) .def_ro("dilation", &Conv2DIm2ColOpNode::dilation) .def_ro("kernel", &Conv2DIm2ColOpNode::kernel) - .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy); + .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy) + .def_ro("nhw_step", &Conv2DIm2ColOpNode::nhw_step) + .def_ro("c_step", &Conv2DIm2ColOpNode::c_step); }
291-297: Conv2DIm2Col equality missing nhw_step and c_stepThese affect behavior and must participate in equality.
Apply:
bool SEqualReduce(const Conv2DIm2ColOpNode *other, SEqualReducer equal) const { return equal(src, other->src) && equal(dst, other->dst) && equal(stride, other->stride) && equal(padding, other->padding) && equal(dilation, other->dilation) && equal(kernel, other->kernel) && - equal(eviction_policy, other->eviction_policy); + equal(eviction_policy, other->eviction_policy) && + equal(nhw_step, other->nhw_step) && + equal(c_step, other->c_step); }
299-307: Conv2DIm2Col hashing missing nhw_step and c_stepHash must include all fields used in equality.
Apply:
void SHashReduce(SHashReducer hash_reduce) const { hash_reduce(src); hash_reduce(dst); hash_reduce(stride); hash_reduce(padding); hash_reduce(dilation); hash_reduce(kernel); hash_reduce(eviction_policy); + hash_reduce(nhw_step); + hash_reduce(c_step); }src/op/gemm_sp.h (1)
58-65: Include policy in structural equality.policy participates in hashing and reflection but is omitted from equality, breaking invariants.
Apply:
bool SEqualReduce(const GemmSPNode *other, SEqualReducer equal) const { - return equal(A, other->A) && equal(B, other->B) && equal(C, other->C) && + return equal(policy, other->policy) && + equal(A, other->A) && equal(B, other->B) && equal(C, other->C) && equal(E, other->E) && equal(trans_A, other->trans_A) && equal(trans_B, other->trans_B) && equal(M, other->M) && equal(N, other->N) && equal(K, other->K) && equal(clear_accum, other->clear_accum) && equal(kPack, other->kPack) && equal(wg_wait, other->wg_wait); }tilelang/ir.py (2)
7-70: Use the bound decorator handle for all registrationsAfter binding
register_object, replace@tvm.ffi.register_object(...)with@register_object(...)for clarity and version-compatibility.-@tvm.ffi.register_object("tl.Fill") +@register_object("tl.Fill") class Fill(Node, Scriptable): ... -@tvm.ffi.register_object("tl.AtomicAdd") +@register_object("tl.AtomicAdd") class AtomicAdd(Node, Scriptable): ... -@tvm.ffi.register_object("tl.Copy") +@register_object("tl.Copy") class Copy(Node, Scriptable): ... -@tvm.ffi.register_object("tl.Conv2DIm2Col") +@register_object("tl.Conv2DIm2Col") class Conv2DIm2ColOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.GemmWarpPolicy") +@register_object("tl.GemmWarpPolicy") class GemmWarpPolicy(Node, Scriptable): ... -@tvm.ffi.register_object("tl.Gemm") +@register_object("tl.Gemm") class Gemm(Node, Scriptable): ... -@tvm.ffi.register_object("tl.GemmSP") +@register_object("tl.GemmSP") class GemmSP(Node, Scriptable): ... -@tvm.ffi.register_object("tl.FinalizeReducerOp") +@register_object("tl.FinalizeReducerOp") class FinalizeReducerOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.ParallelOp") +@register_object("tl.ParallelOp") class ParallelOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.ReduceOp") +@register_object("tl.ReduceOp") class ReduceOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.CumSumOp") +@register_object("tl.CumSumOp") class CumSumOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.RegionOp") +@register_object("tl.RegionOp") class RegionOp(Node, Scriptable): ... -@tvm.ffi.register_object("tl.ReduceType") +@register_object("tl.ReduceType") class ReduceType(Node, Scriptable): ...
1-4: Fix TVM import aliasing; bind register_object directlyUsing
from tilelang import tvm as tvmwhile alsoimport tvm.ffirisks binding@tvm.ffi.register_objectto the wrong module. Import upstream TVM and bind the decorator handle explicitly.-from tilelang import tvm as tvm -from tvm.ir.base import Node -from tvm.runtime import Scriptable -import tvm.ffi +import tvm +from tvm.ir.base import Node +from tvm.runtime import Scriptable +try: + from tvm._ffi import register_object # preferred in newer TVM +except Exception: # pragma: no cover — back-compat + register_object = tvm.ffi.register_objectRun to ensure no remaining aliasing/decorator calls:
#!/bin/bash rg -nP 'from tilelang import tvm as tvm|@tvm\.ffi\.register_object' tilelang/ir.pysrc/op/gemm.h (1)
149-151: Fix incorrect offset_A comparison in SEqualReduce
offset_Ais compared againstother->offset_B, which breaks structural equality.- equal(offset_A, other->offset_B) && + equal(offset_A, other->offset_A) && equal(offset_B, other->offset_B) &&
🧹 Nitpick comments (9)
src/op/atomic_add.h (2)
20-24: Consider relaxing coalesced_width type (future-proofing).If coalesced width may be symbolic/target-dependent, IntImm is restrictive. Prefer Optional (still works with .defined()) to match typical TIR patterns.
- IntImm coalesced_width; ///< Width for memory coalescing optimization + Optional<PrimExpr> coalesced_width; ///< Optional width for memory coalescing
66-73: Avoid magic 0/1 selector for src/dst.Using an enum improves readability and type-safety.
class AtomicAddNode : public TileOperatorNode { @@ - Array<PrimExpr> MakeIndices(const Array<IterVar> &ivs, int src_dst) const; + enum class WhichBuf : int { kSrc = 0, kDst = 1 }; + Array<PrimExpr> MakeIndices(const Array<IterVar> &ivs, WhichBuf which) const; @@ - PrimExpr MakePredicate(arith::Analyzer *analyzer, const Array<IterVar> &ivs, - Array<PrimExpr> extents, int src_dst) const; + PrimExpr MakePredicate(arith::Analyzer *analyzer, const Array<IterVar> &ivs, + Array<PrimExpr> extents, WhichBuf which) const;Note: update corresponding definitions/call sites in the .cc accordingly.
src/op/copy.h (1)
97-97: Optional: use a strongly-typed eviction_policyPrefer storing EvictionPolicy directly (or an int) for readability and reflection friendliness; keep the wire value as uint8_t only at codegen boundaries.
Apply:
- uint8_t eviction_policy = static_cast<uint8_t>(EvictionPolicy::kEvictNormal); + EvictionPolicy eviction_policy = EvictionPolicy::kEvictNormal;Note: Adjust reflection/hash/equality to cast to int when needed.
src/op/gemm_sp.cc (1)
31-44: Remove unused helper toPrimeFactorsDead code; drop to keep TU lean.
Apply:
-static std::vector<int> toPrimeFactors(int x) { - int i = 2; - std::vector<int> result; - while (x > 1) { - if (x % i == 0) { - x /= i; - result.push_back(i); - } else { - i++; - } - } - return result; -}src/op/gemm.cc (2)
148-149: Avoid non-ASCII character in diagnostics.The “×” may cause encoding issues in some toolchains. Prefer plain "x".
Apply:
- ICHECK(num_warps % 4 == 0) << "Warp-Group MMA requires 128×k threads."; + ICHECK(num_warps % 4 == 0) << "Warp-Group MMA requires 128xk threads.";
31-43: Remove or mark the unused toPrimeFactors helper.It’s not referenced; keep the TU lean or annotate to silence warnings.
Apply either:
- Delete the function, or
- Mark it as maybe_unused:
-static std::vector<int> toPrimeFactors(int x) { +[[maybe_unused]] static std::vector<int> toPrimeFactors(int x) {src/op/fill.h (1)
26-28: Mark overrides to catch signature drift at compile time.Add override on virtuals.
Apply:
- Stmt Lower(const LowerArgs &T, arith::Analyzer *analyzer) const; - LayoutMap InferLayout(const LayoutInferArgs &T, InferLevel level) const; + Stmt Lower(const LowerArgs &T, arith::Analyzer *analyzer) const override; + LayoutMap InferLayout(const LayoutInferArgs &T, InferLevel level) const override;src/op/gemm.h (1)
29-31: Optional: initialize policy_type to a sane defaultAvoids undefined reads if a node is constructed via non-standard paths.
- int policy_type; + int policy_type{static_cast<int>(GemmWarpPolicyType::kSquare)};src/op/reduce.h (1)
56-75: Provide enum-based constructor for ReduceTypeConvenience overload avoids string parsing at call sites and mirrors Gemm policy constructors.
class ReduceType : public ObjectRef { public: TVM_DEFINE_OBJECT_REF_METHODS(ReduceType, ObjectRef, ReduceTypeNode); + explicit ReduceType(ReduceTypeEnum e) { + auto node = make_object<ReduceTypeNode>(); + node->type = static_cast<int>(e); + data_ = std::move(node); + } TVM_DLL ReduceType(std::string type) { auto node = make_object<ReduceTypeNode>();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
src/op/atomic_add.h(3 hunks)src/op/copy.h(4 hunks)src/op/fill.h(1 hunks)src/op/gemm.cc(11 hunks)src/op/gemm.h(2 hunks)src/op/gemm_sp.cc(4 hunks)src/op/gemm_sp.h(2 hunks)src/op/operator.h(0 hunks)src/op/reduce.cc(6 hunks)src/op/reduce.h(3 hunks)tilelang/ir.py(1 hunks)
💤 Files with no reviewable changes (1)
- src/op/operator.h
🧰 Additional context used
🧬 Code graph analysis (10)
src/op/fill.h (3)
src/op/parallel.h (6)
tvm(22-158)tl(23-156)InferLayout(101-108)RegisterReflection(72-78)SEqualReduce(80-84)SHashReduce(86-90)src/op/operator.h (2)
TileOperatorNode(53-93)TileOperator(60-92)src/op/fill.cc (9)
Lower(170-205)Lower(170-170)InferLayout(218-221)InferLayout(218-219)Clone(118-121)Clone(118-118)MakeSIMTLoop(135-150)MakeSIMTLoop(135-135)Fill(61-108)
tilelang/ir.py (6)
src/op/gemm_sp.h (1)
tvm(13-95)src/op/gemm.h (5)
GemmWarpPolicy(72-95)GemmWarpPolicy(76-80)GemmWarpPolicy(82-86)GemmWarpPolicy(88-94)Gemm(194-199)src/op/gemm.cc (1)
Gemm(73-103)src/op/gemm_sp.cc (1)
GemmSP(66-89)src/op/reduce.cc (2)
ReduceOp(24-33)CumSumOp(396-410)src/op/reduce.h (4)
ReduceOp(134-139)CumSumOp(158-163)ReduceType(56-76)ReduceType(59-75)
src/op/gemm.h (3)
src/op/gemm_sp.h (2)
tl(15-94)tvm(13-95)src/target/utils.h (2)
tl(13-31)tvm(12-32)src/op/gemm.cc (2)
ComputeWarpPartition(134-299)ComputeWarpPartition(135-136)
src/op/reduce.h (6)
src/op/copy.h (7)
RegisterReflection(101-109)RegisterReflection(279-289)SEqualReduce(111-116)SEqualReduce(291-297)SHashReduce(118-124)SHashReduce(299-307)Clone(239-334)src/op/fill.h (4)
RegisterReflection(30-36)tvm(12-67)SEqualReduce(38-41)SHashReduce(43-47)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/atomic_add.h (1)
tvm(12-84)src/op/operator.h (2)
TileOperatorNode(53-93)TileOperator(60-92)src/op/reduce.cc (19)
Lower(151-317)Lower(151-151)Lower(412-436)Lower(412-412)InferLayout(319-389)InferLayout(319-320)InferLayout(438-441)InferLayout(438-439)Clone(35-38)Clone(35-35)Clone(40-43)Clone(40-40)MakeInitValue(45-76)MakeInitValue(45-45)MakeReduce(78-96)MakeReduce(78-79)MakeCodegenReducer(98-113)MakeCodegenReducer(98-98)ReduceOp(24-33)
src/op/reduce.cc (1)
src/op/reduce.h (2)
ReduceType(56-76)ReduceType(59-75)
src/op/gemm_sp.cc (2)
tilelang/ir.py (1)
GemmWarpPolicy(28-29)src/op/gemm.h (6)
GemmWarpPolicy(72-95)GemmWarpPolicy(76-80)GemmWarpPolicy(82-86)GemmWarpPolicy(88-94)RegisterReflection(34-40)RegisterReflection(117-139)
src/op/gemm_sp.h (1)
src/op/gemm.h (10)
GemmWarpPolicy(72-95)GemmWarpPolicy(76-80)GemmWarpPolicy(82-86)GemmWarpPolicy(88-94)RegisterReflection(34-40)RegisterReflection(117-139)SEqualReduce(42-46)SEqualReduce(141-154)SHashReduce(48-52)SHashReduce(156-176)
src/op/gemm.cc (3)
src/op/gemm.h (4)
GemmWarpPolicy(72-95)GemmWarpPolicy(76-80)GemmWarpPolicy(82-86)GemmWarpPolicy(88-94)tilelang/primitives/gemm/base.py (1)
GemmWarpPolicy(8-188)src/target/utils.cc (2)
TargetGetWarpSize(114-119)TargetGetWarpSize(114-114)
src/op/copy.h (3)
src/op/fill.h (4)
RegisterReflection(30-36)tvm(12-67)SEqualReduce(38-41)SHashReduce(43-47)src/op/reduce.h (6)
RegisterReflection(33-36)RegisterReflection(89-97)SEqualReduce(38-40)SEqualReduce(99-103)SHashReduce(42-42)SHashReduce(105-111)src/op/atomic_add.h (1)
tvm(12-84)
src/op/atomic_add.h (5)
src/op/operator.h (1)
TileOperatorNode(53-93)src/op/parallel.h (5)
ParallelOp(148-155)RegisterReflection(72-78)tvm(22-158)SEqualReduce(80-84)SHashReduce(86-90)src/op/copy.h (6)
RegisterReflection(101-109)RegisterReflection(279-289)SEqualReduce(111-116)SEqualReduce(291-297)SHashReduce(118-124)SHashReduce(299-307)src/op/fill.h (4)
RegisterReflection(30-36)tvm(12-67)SEqualReduce(38-41)SHashReduce(43-47)src/op/atomic_add.cc (8)
MakeSIMTLoop(256-317)MakeSIMTLoop(256-256)MakeIterVars(118-130)MakeIterVars(118-118)MakeIndices(153-170)MakeIndices(153-154)MakePredicate(197-226)MakePredicate(197-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
🔇 Additional comments (13)
src/op/atomic_add.h (2)
25-25: Confirm exclusion of par_op_ from equality/hash/reflect is intentional.par_op_ looks like derived/cache state. Excluding it avoids spurious inequality but please confirm this invariant and that no logic depends on it being reflected.
45-61: SEqual/SHash implementations look correct and complete.Covers src/dst, ranges, and coalesced_width; flags are set. No ephemeral fields included. LGTM.
src/op/gemm_sp.cc (4)
77-77: Confirm Python/IR contract for policy argumentConstructor interprets args[9] as an IntImm policy type. If Python can pass free (m_warp, n_warp) policies, this will reject them.
Would you like me to scan tilelang/ir.py and call sites to ensure only enum-style policies are passed, or extend parsing to accept a 2-tuple? I can provide a patch either way.
130-132: Null-safety: ensure policy is always setComputeWarpPartition assumes non-null policy. Given constructor always sets it, this is OK; just flagging the invariant.
If there are alternate constructors/Clone paths that can produce a node without policy, we should set a default (e.g., kAuto) in GemmSPNode’s default ctor.
208-210: InferLayout uses policy→ComputeWarpPartition consistentlyGood consolidation; reduces duplication vs prior node-local logic.
252-253: Reflection registration LGTMStatic init registration matches the pattern used elsewhere.
src/op/reduce.cc (2)
24-33: Constructor change to ReduceType wrapper looks goodCleanly centralizes parsing/validation of reduce type.
396-410: CumSum: argument parsing and checks LGTMDoc updates and reverse handling look consistent.
src/op/gemm.cc (1)
426-428: Call-site transition to policy looks good.Lower/InferLayout now use policy->ComputeWarpPartition consistently. Once the fixes above land, these sites should be robust.
Also applies to: 488-491
src/op/fill.h (1)
30-36: Reflection shape looks correct and matches FillNode fields.No issues.
src/op/gemm_sp.h (1)
40-56: Reflection coverage is complete (now includes policy).Good update.
src/op/gemm.h (1)
117-139: Good: policy is reflected and participates in equality/hashThis addresses prior gaps where
policywas excluded from identity.src/op/reduce.h (1)
99-111: Good: equality/hash now include typeIncluding
typefixes correctness for structural comparisons and hashing.
| static void RegisterReflection() { | ||
| namespace refl = tvm::ffi::reflection; | ||
| refl::ObjectDef<AtomicAddNode>() | ||
| .def_ro("src", &AtomicAddNode::src) | ||
| .def_ro("dst", &AtomicAddNode::dst) | ||
| .def_ro("src_range", &AtomicAddNode::src_range) | ||
| .def_ro("dst_range", &AtomicAddNode::dst_range) | ||
| .def_ro("coalesced_width", &AtomicAddNode::coalesced_width); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Ensure RegisterReflection() runs at static init.
Please verify a TU calls AtomicAddNode::RegisterReflection() (and node type registration) during startup; otherwise Python reflection won’t work.
🏁 Script executed:
#!/bin/bash
# Expect: at least one call to RegisterReflection() and node/op registrations.
rg -n -C2 -e 'AtomicAddNode::RegisterReflection\(' -e 'TVM_REGISTER_NODE_TYPE\(\s*AtomicAddNode\s*\)' -e 'TIR_REGISTER_TL_OP\(\s*AtomicAdd'Length of output: 663
Register node type for AtomicAddNode
Add a TVM_REGISTER_NODE_TYPE(AtomicAddNode); invocation (e.g. in src/op/atomic_add.cc, just above the existing TVM_FFI_STATIC_INIT_BLOCK) so the node type is registered with FFI before calling AtomicAddNode::RegisterReflection(). Without it, Python reflection will not work.
🤖 Prompt for AI Agents
In src/op/atomic_add.h around lines 35 to 43 the reflection registration for
AtomicAddNode is defined but the node type itself is not registered with TVM's
FFI, causing Python reflection to fail; add a
TVM_REGISTER_NODE_TYPE(AtomicAddNode); call (place it in src/op/atomic_add.cc
just above the existing TVM_FFI_STATIC_INIT_BLOCK) so the node type is
registered before AtomicAddNode::RegisterReflection() is invoked.
| /// Create SIMT-style parallel loop structure | ||
| For MakeSIMTLoop(arith::Analyzer *analyzer) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: scalar path in MakeSIMTLoop does a plain store, not an atomic add.
In src/op/atomic_add.cc, the is_scalar branch emits BufferStore(dst, src) instead of an atomic add. This silently changes semantics.
Patch suggestion (cc file):
@@
- if (is_scalar) {
- return For(Var("i"), 0, 1, ForKind::kSerial,
- BufferStore(dst, BufferLoad(src, {0}), {0}));
- }
+ if (is_scalar) {
+ Array<PrimExpr> new_args;
+ new_args.push_back(StringImm("AtomicAdd"));
+ PrimExpr src_value = BufferLoad(src, {0});
+ if (src->dtype != dst->dtype) {
+ src_value = Cast(dst->dtype, src_value);
+ }
+ PrimExpr dst_value = BufferLoad(dst, {0});
+ Call address_of_value =
+ tvm::tir::Call(DataType::Handle(), builtin::address_of(), {dst_value});
+ new_args.push_back(address_of_value);
+ new_args.push_back(src_value);
+ Call atomicadd_call =
+ tvm::tir::Call(dst->dtype, builtin::call_extern(), new_args);
+ return For(Var("i"), 0, 1, ForKind::kSerial, Evaluate(atomicadd_call));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Create SIMT-style parallel loop structure | |
| For MakeSIMTLoop(arith::Analyzer *analyzer) const; | |
| if (is_scalar) { | |
| Array<PrimExpr> new_args; | |
| new_args.push_back(StringImm("AtomicAdd")); | |
| PrimExpr src_value = BufferLoad(src, {0}); | |
| if (src->dtype != dst->dtype) { | |
| src_value = Cast(dst->dtype, src_value); | |
| } | |
| PrimExpr dst_value = BufferLoad(dst, {0}); | |
| Call address_of_value = | |
| tvm::tir::Call(DataType::Handle(), builtin::address_of(), {dst_value}); | |
| new_args.push_back(address_of_value); | |
| new_args.push_back(src_value); | |
| Call atomicadd_call = | |
| tvm::tir::Call(dst->dtype, builtin::call_extern(), new_args); | |
| return For(Var("i"), 0, 1, ForKind::kSerial, Evaluate(atomicadd_call)); | |
| } |
🤖 Prompt for AI Agents
In src/op/atomic_add.h around lines 64-65, the scalar path in MakeSIMTLoop
currently emits a plain BufferStore(dst, src) which changes semantics; replace
the scalar-store with the same atomic-add operation used by the vector/SIMT path
(use the existing BufferAtomicAdd/atomic add intrinsic or helper used
elsewhere), preserving operand order, return/old-value handling, and any
required memory ordering flags; include any missing header or forward
declaration needed for the atomic helper and add/update tests to cover the
scalar branch.
| if (this->isFullRow()) { | ||
| // Try to partition M first | ||
| m_warp = num_warps; | ||
| n_warp = 1; | ||
|
|
||
| // If M cannot be evenly divided by m_warp*16, try to split remaining warps | ||
| // to N | ||
| if (this->M % (m_warp * kMPerWarp) != 0) { | ||
| if (M % (m_warp * kMPerWarp) != 0) { | ||
| // Calculate how many warps we can use for M | ||
| int max_m_warps = this->M / kMPerWarp; | ||
| int max_m_warps = M / kMPerWarp; | ||
| m_warp = max_m_warps; | ||
| // Use remaining warps for N | ||
| n_warp = num_warps / m_warp; | ||
| if (n_warp == 0) | ||
| n_warp = 1; | ||
| } | ||
| } else if (this->policy == GemmWarpPolicy::kFullCol) { | ||
| } else if (this->isFullCol()) { | ||
| // Try to partition N first | ||
| m_warp = 1; | ||
| n_warp = num_warps; | ||
|
|
||
| // If N cannot be evenly divided by n_warp*8, try to split remaining warps | ||
| // to M | ||
| if (this->N % (n_warp * kNPerWarp) != 0) { | ||
| if (N % (n_warp * kNPerWarp) != 0) { | ||
| // Calculate how many warps we can use for N | ||
| int max_n_warps = this->N / kNPerWarp; | ||
| int max_n_warps = N / kNPerWarp; | ||
| n_warp = max_n_warps; | ||
| // Use remaining warps for M | ||
| m_warp = num_warps / n_warp; | ||
| if (m_warp == 0) | ||
| m_warp = 1; | ||
| } | ||
| } else if (this->policy == GemmWarpPolicy::kSquare) { | ||
| } else if (this->isSquare()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
FullRow/FullCol (non-WGMMA) may produce leftover warps; enforce exact tiling.
m_warp and n_warp are derived via integer division; when not divisible, extra warps are lost. Pick the largest divisor within bounds so that m_warp*n_warp == num_warps.
Apply:
if (this->isFullRow()) {
// Try to partition M first
- m_warp = num_warps;
- n_warp = 1;
+ m_warp = std::min(num_warps, M / kMPerWarp);
+ while (m_warp > 1 && (num_warps % m_warp != 0)) --m_warp;
+ if (m_warp == 0) m_warp = 1;
+ n_warp = num_warps / m_warp;
@@
- if (M % (m_warp * kMPerWarp) != 0) {
- // Calculate how many warps we can use for M
- int max_m_warps = M / kMPerWarp;
- m_warp = max_m_warps;
- // Use remaining warps for N
- n_warp = num_warps / m_warp;
- if (n_warp == 0)
- n_warp = 1;
- }
+ ICHECK(M % (m_warp * kMPerWarp) == 0);
} else if (this->isFullCol()) {
// Try to partition N first
- m_warp = 1;
- n_warp = num_warps;
+ n_warp = std::min(num_warps, N / kNPerWarp);
+ while (n_warp > 1 && (num_warps % n_warp != 0)) --n_warp;
+ if (n_warp == 0) n_warp = 1;
+ m_warp = num_warps / n_warp;
@@
- if (N % (n_warp * kNPerWarp) != 0) {
- // Calculate how many warps we can use for N
- int max_n_warps = N / kNPerWarp;
- n_warp = max_n_warps;
- // Use remaining warps for M
- m_warp = num_warps / n_warp;
- if (m_warp == 0)
- m_warp = 1;
- }
+ ICHECK(N % (n_warp * kNPerWarp) == 0);Also add a final ICHECK after the policy branch (see next comment) to guarantee exact tiling.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (this->isFullRow()) { | |
| // Try to partition M first | |
| m_warp = num_warps; | |
| n_warp = 1; | |
| // If M cannot be evenly divided by m_warp*16, try to split remaining warps | |
| // to N | |
| if (this->M % (m_warp * kMPerWarp) != 0) { | |
| if (M % (m_warp * kMPerWarp) != 0) { | |
| // Calculate how many warps we can use for M | |
| int max_m_warps = this->M / kMPerWarp; | |
| int max_m_warps = M / kMPerWarp; | |
| m_warp = max_m_warps; | |
| // Use remaining warps for N | |
| n_warp = num_warps / m_warp; | |
| if (n_warp == 0) | |
| n_warp = 1; | |
| } | |
| } else if (this->policy == GemmWarpPolicy::kFullCol) { | |
| } else if (this->isFullCol()) { | |
| // Try to partition N first | |
| m_warp = 1; | |
| n_warp = num_warps; | |
| // If N cannot be evenly divided by n_warp*8, try to split remaining warps | |
| // to M | |
| if (this->N % (n_warp * kNPerWarp) != 0) { | |
| if (N % (n_warp * kNPerWarp) != 0) { | |
| // Calculate how many warps we can use for N | |
| int max_n_warps = this->N / kNPerWarp; | |
| int max_n_warps = N / kNPerWarp; | |
| n_warp = max_n_warps; | |
| // Use remaining warps for M | |
| m_warp = num_warps / n_warp; | |
| if (m_warp == 0) | |
| m_warp = 1; | |
| } | |
| } else if (this->policy == GemmWarpPolicy::kSquare) { | |
| } else if (this->isSquare()) { | |
| if (this->isFullRow()) { | |
| // Try to partition M first | |
| m_warp = std::min(num_warps, M / kMPerWarp); | |
| while (m_warp > 1 && (num_warps % m_warp != 0)) --m_warp; | |
| if (m_warp == 0) m_warp = 1; | |
| n_warp = num_warps / m_warp; | |
| ICHECK(M % (m_warp * kMPerWarp) == 0); | |
| } else if (this->isFullCol()) { | |
| // Try to partition N first | |
| n_warp = std::min(num_warps, N / kNPerWarp); | |
| while (n_warp > 1 && (num_warps % n_warp != 0)) --n_warp; | |
| if (n_warp == 0) n_warp = 1; | |
| m_warp = num_warps / n_warp; | |
| ICHECK(N % (n_warp * kNPerWarp) == 0); | |
| } else if (this->isSquare()) { |
| // Store the computed values in the object's member variables | ||
| this->m_warp = m_warp; | ||
| this->n_warp = n_warp; | ||
|
|
||
| return {m_warp, n_warp}; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add a sanity check for non-WGMMA path.
Mirror the WGMMA branch by asserting m_warp*n_warp == num_warps before persisting.
Apply:
- // Store the computed values in the object's member variables
+ ICHECK(m_warp * n_warp == num_warps) << "m_warp * n_warp must equal num_warps";
+ // Store the computed values in the object's member variables
this->m_warp = m_warp;
this->n_warp = n_warp;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Store the computed values in the object's member variables | |
| this->m_warp = m_warp; | |
| this->n_warp = n_warp; | |
| return {m_warp, n_warp}; | |
| } | |
| ICHECK(m_warp * n_warp == num_warps) << "m_warp * n_warp must equal num_warps"; | |
| // Store the computed values in the object's member variables | |
| this->m_warp = m_warp; | |
| this->n_warp = n_warp; | |
| return {m_warp, n_warp}; | |
| } |
🤖 Prompt for AI Agents
In src/op/gemm.cc around lines 294 to 299, the non-WGMMA branch lacks a sanity
check ensuring the calculated m_warp and n_warp multiply to num_warps; add the
same assertion used in the WGMMA branch (assert(m_warp * n_warp == num_warps) or
an equivalent check that throws/logs on mismatch) immediately before persisting
to this->m_warp and this->n_warp and before returning, so the function validates
the values match expected num_warps.
| bool SEqualReduce(const GemmWarpPolicyNode *other, | ||
| SEqualReducer equal) const { | ||
| return equal(policy_type, other->policy_type) && | ||
| equal(m_warp, other->m_warp) && equal(n_warp, other->n_warp); | ||
| } | ||
|
|
||
| void SHashReduce(SHashReducer hash_reduce) const { | ||
| hash_reduce(policy_type); | ||
| hash_reduce(m_warp); | ||
| hash_reduce(n_warp); | ||
| } | ||
|
|
||
| static constexpr bool _type_has_method_sequal_reduce = true; | ||
| static constexpr bool _type_has_method_shash_reduce = true; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Stabilize equality/hash: exclude derived m_warp/n_warp except for kFree
m_warp/n_warp are mutated by ComputeWarpPartition (target/M/N dependent). Including them in equality/hash for non-free policies makes identity depend on lowering-time side effects. Compare/hash them only when policy_type == kFree.
- bool SEqualReduce(const GemmWarpPolicyNode *other,
- SEqualReducer equal) const {
- return equal(policy_type, other->policy_type) &&
- equal(m_warp, other->m_warp) && equal(n_warp, other->n_warp);
- }
+ bool SEqualReduce(const GemmWarpPolicyNode *other,
+ SEqualReducer equal) const {
+ if (!equal(policy_type, other->policy_type)) return false;
+ if (isFree()) {
+ return equal(m_warp, other->m_warp) && equal(n_warp, other->n_warp);
+ }
+ return true;
+ }
- void SHashReduce(SHashReducer hash_reduce) const {
- hash_reduce(policy_type);
- hash_reduce(m_warp);
- hash_reduce(n_warp);
- }
+ void SHashReduce(SHashReducer hash_reduce) const {
+ hash_reduce(policy_type);
+ if (isFree()) {
+ hash_reduce(m_warp);
+ hash_reduce(n_warp);
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool SEqualReduce(const GemmWarpPolicyNode *other, | |
| SEqualReducer equal) const { | |
| return equal(policy_type, other->policy_type) && | |
| equal(m_warp, other->m_warp) && equal(n_warp, other->n_warp); | |
| } | |
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(policy_type); | |
| hash_reduce(m_warp); | |
| hash_reduce(n_warp); | |
| } | |
| static constexpr bool _type_has_method_sequal_reduce = true; | |
| static constexpr bool _type_has_method_shash_reduce = true; | |
| bool SEqualReduce(const GemmWarpPolicyNode *other, | |
| SEqualReducer equal) const { | |
| if (!equal(policy_type, other->policy_type)) return false; | |
| if (isFree()) { | |
| return equal(m_warp, other->m_warp) && equal(n_warp, other->n_warp); | |
| } | |
| return true; | |
| } | |
| void SHashReduce(SHashReducer hash_reduce) const { | |
| hash_reduce(policy_type); | |
| if (isFree()) { | |
| hash_reduce(m_warp); | |
| hash_reduce(n_warp); | |
| } | |
| } | |
| static constexpr bool _type_has_method_sequal_reduce = true; | |
| static constexpr bool _type_has_method_shash_reduce = true; |
🤖 Prompt for AI Agents
In src/op/gemm.h around lines 42 to 56, the equality/hash currently include
m_warp and n_warp unconditionally which makes identity depend on lowering-time
mutations; change SEqualReduce to always compare policy_type but only compare
m_warp and n_warp when policy_type == kFree (i.e., if (policy_type == kFree)
then call equal on m_warp and n_warp), and change SHashReduce to always hash
policy_type but only call hash_reduce(m_warp) and hash_reduce(n_warp) when
policy_type == kFree; leave the constexpr flags unchanged.
| if (type->isSum()) { | ||
| return make_zero(dst->dtype); | ||
| case ReduceType::kAbsSum: | ||
| } else if (type->isAbsSum()) { | ||
| return make_zero(dst->dtype); | ||
| case ReduceType::kMax: | ||
| } else if (type->isMax()) { | ||
| if (is_int) { | ||
| return make_const(dst->dtype, -(1 << (bits - 1))); | ||
| } else if (is_uint) { | ||
| return make_const(dst->dtype, 0); | ||
| } else { | ||
| return make_const(dst->dtype, -INFINITY); | ||
| } | ||
| case ReduceType::kMin: | ||
| } else if (type->isMin()) { | ||
| if (is_int) { | ||
| return make_const(dst->dtype, (1 << (bits - 1)) - 1); | ||
| } else if (is_uint) { | ||
| return make_const(dst->dtype, (1 << bits) - 1); | ||
| } else { | ||
| return make_const(dst->dtype, INFINITY); | ||
| } | ||
| case ReduceType::kAbsMax: | ||
| } else if (type->isAbsMax()) { | ||
| return make_const(dst->dtype, 0); | ||
| default: | ||
| ICHECK(0); | ||
| } else { | ||
| LOG(FATAL) << "Unsupported reduce type: " << type->type; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
UB risk in MakeInitValue for int min/max
Shifts like (1 << (bits - 1)) are undefined for bits ≥ 32 (1 is int). Use typed intrinsics.
Apply:
if (type->isMax()) {
- if (is_int) {
- return make_const(dst->dtype, -(1 << (bits - 1)));
+ if (is_int) {
+ return min_value(dst->dtype);
} else if (is_uint) {
- return make_const(dst->dtype, 0);
+ return min_value(dst->dtype);
} else {
return make_const(dst->dtype, -INFINITY);
}
} else if (type->isMin()) {
- if (is_int) {
- return make_const(dst->dtype, (1 << (bits - 1)) - 1);
- } else if (is_uint) {
- return make_const(dst->dtype, (1 << bits) - 1);
+ if (is_int) {
+ return max_value(dst->dtype);
+ } else if (is_uint) {
+ return max_value(dst->dtype);
} else {
return make_const(dst->dtype, INFINITY);
}Note: min_value/max_value are available in TIR; otherwise compute via dtype-aware helpers.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (type->isSum()) { | |
| return make_zero(dst->dtype); | |
| case ReduceType::kAbsSum: | |
| } else if (type->isAbsSum()) { | |
| return make_zero(dst->dtype); | |
| case ReduceType::kMax: | |
| } else if (type->isMax()) { | |
| if (is_int) { | |
| return make_const(dst->dtype, -(1 << (bits - 1))); | |
| } else if (is_uint) { | |
| return make_const(dst->dtype, 0); | |
| } else { | |
| return make_const(dst->dtype, -INFINITY); | |
| } | |
| case ReduceType::kMin: | |
| } else if (type->isMin()) { | |
| if (is_int) { | |
| return make_const(dst->dtype, (1 << (bits - 1)) - 1); | |
| } else if (is_uint) { | |
| return make_const(dst->dtype, (1 << bits) - 1); | |
| } else { | |
| return make_const(dst->dtype, INFINITY); | |
| } | |
| case ReduceType::kAbsMax: | |
| } else if (type->isAbsMax()) { | |
| return make_const(dst->dtype, 0); | |
| default: | |
| ICHECK(0); | |
| } else { | |
| LOG(FATAL) << "Unsupported reduce type: " << type->type; | |
| } | |
| if (type->isSum()) { | |
| return make_zero(dst->dtype); | |
| } else if (type->isAbsSum()) { | |
| return make_zero(dst->dtype); | |
| } else if (type->isMax()) { | |
| if (is_int) { | |
| return min_value(dst->dtype); | |
| } else if (is_uint) { | |
| return min_value(dst->dtype); | |
| } else { | |
| return make_const(dst->dtype, -INFINITY); | |
| } | |
| } else if (type->isMin()) { | |
| if (is_int) { | |
| return max_value(dst->dtype); | |
| } else if (is_uint) { | |
| return max_value(dst->dtype); | |
| } else { | |
| return make_const(dst->dtype, INFINITY); | |
| } | |
| } else if (type->isAbsMax()) { | |
| return make_const(dst->dtype, 0); | |
| } else { | |
| LOG(FATAL) << "Unsupported reduce type: " << type->type; | |
| } |
- Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.
- Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/op/reduce.h (1)
141-156: Add reflection + structural eq/hash for CumSumOpNode to meet Python reflection goalCumSumOpNode lacks RegisterReflection/SEqualReduce/SHashReduce and the type_has_method* flags, unlike other operators (e.g., Fill, Copy, FinalizeReducerOp). Without these, Python-side reflection and structural memoization won’t work uniformly.
class CumSumOpNode : public TileOperatorNode { public: tir::Buffer src, dst; ///< Source and destination buffers int dim; ///< Dimension along which to compute cumulative sum bool reverse; ///< Whether to compute in reverse order static constexpr const char *_type_key = "tl.CumSumOp"; TVM_DECLARE_FINAL_OBJECT_INFO(CumSumOpNode, TileOperatorNode); + static void RegisterReflection() { + namespace refl = tvm::ffi::reflection; + refl::ObjectDef<CumSumOpNode>() + .def_ro("src", &CumSumOpNode::src) + .def_ro("dst", &CumSumOpNode::dst) + .def_ro("dim", &CumSumOpNode::dim) + .def_ro("reverse", &CumSumOpNode::reverse); + } + + bool SEqualReduce(const CumSumOpNode* other, SEqualReducer equal) const { + return equal(src, other->src) && equal(dst, other->dst) && + equal(dim, other->dim) && equal(reverse, other->reverse); + } + + void SHashReduce(SHashReducer hash_reduce) const { + hash_reduce(src); + hash_reduce(dst); + hash_reduce(dim); + hash_reduce(reverse); + } + static constexpr bool _type_has_method_sequal_reduce = true; + static constexpr bool _type_has_method_shash_reduce = true; + Stmt Lower(const LowerArgs &T, arith::Analyzer *analyzer) const override;
♻️ Duplicate comments (1)
src/op/reduce.h (1)
99-104: SEqual/SHash now include type — resolves prior issueThanks for adding type into equality and hashing. This addresses the previously flagged correctness gap.
Also applies to: 105-112
🧹 Nitpick comments (2)
src/op/reduce.h (2)
9-9: Add missing standard headers to avoid transitive includesThis header uses uint8_t, std::string, and std::move but doesn’t include their headers. Add these to keep the header self-contained.
+#include <cstdint> +#include <string> +#include <utility> #include "operator.h"
28-33: Prefer a narrower, typed storage for ReduceTypeNode::typeCurrently stored as int with sentinel -1. Using an 8-bit integral matches ReduceTypeEnum’s size and reduces ambiguity.
-class ReduceTypeNode : public Object { +class ReduceTypeNode : public Object { public: - int type{-1}; ///< Internal type identifier + int8_t type{-1}; ///< Internal type identifier (matches underlying enum width)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/op/reduce.h(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/reduce.h (6)
src/op/fill.h (4)
RegisterReflection(30-36)tvm(12-67)SEqualReduce(38-41)SHashReduce(43-47)src/op/copy.h (7)
RegisterReflection(101-109)RegisterReflection(279-289)SEqualReduce(111-116)SEqualReduce(291-297)SHashReduce(118-124)SHashReduce(299-307)Clone(239-334)src/op/finalize_reducer.h (4)
RegisterReflection(33-38)tvm(20-68)SEqualReduce(40-43)SHashReduce(45-48)src/op/atomic_add.h (1)
tvm(12-84)src/op/operator.h (2)
TileOperatorNode(53-93)TileOperator(60-92)src/op/reduce.cc (19)
Lower(152-318)Lower(152-152)Lower(413-437)Lower(413-413)InferLayout(320-390)InferLayout(320-321)InferLayout(439-442)InferLayout(439-440)Clone(35-38)Clone(35-35)Clone(40-43)Clone(40-40)MakeInitValue(45-76)MakeInitValue(45-45)MakeReduce(78-97)MakeReduce(78-79)MakeCodegenReducer(99-114)MakeCodegenReducer(99-99)ReduceOp(24-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
🔇 Additional comments (1)
src/op/reduce.h (1)
59-75: Change ReduceType constructor to takeconst std::string&
Pass the input string by const reference to avoid an unnecessary copy insrc/op/reduce.h(lines 61–69) .Python API already emits only the canonical tokens
The calls intilelang/language/reduce.pyuse exactly"sum","abssum","max","min", and"absmax"(lines 67, 84, 109, 124, 139) .
| static void RegisterReflection() { | ||
| namespace refl = tvm::ffi::reflection; | ||
| refl::ObjectDef<ReduceTypeNode>().def_ro("type", &ReduceTypeNode::type); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Ensure RegisterReflection() is invoked at static init
ReduceTypeNode::RegisterReflection and ReduceOpNode::RegisterReflection are defined but invocation isn’t shown here. Please confirm they’re called (e.g., in reduce.cc or a central registration TU). If not, add a tiny static initializer.
Example (in src/op/reduce.cc):
namespace {
struct _TL_RegisterReduceReflection {
_TL_RegisterReduceReflection() {
tvm::tl::ReduceTypeNode::RegisterReflection();
tvm::tl::ReduceOpNode::RegisterReflection();
tvm::tl::CumSumOpNode::RegisterReflection(); // after adding it
}
} _tl_register_reduce_reflection;
} // namespaceTo check current calls:
Also applies to: 89-97
🏁 Script executed:
#!/bin/bash
rg -nP '\bRegisterReflection\s*\(' -C2 srcLength of output: 8364
Add static initializer for Reduce reflection
No calls to ReduceTypeNode::RegisterReflection() or ReduceOpNode::RegisterReflection() exist—create a TU (e.g. src/op/reduce.cc) with:
namespace {
struct _TL_RegisterReduceReflection {
_TL_RegisterReduceReflection() {
tvm::tl::ReduceTypeNode::RegisterReflection();
tvm::tl::ReduceOpNode::RegisterReflection();
}
} _tl_register_reduce_reflection;
} // namespace🤖 Prompt for AI Agents
In src/op/reduce.h around lines 33 to 36, the reflection registration for Reduce
types is defined but never invoked; add a translation unit (e.g.,
src/op/reduce.cc) that defines an anonymous static initializer struct whose
constructor calls tvm::tl::ReduceTypeNode::RegisterReflection() and
tvm::tl::ReduceOpNode::RegisterReflection(), and instantiate a single static
instance of that struct inside an anonymous namespace so the registrations run
at program startup.
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling - Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes. - Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management. - Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body. - Removed obsolete code and improved overall code clarity and maintainability. * lint fix * Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls - Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves. - Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations. * test fix * Enhance global read detection in pipeline planning - Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses. - Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis. - Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code. * Add IndexLegalizer to enforce int64 for out-of-bound indices - Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds. - Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability. - Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations. * [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow. * Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job. * fix: NVRTC backend (#717) * fix: NVRTC backend * fix: CI --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CUDA] Init support for sm_120 (#716) * Init support for sm120 * fmt * resolve comments * unify mma gemm * fmt --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [CI] fix docs ci (#720) * [Chore] fix typos (#719) * chore: fix typos * chore: fix ruff * chore: fix clang-format * [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Update AMD FlashAttention example and TVM submodule - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters. * Add CI workflow for automated format checking and testing - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory. * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity. * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management. * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management. * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment. * Add installation of ROCm in AMD CI workflow - Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script. * Remove installation step for ROCm in AMD CI workflow to simplify the setup process. * Update AMD CI workflow to run specific test file with verbose output instead of all tests. * Add new tilelang built-in operations for AMD architecture - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects. * Enhance autotuner configurations and GEMM operations in AMD example - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications. * Update autotuner configurations in AMD example for enhanced performance - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns. * Enhance autotuner configurations and memory handling in AMD example - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities. - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations. * Refine autotuner configurations and memory usage in AMD example - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning. - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations. * Update autotuner configurations in AMD example for enhanced performance - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities. - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations. * Enhance autotuner configurations and GEMM operations in AMD example - Expanded thread counts in `get_configs` to include higher values for improved autotuning. - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications. * Update AMD CI workflow and remove obsolete test script - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu. - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure. * Remove TVM subproject from 3rdparty directory * Refactor configuration generation and accumulation logic in AMD example - Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions. * Enhance AMD CI workflow with additional logging and setup steps - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed. * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps * Enhance BuildTileLangHIP function by adding whitespace for improved readability * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * lint fix * Update AMD CI workflow to use requirements-rocm.txt for dependency installation * fix ci * Remove dependency on format-check from AMD CI workflow * fix ci * fix ci * fix ci * Remove format-check job from AMD CI workflow * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow * Add dependency on format-check job in AMD CI workflow * Add format-check job to AMD CI workflow * Update format-check job in AMD CI workflow to run on self-hosted environment * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes * Update amd_ci.yml --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724) * [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy * [Typo] Correct architecture selection for CUDA and CDNA * [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Refactor CUDA code generation to simplify eviction policy handling - Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity. - Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations. * lint fix * [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) * Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107 * Support strided tensors * Refactor target attribute helper functions for improved clarity * No code changes made in proxy.py and setup.py * lint fix * lint fix via gemini * lint fix * test fix * test fix * lint fix * Update wrapper.py * test fix * Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity. * lint fix --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> * [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712) * bug fix and support gemm_sr fallback for hopper * Update gemm.cc --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `fix` (#726) Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851 The following files were modified: * `src/op/gemm.cc` * `src/tl_templates/cuda/gemm_sm90.h` * `src/transform/warp_specialized_rewriter.cc` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [CI] Fix AMD CI (#729) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. --------- Co-authored-by: xinxyxiao <xinyxiao@amd.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * 📝 Add docstrings to `mxfp4` (#732) * 📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: tzj-fxz <tzjfxz@gmail.com> * [Refactor] Refactor env into a more flexible version (#740) * Fix environment variable name for compilation print setting in `env.py` * Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity. * lint fix * Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py` * [Enhancement] Add stride index validation in CythonKernelWrapper (#743) * Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`. * This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code. * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742) * [Bugfix]:Fix atomic add auto vectorize memory access out of bound error * Update atomicadd_vectorize.cc * format * 📝 Add docstrings to PR #744 (#745) * 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559 The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Refactor] Refactor barrier management (#744) * Introduce Barrier * Enhance CUDA kernel with new barrier management and post-processing support - Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers. - Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure. - Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency. - Introduced additional print statements for debugging in the lowering phase of the TileLang engine. - Enhanced the overall structure and readability of the codebase. * Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic. * Enhance barrier management in TileLang - Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework. - Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory. - Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code. - Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine. - Removed deprecated memory scope handling code to enhance clarity and maintainability. * lint fix * lint fix * Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability. * Refactor logging in JITKernel to improve kernel compilation tracking - Removed unused import of `torch.backends` in the example file. - Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging. - Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function. * Refactor dequantization tests and update barrier function - Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite. - Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management. * Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed. * Fix typos in rasterization parameters and update import path for cached module - Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage. - Updated the import statement for the `cached` module to reflect the new path in the cache submodule. - Added `StridedTensor` import in the language module for enhanced tensor functionality. * Update ci.yml * [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed. * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping. * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable. * lint fix * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency. * remove bulk copy * Refactor copy and atomic add operations to support TMA lower configuration - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations. - Modified `Lower` method in `Copy` to incorporate the new TMA configuration. - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic. - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity. - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach. * Enhance TMA bulk copy logic in `LowerBulkCopy` method - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling. - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities. * lint fix * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions. * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations. - Added TODO comments to indicate the need for further improvements in shared memory handling. * Update `native_sparse_attention` function to include TMA configuration options - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations. - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1. * Refactor JIT decorator formatting in `native_sparse_attention` function - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase. - No functional changes were made; this update focuses on code clarity and maintainability. * Enhance thread management and logging in TileLang compilation - Added a method to check if printing is enabled during compilation, improving control over logging behavior. - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output. - Added comments to clarify the purpose of changes and improve code readability. * Add warp specialization scope and refactor register management in TileLang - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management. - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process. - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management. - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process. * Refactor test for InjectSetMaxNReg pass in TileLang - Improved readability by restructuring conditional checks and assertions in the test cases. - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic. - Ensured consistent formatting and spacing throughout the test functions for better maintainability. * Enhance bulk copy and store checks in `Copy` class - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options. - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations. - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging. * lint fix * [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency. * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency. * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability. * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality. * typo fix * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues. * lint fix * test fix * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation. * use ninja instead of make * Add CMake configuration step for Ninja build system in setup.py * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja. * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility. * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks. * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly. * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes. * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions. * bugfix * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp. * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts. * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality. * [Enhancement] Optimize loop body handling in IR (#749) - Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable. - This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [MXFP4] Fix bugs and optimize exponential operation (#750) * [MXFP4] Fix bugs - Optimize exp2 with shift operation to boost performance - Fix bug of simple dequantization function call - Fix bug of scaling factor with bias * [Lint] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751) - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. * [Enhancement] Add shape checking for reduce options (#748) * Add shape checking for reduce options * lint fix * Handle special case reducing into shape-1 tensor Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y] --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix] Add missing FP8 header include (#752) * [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h - Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations. - Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Enhancement] Include cuda_fp8.h in gemm_sm90.h - Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types. Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * lint fix * [Refactor] Remove unused tl_shuffle_elect and related functions from common.h - Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase. - Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations. - Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability. * lint fix * [Refactor] Update header inclusions in common.h and gemm_sm90.h - Removed the inclusion of "intrin.h" from common.h to streamline dependencies. - Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability. * bug fix * [MXFP4] Add bias to MXFP4 GEMM kernel (#753) * [MXFP4] Add bias to gemm kernel * [Lint] * [Lint] Rename "bias" to "Bias" * [Bugfix][WS] Consider loop min extent when computing phase id (#754) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * [Typo] Remove `disable_cache` in some tests (#755) * Update test parameters and remove debug print statement - Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution. - Removed a debug print statement from `phase.py` to clean up the code and enhance clarity. * Refactor loop stack management in warp_specialized_rewriter - Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability. - Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability. - Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations. * Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability. * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Feature] Add 1D TMA support (#761) * [Feature] Add 1D TMA support - Check the contiguous conditions of 1D TMA copy - Add new interface and params order of `tma_load` and `tma_store` call - Add 1D `tma_store` interface in sm90 template - Add elementwise kernel for 1D TMA example * [Lint] * [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors * [Lint] * [BugFix] 1D TMA load * [README] Update GDN README for clarity and add acknowledgements (#758) - Improved formatting and clarity of the GDN kernel implementation description. - Updated requirement section to list dependencies in a clearer format. - Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions. * cutlass v4.2.0 supporting cuda 13 (#760) * [Lint] * [Lint] * [MXFP4] Add test for bf16&mxfp4 gemm * [BugFix] * [Lint] --------- Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> * [Example] Add vertical slash sparse attention pattern (#762) * upd sparse attn * lint * rename * update test file * update benchmark * lint * update benchmark * [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767) * fix ci and pass bug * fix * try * lint * [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) * [TMA] Add 1D TMA copy for Scale tensor * [Lint] * [Test] Add test for kernel * [BugFix] * hot fix blackwell (#768) * [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) * Refactor operator classes to inherit from TileOperator and update layout inference methods - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations. - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency. - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization. - Added missing layout inference implementations for Fill and Conv2DIm2ColOp. - Removed deprecated op.cc and op.h files to streamline the codebase. * lint fix * Refactor operator classes to use Node pattern and improve memory management - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation. - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access. - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design. - Refactored InferLayout and Lower methods to ensure consistency across operator implementations. - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase. * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects. - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations. - Made minor adjustments in layout inference and other related methods for consistency and clarity. * Refactor FillNode::Lower method to remove unused global function call - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity. - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies. * [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) * [Enhancement] Introduce finalize_reducer operator and layout reducer support - Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations. - Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management. - Updated `setup.py` to include logging for build directory paths, improving build process visibility. - Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations. - Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts. - Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance. * Refactor code formatting and improve readability in multiple files - Cleaned up whitespace in `setup.py` to enhance logging clarity. - Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability. - Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability. - Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity. * Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability. * [Enhancement] Disable reuse of small arrays in shared memory allocation - Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management. * Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability. * Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability. * bug fix * Add thread checks workaround for replicated cases * Remove the is_one check * fix lint error * lint fix * Update autotune tests to use smaller matrix sizes for improved performance and reliability * [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods - Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure. - Enhanced layout inference and cloning methods in FinalizeReducerOpNode. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main. - Adjusted header inclusions for improved organization and clarity across multiple files. * [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py - Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently. - Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization. * [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution - Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations. - Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus. --------- Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com> Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com> * 📝 Add docstrings to `pytile_0826` (#770) * 📝 Add docstrings to `pytile_0826` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814 The following files were modified: * `src/op/atomic_add.cc` * `src/op/atomic_add.h` * `src/op/copy.cc` * `src/op/copy.h` * `src/op/elem.cc` * `src/op/elem.h` * `src/op/gemm.cc` * `src/op/gemm.h` * `src/op/gemm_sp.cc` * `src/op/gemm_sp.h` * `src/op/operator.cc` * `src/op/operator.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/op/reduce.h` * `src/op/region.cc` * `src/op/region.h` * `src/transform/layout_inference.cc` * `src/transform/lower_tile_op.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * [Bugfix]:Fix atomic add auto vectorize negative optimization (#765) * [Bugfix]:Fix atomic add auto vectorize negative optimization * fixbug * format * fix bug * 📝 Add docstrings to `reducer_0825` (#772) * 📝 Add docstrings to `reducer_0825` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118 The following files were modified: * `setup.py` * `src/op/builtin.h` * `src/op/finalize_reducer.cc` * `src/op/finalize_reducer.h` * `src/op/parallel.cc` * `src/op/parallel.h` * `src/op/reduce.cc` * `src/target/codegen_cuda.cc` * `src/tl_templates/cuda/common.h` * `src/transform/layout_inference.cc` * `src/transform/layout_reducer.cc` * `src/transform/layout_reducer.h` * `src/transform/merge_shared_memory_allocations.cc` * `src/transform/storage_access.cc` * `src/transform/warp_specialized_rewriter.cc` * `testing/python/autotune/test_tilelang_autotune_with_inputs.py` * `tilelang/engine/phase.py` * `tilelang/language/customize.py` * `tilelang/language/reduce.py` * `tilelang/transform/__init__.py` * lint fix * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> * Allow fill global buffer (#774) * Allow fill global buffer * fix lint error * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771) * [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match * [Lint] * add bf16 exp fallback (#776) * [Lint] Introduce clang-tidy into format.sh (#777) * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization. - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity. - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions. - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code. - General code cleanup and adherence to best practices for better maintainability. * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration - Updated .clang-tidy configuration to include additional checks for improved code quality and performance. - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity. - Replaced size checks with `empty()` method calls in various locations for clearer intent. - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency - Added clang-tidy checks to the format script for improved code quality assurance. - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity. - Updated the requirements-lint.txt file to include clang-tidy as a dependency. - General code cleanup to adhere to best practices and improve maintainability. * [CI] Update AMD CI Workflow to Include Build Directory Creation - Added steps to create a build directory and configure CMake with ROCm support during the format check process. - Ensured cleanup of the build directory after the format check to maintain a clean workspace. * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members. - This change enhances code clarity and maintainability by focusing on relevant attributes for each class. * [Refactor] Update Clang-Tidy Integration and Code Improvements - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes. - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures. - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class. - General code cleanup to adhere to best practices and improve maintainability. * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity. - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency. - General code cleanup to maintain consistency and adhere to best practices. * [CI] Add Git Submodule Initialization to CI Workflow - Included a step to initialize and update git submodules recursively in the CI workflow. - This change ensures that all necessary submodules are available during the format check process, improving build reliability. * [CI] Add Git Submodule Update Step to Format Check - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process. - This enhancement ensures that all required submodules are available, contributing to improved build reliability. * [Refactor] Update Function Signatures in AtomicAddVectorize - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity. - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase. * [Cache] Introduce detailed target information for the disk kernel cache (#780) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * [Example]Adds example for top-k operation (#775) * [Example]Adds example for top-k operation Adds an example demonstrating the top-k operation using tilelang * format * Adds topk tilelang example test * fix lint * [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) * Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py * [CI] Adds pytest-durations for test timing (#782) * [Ci] Adds pytest-durations for test timing Adds `pytest-durations` to the test requirements and configures pytest to display test durations. This helps in identifying slow-running tests and optimizing the test suite for faster feedback. * add amd ci durations * Removes flash_attn installation from CI * [Refactor] Support python reflection for tile operators (#783) * Implement Fill operator and related reflection methods in TileLang - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers. - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities. - Updated relevant files to register reflection methods and ensure proper initialization in static blocks. - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability. - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly. * Refactor operator reflection methods and improve code clarity - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks. - Consolidated static initialization blocks for various operators to a single line for improved consistency. - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability. - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports. * Refactor GEMM and AtomicAdd operations for improved clarity - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety. - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method. - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity. - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports. * Remove deprecated operator files from the tilelang IR module - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase. - This cleanup enhances maintainability by removing unused code and improving overall organization of the module. * Refactor imports in tilelang IR module for improved organization - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase. * lint fix * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability. - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code. - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity. - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure. - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization. - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability. - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability. * comment * Refactor operator header files for improved readability - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity. - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions. * Refactor MakeReduce method in ReduceOpNode for clarity - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability. - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation. * Update Reduce operation type checks for consistency - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity. - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase. * [AMD] Fix amd tir&add examples (#784) * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py * Enhance AMD example script and update CI workflows - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization. - Added new CI workflows for AMD and documentation publishing. - Updated various requirements files to include necessary dependencies. - Introduced new test cases and examples for better coverage and functionality. - Refactored existing code for improved readability and maintainability. * Remove redundant tool cache cleanup step in AMD CI workflow * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements. * Add new AMD FlashAttention example and test script - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang. - Added `test.sh` script to facilitate running the new example with specified parameters. - Enhanced the overall structure and organization of the example for better clarity and usability. * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner - Reduced the number of threads and `num_split_q` options for improved performance. - Adjusted `panel_size` options to streamline configuration settings. * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217 * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c * Add example for AMD Flash Attention backward pass implementation - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang. - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps. - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters. - Included reference implementation for validation against PyTorch's attention mechanism. This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications. * Enhance AMD Flash Attention example with additional testing capabilities - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation. - Improved the main function to allow for better parameter configuration and benchmarking. - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example. This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications. * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a * Refactor HIP intrinsic rules to CUDA - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules. - Adjusted include paths for better organization and clarity in the code structure. * Update AMD CI workflow to uninstall specific PyTorch packages before installation - Removed the installation of `flash_attn==2.5.8` to streamline the CI process. - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts. * Remove unused shared memory allocations in AMD Flash Attention backward example - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance. - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead. * Remove unnecessary pip uninstall command from AMD CI workflow - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions. - This change simplifies the CI process and reduces potential overhead during package management. * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity. - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues. * Refactor formatting of HIP intrinsic rule registrations - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining. - No functional changes were made; this update focuses on code style improvements to enhance maintainability. * Update file na…
Summary by CodeRabbit
New Features
Refactor
Chores