-
Couldn't load subscription status.
- Fork 282
[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughMass migration to TVM FFI: replace local object factories/GetRef and TVM object macros with tvm::ffi equivalents, add an ffi_aliases header, switch many public/internal types to ffi variants, normalize static-init macro usage, change several API signatures (ArgBinder, target/runtime modules), remove the WebGPU TileLang backend, and update Python FFI registrations and packaging. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Cpp as C++ code
participant FFI as tvm::ffi
Caller->>Cpp: request create Node / access Node
alt pre-migration
Cpp->>Cpp: make_object<Node>() / GetRef<Node>(...)
note right of Cpp: local TVM factory / unqualified GetRef
Cpp-->>Caller: ObjectRef
else post-migration
Cpp->>FFI: tvm::ffi::make_object<Node>() / tvm::ffi::GetRef<Node>(...)
note right of FFI #D2E7D5: FFI-managed allocation/reflection and ffi types
FFI-->>Cpp: ObjectPtr / ObjectRef
Cpp-->>Caller: ObjectRef
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (14)
src/transform/make_packed_api.cc (2)
366-377: FFI type-code check is wrong for ints/floats.
buf_packed_arg_type_idsholds FFI type indices, but lines 369 and 375 check againstkDLIntandkDLFloat(DLPack dtype codes), causing invalid runtime validation. The handle type case (line 361–363) correctly usesffi::TypeIndex::kTVMFFI*constants on the same buffer. Replace with:} else if (t.is_int() || t.is_uint()) { std::ostringstream msg; msg << name_hint << ": Expect arg[" << i << "] to be int"; - seq_init.emplace_back(AssertStmt(type_index == kDLInt, + seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIInt, tvm::tir::StringImm(msg.str()), nop)); } else { ICHECK(t.is_float()); std::ostringstream msg; msg << name_hint << ": Expect arg[" << i << "] to be float"; - seq_init.emplace_back(AssertStmt(type_index == kDLFloat, + seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIFloat, tvm::tir::StringImm(msg.str()), nop)); }
129-135: Fix incorrect ret() call construction at lines 133 and 450.Both lines incorrectly use undefined or incorrect functions. Line 133 uses
tvm::ret()which doesn't exist, and line 450 uses bareret()which is undefined. The correct approach is to construct aCallwithbuiltin::ret(), consistent with the pattern already in the codebase at line 417.- Stmt ret_zero = Evaluate(tvm::ret(0)); + Stmt ret_zero = Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}));and
- body = SeqStmt({body, Evaluate(ret(Integer(0)))}); + body = SeqStmt({body, Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}))});src/target/utils.cc (1)
59-64: Bug: bitwise '&' used instead of logical '&&' in TargetIsSm100.This will yield incorrect results for most values. Use logical and.
Apply this diff:
- return arch >= 100 & arch <= 110; + return arch >= 100 && arch <= 110;src/op/reduce.cc (1)
45-90: 64‑bit overflow/UB in bit shifts when building init constants.(1 << (bits-1)) and (1 << bits) overflow for bits ≥ 31 (signed) / 32 (unsigned). Use min_value/max_value helpers instead of shifting.
if (type->isMax()) { - if (is_int) { - return make_const(dst->dtype, -(1 << (bits - 1))); - } else if (is_uint) { - return make_const(dst->dtype, 0); - } else { - return make_const(dst->dtype, -INFINITY); - } + if (is_int) { + return min_value(dst->dtype); + } else if (is_uint) { + return make_const(dst->dtype, 0); + } else { + return make_const(dst->dtype, -INFINITY); + } } else if (type->isMin()) { - if (is_int) { - return make_const(dst->dtype, (1 << (bits - 1)) - 1); - } else if (is_uint) { - return make_const(dst->dtype, (1 << bits) - 1); - } else { - return make_const(dst->dtype, INFINITY); - } + if (is_int) { + return max_value(dst->dtype); + } else if (is_uint) { + return max_value(dst->dtype); + } else { + return make_const(dst->dtype, INFINITY); + } } else if (type->isBitAnd()) { - if (is_int) { - return make_const(dst->dtype, -1); - } else if (is_uint) { - return make_const(dst->dtype, (1 << bits) - 1); - } else { - // Should not arrive here - return make_const(dst->dtype, -INFINITY); - } + if (is_int) { + return make_const(dst->dtype, -1); + } else if (is_uint) { + return max_value(dst->dtype); + } else { + // Should not arrive here + return make_const(dst->dtype, -INFINITY); + }Note: min_value/max_value are in tir/op.h; already included above.
src/transform/vectorize_loop.cc (1)
170-178: Bug: strides are cleared due to iterating an empty arrayYou initialize
Array<PrimExpr> strides;and then loop overstrides.size(), which is 0. This drops existing strides when assigning back, changing layout semantics.Apply:
- // Update strides if defined. - Array<PrimExpr> strides; - for (size_t i = 0; i < strides.size(); i++) { - PrimExpr stride = strides[i]; - if (i != strides.size() - 1) { - stride *= var_lanes_; - } - strides.push_back(analyzer_.Simplify(stride)); - } + // Update strides if defined. + Array<PrimExpr> strides = node->buffer->strides; + if (!strides.empty()) { + for (size_t i = 0; i < strides.size(); i++) { + PrimExpr stride = strides[i]; + if (i != strides.size() - 1) { + stride *= var_lanes_; + } + strides.Set(i, analyzer_.Simplify(stride)); + } + }src/op/gemm_py.cc (1)
65-66: Fix type casting for clear_accum fieldThe comment in the file (lines 38-41) and Python interface both document
clear_accumasBooltype. The parallel implementation insrc/op/gemm_sp.cc(line 98) and validation insrc/op/gemm.cc(lines 592-595) both use.as<Bool>()casting. Update the parsing to match:- node->clear_accum = args[9].as<PrimExpr>().value(); + node->clear_accum = args[9].as<Bool>().value();src/target/rt_mod_cuda.cc (3)
47-54: Unqualified TIR types; qualify to avoid ADL/visibility pitfall.
PrimFuncNode/PrimFuncare used withouttir::here, unlike earlier in the file. This can break if nousingbrings them in.- ICHECK(kv.second->IsInstance<PrimFuncNode>()) + ICHECK(kv.second->IsInstance<tir::PrimFuncNode>()) ... - auto f = Downcast<PrimFunc>(kv.second); + auto f = Downcast<tir::PrimFunc>(kv.second);Applies to both loops.
Also applies to: 79-87
64-68: Guard against empty PTX before indexing.
ptx[0]is UB whenptxis empty (compile callback could return an empty string).- if (ptx[0] != '/') + if (!ptx.empty() && ptx[0] != '/') fmt = "cubin";
69-70: ReplaceICHECK(0)with a descriptive fatal.Improves diagnosability.
- } else { - ICHECK(0); - } + } else { + LOG(FATAL) << "Missing global func: tilelang_callback_cuda_compile"; + }src/ir.cc (1)
7-16: Include the standard headers used here.
- std::vector and std::sort are used but not explicitly included.
Apply:
#include "./transform/common/attr.h" #include "op/builtin.h" #include "tvm/ffi/any.h" #include <tvm/ffi/object.h> #include "support/ffi_aliases.h" #include <tvm/arith/analyzer.h> #include <tvm/ffi/reflection/registry.h> #include <tvm/script/ir_builder/tir/ir.h> +#include <vector> +#include <algorithm>src/op/gemm.h (1)
10-12: Add missing<optional>header to fix std::optional usage.The header uses
std::optional(line 101) but does not include<optional>. Whilesupport/ffi_aliases.his already transitively included viaoperator.h, it provides only the FFI wrapperOptional, not the standard librarystd::optional.#include "operator.h" +#include <optional>The
#include "support/ffi_aliases.h"is optional—it's already transitively available throughoperator.h→layout.h→support/ffi_aliases.h, but adding it explicitly improves header self-sufficiency if desired.src/op/finalize_reducer.cc (1)
84-86: Potential null deref in extent check.If T.thread_bounds->extent is symbolic, as_const_int(...) may be null and dereferenced in the ICHECK. Guard before deref.
Apply this minimal fix:
- int extent = *p_extent, scale = 1; - ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent)) + int extent = *p_extent, scale = 1; + const int64_t* p_thr = as_const_int(T.thread_bounds->extent); + ICHECK(p_thr != nullptr && (extent == 1 || extent == *p_thr)) << "Illegal finalize_reducer: extent=" << extent << "; T.thread_bounds=" << T.thread_bounds;src/op/gemm.cc (2)
115-152: Constructor reads beyond args without size checks; fix parsing robustness.Accessing args[16..18] unconditionally can OOB when older call sites omit them. Also prefer robust Downcast/Optional handling for types.
Apply:
Gemm::Gemm(Array<PrimExpr> args, BufferMap vmap) { - ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>(); + ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>(); node->Aptr = args[0]; node->Bptr = args[1]; node->Cptr = args[2]; node->A = vmap[GetVarFromAccessPtr(node->Aptr)]; node->B = vmap[GetVarFromAccessPtr(node->Bptr)]; node->C = vmap[GetVarFromAccessPtr(node->Cptr)]; - node->trans_A = args[3].as<Bool>().value(); - node->trans_B = args[4].as<Bool>().value(); + node->trans_A = Downcast<Bool>(args[3]).value(); + node->trans_B = Downcast<Bool>(args[4]).value(); node->M = args[5].as<IntImm>().value()->value; node->N = args[6].as<IntImm>().value()->value; node->K = args[7].as<IntImm>().value()->value; node->policy = GemmWarpPolicy(args[8].as<IntImm>().value()->value); - node->clear_accum = args[9].as<PrimExpr>().value(); + node->clear_accum = Downcast<Bool>(args[9]); node->stride_A = args[10].as<IntImm>().value()->value; node->stride_B = args[11].as<IntImm>().value()->value; node->offset_A = args[12].as<IntImm>().value()->value; node->offset_B = args[13].as<IntImm>().value()->value; if (args.size() > 14) { node->kPack = args[14].as<IntImm>().value()->value; if (node->kPack != 1 && node->kPack != 2) { ICHECK(false) << "kPack must be 1 or 2"; } } - if (args.size() > 15) { - node->wg_wait = args[15].as<IntImm>().value()->value; - } - node->mbarptr = args[16]; - if (node->mbarptr.as<CallNode>()) { - node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)]; - } else { - node->mbar = std::nullopt; - } - node->C_coords = Array<PrimExpr>( - {args[17].as<PrimExpr>().value(), args[18].as<PrimExpr>().value()}); + if (args.size() > 15) { + node->wg_wait = args[15].as(IntImmNode())->value; + } + if (args.size() > 16) { + node->mbarptr = args[16]; + if (node->mbarptr.as<CallNode>()) { + node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)]; + } else { + node->mbar = std::nullopt; + } + } + if (args.size() > 18) { + node->C_coords = {args[17], args[18]}; + } else { + node->C_coords = {}; + }Also update the comment above to document the additional parameters and new optionality.
877-887: FFI binding drops return value from ComputeWarpPartition.The lambda doesn’t return the computed pair; callers can’t observe the result. Return both values (e.g., as Array or a Tuple).
Apply one of:
- refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition", - [](GemmWarpPolicy policy, int M, int N, int block_size, - Target target, GemmInst gemm_inst) { - policy->ComputeWarpPartition(M, N, block_size, target, - gemm_inst); - }); + refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition", + [](GemmWarpPolicy policy, int M, int N, int block_size, + Target target, GemmInst gemm_inst) -> Array<Integer> { + auto res = policy->ComputeWarpPartition(M, N, block_size, target, gemm_inst); + return {Integer(res.first), Integer(res.second)}; + });If your FFI prefers packed returns, switch to def_packed and set rv accordingly.
🧹 Nitpick comments (21)
src/transform/lower_device_kernel_launch.cc (1)
39-39: Optional: Theusing namespace ffi;directive appears unused.All FFI components in this file are explicitly qualified (e.g.,
tvm::ffi::GetRefon lines 376 and 392), so this namespace directive isn't currently utilized. Consider removing it or using unqualified FFI names if consistency with theusing namespace tir;pattern is desired.src/transform/storage_access.cc (1)
343-407: FFI migration correctly applied.All
tvm::ffi::GetRef<Var>conversions are necessary to convert the rawconst VarNode*(line 346) toVarobjects.Minor optimization opportunity: Lines 350, 355, 360-361, and 392 each create a new
Varreference from the samebuffer_varpointer. You could cachetvm::ffi::GetRef<Var>(buffer_var)in a local variable to avoid repeated conversions, though the performance impact is likely negligible.Optional optimization
DataType dtype = op->args[0].dtype(); const VarNode *buffer_var = op->args[1].as<VarNode>(); + Var buffer_var_ref = tvm::ffi::GetRef<Var>(buffer_var); PrimExpr offset = op->args[2]; PrimExpr extent = op->args[3]; const IntImmNode *flag = op->args[4].as<IntImmNode>(); - StorageScope scope = GetScope(tvm::ffi::GetRef<Var>(buffer_var)); + StorageScope scope = GetScope(buffer_var_ref); // The buffer scope. if (Enabled(buffer_var, scope)) { ICHECK(allow_append_); Array<Range> buffer_ranges; - if (buffer_data_to_buffer_.find(tvm::ffi::GetRef<Var>(buffer_var)) == + if (buffer_data_to_buffer_.find(buffer_var_ref) == buffer_data_to_buffer_.end()) { // cannot find buffer map, use the default buffer buffer_ranges = {Range::FromMinExtent(offset, extent)}; } else { - Buffer buffer = - buffer_data_to_buffer_.at(tvm::ffi::GetRef<Var>(buffer_var)); + Buffer buffer = buffer_data_to_buffer_.at(buffer_var_ref); // ... (rest of logic unchanged) } AccessEntry e; e.threads = env_threads(); e.thread_range = this->ComputeThreadRange(e.threads); e.dtype = dtype; - e.buffer = tvm::ffi::GetRef<Var>(buffer_var); + e.buffer = buffer_var_ref;src/transform/inject_pipeline.cc (1)
40-40: Verify if this namespace import is actually used.The
using namespace ffi;directive appears unused, as all FFI members throughout the file are explicitly qualified (e.g.,tvm::ffi::make_objectat line 463,tvm::ffi::GetRefat line 985,tvm::ffi::reflectionat line 1077).If this import is not used elsewhere in the file, consider removing it:
using namespace tir; -using namespace ffi;src/transform/storage_access.h (1)
42-42: Consider moving theusing namespace ffi;directive to the implementation file.Using namespace directives in headers can lead to namespace pollution for all translation units that include this header. Since no FFI types or functions are directly used in this header declaration, consider moving this directive to the corresponding
.ccimplementation file where FFI functions likemake_objectorGetRefare likely called.Alternatively, use explicit
ffi::qualification at call sites or prefer specific using declarations (e.g.,using ffi::make_object;) for better clarity and to avoid potential name collisions.src/transform/inject_tma_barrier.cc (4)
296-301: Assumes barrier_id is always Call(get_mbarrier(...)) — verify or guard.tma_op_to_barrier_id_[GetRef(op)].as()->args[0] will segfault if the value isn’t a Call. Add an ICHECK for as() or handle Var/IntImm cases.
Would you like me to harden this with a small guard and fallback?
408-416: Unused local ‘block’.auto block = tvm::ffi::GetRef(op) is unused. Remove to avoid warnings.
- auto block = tvm::ffi::GetRef<Block>(op);
487-494: Eliding arrive_barrier by returning 0 — confirm semantics.Replacing builtin::ptx_arrive_barrier() with literal 0 relies on the surrounding Evaluate(…) context treating it as a no-op. If any code inspects side effects, this may mislead. Consider returning an Evaluate(0) at the stmt layer instead where possible.
82-87: Typo: old_loop_evtents.Rename to old_loop_extents for clarity.
- PrimExpr old_loop_evtents = loop_extents; + PrimExpr old_loop_extents = loop_extents; loop_extents *= op->extent; StmtExprVisitor::VisitStmt_(op); - loop_extents = old_loop_evtents; + loop_extents = old_loop_extents;src/op/copy.h (2)
106-114: Expose remaining fields in reflection (disable_tma, eviction_policy, par_op_).For parity with other ops (e.g., AtomicAddNode), add def_ro for disable_tma and eviction_policy to aid Python-side introspection/debug.
refl::ObjectDef<CopyNode>() .def_ro("src", &CopyNode::src) .def_ro("dst", &CopyNode::dst) .def_ro("src_range", &CopyNode::src_range) .def_ro("dst_range", &CopyNode::dst_range) - .def_ro("coalesced_width", &CopyNode::coalesced_width); + .def_ro("coalesced_width", &CopyNode::coalesced_width) + .def_ro("disable_tma", &CopyNode::disable_tma) + .def_ro("eviction_policy", &CopyNode::eviction_policy);
308-345: FFI macros for Conv2DIm2ColOpNode/Op look correct; consider reflecting nhw_step/c_step.nhw_step and c_step are important for debugging/instrumentation but not reflected. Consider adding def_ro for both.
refl::ObjectDef<Conv2DIm2ColOpNode>() .def_ro("src", &Conv2DIm2ColOpNode::src) .def_ro("dst", &Conv2DIm2ColOpNode::dst) .def_ro("stride", &Conv2DIm2ColOpNode::stride) .def_ro("padding", &Conv2DIm2ColOpNode::padding) .def_ro("dilation", &Conv2DIm2ColOpNode::dilation) .def_ro("kernel", &Conv2DIm2ColOpNode::kernel) - .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy); + .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy) + .def_ro("nhw_step", &Conv2DIm2ColOpNode::nhw_step) + .def_ro("c_step", &Conv2DIm2ColOpNode::c_step);src/transform/simplify.cc (2)
26-27: Avoid broad ‘using namespace ffi’ in .cc unless needed.Localize to needed symbols or qualify with tvm::ffi to reduce ambiguity.
515-523: Default argument differs from Python wrapper.C++ Simplify defaults simplify_arguments = true, while tilelang/transform/simplify.py defaults to False. Align to False to avoid surprises when invoked from C++.
-tvm::transform::Pass Simplify(bool simplify_arguments = true) { +tvm::transform::Pass Simplify(bool simplify_arguments = false) {Would you like me to scan for in-repo C++ call sites that rely on the current default?
src/op/reduce.h (1)
54-58: Constructor uses tvm::ffi::make_object — good. Consider std::string_view.Optional: accept std::string_view to avoid copies and normalize input via to_lower for robustness.
src/op/reduce.cc (1)
317-321: AllReduce workspace conditional looks fine; small nit: reduce literal 32 as constexpr.Optional micro‑refactor to constexpr kWarpSize = 32 for clarity.
src/layout/swizzle.cc (3)
103-106: Register reflection in a static init blockRegisterReflection() won’t run unless called. Add a static init block.
Apply:
void SwizzledLayoutNode::RegisterReflection() { namespace refl = tvm::ffi::reflection; refl::ObjectDef<SwizzledLayoutNode>(); } + +TVM_FFI_STATIC_INIT_BLOCK() { SwizzledLayoutNode::RegisterReflection(); }
82-86: Use ICHECK instead of CHECK for consistencyElsewhere you use ICHECK; align this check as well.
- CHECK(is_zero(forward_var[i]->dom->min)); + ICHECK(is_zero(forward_var[i]->dom->min));
27-33: Make bit shifts 64-bit safe1 << bits_ and 1 << base_ on int can overflow/UB for large values. Use 1LL.
- int base = (1 << base_); - int mask = ((1 << bits_) - 1) << shift_; + int base = static_cast<int>(1LL << base_); + int mask = static_cast<int>(((1LL << bits_) - 1) << shift_);src/transform/layout_reducer.h (1)
8-11: Make header self-sufficient for FFI typesReducerInfoNode/ReducerInfo use String and FFI macros but this header doesn’t include the necessary FFI headers. Include the aliases header or specific FFI headers to avoid relying on transitive includes.
#include <tvm/tir/op.h> #include "../layout/layout.h" +#include "../support/ffi_aliases.h" +#include <tvm/ffi/object.h>src/target/codegen_cuda.h (1)
67-69: Type qualification inconsistency confirmed—refactor to maintain consistency across translation unit boundaries.The header declares
PrintFunctionSignatureandPrintCallExternwithffi::Stringandffi::Array<PrimExpr>, while the.ccimplementation uses bareStringandArray<PrimExpr>. Although semantically equivalent due tousing namespace ffi;at line 22 of codegen_cuda.cc, this mixed qualification style can obscure intent across TU boundaries and risks confusion in future maintenance.Recommendation: Align the
.ccdefinitions to use explicitffi::Stringandffi::Arrayfor both methods to match the header declarations, or conversely, update the header to use bare types if the unqualified form is preferred in this codebase.src/ir.cc (1)
246-249: Remove no-op ICHECK.
- ICHECK(grid_size.size() >= 0) is always true; drop it.
- ICHECK(grid_size.size() >= 0);src/layout/layout.cc (1)
481-565: FFI GlobalDef and reflection blocks look valid.Two static init blocks for API defs and reflection are fine; no duplicate registrations detected.
Minor: In LayoutNode::Inverse() and FragmentNode::Inverse(), returning std::move(...) is unnecessary for local values; consider returning the value directly for clarity.
Also applies to: 567-572
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (107)
3rdparty/tvm(1 hunks)CMakeLists.txt(1 hunks)cmake/load_tvm.cmake(1 hunks)src/ir.cc(11 hunks)src/layout/layout.cc(7 hunks)src/layout/layout.h(5 hunks)src/layout/swizzle.cc(3 hunks)src/layout/swizzle.h(2 hunks)src/layout/utils.cc(1 hunks)src/layout/utils.h(1 hunks)src/op/atomic_add.cc(3 hunks)src/op/atomic_add.h(2 hunks)src/op/copy.cc(5 hunks)src/op/copy.h(4 hunks)src/op/fill.cc(3 hunks)src/op/fill.h(2 hunks)src/op/finalize_reducer.cc(3 hunks)src/op/finalize_reducer.h(2 hunks)src/op/gemm.cc(5 hunks)src/op/gemm.h(4 hunks)src/op/gemm_py.cc(7 hunks)src/op/gemm_py.h(2 hunks)src/op/gemm_sp.cc(3 hunks)src/op/gemm_sp.h(3 hunks)src/op/logical.cc(2 hunks)src/op/math.cc(1 hunks)src/op/operator.cc(2 hunks)src/op/operator.h(1 hunks)src/op/parallel.cc(2 hunks)src/op/parallel.h(2 hunks)src/op/reduce.cc(5 hunks)src/op/reduce.h(6 hunks)src/op/region.cc(3 hunks)src/op/region.h(2 hunks)src/runtime/runtime.cc(4 hunks)src/support/ffi_aliases.h(1 hunks)src/target/codegen_cpp.cc(3 hunks)src/target/codegen_cpp.h(2 hunks)src/target/codegen_cuda.cc(4 hunks)src/target/codegen_cuda.h(1 hunks)src/target/codegen_hip.cc(1 hunks)src/target/codegen_webgpu.cc(0 hunks)src/target/codegen_webgpu.h(0 hunks)src/target/intrin_rule_cuda.cc(1 hunks)src/target/intrin_rule_hip.cc(2 hunks)src/target/rt_mod_cpp.cc(2 hunks)src/target/rt_mod_cuda.cc(3 hunks)src/target/rt_mod_hip.cc(1 hunks)src/target/utils.cc(6 hunks)src/transform/align_dynamic_shared_memory_allocations.cc(5 hunks)src/transform/annotate_device_regions.cc(2 hunks)src/transform/annotate_warp_group_reg_alloc.cc(1 hunks)src/transform/cluster_planning.cc(3 hunks)src/transform/common/loop_parallel_transform_utils.h(2 hunks)src/transform/common/loop_vectorization_utils.h(19 hunks)src/transform/config_index_bitwidth.cc(3 hunks)src/transform/eliminate_storage_sync_for_mbarrier.cc(3 hunks)src/transform/flatten_buffer.cc(3 hunks)src/transform/frontend_legalize.cc(1 hunks)src/transform/if_stmt_binding.cc(2 hunks)src/transform/inject_assumes.cc(1 hunks)src/transform/inject_fence_proxy.cc(1 hunks)src/transform/inject_pipeline.cc(5 hunks)src/transform/inject_ptx_async_copy.cc(1 hunks)src/transform/inject_tma_barrier.cc(6 hunks)src/transform/layout_inference.cc(5 hunks)src/transform/layout_reducer.cc(1 hunks)src/transform/layout_reducer.h(1 hunks)src/transform/legalize_safe_memory_access.cc(2 hunks)src/transform/legalize_vectorized_loop.cc(1 hunks)src/transform/loop_partition.cc(1 hunks)src/transform/loop_vectorize.cc(1 hunks)src/transform/loop_vectorize_dynamic.cc(2 hunks)src/transform/lower_device_kernel_launch.cc(4 hunks)src/transform/lower_device_storage_access_info.cc(1 hunks)src/transform/lower_hopper_intrin.cc(2 hunks)src/transform/lower_intrin.cc(7 hunks)src/transform/lower_l2_persistent_annotation.cc(1 hunks)src/transform/lower_opaque_block.cc(2 hunks)src/transform/lower_shared_barrier.cc(2 hunks)src/transform/lower_shared_tmem.cc(2 hunks)src/transform/lower_thread_allreduce.cc(2 hunks)src/transform/lower_tile_op.cc(4 hunks)src/transform/make_packed_api.cc(6 hunks)src/transform/merge_if_stmt.cc(1 hunks)src/transform/merge_shared_memory_allocations.cc(5 hunks)src/transform/multi_version_buffer_rewriter.cc(3 hunks)src/transform/persist_threadblock.cc(1 hunks)src/transform/pipeline_planning.cc(5 hunks)src/transform/simplify.cc(5 hunks)src/transform/split_host_device.cc(2 hunks)src/transform/storage_access.cc(5 hunks)src/transform/storage_access.h(1 hunks)src/transform/storage_rewrite.cc(5 hunks)src/transform/thread_storage_sync.cc(1 hunks)src/transform/vectorize_loop.cc(20 hunks)src/transform/warp_specialized_rewriter.cc(5 hunks)src/transform/wgmma_sync_rewriter.cc(1 hunks)tilelang/_ffi_api.py(1 hunks)tilelang/contrib/dlpack.py(2 hunks)tilelang/contrib/hipcc.py(2 hunks)tilelang/contrib/nvcc.py(6 hunks)tilelang/contrib/rocm.py(4 hunks)tilelang/engine/callback.py(3 hunks)tilelang/engine/lower.py(5 hunks)tilelang/ir.py(2 hunks)tilelang/layout/fragment.py(1 hunks)
⛔ Files not processed due to max files limit (4)
- tilelang/layout/layout.py
- tilelang/tileop/gemm/init.py
- tilelang/transform/_ffi_api.py
- tilelang/utils/tensor.py
💤 Files with no reviewable changes (2)
- src/target/codegen_webgpu.h
- src/target/codegen_webgpu.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_shared_barrier.ccsrc/transform/inject_tma_barrier.ccsrc/transform/align_dynamic_shared_memory_allocations.ccsrc/transform/lower_shared_tmem.cc
🧬 Code graph analysis (75)
src/transform/legalize_vectorized_loop.cc (1)
tilelang/transform/__init__.py (1)
LegalizeVectorizedLoop(241-249)
src/transform/persist_threadblock.cc (1)
tilelang/transform/__init__.py (1)
PersistThreadblock(392-395)
src/op/gemm_sp.cc (2)
src/op/finalize_reducer.h (1)
RegisterReflection(33-38)src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)
src/transform/inject_ptx_async_copy.cc (1)
tilelang/transform/__init__.py (1)
InjectPTXAsyncCopy(307-315)
src/transform/annotate_warp_group_reg_alloc.cc (1)
tilelang/transform/__init__.py (1)
AnnotateWarpGroupRegAlloc(204-216)
src/transform/inject_assumes.cc (1)
tilelang/transform/__init__.py (1)
InjectAssumes(83-91)
src/transform/flatten_buffer.cc (2)
src/transform/config_index_bitwidth.cc (16)
op(27-35)op(27-27)op(37-42)op(37-37)op(44-50)op(44-44)op(52-59)op(52-52)op(61-68)op(61-61)op(89-94)op(89-89)op(96-101)op(96-96)op(103-108)op(103-103)tilelang/transform/__init__.py (1)
FlattenBuffer(357-365)
tilelang/contrib/nvcc.py (2)
tilelang/engine/lower.py (1)
tilelang_callback_cuda_compile(56-89)tilelang/contrib/nvrtc.py (1)
compile_cuda(13-110)
src/transform/lower_shared_barrier.cc (2)
src/transform/layout_inference.cc (10)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)op(443-460)op(443-443)op(462-471)op(462-462)tilelang/transform/__init__.py (1)
LowerSharedBarrier(412-415)
src/transform/common/loop_parallel_transform_utils.h (2)
src/transform/common/loop_vectorization_utils.h (1)
tvm(39-783)src/transform/layout_inference.cc (16)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)op(443-460)op(443-443)op(462-471)op(462-462)op(657-669)op(657-657)op(706-801)op(706-706)op(803-812)op(803-803)
src/transform/layout_reducer.cc (1)
tilelang/transform/__init__.py (1)
LayoutReducer(466-475)
src/layout/swizzle.h (2)
src/layout/swizzle.cc (3)
SwizzledLayoutNode(40-49)SwizzledLayout(77-93)SwizzledLayout(95-101)src/layout/layout.h (1)
LayoutNode(24-209)
src/transform/common/loop_vectorization_utils.h (1)
src/transform/common/loop_parallel_transform_utils.h (1)
tvm(20-165)
src/transform/lower_hopper_intrin.cc (2)
src/op/builtin.cc (2)
cuTensorMapType(38-38)cuTensorMapType(38-38)tilelang/transform/__init__.py (1)
LowerHopperIntrin(94-103)
src/transform/storage_access.cc (2)
src/transform/merge_shared_memory_allocations.cc (6)
buf(215-232)buf(215-215)buffer_var(614-619)buffer_var(614-614)buffer(549-568)buffer(549-549)src/transform/thread_storage_sync.cc (21)
buf(57-59)buf(57-57)op(353-365)op(353-353)op(411-425)op(411-411)op(456-462)op(456-456)op(463-469)op(463-463)op(470-488)op(470-470)op(490-529)op(490-490)op(544-563)op(544-544)op(614-633)op(614-614)ICHECK(564-586)buffer_var(539-541)buffer_var(539-539)
src/transform/split_host_device.cc (1)
tilelang/transform/__init__.py (1)
SplitHostDevice(285-293)
src/layout/utils.cc (1)
src/transform/layout_inference.cc (12)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)op(443-460)op(443-443)op(462-471)op(462-462)op(657-669)op(657-657)
src/transform/loop_vectorize.cc (1)
src/transform/lower_tile_op.cc (2)
expr(433-445)expr(433-433)
src/transform/inject_tma_barrier.cc (2)
src/transform/lower_hopper_intrin.cc (2)
call(102-132)call(102-102)src/transform/warp_specialized_rewriter.cc (18)
call(31-36)call(31-31)op(38-43)op(38-38)op(73-83)op(73-73)op(85-93)op(85-85)op(95-100)op(95-95)op(102-110)op(102-102)op(134-146)op(134-134)op(148-177)op(148-148)op(179-189)op(179-179)
tilelang/layout/fragment.py (2)
src/layout/layout.cc (2)
Layout(57-69)Layout(71-74)tilelang/layout/layout.py (1)
Layout(13-148)
src/op/fill.h (2)
src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/fill.cc (1)
Fill(62-109)
src/layout/swizzle.cc (1)
src/layout/swizzle.h (1)
SwizzledLayout(58-66)
src/transform/layout_reducer.h (5)
src/op/atomic_add.h (1)
tvm(12-73)src/op/finalize_reducer.h (1)
tvm(20-56)src/op/gemm_sp.h (1)
tvm(13-104)src/op/parallel.h (1)
tvm(22-148)src/support/ffi_aliases.h (1)
tvm(10-16)
src/op/fill.cc (1)
src/op/fill.h (1)
RegisterReflection(29-35)
src/transform/eliminate_storage_sync_for_mbarrier.cc (2)
src/transform/lower_hopper_intrin.cc (2)
op(55-100)op(55-55)tilelang/transform/__init__.py (1)
EliminateStorageSyncForMBarrier(368-371)
src/op/atomic_add.cc (6)
src/op/finalize_reducer.h (1)
RegisterReflection(33-38)src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)src/op/parallel.h (1)
RegisterReflection(72-78)src/op/reduce.h (3)
RegisterReflection(35-38)RegisterReflection(92-100)RegisterReflection(137-144)src/op/copy.h (2)
RegisterReflection(106-114)RegisterReflection(311-321)src/op/fill.h (1)
RegisterReflection(29-35)
src/op/finalize_reducer.cc (1)
src/op/finalize_reducer.h (1)
RegisterReflection(33-38)
src/transform/make_packed_api.cc (2)
src/transform/lower_device_kernel_launch.cc (6)
gvar(74-98)gvar(74-74)gvar(206-219)gvar(206-206)gvar(221-266)gvar(221-221)tilelang/transform/__init__.py (1)
MakePackedAPI(263-271)
src/transform/config_index_bitwidth.cc (1)
src/transform/flatten_buffer.cc (16)
op(76-81)op(76-76)op(83-88)op(83-83)op(90-95)op(90-90)op(97-101)op(97-97)op(103-106)op(103-103)op(111-144)op(111-111)op(146-217)op(146-146)op(219-225)op(219-219)
src/op/finalize_reducer.h (3)
src/op/operator.h (1)
TileOperatorNode(56-66)src/op/finalize_reducer.cc (1)
FinalizeReducerOp(35-40)tilelang/ir.py (1)
FinalizeReducerOp(53-54)
src/op/atomic_add.h (2)
src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/atomic_add.cc (1)
AtomicAdd(44-69)
src/op/parallel.h (3)
src/op/parallel.cc (1)
ParallelOpNode(176-178)src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/transform/layout_inference.cc (16)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)op(443-460)op(443-443)op(462-471)op(462-462)op(657-669)op(657-657)op(706-801)op(706-706)op(803-812)op(803-803)
src/transform/storage_rewrite.cc (1)
src/transform/flatten_buffer.cc (10)
op(76-81)op(76-76)op(83-88)op(83-83)op(90-95)op(90-90)op(97-101)op(97-97)op(103-106)op(103-103)
tilelang/engine/lower.py (1)
tilelang/engine/param.py (1)
CompiledArtifact(107-117)
src/transform/wgmma_sync_rewriter.cc (1)
tilelang/transform/__init__.py (1)
RewriteWgmmaSync(117-125)
src/transform/thread_storage_sync.cc (1)
tilelang/transform/__init__.py (1)
ThreadSync(128-141)
src/transform/if_stmt_binding.cc (2)
src/transform/warp_specialized_rewriter.cc (14)
op(38-43)op(38-38)op(73-83)op(73-73)op(85-93)op(85-85)op(95-100)op(95-95)op(102-110)op(102-102)op(134-146)op(134-134)op(148-177)op(148-148)tilelang/transform/__init__.py (1)
IfStmtBinding(160-168)
src/op/gemm_py.cc (3)
src/transform/pipeline_planning.cc (12)
op(48-57)op(48-48)op(87-143)op(87-87)op(166-184)op(166-166)op(186-204)op(186-186)op(206-260)op(206-206)f(287-297)f(287-287)src/layout/layout.cc (4)
RegisterReflection(76-81)RegisterReflection(76-76)RegisterReflection(474-479)RegisterReflection(474-474)src/op/gemm_py.h (1)
RegisterReflection(38-60)
src/op/gemm.h (7)
src/op/atomic_add.h (1)
tvm(12-73)src/op/finalize_reducer.h (1)
tvm(20-56)src/op/gemm_sp.h (1)
tvm(13-104)src/op/region.h (1)
tvm(72-116)src/support/ffi_aliases.h (1)
tvm(10-16)src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/gemm.cc (1)
Gemm(114-152)
src/support/ffi_aliases.h (4)
src/op/atomic_add.h (1)
tvm(12-73)src/op/parallel.h (1)
tvm(22-148)src/op/region.h (1)
tvm(72-116)src/layout/layout.h (1)
Array(39-71)
src/op/reduce.cc (1)
src/op/reduce.h (4)
ReduceOp(120-126)RegisterReflection(35-38)RegisterReflection(92-100)RegisterReflection(137-144)
src/transform/legalize_safe_memory_access.cc (1)
tilelang/transform/__init__.py (1)
LegalizeSafeMemoryAccess(252-260)
src/transform/lower_device_storage_access_info.cc (1)
tilelang/transform/__init__.py (1)
LowerDeviceStorageAccessInfo(318-330)
src/layout/layout.cc (4)
src/layout/swizzle.cc (2)
RegisterReflection(103-106)RegisterReflection(103-103)src/op/finalize_reducer.h (1)
RegisterReflection(33-38)src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)src/op/parallel.h (1)
RegisterReflection(72-78)
src/op/region.cc (4)
src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)src/op/parallel.h (1)
RegisterReflection(72-78)src/op/reduce.h (3)
RegisterReflection(35-38)RegisterReflection(92-100)RegisterReflection(137-144)src/op/copy.h (2)
RegisterReflection(106-114)RegisterReflection(311-321)
src/transform/multi_version_buffer_rewriter.cc (2)
src/transform/warp_specialized_rewriter.cc (16)
op(38-43)op(38-38)op(73-83)op(73-73)op(85-93)op(85-85)op(95-100)op(95-95)op(102-110)op(102-102)op(134-146)op(134-134)op(148-177)op(148-148)op(179-189)op(179-179)src/transform/inject_pipeline.cc (4)
buffer(398-453)buffer(398-399)buffer(461-471)buffer(461-461)
src/transform/simplify.cc (3)
src/op/atomic_add.h (1)
tl(13-72)src/layout/layout.cc (4)
RegisterReflection(76-81)RegisterReflection(76-76)RegisterReflection(474-479)RegisterReflection(474-474)tilelang/transform/simplify.py (1)
Simplify(19-27)
src/transform/inject_fence_proxy.cc (1)
tilelang/transform/__init__.py (1)
InjectFenceProxy(230-238)
src/transform/merge_if_stmt.cc (1)
tilelang/transform/__init__.py (1)
MergeIfStmt(171-179)
src/layout/layout.h (4)
src/op/atomic_add.h (1)
tl(13-72)src/op/parallel.h (1)
tl(23-146)src/layout/swizzle.h (1)
tl(13-68)src/layout/layout.cc (6)
Layout(57-69)Layout(71-74)LayoutNode(48-55)Fragment(318-340)Fragment(342-352)FragmentNode(302-316)
src/transform/align_dynamic_shared_memory_allocations.cc (1)
tilelang/transform/__init__.py (1)
AlignDynamicSharedMemoryAllocations(398-409)
src/transform/layout_inference.cc (3)
src/op/operator.cc (4)
ParseOperator(32-41)ParseOperator(32-32)ParseOperator(55-61)ParseOperator(55-55)src/op/parallel.h (1)
ParallelOp(137-145)tilelang/transform/__init__.py (1)
LayoutInference(39-47)
src/op/gemm_py.h (2)
src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/gemm_py.cc (1)
GemmPy(50-80)
src/transform/frontend_legalize.cc (1)
tilelang/transform/simplify.py (1)
LetInline(8-16)
src/ir.cc (3)
src/op/gemm_sp.h (1)
tl(15-103)src/op/parallel.h (2)
tl(23-146)RegisterReflection(72-78)src/transform/layout_reducer.h (1)
tl(58-87)
src/transform/lower_intrin.cc (1)
tilelang/transform/__init__.py (1)
LowerIntrin(441-444)
src/transform/lower_shared_tmem.cc (2)
src/transform/lower_tile_op.cc (10)
op(96-102)op(96-96)op(117-151)op(117-117)op(181-186)op(181-181)op(193-207)op(193-193)op(282-316)op(282-282)tilelang/transform/__init__.py (1)
LowerSharedTmem(460-463)
src/op/copy.cc (4)
src/layout/layout.cc (4)
RegisterReflection(76-81)RegisterReflection(76-76)RegisterReflection(474-479)RegisterReflection(474-474)src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)src/op/parallel.h (1)
RegisterReflection(72-78)src/op/copy.h (2)
RegisterReflection(106-114)RegisterReflection(311-321)
src/target/codegen_cpp.h (1)
src/target/codegen_cpp.cc (2)
GenerateForwardFunctionDeclarations(65-87)GenerateForwardFunctionDeclarations(65-68)
src/op/region.h (2)
src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/region.cc (1)
RegionOp(34-52)
src/transform/loop_vectorize_dynamic.cc (2)
src/transform/make_packed_api.cc (2)
op(168-189)op(168-168)tilelang/transform/__init__.py (1)
LoopVectorizeDynamic(333-342)
src/op/parallel.cc (7)
src/layout/layout.cc (4)
RegisterReflection(76-81)RegisterReflection(76-76)RegisterReflection(474-479)RegisterReflection(474-474)src/layout/swizzle.cc (2)
RegisterReflection(103-106)RegisterReflection(103-103)src/op/finalize_reducer.h (1)
RegisterReflection(33-38)src/op/gemm.h (2)
RegisterReflection(35-41)RegisterReflection(106-128)src/op/parallel.h (1)
RegisterReflection(72-78)src/op/copy.h (2)
RegisterReflection(106-114)RegisterReflection(311-321)src/op/fill.h (1)
RegisterReflection(29-35)
src/transform/annotate_device_regions.cc (3)
src/transform/make_packed_api.cc (2)
op(168-189)op(168-168)src/transform/layout_inference.cc (14)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)op(443-460)op(443-443)op(462-471)op(462-462)op(657-669)op(657-657)op(706-801)op(706-706)tilelang/transform/__init__.py (1)
AnnotateDeviceRegions(274-282)
src/transform/inject_pipeline.cc (2)
src/transform/multi_version_buffer_rewriter.cc (2)
buffer(255-265)buffer(255-255)src/transform/pipeline_planning.cc (14)
op(48-57)op(48-48)op(87-143)op(87-87)op(166-184)op(166-166)op(186-204)op(186-186)op(206-260)op(206-206)op(262-272)op(262-262)op(703-712)op(703-703)
src/transform/lower_tile_op.cc (2)
src/transform/legalize_safe_memory_access.cc (10)
op(33-46)op(33-33)op(64-74)op(64-64)op(76-84)op(76-76)op(158-180)op(158-158)op(182-214)op(182-182)src/op/operator.cc (4)
ParseOperator(32-41)ParseOperator(32-32)ParseOperator(55-61)ParseOperator(55-55)
src/transform/warp_specialized_rewriter.cc (2)
src/transform/annotate_warp_group_reg_alloc.cc (10)
op(30-47)op(30-30)op(49-54)op(49-49)op(71-78)op(71-71)op(96-105)op(96-96)op(107-166)op(107-107)tilelang/transform/__init__.py (1)
WarpSpecialized(193-201)
src/transform/cluster_planning.cc (1)
tilelang/transform/__init__.py (1)
ClusterPlanning(17-25)
src/target/codegen_cuda.h (1)
src/target/codegen_cuda.cc (12)
PrintFunctionSignature(2675-2734)PrintFunctionSignature(2675-2677)t(26-64)t(26-26)t(68-75)t(68-68)t(79-95)t(79-79)t(99-107)t(99-100)PrintCallExtern(974-1030)PrintCallExtern(974-977)
src/target/codegen_cuda.cc (1)
src/transform/warp_specialized_rewriter.cc (16)
op(38-43)op(38-38)op(73-83)op(73-73)op(85-93)op(85-85)op(95-100)op(95-95)op(102-110)op(102-102)op(134-146)op(134-134)op(148-177)op(148-148)op(179-189)op(179-179)
src/transform/merge_shared_memory_allocations.cc (2)
src/transform/layout_inference.cc (6)
op(42-48)op(42-42)op(327-387)op(327-327)op(417-441)op(417-417)tilelang/transform/__init__.py (1)
MergeSharedMemoryAllocations(374-383)
src/op/reduce.h (8)
src/op/atomic_add.h (1)
tvm(12-73)src/op/finalize_reducer.h (2)
tvm(20-56)RegisterReflection(33-38)src/op/gemm_sp.h (1)
tvm(13-104)src/op/region.h (1)
tvm(72-116)src/support/ffi_aliases.h (1)
tvm(10-16)src/op/fill.h (2)
tvm(12-53)RegisterReflection(29-35)src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/reduce.cc (2)
ReduceOp(24-33)CumSumOp(509-523)
src/op/operator.h (1)
src/transform/layout_reducer.h (2)
Object(63-70)ObjectRef(72-76)
src/op/gemm_sp.h (10)
src/op/gemm.h (1)
GemmWarpPolicyNode(27-57)src/transform/layout_reducer.h (1)
ObjectRef(72-76)src/op/atomic_add.h (1)
tvm(12-73)src/op/finalize_reducer.h (1)
tvm(20-56)src/op/parallel.h (1)
tvm(22-148)src/op/region.h (1)
tvm(72-116)src/support/ffi_aliases.h (1)
tvm(10-16)src/op/fill.h (1)
tvm(12-53)src/op/operator.h (2)
TileOperatorNode(56-66)TileOperator(68-72)src/op/gemm_sp.cc (1)
GemmSP(86-109)
src/transform/lower_l2_persistent_annotation.cc (1)
tilelang/transform/__init__.py (1)
LowerL2Persistent(386-389)
src/op/copy.h (2)
src/op/operator.h (1)
TileOperatorNode(56-66)src/op/copy.cc (1)
Conv2DIm2ColOp(1725-1738)
🪛 Ruff (0.14.1)
tilelang/contrib/nvcc.py
186-186: Unused function argument: target
(ARG001)
| #include "../support/ffi_aliases.h" | ||
| #include "../target/utils.h" | ||
| #include "tvm/ffi/string.h" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure ffi::Function is declared
This file uses ffi::Function::GetGlobal but doesn’t include <tvm/ffi/function.h>. Either include it here or rely on the aliases header after adding the missing include there (preferred).
+#include <tvm/ffi/function.h>
#include "../support/ffi_aliases.h"🤖 Prompt for AI Agents
In src/op/gemm_py.cc around lines 14 to 16, the file calls
ffi::Function::GetGlobal but does not include the declaration for ffi::Function;
add #include <tvm/ffi/function.h> so the symbol is declared. Preferably, add
that include to src/support/ffi_aliases.h (which this file already includes) so
all files using ffi::Function get the declaration transitively; otherwise add
the include directly in src/op/gemm_py.cc.
| } | ||
|
|
||
| runtime::Module BuildTileLangCUDA(IRModule mod, Target target) { | ||
| ffi::Module BuildTileLangCUDA(IRModule mod, Target target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Return type likely mismatched with factory return (ffi::Module vs runtime::Module).
runtime::CUDAModuleCreate(...) returns runtime::Module, but these functions return ffi::Module. Unless tvm::ffi::Module is an alias of tvm::runtime::Module, this won’t compile.
Apply one of:
- Preferred: keep FFI-neutral and return
runtime::Module. - If you do intend to use an alias, ensure
tvm::ffi::Moduleis defined and visible here (otherwise switch back toruntime::Module).
-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {
...
-ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {Run the script below to confirm whether tvm::ffi::Module exists and is an alias to tvm::runtime::Module:
Also applies to: 74-74
🏁 Script executed:
#!/usr/bin/env bash
set -euo pipefail
# Find any definition/alias of ffi::Module
rg -nP -C2 '\bnamespace\s+ffi\b|using\s+Module\s*=' --glob '!**/third_party/**'
# Check all uses of CUDAModuleCreate and its expected return type
rg -nP -C2 'CUDAModuleCreate\s*\(' --glob '!**/third_party/**'Length of output: 3253
I need to inspect the actual file and search more thoroughly for any ffi::Module definition or typedef.
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# View the actual function signatures at lines 41 and 74
head -100 src/target/rt_mod_cuda.cc | tail -60
# Search for any Module typedef/using in ffi namespace more broadly
rg -nP 'class\s+Module|struct\s+Module|using\s+Module|typedef.*Module' --glob '!**/third_party/**' -A2 -B2 | head -50
# Check includes in rt_mod_cuda.cc for ffi-related headers
head -50 src/target/rt_mod_cuda.ccLength of output: 3742
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Search for ffi::Module in header files (likely in include/tvm/ffi or similar)
rg -nP 'class\s+Module|struct\s+Module' src/target/rt_mod_cuda.cc --context=30
# Find ffi header files
fd -e 'h$|hpp$' 'ffi' --glob '!**/third_party/**' | head -20
# Search for Module in ffi-related includes
rg -nP '\bModule\b' include/tvm/ffi --type=h -A1 -B1 | grep -i module | head -40
# Check if ffi namespace re-exports runtime::Module
rg -nP 'using.*Module|Module.*=' include/tvm/ffi --type=hLength of output: 217
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Find all header files with ffi in them
find . -type f \( -name '*ffi*.h' -o -name '*ffi*.hpp' \) -not -path '*/third_party/*' 2>/dev/null | head -30
# Search for ffi namespace definition and Module references
rg -nP 'namespace\s+ffi\s*\{|class\s+Module\s*\{|struct\s+Module\s*\{' --type=h --type=cc -A3 | head -60
# Check the actual include path in rt_mod_cuda.cc and trace it
rg -nP '#include.*ffi' src/target/rt_mod_cuda.cc
# Search for runtime::Module definition to understand the type
rg -nP 'class\s+Module|struct\s+Module' --type=h --type=cc -B2 -A5 2>/dev/null | grep -A7 'runtime' | head -40
# Check if ffi::Module is used elsewhere in the codebase
rg -nP 'ffi::Module' --type=h --type=cc | head -20Length of output: 124
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# View the ffi_aliases.h file which likely contains Module alias
cat src/support/ffi_aliases.h
# Search for Module in runtime namespace
grep -rn "class Module\|struct Module" src/ include/ 2>/dev/null | grep -i runtime
# Check if ffi::Module is used elsewhere
grep -rn "ffi::Module" src/ include/ 2>/dev/null | head -30
# Find runtime module creation function
grep -rn "CUDAModuleCreate" src/ 2>/dev/nullLength of output: 816
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Search for ALL Module definitions/usages
grep -rn "^.*Module" src/support/ src/target/*.h src/target/*.cc 2>/dev/null | grep -E "(class|struct|using|typedef).*Module" | head -40
# Check rt_mod_cpp.cc for context
cat src/target/rt_mod_cpp.cc | head -50
# Search in include directories for Module
find . -path './third_party' -prune -o -type f -name '*.h' -print 2>/dev/null | xargs grep -l "class Module\|struct Module" 2>/dev/null | head -20
# Look for ffi::Module specifically
grep -rn "ffi::Module" . --include="*.h" --include="*.cpp" 2>/dev/null | grep -v third_party | head -30
# Check if Module is in runtime namespace and its definition
grep -rn "namespace runtime" src/ include/ 2>/dev/null | head -5
grep -A10 "class Module" src/runtime/ include/tvm/runtime/ 2>/dev/null | head -50Length of output: 1859
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Get ALL includes from rt_mod_cuda.cc
head -20 src/target/rt_mod_cuda.cc
# Check if codegen_cuda.h includes tvm/ffi/extra/module.h
cat src/target/codegen_cuda.h 2>/dev/null | head -30
# Check if runtime/cuda/cuda_module.h includes it
find . -name "cuda_module.h" -type f | head -5
# Search for where ffi::Module is actually defined/declared
grep -rn "ffi::Module\|namespace.*ffi" . --include="*.h" 2>/dev/null | grep -v third_party | grep -E "(class|struct|using|typedef|namespace)" | head -50Length of output: 2406
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Check what codegen_cuda.h includes - it was partially shown, get full file
cat src/target/codegen_cuda.h
# Check runtime/cuda/cuda_module.h if it exists
find . -name "cuda_module.h" -type f ! -path "*/third_party/*" -exec cat {} \; 2>/dev/null | head -50
# Search for where ffi::Module is actually defined
grep -rn "class Module\|struct Module" . --include="*.h" 2>/dev/null | grep -v third_party | head -30
# Check if any header brings in ffi/extra/module.h transitively
grep -rn "ffi/extra/module.h\|ffi::Module" . --include="*.h" 2>/dev/null | grep -v third_partyLength of output: 5913
Add missing #include <tvm/ffi/extra/module.h> and align return type with actual value returned.
The functions on lines 41 and 74 declare ffi::Module as the return type but are missing the header that defines this type. Additionally, runtime::CUDAModuleCreate() returns runtime::Module, not ffi::Module. The parallel file src/target/rt_mod_cpp.cc correctly includes <tvm/ffi/extra/module.h> and declares the same return type, but rt_mod_cuda.cc lacks this include and has the type mismatch.
Apply the suggested fix:
+#include <tvm/ffi/extra/module.h>
#include "codegen_cuda.h"
#include "runtime/cuda/cuda_module.h"
#include "runtime/pack_args.h"
#include <tvm/ffi/reflection/registry.h>And change the return types:
-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {-ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {This matches the actual return value from runtime::CUDAModuleCreate().
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 41 to 41 (and also the second function
around line 74), add the missing include #include <tvm/ffi/extra/module.h> at
the top of the file and change the return type of the functions that currently
declare ffi::Module to runtime::Module so the signature matches the actual value
returned by runtime::CUDAModuleCreate(); keep function bodies unchanged.
| if (const auto *var_node = expr.as<VarNode>()) { | ||
| Var var = GetRef<Var>(var_node); | ||
| Var var = tvm::ffi::GetRef<Var>(var_node); | ||
| auto it = let_bindings_.find(var); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFI GetRef migration — LGTM; fix IterVar bound lookup.
- Var/Stmt GetRef updates and static-init macro usage are correct.
- Issue: In thread bounds computation, Analyzer is called with IterVar instead of its Var.
Apply this diff:
- if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
- auto const_int_bound = analyzer_->const_int_bound(thread_var_);
+ if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
+ auto const_int_bound = analyzer_->const_int_bound(thread_var_->var);Also applies to: 615-615, 655-657, 734-737
🤖 Prompt for AI Agents
In src/transform/lower_tile_op.cc around lines 439 (and also at 615, 655-657,
734-737), the code passes an IterVar object into the Analyzer call when
computing thread bounds; instead, extract and pass the underlying Var from the
IterVar (e.g., use iter_var->var or iter_var->var.get() per existing code
patterns) so Analyzer receives the Var key it expects. Update each location to
obtain the IterVar's Var before calling the Analyzer-bound lookup and use that
Var in place of the IterVar argument.
| host_mod, device_mod, params, codegen_mod.inspect_source(), rt_mod=host_mod) | ||
|
|
||
| return CompiledArtifact(host_mod, device_mod, params, codegen_mod.get_source()) | ||
| return CompiledArtifact(host_mod, device_mod, params, codegen_mod.inspect_source()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
get_source → inspect_source: confirm presence on all backends.
inspect_source() is the right API on newer TVM, but ensure all paths (CUDA/HIP/LLVM/C/Metal/WebGPU) return a Module that implements inspect_source(), otherwise Artifact.kernel_source may be None.
🏁 Script executed:
#!/bin/bash
# Grep for inspect_source uses and ensure we don’t still call get_source elsewhere.
rg -n 'get_source\(|inspect_source\(' -g 'tilelang/**'Length of output: 323
Let me gather more context to verify the scope and completeness of the migration.Now I need to examine the codebase directly to understand the scope and context.
I need to inspect the actual codebase to understand what inspect_source() is and whether it's a valid method.
🏁 Script executed:
#!/bin/bash
# Check if inspect_source is a custom method in tilelang
rg -n 'def inspect_source|\.inspect_source' tilelang/ --type py -B 2 -A 2
# Also check for codegen_mod type and imports in lower.py
echo "=== lower.py context around 243-245 ==="
sed -n '230,250p' tilelang/engine/lower.py
# Check base.py context
echo "=== base.py context around line 52 ==="
sed -n '40,65p' tilelang/jit/adapter/base.pyLength of output: 1581
Let me search for the device_codegen function and inspect_source imports to verify if this is a valid method.
🏁 Script executed:
#!/bin/bash
# Find device_codegen definition
echo "=== Searching for device_codegen definition ==="
rg -n 'def device_codegen\b|^device_codegen\s*=' tilelang/ -A 3
# Check for inspect_source imports
echo "=== Searching for inspect_source in imports ==="
rg -n 'inspect_source' tilelang/ --type py -B 1 -A 1
# Check what codegen_mod type should be - look at return type annotations
echo "=== Checking device_codegen and device_codegen_without_compile signatures ==="
rg -n 'device_codegen|device_codegen_without_compile' tilelang/engine/lower.py -B 2 -A 2
# Check imports at top of lower.py
echo "=== Imports in lower.py ==="
head -40 tilelang/engine/lower.pyLength of output: 2878
🏁 Script executed:
#!/bin/bash
# Find device_codegen definition
echo "=== Searching for device_codegen definition ==="
rg -n 'def device_codegen\b' tilelang/ --type py -A 5
# Check for inspect_source imports or usage
echo "=== Searching for inspect_source anywhere in codebase ==="
rg -n 'inspect_source' tilelang/ --type py
# Check what codegen_mod type annotation is
echo "=== Checking lower.py imports and device_codegen calls ==="
head -50 tilelang/engine/lower.py
# Find where device_codegen comes from
echo "=== Search for device_codegen in entire tilelang ==="
rg -n 'device_codegen' tilelang/ --type py -g '!lower.py' -B 1Length of output: 2396
Let me verify if tvm.IRModule has an inspect_source() method by checking TVM module API.
Let me verify one more time if inspect_source() is defined anywhere in the tilelang custom code or if this is purely broken code.
🏁 Script executed:
#!/bin/bash
# Comprehensive search for inspect_source definition
echo "=== Searching for ANY definition of inspect_source ==="
find tilelang -type f -name "*.py" -exec grep -l "inspect_source" {} \;
# Look for monkey-patching or property decoration
echo "=== Searching for property or method definitions that might alias inspect_source ==="
rg -n "@property|def " tilelang/ --type py -B 1 -A 1 | grep -C 2 "source"
# Check if it's being imported from external library
echo "=== Checking tvm module or tilelang.tvm for inspect_source ==="
python3 -c "from tilelang import tvm; print(dir(tvm))" 2>&1 | grep -i inspectLength of output: 6901
Replace inspect_source() with get_source() on lines 243 and 245.
The method inspect_source() does not exist on tvm.IRModule. Calling it will fail at runtime with AttributeError. The standard TVM API for retrieving source code is get_source(), as evidenced by its use elsewhere in the codebase (e.g., tilelang/jit/adapter/base.py:52).
# Lines 243, 245 should use:
codegen_mod.get_source()
# instead of:
codegen_mod.inspect_source()🤖 Prompt for AI Agents
In tilelang/engine/lower.py around lines 243 to 245, the code calls
codegen_mod.inspect_source() which does not exist on tvm.IRModule and will raise
AttributeError; replace both calls to use codegen_mod.get_source() instead so
the module source is retrieved via the correct TVM API.
|
Thanks @LeiWang1999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)
23-23: Limit namespace pollution from ffi.
using namespace ffi;in a large TU can shadow or collide with nearby symbols. Prefer explicittvm::ffi::...or localusingdeclarations for specific types/functions.Apply:
-using namespace ffi; +// Prefer explicit tvm::ffi:: qualifiers at use sites to avoid namespace pollution.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
3rdparty/tvm(1 hunks)src/op/copy.cc(7 hunks)src/target/codegen_cuda.cc(4 hunks)src/transform/loop_partition.cc(1 hunks)testing/python/jit/test_tilelang_jit_gemm_ctypes.py(2 hunks)tilelang/contrib/hipcc.py(2 hunks)tilelang/contrib/nvcc.py(6 hunks)tilelang/engine/callback.py(3 hunks)tilelang/engine/lower.py(5 hunks)tilelang/layout/fragment.py(1 hunks)tilelang/layout/layout.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- src/transform/loop_partition.cc
- tilelang/engine/callback.py
- tilelang/contrib/hipcc.py
- 3rdparty/tvm
- tilelang/layout/fragment.py
- src/op/copy.cc
- tilelang/engine/lower.py
🧰 Additional context used
🧬 Code graph analysis (3)
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)
testing/python/jit/test_tilelang_jit_gemm_cython.py (1)
test_gemm_f16f16f16_nn(100-114)testing/python/jit/test_tilelang_jit_callback.py (1)
test_gemm_f16f16f16_nn(101-115)
src/target/codegen_cuda.cc (1)
src/transform/warp_specialized_rewriter.cc (16)
op(38-43)op(38-38)op(73-83)op(73-73)op(85-93)op(85-85)op(95-100)op(95-95)op(102-110)op(102-102)op(134-146)op(134-134)op(148-177)op(148-148)op(179-189)op(179-179)
tilelang/contrib/nvcc.py (2)
tilelang/engine/lower.py (1)
tilelang_callback_cuda_compile(57-90)tilelang/contrib/nvrtc.py (1)
compile_cuda(14-111)
🪛 Ruff (0.14.1)
tilelang/contrib/nvcc.py
186-186: Unused function argument: target
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Quick Lint
- GitHub Check: Build wheels for Python 3.8 on macos-latest with Metal
- GitHub Check: Build wheels for Python 3.8 on ubuntu-24.04-arm with CUDA-12.8
- GitHub Check: Build wheels for Python 3.8 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (9)
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)
88-88: LGTM! FFI migration applied correctly.The migration from
@tvm.register_functo@tvm.register_global_funcwith explicit function naming aligns with the TVM v0.22.0 FFI update. Theoverride=Trueparameter appropriately allows the test to replace any existing registration.
410-411: Ignore this review comment—this change follows an established codebase pattern.The change reflects a deliberate design choice, not a temporary debug modification. Five test files across different directories (kernel, transform, jit) use the identical pattern of commenting out
tilelang.testing.main()and calling specific test functions directly:
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/transform/test_tilelang_transform_inject_set_max_nreg.pytesting/python/transform/test_tilelang_transform_layout_inference.pytesting/python/kernel/test_tilelang_kernel_bf16_gemm_mma.pytesting/python/jit/test_tilelang_jit_gemm_ctypes.py(this file)This consistent pattern across multiple files indicates intentional design, not ad-hoc debugging. The review's assumption that this is temporary and should be reverted contradicts established codebase practice.
Likely an incorrect or invalid review comment.
tilelang/layout/layout.py (1)
5-5: LGTM: FFI migration verified and complete.The migration from
tvm.ffitotvm_ffihas been comprehensively applied across the entire codebase. The@tvm_ffi.register_object()decorator is confirmed functional and actively used for 15+ class registrations (Fill, AtomicAdd, Copy, Gemm, Layout, Fragment, GemmPy, and others) as well as global function registrations. No orphaned old-style patterns remain, and all imports are consistent.src/target/codegen_cuda.cc (4)
2025-2026: Correct migration to tvm::ffi::GetRef for tl_gemm.Switching to
tvm::ffi::GetRef<PrimExpr>(op)is consistent with the new FFI. Looks good.Confirm the TU includes the header that declares
tvm::ffi::GetRef(e.g.,<tvm/ffi/object.h>or the umbrella that provides it). If not, add it to avoid ADL/dep-inclusion pitfalls.
2034-2035: Same as above for tl_gemm_sp.The FFI-based
GetRefusage is consistent and correct.Same header check as the previous comment to ensure
GetRefis declared in this TU.
2308-2310: FFI GetRef in Ramp CHECK message is fine.Diagnostic now prints an FFI-backed
Rampref; no functional change. OK.
2811-2811: Optional truthiness check is acceptable.
ICHECK(global_symbol)is equivalent todefined()forOptional<String>on modern TVM. Safe change.If you target older compilers/TVM variants, keep
defined()for clarity; otherwise this is fine.tilelang/contrib/nvcc.py (2)
11-12: LGTM! Import changes align with FFI migration.The new imports correctly support the migration to
tvm_ffi.register_global_funcand maintain access to TVM utilities through the localtvmalias.
185-185: LGTM! Consistent FFI decorator migration.All function registrations have been properly migrated from the legacy
@tvm.ffi.register_functo@tvm_ffi.register_global_funcwithoverride=True, maintaining backward compatibility through unchanged registered names.Also applies to: 192-192, 257-257, 403-403, 416-416, 433-433
| def tilelang_callback_cuda_compile(code, target): # pylint: disable=unused-argument | ||
| """use nvcc to generate fatbin code for better optimization""" | ||
| ptx = compile_cuda(code, target_format="fatbin") | ||
| return ptx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify whether the target parameter should be used.
The target parameter is accepted but unused in the callback. The function relies on Target.current() within compile_cuda to determine architecture. However, the caller in tilelang/engine/lower.py explicitly extracts architecture from the target parameter (nvcc.get_target_compute_version(target)), suggesting explicit usage might be preferred over ambient context.
Consider either:
- Using the target parameter explicitly if it provides more reliable architecture detection, or
- Documenting why relying on
Target.current()is the intended design
Based on learnings (static analysis hints).
🏁 Script executed:
#!/bin/bash
# Search for other usages of tilelang_callback_cuda_compile to understand expected behavior
rg -n "tilelang_callback_cuda_compile" -A5 -B5Length of output: 2342
Based on my verification, the unused target parameter in the tilelang_callback_cuda_compile function in tilelang/contrib/nvcc.py is indeed a concern.
The underlying compile_cuda function accepts an arch parameter and defaults to using Target.current() if none is provided. The current implementation relies on this ambient context, but since compiling with an explicit architecture flag enables faster runtime as code generation occurs during compilation, whereas omitting it forces JIT compilation by the CUDA driver, the docstring's promise of "better optimization" suggests the target should be explicitly used.
Use the target parameter to pass architecture to compile_cuda.
The target parameter is passed by the C++ caller but unused. To achieve the promised "better optimization," extract the compute version from the target and pass it as the arch parameter to compile_cuda, or use the target to establish the proper compilation context. This ensures architecture-specific optimization occurs at compile time rather than relying on ambient Target.current() context.
🧰 Tools
🪛 Ruff (0.14.1)
186-186: Unused function argument: target
(ARG001)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
requirements-test.txt (1)
15-15: Remove redundant apache-tvm-ffi from test requirements.Since apache-tvm-ffi is now a runtime requirement (in requirements.txt), it is automatically available during test execution. Listing it in test requirements is redundant and increases maintenance burden. Test environments typically install both runtime and test dependencies.
Apply this diff to remove the duplicate:
-apache-tvm-ffi cffi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
requirements-test.txt(2 hunks)requirements.txt(1 hunks)
🔇 Additional comments (1)
requirements-test.txt (1)
15-15: Review comment is based on incorrect assumptions about apache-tvm-ffi versioning and TVM dependency model.The codebase manages TVM as a C++ build dependency via CMake submodule (3rdparty/tvm), not as a pip package. The PyPI package
apache-tvm-ffiis only a Python FFI wrapper and contains no explicit TVM version constraint in its dependencies. Additionally, only version 0.1.0 is available on PyPI, so there is no "known-compatible version" to pin to as the review suggests.While the inconsistency between unpinned
apache-tvm-ffiand pinnedflash-linear-attention==0.3.2is technically observable, it does not represent a practical risk because apache-tvm-ffi's version does not control which TVM library is used—the C++ build system does. Pinning a single version on PyPI that has no version alternatives would provide no additional stability benefit.Likely an incorrect or invalid review comment.
|
CI tests passed, but the bdist step failed. Would you mind taking a look, @oraluben or @XuehaiPan? |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
I think it's most likely due to py38 compatibility of tvm 0.22.0. I'll take a look then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.github/workflows/ci.yml (1)
45-46: Update step ID to match step name for clarity.The step ID is
setup-py38but the step name is "Setup Python 3.9". While this doesn't break functionality (outputs are keyed by ID), updating the step ID tosetup-py39would improve consistency and avoid confusion for future maintainers.- - name: Setup Python 3.9 - id: setup-py38 + - name: Setup Python 3.9 + id: setup-py39Then update line 54 to reference the new ID:
- "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang + "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelangpyproject.toml (1)
86-86: Update ruff target-version to match project Python baseline.The
target-versionis set topy38, but the project now requires Python ≥3.9. Updating this topy39ensures ruff applies rules and fixes aligned with the current minimum supported version. This improves consistency between tool configuration and project requirements.-target-version = "py38" +target-version = "py39"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/ci.yml(1 hunks).github/workflows/dist.yml(1 hunks)docs/get_started/Installation.md(2 hunks)examples/gemm/README.md(1 hunks)format.sh(1 hunks)pyproject.toml(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- examples/gemm/README.md
- docs/get_started/Installation.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
- GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
- GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (3)
format.sh (1)
86-87: PIP_USER environment variable improves virtualenv isolation.The addition of
export PIP_USER=0prevents pip from defaulting to--userinstalls, ensuring packages are installed in the virtual environment rather than the user site-packages. This is a sensible safeguard for CI and scripted contexts..github/workflows/ci.yml (1)
49-49: Python version bump is consistent with project requirements.Updating to Python 3.9 aligns with the
requires-python = ">=3.9"in pyproject.toml and reflects the TVM 0.22.0 compatibility baseline..github/workflows/dist.yml (1)
44-45: Python 3.9 wheel build aligns with TVM 0.22.0 Limited API.The build matrix now uses Python 3.9, and the comment correctly notes that TVM is built with the Python 3.9 Limited API (enabling wheel compatibility with all Python versions ≥3.9). This supports the project's shift to Python 3.9+ and the TVM rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)
45-54: Update step ID to reflect Python 3.9.The step ID is still
setup-py38but now sets up Python 3.9. For clarity and maintainability, rename the step ID and update the reference on line 54.- name: Setup Python 3.9 - id: setup-py38 + id: setup-py39 uses: actions/setup-python@v6 with: python-version: "3.9" # use lowest supported version for linting update-environment: false - name: Check AST with Python 3.9 run: | - "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang + "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelang
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/ci.yml(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Quick Lint
- GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
- GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
- GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
🔇 Additional comments (1)
.github/workflows/ci.yml (1)
170-170: HTTP timeout configuration added for CUDA setup.Adding
UV_HTTP_TIMEOUT=600to the CUDA environment is reasonable for addressing potential timeout issues during package downloads in that environment.
@oraluben We have already packed the whole Here are the contents of a wheel built from a fresh git clone: $ unzip tilelang-*.whl
$ du -d 1 -h tilelang | sort -rh
662M tilelang
592M tilelang/3rdparty
65M tilelang/lib
3.3M tilelang/src
408K tilelang/language
292K tilelang/carver
212K tilelang/jit
128K tilelang/intrinsics
88K tilelang/quantize
64K tilelang/contrib
60K tilelang/autotuner
52K tilelang/utils
36K tilelang/transform
36K tilelang/layout
36K tilelang/engine
32K tilelang/tileop
32K tilelang/primitives
24K tilelang/tools
24K tilelang/cache
20K tilelang/profiler
8.0K tilelang/common
4.0K tilelang/testing
4.0K tilelang/math
$ du -d 1 -h tilelang/3rdparty | sort -rh
592M tilelang/3rdparty
406M tilelang/3rdparty/tvm
149M tilelang/3rdparty/cutlass
38M tilelang/3rdparty/composable_kernel |
|
@XuehaiPan It's unexpected when we have |
@oraluben Then you need to set an unreliable RPATH to $ORIGIN:$ORIGIN/../../tvm_ffi/libI can change to this approach, but it might not work with |
I don't think that's necessary, when we have |
Then you rely on another unguaranteed thing: the execution order of Python code. |
That's true. But when we're dealing with ffi I think it's ok to add one exception and maybe to explicitly Update: isn't it very dangerous trying to load two |
This makes sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/dist.yml(2 hunks)CMakeLists.txt(3 hunks)pyproject.toml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/dist.yml
- CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
- GitHub Check: Build wheels for Python 3.8 on ubuntu-24.04-arm with CUDA-12.8
- GitHub Check: Build wheels for Python 3.8 on macos-latest with Metal
- GitHub Check: Build SDist
- GitHub Check: Build wheels for Python 3.8 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (5)
pyproject.toml (5)
11-28: Classifiers look good; align with FFI modernization.Added Development Status Beta, C++ and CPython classifiers, plus Python 3.13 and 3.14 support. These changes appropriately reflect the C++ FFI integration and broader Python version support.
30-40: apache-tvm-ffi dependency is critical; audit torch specification.The new
apache-tvm-ffi~=0.1.0dependency directly supports the PR's FFI rebase. However, note that lines 36–37 list torch twice:"torch", "torch>=2.7; platform_system == 'Darwin'",This is redundant and may confuse dependency resolvers. Typically, you'd either specify a single unconditional version (
"torch>=2.7"), or platform-specific variants without duplication. Verify whether torch should require ≥2.7 universally, or only on macOS for a specific reason.
48-48: Build system requires list is well-formatted.Single-line
requiresis cleaner and more maintainable than the previous multi-entry format.
52-52: AI summary inconsistency: py-api and target-version.The AI summary claims
wheel.py-apichanged from "cp38" to "cp39" andtarget-versionchanged from "py38" to "py39", but neither line is marked with~and both still show cp38/py38. Verify whether these were intentionally kept at cp38/py38 or whether the summary is outdated.Also applies to: 111-111
182-194: Linux wheel repair command: libtvm_ffi exclusion is correct.The updated auditwheel command (line 193) includes
-vfor verbosity and correctly excludeslibtvm_ffi.soalong with CUDA libraries. This aligns with the PR's objective to avoid bundling conflicting libtvm_ffi instances alongside the apache-tvm-ffi pip package.
This reverts commit cd9ab57.
|
The built wheel is not usable on Python 3.8 due to Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tilelang/__init__.py", line 77, in <module>
import tvm
File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tilelang/3rdparty/tvm/python/tvm/__init__.py", line 24, in <module>
from tvm_ffi import register_object, register_global_func, get_global_func
File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tvm_ffi/__init__.py", line 39, in <module>
from .registry import (
File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tvm_ffi/registry.py", line 25, in <module>
from . import core
File "python/tvm_ffi/cython/dtype.pxi", line 192, in init core
File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/ml_dtypes/__init__.py", line 71, in __getattr__
raise AttributeError(f'cannot import name {name!r} from {__name__!r}')
AttributeError: cannot import name 'int2' from 'ml_dtypes'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/ci.yml(2 hunks).github/workflows/dist.yml(1 hunks).github/workflows/pr-perfbench-bot.yml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/ci.yml
- .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
- GitHub Check: Build SDist
- GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
- GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
- GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
🔇 Additional comments (1)
.github/workflows/pr-perfbench-bot.yml (1)
49-49: Verify Python 3.12 compatibility with TVM v0.22.0.Upgrading from 3.9 to 3.12 addresses the Python 3.8 incompatibility noted in PR comments. However, verify that this single-version workflow is intentional and that TVM v0.22.0 is fully tested on Python 3.12. If the project maintains support for multiple Python versions, consider whether other workflows (e.g., build, test) validate compatibility across a version range.
Checked a little bit, this When installing tilelang on py3.8, it installs the latest ml_dtypes (0.4.x) that do not have those types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pyproject.toml (1)
227-231: Add explicit dylib exclusions to macOS repair-wheel command.The macOS
repair-wheel-command(lines 227–231) uses--no-sanitize-rpathsbut does not include--excludeflags to prevent bundling oflibtvm_ffi.dylibandlibcuda.dylib. Per web search results in past review comments,--no-sanitize-rpathsonly preserves RPATH entries; it does not prevent delocate-wheel from copying dylibs into the wheel.In contrast, the Linux configuration (lines 222–225) explicitly excludes
libtvm_ffi.soandlibcuda.so.1. Without equivalent exclusions on macOS, the wheel will bundlelibtvm_ffi.dylib, creating a conflict with the pip-installedapache-tvm-ffipackage and risking duplication or ABI mismatches.Add
--exclude libtvm_ffi.dyliband--exclude libcuda.dylibto the delocate-wheel command:[tool.cibuildwheel.macos] repair-wheel-command = [ - "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --require-archs {delocate_archs} -w {dest_dir} -v {wheel}", + "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --exclude libtvm_ffi.dylib --exclude libcuda.dylib --require-archs {delocate_archs} -w {dest_dir} -w {dest_dir} {wheel}", "pipx run abi3audit --verbose --strict {wheel}", ]
🧹 Nitpick comments (2)
pyproject.toml (2)
31-31: Clarify redundant torch dependency.Line 31 adds
apache-tvm-ffi~=0.1.0, but lines 36–37 declare torch twice: once without a version constraint (torch) and again with a macOS-specific pin (torch>=2.7; platform_system == 'Darwin'). Typically, a base requirement should precede environment markers.Consider restructuring to a single torch entry with the appropriate version constraint:
dependencies = [ "apache-tvm-ffi~=0.1.0", "cloudpickle", "ml-dtypes", "numpy>=1.23.5", "psutil", - "torch", - "torch>=2.7; platform_system == 'Darwin'", + "torch>=2.7; platform_system == 'Darwin'", + "torch; platform_system != 'Darwin'", "tqdm>=4.62.3", "typing-extensions>=4.10.0", ]Alternatively, if the minimum torch version differs by platform, clarify the intent with a comment.
Also applies to: 36-37
186-188: Consider expanding test-command to verify FFI integration.The added
test-command(lines 186–188) only checks that tilelang imports and prints its version. For a PR that migrates extensively to FFI-based object creation and adds a new FFI dependency (apache-tvm-ffi), consider a slightly more comprehensive smoke test to ensure FFI components load correctly.For example:
python -c "import tilelang; from tilelang import ir; print(f'tilelang {tilelang.__version__}'); print('FFI OK')"This ensures not just import, but also FFI-dependent submodules (like
ir) load without error.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/pr-perfbench-bot.yml(2 hunks)pyproject.toml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/pr-perfbench-bot.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Quick Lint
- GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
- GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
- GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
- GitHub Check: Build SDist
🔇 Additional comments (1)
pyproject.toml (1)
30-40: Pin ml-dtypes version to ensure Python 3.8 compatibility.The project declares
requires-python = ">=3.8"(line 5), but does not pinml-dtypes(line 33). Per PR comments,apache-tvm-ffirequiresml-dtypes >= 0.5.0, which in turn requires Python ≥ 3.9 for theint2type. This creates a latent incompatibility: pip may resolveml-dtypes >= 0.5.0on Python 3.8, causing an import failure.Either:
- Explicitly pin
ml-dtypes>=0.5.0and updaterequires-pythonto>=3.9, or- Pin
ml-dtypesto a version compatible with Python 3.8 if older versions support it, or- Make
ml-dtypesconditional on Python ≥ 3.9 (if the library is optional for older Python).As a workaround until
apache-tvm-ffiresolves this upstream, consider pinningml-dtypes>=0.5.0; python_version >= '3.9'.
As title, currently it failed with `ml_dtypes==0.4.x`: tile-ai/tilelang#1108 (comment)
This pull request introduces significant refactoring and modernization of the FFI (Foreign Function Interface) and object system usage in the codebase, particularly in the layout and IR (Intermediate Representation) modules. The changes improve consistency, type safety, and compatibility with updated TVM submodules. Additionally, there are CMake and build system updates to better handle third-party dependencies and include paths.
Key changes include:
FFI and Object System Refactoring
Replaced direct usage of
make_objectand related macros withtvm::ffi::make_objectand new FFI macros (e.g.,TVM_FFI_DECLARE_OBJECT_INFO_FINAL,TVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE/NOTNULLABLE) acrosssrc/layout/layout.cc,src/layout/layout.h,src/layout/swizzle.cc, andsrc/ir.ccforLayout,Fragment,SwizzledLayout, and various IR frame classes. This modernizes the object creation and reflection system, improving type safety and FFI integration. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]Removed custom
SEqualReducemethods and replaced legacy object registration macros with FFI-based equivalents in layout classes, further aligning with the new FFI system. [1] [2] [3]Updated FFI static initialization blocks to use the new syntax and patterns for registering reflection and global functions. [1] [2] [3] [4]
Build System and Third-party Dependency Handling
Updated the TVM submodule to a newer commit, ensuring compatibility with the latest upstream changes.
Improved CMake logic for including TVM and related third-party headers, adding conditional checks for alternative include paths (e.g., for
ffi/includeanddlpack/include), increasing build robustness across different setups.Removed installation and dependency logic for TVM's Cython extension, simplifying the build process and avoiding unnecessary steps for certain build types.
Removed the now-unused
codegen_webgpu.ccfrom build sources, reflecting changes in dependency or platform support.Minor Codebase Cleanups
These changes collectively modernize the codebase, improve maintainability, and ensure compatibility with upstream dependencies and evolving FFI standards.
Summary by CodeRabbit
New Features
Improvements
Removals
Breaking Changes
Chores