[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108

LeiWang1999 · 2025-10-22T13:53:40Z

This pull request introduces significant refactoring and modernization of the FFI (Foreign Function Interface) and object system usage in the codebase, particularly in the layout and IR (Intermediate Representation) modules. The changes improve consistency, type safety, and compatibility with updated TVM submodules. Additionally, there are CMake and build system updates to better handle third-party dependencies and include paths.

Key changes include:

FFI and Object System Refactoring

Replaced direct usage of make_object and related macros with tvm::ffi::make_object and new FFI macros (e.g., TVM_FFI_DECLARE_OBJECT_INFO_FINAL, TVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE/NOTNULLABLE) across src/layout/layout.cc, src/layout/layout.h, src/layout/swizzle.cc, and src/ir.cc for Layout, Fragment, SwizzledLayout, and various IR frame classes. This modernizes the object creation and reflection system, improving type safety and FFI integration. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
Removed custom SEqualReduce methods and replaced legacy object registration macros with FFI-based equivalents in layout classes, further aligning with the new FFI system. [1] [2] [3]
Updated FFI static initialization blocks to use the new syntax and patterns for registering reflection and global functions. [1] [2] [3] [4]

Build System and Third-party Dependency Handling

Updated the TVM submodule to a newer commit, ensuring compatibility with the latest upstream changes.
Improved CMake logic for including TVM and related third-party headers, adding conditional checks for alternative include paths (e.g., for ffi/include and dlpack/include), increasing build robustness across different setups.
Removed installation and dependency logic for TVM's Cython extension, simplifying the build process and avoiding unnecessary steps for certain build types.
Removed the now-unused codegen_webgpu.cc from build sources, reflecting changes in dependency or platform support.

Minor Codebase Cleanups

Added missing includes for FFI aliases and object headers, and made minor formatting and whitespace adjustments for consistency. [1] [2] [3] [4]

These changes collectively modernize the codebase, improve maintainability, and ensure compatibility with upstream dependencies and evolving FFI standards.

Summary by CodeRabbit

New Features
- Enhanced GEMM dispatch with new instruction types and richer GEMM public interface.
Improvements
- Large-scale migration to an FFI-backed ABI for types, registrations, and reflection.
- Consolidated install behavior and improved runtime library path handling.
- Centralized FFI type aliases for consistent usage across codebase.
Removals
- WebGPU TileLang backend removed.
Breaking Changes
- Several public APIs and module return types now use FFI-backed types — update callers.
Chores
- Packaging, dependency, and CI/workflow updates.

github-actions · 2025-10-22T13:53:56Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-22T13:54:07Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Mass migration to TVM FFI: qualify GetRef/make_object and object macros with tvm::ffi, add ffi_aliases, convert many C++ types/signatures to FFI variants, standardize TVM_FFI_STATIC_INIT_BLOCK() usage, remove the WebGPU TileLang backend, and switch Python FFI registrations to tvm_ffi. (≤50 words)

Changes

Cohort / File(s)	Summary
Submodule & Build `3rdparty/tvm`, `CMakeLists.txt`, `cmake/load_tvm.cmake`	Updated TVM submodule pointer; added GNU warning suppression; added sccache fallback for launcher; removed WebGPU source from build; conditionalized ffi/DLPack include paths; consolidated install(TARGETS ...) and expanded INSTALL_RPATH handling.
FFI aliases `src/support/ffi_aliases.h`	New header exposing selected `tvm::ffi` types (`Array`, `Map`, `Optional`, `String`, `Function`) as `tvm::` aliases.
Core IR & Layout `src/ir.cc`, `src/layout/layout.cc`, `src/layout/layout.h`, `tilelang/layout/layout.py`, `tilelang/layout/fragment.py`	Replace `make_object`/`GetRef` with `tvm::ffi` variants; swap object macros to `TVM_FFI_*`; add FFI includes/aliases; update Python object registration to use `tvm_ffi`.
Operators `src/op/.cc`, `src/op/.h`	Migrate object creation to `tvm::ffi::make_object`, replace object-info macros with `TVM_FFI_DECLARE_*`, remove `SEqualReduce`/`SHashReduce` hooks and flags, normalize `TVM_FFI_STATIC_INIT_BLOCK()` syntax.
Transforms / Passes `src/transform/**` (many files)	Qualify `GetRef`/`make_object` as `tvm::ffi::...`, include `ffi_aliases` where needed, and standardize `TVM_FFI_STATIC_INIT_BLOCK()` invocation across passes.
ArgBinder API `src/transform/arg_binder.h`, `src/transform/arg_binder.cc`	Public/type changes: `BindArray`, `def_handle_dtype` and internal member now use `ffi::Array` / `ffi::Map` instead of `Array` / `Map`.
Targets & runtime modules `src/target/`, `src/target/rt_mod_.cc`, `src/target/codegen_.{cc,h}`, `src/target/intrin_rule_.cc`	Convert `String`/`Array`/`Module` types to `ffi::String`/`ffi::Array`/`ffi::Module`, update `GetRef` usage and presence checks, adjust codegen/rt function signatures (CUDA/HIP/CPP), and add `ffi_aliases` includes.
WebGPU removal `src/target/codegen_webgpu.cc`, `src/target/codegen_webgpu.h`, build entries	Entire WebGPU TileLang backend (header + implementation + build entries) removed; Python device mapping updated to `target.build.webgpu`.
Python FFI & registrations `tilelang/_ffi_api.py`, `tilelang/ir.py`, `tilelang/contrib/.py`, `tilelang/engine/.py`, `tilelang/engine/callback.py`	Replace `tvm.ffi` imports with `tvm_ffi`, switch registration decorators to `tvm_ffi.register_*`/`register_global_func`, and adjust small Python API signatures (e.g., `GemmWarpPolicy.compute_warp_partition`).
Runtime / tests / examples `src/target/rt_mod_.cc`, `tilelang/contrib/`, `testing/python/*`, `examples/gemm/README.md`	Update runtime/module return types and attribute types to FFI variants; change test registration decorators; minor README and test entry tweaks.
Packaging & CI `pyproject.toml`, `requirements.txt`, `.github/workflows/`, `.clang-tidy`, `format.sh`	Add `apache-tvm-ffi` runtime dependency and packaging updates; modify `cibuildwheel` and CI env/steps; remove clang-tidy `-v` extra arg; export `PIP_USER=0` in `format.sh`.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Cpp as C++ code
  participant FFI as tvm::ffi

  Caller->>Cpp: create/access Node or Module
  alt pre-migration
    Cpp->>Cpp: make_object<T>() / GetRef<T>(...)
    Note right of Cpp: local TVM factories/macros
    Cpp-->>Caller: ObjectRef / Module
  else post-migration
    Cpp->>FFI: tvm::ffi::make_object<T>() / tvm::ffi::GetRef<T>(...) / ffi::Module returns
    Note right of FFI #D2E7D5: FFI-managed allocation/reflection/types
    FFI-->>Cpp: ObjectPtr / ObjectRef / Module
    Cpp-->>Caller: ObjectRef / Module
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus areas:
- ArgBinder signature/type changes and all call sites (ffi::Array / ffi::Map).
- Target/runtime API changes returning ffi::Module and updated codegen signatures (CUDA/HIP/CPP).
- Consistency of migrations: make_object → tvm::ffi::make_object, GetRef → tvm::ffi::GetRef, and object macro replacements to TVM_FFI_*.
- Removal of SEqualReduce/SHashReduce hooks — verify no remaining consumers rely on them.
- Full removal of WebGPU backend — check for dangling references in build scripts, Python bindings, or tests.

Possibly related PRs

[Refactor] Refactor Operator into TileOperator and with tvm reflection #763 — FFI migration and TileOperator refactor overlap (object macros, constructors, many src/op/* changes).
[Refactor] Support python reflection for tile operators #783 — Operator and registration refactors touching the same operator/node macros and init patterns.
[AMD] Supoort T.gemm_v2 for AMD Backend #1136 — GEMM/GemmPy dispatch and binding changes overlapping the gemm-related Python/C++ updates.

Suggested labels

enhancement

Poem

"I hopped through headers, quick and keen,
swapped make_object for ffi unseen.
WebGPU waved a soft goodbye,
Python hooks now reach the sky.
A rabbit's patch—tidy, swift, and lean 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi" is directly related to the main changes in the changeset. The raw summary and objectives confirm the PR primarily involves rebasing the TVM submodule to v0.22.0 and refactoring FFI and object-system usage throughout the codebase, including replacing legacy macros with new FFI equivalents, updating build system dependencies, and removing deprecated code like the WebGPU codegen. The title clearly captures these core objectives and accurately summarizes the primary motivation without being vague or misleading.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9e9191 and 74ec78c.

📒 Files selected for processing (1)

format.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build SDist

🔇 Additional comments (1)

format.sh (1)

83-85: Well-motivated environment setup.

The addition of export PIP_USER=0 ensures pip respects virtualenv isolation and complements the PR's broader strategy for deterministic CI/build environments. The comment clearly explains the rationale.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (14)

src/transform/make_packed_api.cc (2)

366-377: FFI type-code check is wrong for ints/floats.

buf_packed_arg_type_ids holds FFI type indices, but lines 369 and 375 check against kDLInt and kDLFloat (DLPack dtype codes), causing invalid runtime validation. The handle type case (line 361–363) correctly uses ffi::TypeIndex::kTVMFFI* constants on the same buffer. Replace with:
     } else if (t.is_int() || t.is_uint()) {
       std::ostringstream msg;
       msg << name_hint << ": Expect arg[" << i << "] to be int";
-      seq_init.emplace_back(AssertStmt(type_index == kDLInt,
+      seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIInt,
                                        tvm::tir::StringImm(msg.str()), nop));
     } else {
       ICHECK(t.is_float());
       std::ostringstream msg;
       msg << name_hint << ": Expect arg[" << i << "] to be float";
-      seq_init.emplace_back(AssertStmt(type_index == kDLFloat,
+      seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIFloat,
                                        tvm::tir::StringImm(msg.str()), nop));
     }
129-135: Fix incorrect ret() call construction at lines 133 and 450.

Both lines incorrectly use undefined or incorrect functions. Line 133 uses tvm::ret() which doesn't exist, and line 450 uses bare ret() which is undefined. The correct approach is to construct a Call with builtin::ret(), consistent with the pattern already in the codebase at line 417.
-    Stmt ret_zero = Evaluate(tvm::ret(0));
+    Stmt ret_zero = Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}));
and
-  body = SeqStmt({body, Evaluate(ret(Integer(0)))});
+  body = SeqStmt({body, Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}))});

src/target/utils.cc (1)

59-64: Bug: bitwise '&' used instead of logical '&&' in TargetIsSm100.

This will yield incorrect results for most values. Use logical and.

Apply this diff:
-  return arch >= 100 & arch <= 110;
+  return arch >= 100 && arch <= 110;

src/op/reduce.cc (1)

45-90: 64‑bit overflow/UB in bit shifts when building init constants.

(1 << (bits-1)) and (1 << bits) overflow for bits ≥ 31 (signed) / 32 (unsigned). Use min_value/max_value helpers instead of shifting.

   if (type->isMax()) {
-    if (is_int) {
-      return make_const(dst->dtype, -(1 << (bits - 1)));
-    } else if (is_uint) {
-      return make_const(dst->dtype, 0);
-    } else {
-      return make_const(dst->dtype, -INFINITY);
-    }
+    if (is_int) {
+      return min_value(dst->dtype);
+    } else if (is_uint) {
+      return make_const(dst->dtype, 0);
+    } else {
+      return make_const(dst->dtype, -INFINITY);
+    }
   } else if (type->isMin()) {
-    if (is_int) {
-      return make_const(dst->dtype, (1 << (bits - 1)) - 1);
-    } else if (is_uint) {
-      return make_const(dst->dtype, (1 << bits) - 1);
-    } else {
-      return make_const(dst->dtype, INFINITY);
-    }
+    if (is_int) {
+      return max_value(dst->dtype);
+    } else if (is_uint) {
+      return max_value(dst->dtype);
+    } else {
+      return make_const(dst->dtype, INFINITY);
+    }
   } else if (type->isBitAnd()) {
-    if (is_int) {
-      return make_const(dst->dtype, -1);
-    } else if (is_uint) {
-      return make_const(dst->dtype, (1 << bits) - 1);
-    } else {
-      // Should not arrive here
-      return make_const(dst->dtype, -INFINITY);
-    }
+    if (is_int) {
+      return make_const(dst->dtype, -1);
+    } else if (is_uint) {
+      return max_value(dst->dtype);
+    } else {
+      // Should not arrive here
+      return make_const(dst->dtype, -INFINITY);
+    }

Note: min_value/max_value are in tir/op.h; already included above.

src/transform/vectorize_loop.cc (1)

170-178: Bug: strides are cleared due to iterating an empty array

You initialize Array<PrimExpr> strides; and then loop over strides.size(), which is 0. This drops existing strides when assigning back, changing layout semantics.

Apply:

-      // Update strides if defined.
-      Array<PrimExpr> strides;
-      for (size_t i = 0; i < strides.size(); i++) {
-        PrimExpr stride = strides[i];
-        if (i != strides.size() - 1) {
-          stride *= var_lanes_;
-        }
-        strides.push_back(analyzer_.Simplify(stride));
-      }
+      // Update strides if defined.
+      Array<PrimExpr> strides = node->buffer->strides;
+      if (!strides.empty()) {
+        for (size_t i = 0; i < strides.size(); i++) {
+          PrimExpr stride = strides[i];
+          if (i != strides.size() - 1) {
+            stride *= var_lanes_;
+          }
+          strides.Set(i, analyzer_.Simplify(stride));
+        }
+      }

src/op/gemm_py.cc (1)

65-66: Fix type casting for clear_accum field

The comment in the file (lines 38-41) and Python interface both document clear_accum as Bool type. The parallel implementation in src/op/gemm_sp.cc (line 98) and validation in src/op/gemm.cc (lines 592-595) both use .as<Bool>() casting. Update the parsing to match:
-  node->clear_accum = args[9].as<PrimExpr>().value();
+  node->clear_accum = args[9].as<Bool>().value();

src/target/rt_mod_cuda.cc (3)

47-54: Unqualified TIR types; qualify to avoid ADL/visibility pitfall.

PrimFuncNode/PrimFunc are used without tir:: here, unlike earlier in the file. This can break if no using brings them in.
-    ICHECK(kv.second->IsInstance<PrimFuncNode>())
+    ICHECK(kv.second->IsInstance<tir::PrimFuncNode>())
...
-    auto f = Downcast<PrimFunc>(kv.second);
+    auto f = Downcast<tir::PrimFunc>(kv.second);
Applies to both loops.

Also applies to: 79-87

64-68: Guard against empty PTX before indexing.

ptx[0] is UB when ptx is empty (compile callback could return an empty string).
-    if (ptx[0] != '/')
+    if (!ptx.empty() && ptx[0] != '/')
       fmt = "cubin";
69-70: Replace ICHECK(0) with a descriptive fatal.

Improves diagnosability.
-  } else {
-    ICHECK(0);
-  }
+  } else {
+    LOG(FATAL) << "Missing global func: tilelang_callback_cuda_compile";
+  }

src/ir.cc (1)

7-16: Include the standard headers used here.

std::vector and std::sort are used but not explicitly included.

Apply:

 #include "./transform/common/attr.h"
 #include "op/builtin.h"
 #include "tvm/ffi/any.h"
 #include <tvm/ffi/object.h>
 
 #include "support/ffi_aliases.h"
 #include <tvm/arith/analyzer.h>
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/script/ir_builder/tir/ir.h>
+#include <vector>
+#include <algorithm>

src/op/gemm.h (1)

10-12: Add missing <optional> header to fix std::optional usage.

The header uses std::optional (line 101) but does not include <optional>. While support/ffi_aliases.h is already transitively included via operator.h, it provides only the FFI wrapper Optional, not the standard library std::optional.
 #include "operator.h"
+#include <optional>
The #include "support/ffi_aliases.h" is optional—it's already transitively available through operator.h → layout.h → support/ffi_aliases.h, but adding it explicitly improves header self-sufficiency if desired.

src/op/finalize_reducer.cc (1)

84-86: Potential null deref in extent check.

If T.thread_bounds->extent is symbolic, as_const_int(...) may be null and dereferenced in the ICHECK. Guard before deref.

Apply this minimal fix:

-  int extent = *p_extent, scale = 1;
-  ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent))
+  int extent = *p_extent, scale = 1;
+  const int64_t* p_thr = as_const_int(T.thread_bounds->extent);
+  ICHECK(p_thr != nullptr && (extent == 1 || extent == *p_thr))
       << "Illegal finalize_reducer: extent=" << extent
       << "; T.thread_bounds=" << T.thread_bounds;

src/op/gemm.cc (2)

115-152: Constructor reads beyond args without size checks; fix parsing robustness.

Accessing args[16..18] unconditionally can OOB when older call sites omit them. Also prefer robust Downcast/Optional handling for types.

Apply:

 Gemm::Gemm(Array<PrimExpr> args, BufferMap vmap) {
-  ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>();
+  ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>();

   node->Aptr = args[0];
   node->Bptr = args[1];
   node->Cptr = args[2];
   node->A = vmap[GetVarFromAccessPtr(node->Aptr)];
   node->B = vmap[GetVarFromAccessPtr(node->Bptr)];
   node->C = vmap[GetVarFromAccessPtr(node->Cptr)];
-  node->trans_A = args[3].as<Bool>().value();
-  node->trans_B = args[4].as<Bool>().value();
+  node->trans_A = Downcast<Bool>(args[3]).value();
+  node->trans_B = Downcast<Bool>(args[4]).value();
   node->M = args[5].as<IntImm>().value()->value;
   node->N = args[6].as<IntImm>().value()->value;
   node->K = args[7].as<IntImm>().value()->value;
   node->policy = GemmWarpPolicy(args[8].as<IntImm>().value()->value);
-  node->clear_accum = args[9].as<PrimExpr>().value();
+  node->clear_accum = Downcast<Bool>(args[9]);
   node->stride_A = args[10].as<IntImm>().value()->value;
   node->stride_B = args[11].as<IntImm>().value()->value;
   node->offset_A = args[12].as<IntImm>().value()->value;
   node->offset_B = args[13].as<IntImm>().value()->value;
   if (args.size() > 14) {
     node->kPack = args[14].as<IntImm>().value()->value;
     if (node->kPack != 1 && node->kPack != 2) {
       ICHECK(false) << "kPack must be 1 or 2";
     }
   }
-  if (args.size() > 15) {
-    node->wg_wait = args[15].as<IntImm>().value()->value;
-  }
-  node->mbarptr = args[16];
-  if (node->mbarptr.as<CallNode>()) {
-    node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)];
-  } else {
-    node->mbar = std::nullopt;
-  }
-  node->C_coords = Array<PrimExpr>(
-      {args[17].as<PrimExpr>().value(), args[18].as<PrimExpr>().value()});
+  if (args.size() > 15) {
+    node->wg_wait = args[15].as(IntImmNode())->value;
+  }
+  if (args.size() > 16) {
+    node->mbarptr = args[16];
+    if (node->mbarptr.as<CallNode>()) {
+      node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)];
+    } else {
+      node->mbar = std::nullopt;
+    }
+  }
+  if (args.size() > 18) {
+    node->C_coords = {args[17], args[18]};
+  } else {
+    node->C_coords = {};
+  }

Also update the comment above to document the additional parameters and new optionality.

877-887: FFI binding drops return value from ComputeWarpPartition.

The lambda doesn’t return the computed pair; callers can’t observe the result. Return both values (e.g., as Array or a Tuple).

Apply one of:

-  refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition",
-                        [](GemmWarpPolicy policy, int M, int N, int block_size,
-                           Target target, GemmInst gemm_inst) {
-                          policy->ComputeWarpPartition(M, N, block_size, target,
-                                                       gemm_inst);
-                        });
+  refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition",
+                        [](GemmWarpPolicy policy, int M, int N, int block_size,
+                           Target target, GemmInst gemm_inst) -> Array<Integer> {
+                          auto res = policy->ComputeWarpPartition(M, N, block_size, target, gemm_inst);
+                          return {Integer(res.first), Integer(res.second)};
+                        });

If your FFI prefers packed returns, switch to def_packed and set rv accordingly.

🧹 Nitpick comments (21)

src/transform/lower_device_kernel_launch.cc (1)

39-39: Optional: The using namespace ffi; directive appears unused.

All FFI components in this file are explicitly qualified (e.g., tvm::ffi::GetRef on lines 376 and 392), so this namespace directive isn't currently utilized. Consider removing it or using unqualified FFI names if consistency with the using namespace tir; pattern is desired.
src/transform/storage_access.cc (1)
343-407: FFI migration correctly applied.

All tvm::ffi::GetRef<Var> conversions are necessary to convert the raw const VarNode* (line 346) to Var objects.

Minor optimization opportunity: Lines 350, 355, 360-361, and 392 each create a new Var reference from the same buffer_var pointer. You could cache tvm::ffi::GetRef<Var>(buffer_var) in a local variable to avoid repeated conversions, though the performance impact is likely negligible.
Optional optimization
     DataType dtype = op->args[0].dtype();
     const VarNode *buffer_var = op->args[1].as<VarNode>();
+    Var buffer_var_ref = tvm::ffi::GetRef<Var>(buffer_var);
     PrimExpr offset = op->args[2];
     PrimExpr extent = op->args[3];
     const IntImmNode *flag = op->args[4].as<IntImmNode>();
-    StorageScope scope = GetScope(tvm::ffi::GetRef<Var>(buffer_var));
+    StorageScope scope = GetScope(buffer_var_ref);
     // The buffer scope.
     if (Enabled(buffer_var, scope)) {
       ICHECK(allow_append_);
       Array<Range> buffer_ranges;
-      if (buffer_data_to_buffer_.find(tvm::ffi::GetRef<Var>(buffer_var)) ==
+      if (buffer_data_to_buffer_.find(buffer_var_ref) ==
           buffer_data_to_buffer_.end()) {
         // cannot find buffer map, use the default buffer
         buffer_ranges = {Range::FromMinExtent(offset, extent)};
       } else {
-        Buffer buffer =
-            buffer_data_to_buffer_.at(tvm::ffi::GetRef<Var>(buffer_var));
+        Buffer buffer = buffer_data_to_buffer_.at(buffer_var_ref);
         // ... (rest of logic unchanged)
       }
       AccessEntry e;
       e.threads = env_threads();
       e.thread_range = this->ComputeThreadRange(e.threads);
       e.dtype = dtype;
-      e.buffer = tvm::ffi::GetRef<Var>(buffer_var);
+      e.buffer = buffer_var_ref;
src/transform/inject_pipeline.cc (1)
40-40: Verify if this namespace import is actually used.

The using namespace ffi; directive appears unused, as all FFI members throughout the file are explicitly qualified (e.g., tvm::ffi::make_object at line 463, tvm::ffi::GetRef at line 985, tvm::ffi::reflection at line 1077).

If this import is not used elsewhere in the file, consider removing it:
 using namespace tir;
-using namespace ffi;
src/transform/storage_access.h (1)

42-42: Consider moving the using namespace ffi; directive to the implementation file.

Using namespace directives in headers can lead to namespace pollution for all translation units that include this header. Since no FFI types or functions are directly used in this header declaration, consider moving this directive to the corresponding .cc implementation file where FFI functions like make_object or GetRef are likely called.

Alternatively, use explicit ffi:: qualification at call sites or prefer specific using declarations (e.g., using ffi::make_object;) for better clarity and to avoid potential name collisions.
src/transform/inject_tma_barrier.cc (4)
296-301: Assumes barrier_id is always Call(get_mbarrier(...)) — verify or guard.

tma_op_to_barrier_id_[GetRef(op)].as()->args[0] will segfault if the value isn’t a Call. Add an ICHECK for as() or handle Var/IntImm cases.

Would you like me to harden this with a small guard and fallback?

408-416: Unused local ‘block’.

auto block = tvm::ffi::GetRef(op) is unused. Remove to avoid warnings.
-  auto block = tvm::ffi::GetRef<Block>(op);
487-494: Eliding arrive_barrier by returning 0 — confirm semantics.

Replacing builtin::ptx_arrive_barrier() with literal 0 relies on the surrounding Evaluate(…) context treating it as a no-op. If any code inspects side effects, this may mislead. Consider returning an Evaluate(0) at the stmt layer instead where possible.

82-87: Typo: old_loop_evtents.

Rename to old_loop_extents for clarity.
-    PrimExpr old_loop_evtents = loop_extents;
+    PrimExpr old_loop_extents = loop_extents;
     loop_extents *= op->extent;
     StmtExprVisitor::VisitStmt_(op);
-    loop_extents = old_loop_evtents;
+    loop_extents = old_loop_extents;
src/op/copy.h (2)
106-114: Expose remaining fields in reflection (disable_tma, eviction_policy, par_op_).

For parity with other ops (e.g., AtomicAddNode), add def_ro for disable_tma and eviction_policy to aid Python-side introspection/debug.
   refl::ObjectDef<CopyNode>()
       .def_ro("src", &CopyNode::src)
       .def_ro("dst", &CopyNode::dst)
       .def_ro("src_range", &CopyNode::src_range)
       .def_ro("dst_range", &CopyNode::dst_range)
-      .def_ro("coalesced_width", &CopyNode::coalesced_width);
+      .def_ro("coalesced_width", &CopyNode::coalesced_width)
+      .def_ro("disable_tma", &CopyNode::disable_tma)
+      .def_ro("eviction_policy", &CopyNode::eviction_policy);
308-345: FFI macros for Conv2DIm2ColOpNode/Op look correct; consider reflecting nhw_step/c_step.

nhw_step and c_step are important for debugging/instrumentation but not reflected. Consider adding def_ro for both.
   refl::ObjectDef<Conv2DIm2ColOpNode>()
       .def_ro("src", &Conv2DIm2ColOpNode::src)
       .def_ro("dst", &Conv2DIm2ColOpNode::dst)
       .def_ro("stride", &Conv2DIm2ColOpNode::stride)
       .def_ro("padding", &Conv2DIm2ColOpNode::padding)
       .def_ro("dilation", &Conv2DIm2ColOpNode::dilation)
       .def_ro("kernel", &Conv2DIm2ColOpNode::kernel)
-      .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy);
+      .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy)
+      .def_ro("nhw_step", &Conv2DIm2ColOpNode::nhw_step)
+      .def_ro("c_step", &Conv2DIm2ColOpNode::c_step);
src/transform/simplify.cc (2)
26-27: Avoid broad ‘using namespace ffi’ in .cc unless needed.

Localize to needed symbols or qualify with tvm::ffi to reduce ambiguity.

515-523: Default argument differs from Python wrapper.

C++ Simplify defaults simplify_arguments = true, while tilelang/transform/simplify.py defaults to False. Align to False to avoid surprises when invoked from C++.
-tvm::transform::Pass Simplify(bool simplify_arguments = true) {
+tvm::transform::Pass Simplify(bool simplify_arguments = false) {
Would you like me to scan for in-repo C++ call sites that rely on the current default?
src/op/reduce.h (1)

54-58: Constructor uses tvm::ffi::make_object — good. Consider std::string_view.

Optional: accept std::string_view to avoid copies and normalize input via to_lower for robustness.

src/op/reduce.cc (1)

317-321: AllReduce workspace conditional looks fine; small nit: reduce literal 32 as constexpr.

Optional micro‑refactor to constexpr kWarpSize = 32 for clarity.
src/layout/swizzle.cc (3)
103-106: Register reflection in a static init block

RegisterReflection() won’t run unless called. Add a static init block.

Apply:
 void SwizzledLayoutNode::RegisterReflection() {
   namespace refl = tvm::ffi::reflection;
   refl::ObjectDef<SwizzledLayoutNode>();
 }
+
+TVM_FFI_STATIC_INIT_BLOCK() { SwizzledLayoutNode::RegisterReflection(); }
82-86: Use ICHECK instead of CHECK for consistency

Elsewhere you use ICHECK; align this check as well.
-    CHECK(is_zero(forward_var[i]->dom->min));
+    ICHECK(is_zero(forward_var[i]->dom->min));
27-33: Make bit shifts 64-bit safe

1 << bits_ and 1 << base_ on int can overflow/UB for large values. Use 1LL.
-  int base = (1 << base_);
-  int mask = ((1 << bits_) - 1) << shift_;
+  int base = static_cast<int>(1LL << base_);
+  int mask = static_cast<int>(((1LL << bits_) - 1) << shift_);
src/transform/layout_reducer.h (1)
8-11: Make header self-sufficient for FFI types

ReducerInfoNode/ReducerInfo use String and FFI macros but this header doesn’t include the necessary FFI headers. Include the aliases header or specific FFI headers to avoid relying on transitive includes.
 #include <tvm/tir/op.h>
 
 #include "../layout/layout.h"
+#include "../support/ffi_aliases.h"
+#include <tvm/ffi/object.h>
src/target/codegen_cuda.h (1)

67-69: Type qualification inconsistency confirmed—refactor to maintain consistency across translation unit boundaries.

The header declares PrintFunctionSignature and PrintCallExtern with ffi::String and ffi::Array<PrimExpr>, while the .cc implementation uses bare String and Array<PrimExpr>. Although semantically equivalent due to using namespace ffi; at line 22 of codegen_cuda.cc, this mixed qualification style can obscure intent across TU boundaries and risks confusion in future maintenance.

Recommendation: Align the .cc definitions to use explicit ffi::String and ffi::Array for both methods to match the header declarations, or conversely, update the header to use bare types if the unqualified form is preferred in this codebase.
src/ir.cc (1)
246-249: Remove no-op ICHECK.

ICHECK(grid_size.size() >= 0) is always true; drop it.
-    ICHECK(grid_size.size() >= 0);
src/layout/layout.cc (1)

481-565: FFI GlobalDef and reflection blocks look valid.

Two static init blocks for API defs and reflection are fine; no duplicate registrations detected.

Minor: In LayoutNode::Inverse() and FragmentNode::Inverse(), returning std::move(...) is unnecessary for local values; consider returning the value directly for clarity.

Also applies to: 567-572

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e28433e and 951f2de.

📒 Files selected for processing (107)

3rdparty/tvm (1 hunks)
CMakeLists.txt (1 hunks)
cmake/load_tvm.cmake (1 hunks)
src/ir.cc (11 hunks)
src/layout/layout.cc (7 hunks)
src/layout/layout.h (5 hunks)
src/layout/swizzle.cc (3 hunks)
src/layout/swizzle.h (2 hunks)
src/layout/utils.cc (1 hunks)
src/layout/utils.h (1 hunks)
src/op/atomic_add.cc (3 hunks)
src/op/atomic_add.h (2 hunks)
src/op/copy.cc (5 hunks)
src/op/copy.h (4 hunks)
src/op/fill.cc (3 hunks)
src/op/fill.h (2 hunks)
src/op/finalize_reducer.cc (3 hunks)
src/op/finalize_reducer.h (2 hunks)
src/op/gemm.cc (5 hunks)
src/op/gemm.h (4 hunks)
src/op/gemm_py.cc (7 hunks)
src/op/gemm_py.h (2 hunks)
src/op/gemm_sp.cc (3 hunks)
src/op/gemm_sp.h (3 hunks)
src/op/logical.cc (2 hunks)
src/op/math.cc (1 hunks)
src/op/operator.cc (2 hunks)
src/op/operator.h (1 hunks)
src/op/parallel.cc (2 hunks)
src/op/parallel.h (2 hunks)
src/op/reduce.cc (5 hunks)
src/op/reduce.h (6 hunks)
src/op/region.cc (3 hunks)
src/op/region.h (2 hunks)
src/runtime/runtime.cc (4 hunks)
src/support/ffi_aliases.h (1 hunks)
src/target/codegen_cpp.cc (3 hunks)
src/target/codegen_cpp.h (2 hunks)
src/target/codegen_cuda.cc (4 hunks)
src/target/codegen_cuda.h (1 hunks)
src/target/codegen_hip.cc (1 hunks)
src/target/codegen_webgpu.cc (0 hunks)
src/target/codegen_webgpu.h (0 hunks)
src/target/intrin_rule_cuda.cc (1 hunks)
src/target/intrin_rule_hip.cc (2 hunks)
src/target/rt_mod_cpp.cc (2 hunks)
src/target/rt_mod_cuda.cc (3 hunks)
src/target/rt_mod_hip.cc (1 hunks)
src/target/utils.cc (6 hunks)
src/transform/align_dynamic_shared_memory_allocations.cc (5 hunks)
src/transform/annotate_device_regions.cc (2 hunks)
src/transform/annotate_warp_group_reg_alloc.cc (1 hunks)
src/transform/cluster_planning.cc (3 hunks)
src/transform/common/loop_parallel_transform_utils.h (2 hunks)
src/transform/common/loop_vectorization_utils.h (19 hunks)
src/transform/config_index_bitwidth.cc (3 hunks)
src/transform/eliminate_storage_sync_for_mbarrier.cc (3 hunks)
src/transform/flatten_buffer.cc (3 hunks)
src/transform/frontend_legalize.cc (1 hunks)
src/transform/if_stmt_binding.cc (2 hunks)
src/transform/inject_assumes.cc (1 hunks)
src/transform/inject_fence_proxy.cc (1 hunks)
src/transform/inject_pipeline.cc (5 hunks)
src/transform/inject_ptx_async_copy.cc (1 hunks)
src/transform/inject_tma_barrier.cc (6 hunks)
src/transform/layout_inference.cc (5 hunks)
src/transform/layout_reducer.cc (1 hunks)
src/transform/layout_reducer.h (1 hunks)
src/transform/legalize_safe_memory_access.cc (2 hunks)
src/transform/legalize_vectorized_loop.cc (1 hunks)
src/transform/loop_partition.cc (1 hunks)
src/transform/loop_vectorize.cc (1 hunks)
src/transform/loop_vectorize_dynamic.cc (2 hunks)
src/transform/lower_device_kernel_launch.cc (4 hunks)
src/transform/lower_device_storage_access_info.cc (1 hunks)
src/transform/lower_hopper_intrin.cc (2 hunks)
src/transform/lower_intrin.cc (7 hunks)
src/transform/lower_l2_persistent_annotation.cc (1 hunks)
src/transform/lower_opaque_block.cc (2 hunks)
src/transform/lower_shared_barrier.cc (2 hunks)
src/transform/lower_shared_tmem.cc (2 hunks)
src/transform/lower_thread_allreduce.cc (2 hunks)
src/transform/lower_tile_op.cc (4 hunks)
src/transform/make_packed_api.cc (6 hunks)
src/transform/merge_if_stmt.cc (1 hunks)
src/transform/merge_shared_memory_allocations.cc (5 hunks)
src/transform/multi_version_buffer_rewriter.cc (3 hunks)
src/transform/persist_threadblock.cc (1 hunks)
src/transform/pipeline_planning.cc (5 hunks)
src/transform/simplify.cc (5 hunks)
src/transform/split_host_device.cc (2 hunks)
src/transform/storage_access.cc (5 hunks)
src/transform/storage_access.h (1 hunks)
src/transform/storage_rewrite.cc (5 hunks)
src/transform/thread_storage_sync.cc (1 hunks)
src/transform/vectorize_loop.cc (20 hunks)
src/transform/warp_specialized_rewriter.cc (5 hunks)
src/transform/wgmma_sync_rewriter.cc (1 hunks)
tilelang/_ffi_api.py (1 hunks)
tilelang/contrib/dlpack.py (2 hunks)
tilelang/contrib/hipcc.py (2 hunks)
tilelang/contrib/nvcc.py (6 hunks)
tilelang/contrib/rocm.py (4 hunks)
tilelang/engine/callback.py (3 hunks)
tilelang/engine/lower.py (5 hunks)
tilelang/ir.py (2 hunks)
tilelang/layout/fragment.py (1 hunks)

⛔ Files not processed due to max files limit (4)

tilelang/layout/layout.py
tilelang/tileop/gemm/init.py
tilelang/transform/_ffi_api.py
tilelang/utils/tensor.py

💤 Files with no reviewable changes (2)

src/target/codegen_webgpu.h
src/target/codegen_webgpu.cc

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/lower_shared_barrier.cc
src/transform/inject_tma_barrier.cc
src/transform/align_dynamic_shared_memory_allocations.cc
src/transform/lower_shared_tmem.cc

🧬 Code graph analysis (75)

src/transform/legalize_vectorized_loop.cc (1)

tilelang/transform/__init__.py (1)

LegalizeVectorizedLoop (241-249)

src/transform/persist_threadblock.cc (1)

tilelang/transform/__init__.py (1)

PersistThreadblock (392-395)

src/op/gemm_sp.cc (2)

src/op/finalize_reducer.h (1)

RegisterReflection (33-38)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/transform/inject_ptx_async_copy.cc (1)

tilelang/transform/__init__.py (1)

InjectPTXAsyncCopy (307-315)

src/transform/annotate_warp_group_reg_alloc.cc (1)

tilelang/transform/__init__.py (1)

AnnotateWarpGroupRegAlloc (204-216)

src/transform/inject_assumes.cc (1)

tilelang/transform/__init__.py (1)

InjectAssumes (83-91)

src/transform/flatten_buffer.cc (2)

src/transform/config_index_bitwidth.cc (16)

op (27-35)

op (27-27)

op (37-42)

op (37-37)

op (44-50)

op (44-44)

op (52-59)

op (52-52)

op (61-68)

op (61-61)

op (89-94)

op (89-89)

op (96-101)

op (96-96)

op (103-108)

op (103-103)

tilelang/transform/__init__.py (1)

FlattenBuffer (357-365)

tilelang/contrib/nvcc.py (2)

tilelang/engine/lower.py (1)

tilelang_callback_cuda_compile (56-89)

tilelang/contrib/nvrtc.py (1)

compile_cuda (13-110)

src/transform/lower_shared_barrier.cc (2)

src/transform/layout_inference.cc (10)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

op (443-460)

op (443-443)

op (462-471)

op (462-462)

tilelang/transform/__init__.py (1)

LowerSharedBarrier (412-415)

src/transform/common/loop_parallel_transform_utils.h (2)

src/transform/common/loop_vectorization_utils.h (1)

tvm (39-783)

src/transform/layout_inference.cc (16)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

op (443-460)

op (443-443)

op (462-471)

op (462-462)

op (657-669)

op (657-657)

op (706-801)

op (706-706)

op (803-812)

op (803-803)

src/transform/layout_reducer.cc (1)

tilelang/transform/__init__.py (1)

LayoutReducer (466-475)

src/layout/swizzle.h (2)

src/layout/swizzle.cc (3)

SwizzledLayoutNode (40-49)

SwizzledLayout (77-93)

SwizzledLayout (95-101)

src/layout/layout.h (1)

LayoutNode (24-209)

src/transform/common/loop_vectorization_utils.h (1)

src/transform/common/loop_parallel_transform_utils.h (1)

tvm (20-165)

src/transform/lower_hopper_intrin.cc (2)

src/op/builtin.cc (2)

cuTensorMapType (38-38)

cuTensorMapType (38-38)

tilelang/transform/__init__.py (1)

LowerHopperIntrin (94-103)

src/transform/storage_access.cc (2)

src/transform/merge_shared_memory_allocations.cc (6)

buf (215-232)

buf (215-215)

buffer_var (614-619)

buffer_var (614-614)

buffer (549-568)

buffer (549-549)

src/transform/thread_storage_sync.cc (21)

buf (57-59)

buf (57-57)

op (353-365)

op (353-353)

op (411-425)

op (411-411)

op (456-462)

op (456-456)

op (463-469)

op (463-463)

op (470-488)

op (470-470)

op (490-529)

op (490-490)

op (544-563)

op (544-544)

op (614-633)

op (614-614)

ICHECK (564-586)

buffer_var (539-541)

buffer_var (539-539)

src/transform/split_host_device.cc (1)

tilelang/transform/__init__.py (1)

SplitHostDevice (285-293)

src/layout/utils.cc (1)

src/transform/layout_inference.cc (12)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

op (443-460)

op (443-443)

op (462-471)

op (462-462)

op (657-669)

op (657-657)

src/transform/loop_vectorize.cc (1)

src/transform/lower_tile_op.cc (2)

expr (433-445)

expr (433-433)

src/transform/inject_tma_barrier.cc (2)

src/transform/lower_hopper_intrin.cc (2)

call (102-132)

call (102-102)

src/transform/warp_specialized_rewriter.cc (18)

call (31-36)

call (31-31)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

op (179-189)

op (179-179)

tilelang/layout/fragment.py (2)

src/layout/layout.cc (2)

Layout (57-69)

Layout (71-74)

tilelang/layout/layout.py (1)

Layout (13-148)

src/op/fill.h (2)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/fill.cc (1)

Fill (62-109)

src/layout/swizzle.cc (1)

src/layout/swizzle.h (1)

SwizzledLayout (58-66)

src/transform/layout_reducer.h (5)

src/op/atomic_add.h (1)

tvm (12-73)

src/op/finalize_reducer.h (1)

tvm (20-56)

src/op/gemm_sp.h (1)

tvm (13-104)

src/op/parallel.h (1)

tvm (22-148)

src/support/ffi_aliases.h (1)

tvm (10-16)

src/op/fill.cc (1)

src/op/fill.h (1)

RegisterReflection (29-35)

src/transform/eliminate_storage_sync_for_mbarrier.cc (2)

src/transform/lower_hopper_intrin.cc (2)

op (55-100)

op (55-55)

tilelang/transform/__init__.py (1)

EliminateStorageSyncForMBarrier (368-371)

src/op/atomic_add.cc (6)

src/op/finalize_reducer.h (1)

RegisterReflection (33-38)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/op/parallel.h (1)

RegisterReflection (72-78)

src/op/reduce.h (3)

RegisterReflection (35-38)

RegisterReflection (92-100)

RegisterReflection (137-144)

src/op/copy.h (2)

RegisterReflection (106-114)

RegisterReflection (311-321)

src/op/fill.h (1)

RegisterReflection (29-35)

src/op/finalize_reducer.cc (1)

src/op/finalize_reducer.h (1)

RegisterReflection (33-38)

src/transform/make_packed_api.cc (2)

src/transform/lower_device_kernel_launch.cc (6)

gvar (74-98)

gvar (74-74)

gvar (206-219)

gvar (206-206)

gvar (221-266)

gvar (221-221)

tilelang/transform/__init__.py (1)

MakePackedAPI (263-271)

src/transform/config_index_bitwidth.cc (1)

src/transform/flatten_buffer.cc (16)

op (76-81)

op (76-76)

op (83-88)

op (83-83)

op (90-95)

op (90-90)

op (97-101)

op (97-97)

op (103-106)

op (103-103)

op (111-144)

op (111-111)

op (146-217)

op (146-146)

op (219-225)

op (219-219)

src/op/finalize_reducer.h (3)

src/op/operator.h (1)

TileOperatorNode (56-66)

src/op/finalize_reducer.cc (1)

FinalizeReducerOp (35-40)

tilelang/ir.py (1)

FinalizeReducerOp (53-54)

src/op/atomic_add.h (2)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/atomic_add.cc (1)

AtomicAdd (44-69)

src/op/parallel.h (3)

src/op/parallel.cc (1)

ParallelOpNode (176-178)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/transform/layout_inference.cc (16)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

op (443-460)

op (443-443)

op (462-471)

op (462-462)

op (657-669)

op (657-657)

op (706-801)

op (706-706)

op (803-812)

op (803-803)

src/transform/storage_rewrite.cc (1)

src/transform/flatten_buffer.cc (10)

op (76-81)

op (76-76)

op (83-88)

op (83-83)

op (90-95)

op (90-90)

op (97-101)

op (97-97)

op (103-106)

op (103-103)

tilelang/engine/lower.py (1)

tilelang/engine/param.py (1)

CompiledArtifact (107-117)

src/transform/wgmma_sync_rewriter.cc (1)

tilelang/transform/__init__.py (1)

RewriteWgmmaSync (117-125)

src/transform/thread_storage_sync.cc (1)

tilelang/transform/__init__.py (1)

ThreadSync (128-141)

src/transform/if_stmt_binding.cc (2)

src/transform/warp_specialized_rewriter.cc (14)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

tilelang/transform/__init__.py (1)

IfStmtBinding (160-168)

src/op/gemm_py.cc (3)

src/transform/pipeline_planning.cc (12)

op (48-57)

op (48-48)

op (87-143)

op (87-87)

op (166-184)

op (166-166)

op (186-204)

op (186-186)

op (206-260)

op (206-206)

f (287-297)

f (287-287)

src/layout/layout.cc (4)

RegisterReflection (76-81)

RegisterReflection (76-76)

RegisterReflection (474-479)

RegisterReflection (474-474)

src/op/gemm_py.h (1)

RegisterReflection (38-60)

src/op/gemm.h (7)

src/op/atomic_add.h (1)

tvm (12-73)

src/op/finalize_reducer.h (1)

tvm (20-56)

src/op/gemm_sp.h (1)

tvm (13-104)

src/op/region.h (1)

tvm (72-116)

src/support/ffi_aliases.h (1)

tvm (10-16)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/gemm.cc (1)

Gemm (114-152)

src/support/ffi_aliases.h (4)

src/op/atomic_add.h (1)

tvm (12-73)

src/op/parallel.h (1)

tvm (22-148)

src/op/region.h (1)

tvm (72-116)

src/layout/layout.h (1)

Array (39-71)

src/op/reduce.cc (1)

src/op/reduce.h (4)

ReduceOp (120-126)

RegisterReflection (35-38)

RegisterReflection (92-100)

RegisterReflection (137-144)

src/transform/legalize_safe_memory_access.cc (1)

tilelang/transform/__init__.py (1)

LegalizeSafeMemoryAccess (252-260)

src/transform/lower_device_storage_access_info.cc (1)

tilelang/transform/__init__.py (1)

LowerDeviceStorageAccessInfo (318-330)

src/layout/layout.cc (4)

src/layout/swizzle.cc (2)

RegisterReflection (103-106)

RegisterReflection (103-103)

src/op/finalize_reducer.h (1)

RegisterReflection (33-38)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/op/parallel.h (1)

RegisterReflection (72-78)

src/op/region.cc (4)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/op/parallel.h (1)

RegisterReflection (72-78)

src/op/reduce.h (3)

RegisterReflection (35-38)

RegisterReflection (92-100)

RegisterReflection (137-144)

src/op/copy.h (2)

RegisterReflection (106-114)

RegisterReflection (311-321)

src/transform/multi_version_buffer_rewriter.cc (2)

src/transform/warp_specialized_rewriter.cc (16)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

op (179-189)

op (179-179)

src/transform/inject_pipeline.cc (4)

buffer (398-453)

buffer (398-399)

buffer (461-471)

buffer (461-461)

src/transform/simplify.cc (3)

src/op/atomic_add.h (1)

tl (13-72)

src/layout/layout.cc (4)

RegisterReflection (76-81)

RegisterReflection (76-76)

RegisterReflection (474-479)

RegisterReflection (474-474)

tilelang/transform/simplify.py (1)

Simplify (19-27)

src/transform/inject_fence_proxy.cc (1)

tilelang/transform/__init__.py (1)

InjectFenceProxy (230-238)

src/transform/merge_if_stmt.cc (1)

tilelang/transform/__init__.py (1)

MergeIfStmt (171-179)

src/layout/layout.h (4)

src/op/atomic_add.h (1)

tl (13-72)

src/op/parallel.h (1)

tl (23-146)

src/layout/swizzle.h (1)

tl (13-68)

src/layout/layout.cc (6)

Layout (57-69)

Layout (71-74)

LayoutNode (48-55)

Fragment (318-340)

Fragment (342-352)

FragmentNode (302-316)

src/transform/align_dynamic_shared_memory_allocations.cc (1)

tilelang/transform/__init__.py (1)

AlignDynamicSharedMemoryAllocations (398-409)

src/transform/layout_inference.cc (3)

src/op/operator.cc (4)

ParseOperator (32-41)

ParseOperator (32-32)

ParseOperator (55-61)

ParseOperator (55-55)

src/op/parallel.h (1)

ParallelOp (137-145)

tilelang/transform/__init__.py (1)

LayoutInference (39-47)

src/op/gemm_py.h (2)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/gemm_py.cc (1)

GemmPy (50-80)

src/transform/frontend_legalize.cc (1)

tilelang/transform/simplify.py (1)

LetInline (8-16)

src/ir.cc (3)

src/op/gemm_sp.h (1)

tl (15-103)

src/op/parallel.h (2)

tl (23-146)

RegisterReflection (72-78)

src/transform/layout_reducer.h (1)

tl (58-87)

src/transform/lower_intrin.cc (1)

tilelang/transform/__init__.py (1)

LowerIntrin (441-444)

src/transform/lower_shared_tmem.cc (2)

src/transform/lower_tile_op.cc (10)

op (96-102)

op (96-96)

op (117-151)

op (117-117)

op (181-186)

op (181-181)

op (193-207)

op (193-193)

op (282-316)

op (282-282)

tilelang/transform/__init__.py (1)

LowerSharedTmem (460-463)

src/op/copy.cc (4)

src/layout/layout.cc (4)

RegisterReflection (76-81)

RegisterReflection (76-76)

RegisterReflection (474-479)

RegisterReflection (474-474)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/op/parallel.h (1)

RegisterReflection (72-78)

src/op/copy.h (2)

RegisterReflection (106-114)

RegisterReflection (311-321)

src/target/codegen_cpp.h (1)

src/target/codegen_cpp.cc (2)

GenerateForwardFunctionDeclarations (65-87)

GenerateForwardFunctionDeclarations (65-68)

src/op/region.h (2)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/region.cc (1)

RegionOp (34-52)

src/transform/loop_vectorize_dynamic.cc (2)

src/transform/make_packed_api.cc (2)

op (168-189)

op (168-168)

tilelang/transform/__init__.py (1)

LoopVectorizeDynamic (333-342)

src/op/parallel.cc (7)

src/layout/layout.cc (4)

RegisterReflection (76-81)

RegisterReflection (76-76)

RegisterReflection (474-479)

RegisterReflection (474-474)

src/layout/swizzle.cc (2)

RegisterReflection (103-106)

RegisterReflection (103-103)

src/op/finalize_reducer.h (1)

RegisterReflection (33-38)

src/op/gemm.h (2)

RegisterReflection (35-41)

RegisterReflection (106-128)

src/op/parallel.h (1)

RegisterReflection (72-78)

src/op/copy.h (2)

RegisterReflection (106-114)

RegisterReflection (311-321)

src/op/fill.h (1)

RegisterReflection (29-35)

src/transform/annotate_device_regions.cc (3)

src/transform/make_packed_api.cc (2)

op (168-189)

op (168-168)

src/transform/layout_inference.cc (14)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

op (443-460)

op (443-443)

op (462-471)

op (462-462)

op (657-669)

op (657-657)

op (706-801)

op (706-706)

tilelang/transform/__init__.py (1)

AnnotateDeviceRegions (274-282)

src/transform/inject_pipeline.cc (2)

src/transform/multi_version_buffer_rewriter.cc (2)

buffer (255-265)

buffer (255-255)

src/transform/pipeline_planning.cc (14)

op (48-57)

op (48-48)

op (87-143)

op (87-87)

op (166-184)

op (166-166)

op (186-204)

op (186-186)

op (206-260)

op (206-206)

op (262-272)

op (262-262)

op (703-712)

op (703-703)

src/transform/lower_tile_op.cc (2)

src/transform/legalize_safe_memory_access.cc (10)

op (33-46)

op (33-33)

op (64-74)

op (64-64)

op (76-84)

op (76-76)

op (158-180)

op (158-158)

op (182-214)

op (182-182)

src/op/operator.cc (4)

ParseOperator (32-41)

ParseOperator (32-32)

ParseOperator (55-61)

ParseOperator (55-55)

src/transform/warp_specialized_rewriter.cc (2)

src/transform/annotate_warp_group_reg_alloc.cc (10)

op (30-47)

op (30-30)

op (49-54)

op (49-49)

op (71-78)

op (71-71)

op (96-105)

op (96-96)

op (107-166)

op (107-107)

tilelang/transform/__init__.py (1)

WarpSpecialized (193-201)

src/transform/cluster_planning.cc (1)

tilelang/transform/__init__.py (1)

ClusterPlanning (17-25)

src/target/codegen_cuda.h (1)

src/target/codegen_cuda.cc (12)

PrintFunctionSignature (2675-2734)

PrintFunctionSignature (2675-2677)

t (26-64)

t (26-26)

t (68-75)

t (68-68)

t (79-95)

t (79-79)

t (99-107)

t (99-100)

PrintCallExtern (974-1030)

PrintCallExtern (974-977)

src/target/codegen_cuda.cc (1)

src/transform/warp_specialized_rewriter.cc (16)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

op (179-189)

op (179-179)

src/transform/merge_shared_memory_allocations.cc (2)

src/transform/layout_inference.cc (6)

op (42-48)

op (42-42)

op (327-387)

op (327-327)

op (417-441)

op (417-417)

tilelang/transform/__init__.py (1)

MergeSharedMemoryAllocations (374-383)

src/op/reduce.h (8)

src/op/atomic_add.h (1)

tvm (12-73)

src/op/finalize_reducer.h (2)

tvm (20-56)

RegisterReflection (33-38)

src/op/gemm_sp.h (1)

tvm (13-104)

src/op/region.h (1)

tvm (72-116)

src/support/ffi_aliases.h (1)

tvm (10-16)

src/op/fill.h (2)

tvm (12-53)

RegisterReflection (29-35)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/reduce.cc (2)

ReduceOp (24-33)

CumSumOp (509-523)

src/op/operator.h (1)

src/transform/layout_reducer.h (2)

Object (63-70)

ObjectRef (72-76)

src/op/gemm_sp.h (10)

src/op/gemm.h (1)

GemmWarpPolicyNode (27-57)

src/transform/layout_reducer.h (1)

ObjectRef (72-76)

src/op/atomic_add.h (1)

tvm (12-73)

src/op/finalize_reducer.h (1)

tvm (20-56)

src/op/parallel.h (1)

tvm (22-148)

src/op/region.h (1)

tvm (72-116)

src/support/ffi_aliases.h (1)

tvm (10-16)

src/op/fill.h (1)

tvm (12-53)

src/op/operator.h (2)

TileOperatorNode (56-66)

TileOperator (68-72)

src/op/gemm_sp.cc (1)

GemmSP (86-109)

src/transform/lower_l2_persistent_annotation.cc (1)

tilelang/transform/__init__.py (1)

LowerL2Persistent (386-389)

src/op/copy.h (2)

src/op/operator.h (1)

TileOperatorNode (56-66)

src/op/copy.cc (1)

Conv2DIm2ColOp (1725-1738)

🪛 Ruff (0.14.1)

tilelang/contrib/nvcc.py

186-186: Unused function argument: target

(ARG001)

src/layout/swizzle.cc

coderabbitai · 2025-10-22T14:12:54Z

src/op/gemm_py.cc

+#include "../support/ffi_aliases.h"
 #include "../target/utils.h"
-#include "tvm/ffi/string.h"



⚠️ Potential issue | 🟠 Major

Ensure ffi::Function is declared

This file uses ffi::Function::GetGlobal but doesn’t include <tvm/ffi/function.h>. Either include it here or rely on the aliases header after adding the missing include there (preferred).

+#include <tvm/ffi/function.h> #include "../support/ffi_aliases.h"

🤖 Prompt for AI Agents

In src/op/gemm_py.cc around lines 14 to 16, the file calls ffi::Function::GetGlobal but does not include the declaration for ffi::Function; add #include <tvm/ffi/function.h> so the symbol is declared. Preferably, add that include to src/support/ffi_aliases.h (which this file already includes) so all files using ffi::Function get the declaration transitively; otherwise add the include directly in src/op/gemm_py.cc.

src/op/operator.h

src/support/ffi_aliases.h

src/target/codegen_cuda.h

coderabbitai · 2025-10-22T14:12:57Z

src/target/rt_mod_cuda.cc

 }

-runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {
+ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Return type likely mismatched with factory return (ffi::Module vs runtime::Module).

runtime::CUDAModuleCreate(...) returns runtime::Module, but these functions return ffi::Module. Unless tvm::ffi::Module is an alias of tvm::runtime::Module, this won’t compile.

Apply one of:

Preferred: keep FFI-neutral and return runtime::Module.

If you do intend to use an alias, ensure tvm::ffi::Module is defined and visible here (otherwise switch back to runtime::Module).

-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) { +runtime::Module BuildTileLangCUDA(IRModule mod, Target target) { ... -ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) { +runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {

Run the script below to confirm whether tvm::ffi::Module exists and is an alias to tvm::runtime::Module:

Also applies to: 74-74

🏁 Script executed:

#!/usr/bin/env bash set -euo pipefail # Find any definition/alias of ffi::Module rg -nP -C2 '\bnamespace\s+ffi\b|using\s+Module\s*=' --glob '!**/third_party/**' # Check all uses of CUDAModuleCreate and its expected return type rg -nP -C2 'CUDAModuleCreate\s*\(' --glob '!**/third_party/**'

Length of output: 3253

I need to inspect the actual file and search more thoroughly for any ffi::Module definition or typedef.

[/run_scripts]

🏁 Script executed:

#!/bin/bash # View the actual function signatures at lines 41 and 74 head -100 src/target/rt_mod_cuda.cc | tail -60 # Search for any Module typedef/using in ffi namespace more broadly rg -nP 'class\s+Module|struct\s+Module|using\s+Module|typedef.*Module' --glob '!**/third_party/**' -A2 -B2 | head -50 # Check includes in rt_mod_cuda.cc for ffi-related headers head -50 src/target/rt_mod_cuda.cc

Length of output: 3742

[/run_scripts]

🏁 Script executed:

#!/bin/bash # Search for ffi::Module in header files (likely in include/tvm/ffi or similar) rg -nP 'class\s+Module|struct\s+Module' src/target/rt_mod_cuda.cc --context=30 # Find ffi header files fd -e 'h$|hpp$' 'ffi' --glob '!**/third_party/**' | head -20 # Search for Module in ffi-related includes rg -nP '\bModule\b' include/tvm/ffi --type=h -A1 -B1 | grep -i module | head -40 # Check if ffi namespace re-exports runtime::Module rg -nP 'using.*Module|Module.*=' include/tvm/ffi --type=h

Length of output: 217

[/run_scripts]

🏁 Script executed:

#!/bin/bash # Find all header files with ffi in them find . -type f $ -name '*ffi*.h' -o -name '*ffi*.hpp' $ -not -path '*/third_party/*' 2>/dev/null | head -30 # Search for ffi namespace definition and Module references rg -nP 'namespace\s+ffi\s*\{|class\s+Module\s*\{|struct\s+Module\s*\{' --type=h --type=cc -A3 | head -60 # Check the actual include path in rt_mod_cuda.cc and trace it rg -nP '#include.*ffi' src/target/rt_mod_cuda.cc # Search for runtime::Module definition to understand the type rg -nP 'class\s+Module|struct\s+Module' --type=h --type=cc -B2 -A5 2>/dev/null | grep -A7 'runtime' | head -40 # Check if ffi::Module is used elsewhere in the codebase rg -nP 'ffi::Module' --type=h --type=cc | head -20

Length of output: 124

[/run_scripts]

🏁 Script executed:

#!/bin/bash # View the ffi_aliases.h file which likely contains Module alias cat src/support/ffi_aliases.h # Search for Module in runtime namespace grep -rn "class Module\|struct Module" src/ include/ 2>/dev/null | grep -i runtime # Check if ffi::Module is used elsewhere grep -rn "ffi::Module" src/ include/ 2>/dev/null | head -30 # Find runtime module creation function grep -rn "CUDAModuleCreate" src/ 2>/dev/null

Length of output: 816

[/run_scripts]

🏁 Script executed:

#!/bin/bash # Search for ALL Module definitions/usages grep -rn "^.*Module" src/support/ src/target/*.h src/target/*.cc 2>/dev/null | grep -E "(class|struct|using|typedef).*Module" | head -40 # Check rt_mod_cpp.cc for context cat src/target/rt_mod_cpp.cc | head -50 # Search in include directories for Module find . -path './third_party' -prune -o -type f -name '*.h' -print 2>/dev/null | xargs grep -l "class Module\|struct Module" 2>/dev/null | head -20 # Look for ffi::Module specifically grep -rn "ffi::Module" . --include="*.h" --include="*.cpp" 2>/dev/null | grep -v third_party | head -30 # Check if Module is in runtime namespace and its definition grep -rn "namespace runtime" src/ include/ 2>/dev/null | head -5 grep -A10 "class Module" src/runtime/ include/tvm/runtime/ 2>/dev/null | head -50

Length of output: 1859

[/run_scripts]

🏁 Script executed:

#!/bin/bash # Get ALL includes from rt_mod_cuda.cc head -20 src/target/rt_mod_cuda.cc # Check if codegen_cuda.h includes tvm/ffi/extra/module.h cat src/target/codegen_cuda.h 2>/dev/null | head -30 # Check if runtime/cuda/cuda_module.h includes it find . -name "cuda_module.h" -type f | head -5 # Search for where ffi::Module is actually defined/declared grep -rn "ffi::Module\|namespace.*ffi" . --include="*.h" 2>/dev/null | grep -v third_party | grep -E "(class|struct|using|typedef|namespace)" | head -50

Length of output: 2406

[/run_scripts]

🏁 Script executed:

#!/bin/bash # Check what codegen_cuda.h includes - it was partially shown, get full file cat src/target/codegen_cuda.h # Check runtime/cuda/cuda_module.h if it exists find . -name "cuda_module.h" -type f ! -path "*/third_party/*" -exec cat {} \; 2>/dev/null | head -50 # Search for where ffi::Module is actually defined grep -rn "class Module\|struct Module" . --include="*.h" 2>/dev/null | grep -v third_party | head -30 # Check if any header brings in ffi/extra/module.h transitively grep -rn "ffi/extra/module.h\|ffi::Module" . --include="*.h" 2>/dev/null | grep -v third_party

Length of output: 5913

Add missing #include <tvm/ffi/extra/module.h> and align return type with actual value returned.

The functions on lines 41 and 74 declare ffi::Module as the return type but are missing the header that defines this type. Additionally, runtime::CUDAModuleCreate() returns runtime::Module, not ffi::Module. The parallel file src/target/rt_mod_cpp.cc correctly includes <tvm/ffi/extra/module.h> and declares the same return type, but rt_mod_cuda.cc lacks this include and has the type mismatch.

Apply the suggested fix:

+#include <tvm/ffi/extra/module.h> #include "codegen_cuda.h" #include "runtime/cuda/cuda_module.h" #include "runtime/pack_args.h" #include <tvm/ffi/reflection/registry.h>

And change the return types:

-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) { +runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {

-ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) { +runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {

This matches the actual return value from runtime::CUDAModuleCreate().

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/target/rt_mod_cuda.cc around lines 41 to 41 (and also the second function around line 74), add the missing include #include <tvm/ffi/extra/module.h> at the top of the file and change the return type of the functions that currently declare ffi::Module to runtime::Module so the signature matches the actual value returned by runtime::CUDAModuleCreate(); keep function bodies unchanged.

coderabbitai · 2025-10-22T14:12:59Z

src/transform/lower_tile_op.cc

    if (const auto *var_node = expr.as<VarNode>()) {
-      Var var = GetRef<Var>(var_node);
+      Var var = tvm::ffi::GetRef<Var>(var_node);
      auto it = let_bindings_.find(var);


⚠️ Potential issue | 🔴 Critical

FFI GetRef migration — LGTM; fix IterVar bound lookup.

Var/Stmt GetRef updates and static-init macro usage are correct.

Issue: In thread bounds computation, Analyzer is called with IterVar instead of its Var.

Apply this diff:

- if (analyzer_->const_int_bound.IsBound(thread_var_->var)) { - auto const_int_bound = analyzer_->const_int_bound(thread_var_); + if (analyzer_->const_int_bound.IsBound(thread_var_->var)) { + auto const_int_bound = analyzer_->const_int_bound(thread_var_->var);

Also applies to: 615-615, 655-657, 734-737

🤖 Prompt for AI Agents

In src/transform/lower_tile_op.cc around lines 439 (and also at 615, 655-657, 734-737), the code passes an IterVar object into the Analyzer call when computing thread bounds; instead, extract and pass the underlying Var from the IterVar (e.g., use iter_var->var or iter_var->var.get() per existing code patterns) so Analyzer receives the Var key it expects. Update each location to obtain the IterVar's Var before calling the Analyzer-bound lookup and use that Var in place of the IterVar argument.

coderabbitai · 2025-10-22T14:13:00Z

tilelang/engine/lower.py

+            host_mod, device_mod, params, codegen_mod.inspect_source(), rt_mod=host_mod)

-    return CompiledArtifact(host_mod, device_mod, params, codegen_mod.get_source())
+    return CompiledArtifact(host_mod, device_mod, params, codegen_mod.inspect_source())


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

get_source → inspect_source: confirm presence on all backends.

inspect_source() is the right API on newer TVM, but ensure all paths (CUDA/HIP/LLVM/C/Metal/WebGPU) return a Module that implements inspect_source(), otherwise Artifact.kernel_source may be None.

🏁 Script executed:

#!/bin/bash # Grep for inspect_source uses and ensure we don’t still call get_source elsewhere. rg -n 'get_source\(|inspect_source\(' -g 'tilelang/**'

Length of output: 323

Let me gather more context to verify the scope and completeness of the migration.Now I need to examine the codebase directly to understand the scope and context.

I need to inspect the actual codebase to understand what inspect_source() is and whether it's a valid method.

🏁 Script executed:

#!/bin/bash # Check if inspect_source is a custom method in tilelang rg -n 'def inspect_source|\.inspect_source' tilelang/ --type py -B 2 -A 2 # Also check for codegen_mod type and imports in lower.py echo "=== lower.py context around 243-245 ===" sed -n '230,250p' tilelang/engine/lower.py # Check base.py context echo "=== base.py context around line 52 ===" sed -n '40,65p' tilelang/jit/adapter/base.py

Length of output: 1581

Let me search for the device_codegen function and inspect_source imports to verify if this is a valid method.

🏁 Script executed:

#!/bin/bash # Find device_codegen definition echo "=== Searching for device_codegen definition ===" rg -n 'def device_codegen\b|^device_codegen\s*=' tilelang/ -A 3 # Check for inspect_source imports echo "=== Searching for inspect_source in imports ===" rg -n 'inspect_source' tilelang/ --type py -B 1 -A 1 # Check what codegen_mod type should be - look at return type annotations echo "=== Checking device_codegen and device_codegen_without_compile signatures ===" rg -n 'device_codegen|device_codegen_without_compile' tilelang/engine/lower.py -B 2 -A 2 # Check imports at top of lower.py echo "=== Imports in lower.py ===" head -40 tilelang/engine/lower.py

Length of output: 2878

🏁 Script executed:

#!/bin/bash # Find device_codegen definition echo "=== Searching for device_codegen definition ===" rg -n 'def device_codegen\b' tilelang/ --type py -A 5 # Check for inspect_source imports or usage echo "=== Searching for inspect_source anywhere in codebase ===" rg -n 'inspect_source' tilelang/ --type py # Check what codegen_mod type annotation is echo "=== Checking lower.py imports and device_codegen calls ===" head -50 tilelang/engine/lower.py # Find where device_codegen comes from echo "=== Search for device_codegen in entire tilelang ===" rg -n 'device_codegen' tilelang/ --type py -g '!lower.py' -B 1

Length of output: 2396

Let me verify if tvm.IRModule has an inspect_source() method by checking TVM module API.

Let me verify one more time if inspect_source() is defined anywhere in the tilelang custom code or if this is purely broken code.

🏁 Script executed:

#!/bin/bash # Comprehensive search for inspect_source definition echo "=== Searching for ANY definition of inspect_source ===" find tilelang -type f -name "*.py" -exec grep -l "inspect_source" {} \; # Look for monkey-patching or property decoration echo "=== Searching for property or method definitions that might alias inspect_source ===" rg -n "@property|def " tilelang/ --type py -B 1 -A 1 | grep -C 2 "source" # Check if it's being imported from external library echo "=== Checking tvm module or tilelang.tvm for inspect_source ===" python3 -c "from tilelang import tvm; print(dir(tvm))" 2>&1 | grep -i inspect

Length of output: 6901

Replace inspect_source() with get_source() on lines 243 and 245.

The method inspect_source() does not exist on tvm.IRModule. Calling it will fail at runtime with AttributeError. The standard TVM API for retrieving source code is get_source(), as evidenced by its use elsewhere in the codebase (e.g., tilelang/jit/adapter/base.py:52).

# Lines 243, 245 should use: codegen_mod.get_source() # instead of: codegen_mod.inspect_source()

🤖 Prompt for AI Agents

In tilelang/engine/lower.py around lines 243 to 245, the code calls codegen_mod.inspect_source() which does not exist on tvm.IRModule and will raise AttributeError; replace both calls to use codegen_mod.get_source() instead so the module source is retrieved via the correct TVM API.

Hzfengsy · 2025-10-22T14:26:08Z

Thanks @LeiWang1999

…registration in tilelang

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/target/codegen_cuda.cc (1)
23-23: Limit namespace pollution from ffi.

using namespace ffi; in a large TU can shadow or collide with nearby symbols. Prefer explicit tvm::ffi::... or local using declarations for specific types/functions.

Apply:
-using namespace ffi;
+// Prefer explicit tvm::ffi:: qualifiers at use sites to avoid namespace pollution.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21e7a0a and cd6daaf.

📒 Files selected for processing (11)

3rdparty/tvm (1 hunks)
src/op/copy.cc (7 hunks)
src/target/codegen_cuda.cc (4 hunks)
src/transform/loop_partition.cc (1 hunks)
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2 hunks)
tilelang/contrib/hipcc.py (2 hunks)
tilelang/contrib/nvcc.py (6 hunks)
tilelang/engine/callback.py (3 hunks)
tilelang/engine/lower.py (5 hunks)
tilelang/layout/fragment.py (1 hunks)
tilelang/layout/layout.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

src/transform/loop_partition.cc
tilelang/engine/callback.py
tilelang/contrib/hipcc.py
3rdparty/tvm
tilelang/layout/fragment.py
src/op/copy.cc
tilelang/engine/lower.py

🧰 Additional context used

🧬 Code graph analysis (3)

testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)

testing/python/jit/test_tilelang_jit_gemm_cython.py (1)

test_gemm_f16f16f16_nn (100-114)

testing/python/jit/test_tilelang_jit_callback.py (1)

test_gemm_f16f16f16_nn (101-115)

src/target/codegen_cuda.cc (1)

src/transform/warp_specialized_rewriter.cc (16)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

op (179-189)

op (179-179)

tilelang/contrib/nvcc.py (2)

tilelang/engine/lower.py (1)

tilelang_callback_cuda_compile (57-90)

tilelang/contrib/nvrtc.py (1)

compile_cuda (14-111)

🪛 Ruff (0.14.1)

tilelang/contrib/nvcc.py

186-186: Unused function argument: target

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Quick Lint
GitHub Check: Build wheels for Python 3.8 on macos-latest with Metal
GitHub Check: Build wheels for Python 3.8 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build wheels for Python 3.8 on ubuntu-latest with CUDA-12.1

🔇 Additional comments (9)

testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)

88-88: LGTM! FFI migration applied correctly.

The migration from @tvm.register_func to @tvm.register_global_func with explicit function naming aligns with the TVM v0.22.0 FFI update. The override=True parameter appropriately allows the test to replace any existing registration.

410-411: Ignore this review comment—this change follows an established codebase pattern.

The change reflects a deliberate design choice, not a temporary debug modification. Five test files across different directories (kernel, transform, jit) use the identical pattern of commenting out tilelang.testing.main() and calling specific test functions directly:

testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py

testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py

testing/python/transform/test_tilelang_transform_layout_inference.py

testing/python/kernel/test_tilelang_kernel_bf16_gemm_mma.py

testing/python/jit/test_tilelang_jit_gemm_ctypes.py (this file)

This consistent pattern across multiple files indicates intentional design, not ad-hoc debugging. The review's assumption that this is temporary and should be reverted contradicts established codebase practice.

Likely an incorrect or invalid review comment.

tilelang/layout/layout.py (1)

5-5: LGTM: FFI migration verified and complete.

The migration from tvm.ffi to tvm_ffi has been comprehensively applied across the entire codebase. The @tvm_ffi.register_object() decorator is confirmed functional and actively used for 15+ class registrations (Fill, AtomicAdd, Copy, Gemm, Layout, Fragment, GemmPy, and others) as well as global function registrations. No orphaned old-style patterns remain, and all imports are consistent.

src/target/codegen_cuda.cc (4)

2025-2026: Correct migration to tvm::ffi::GetRef for tl_gemm.

Switching to tvm::ffi::GetRef<PrimExpr>(op) is consistent with the new FFI. Looks good.

Confirm the TU includes the header that declares tvm::ffi::GetRef (e.g., <tvm/ffi/object.h> or the umbrella that provides it). If not, add it to avoid ADL/dep-inclusion pitfalls.

2034-2035: Same as above for tl_gemm_sp.

The FFI-based GetRef usage is consistent and correct.

Same header check as the previous comment to ensure GetRef is declared in this TU.

2308-2310: FFI GetRef in Ramp CHECK message is fine.

Diagnostic now prints an FFI-backed Ramp ref; no functional change. OK.

2811-2811: Optional truthiness check is acceptable.

ICHECK(global_symbol) is equivalent to defined() for Optional<String> on modern TVM. Safe change.

If you target older compilers/TVM variants, keep defined() for clarity; otherwise this is fine.

tilelang/contrib/nvcc.py (2)

11-12: LGTM! Import changes align with FFI migration.

The new imports correctly support the migration to tvm_ffi.register_global_func and maintain access to TVM utilities through the local tvm alias.

185-185: LGTM! Consistent FFI decorator migration.

All function registrations have been properly migrated from the legacy @tvm.ffi.register_func to @tvm_ffi.register_global_func with override=True, maintaining backward compatibility through unchanged registered names.

Also applies to: 192-192, 257-257, 403-403, 416-416, 433-433

coderabbitai · 2025-10-23T17:11:26Z

tilelang/contrib/nvcc.py

 def tilelang_callback_cuda_compile(code, target):  # pylint: disable=unused-argument
    """use nvcc to generate fatbin code for better optimization"""
    ptx = compile_cuda(code, target_format="fatbin")
    return ptx


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify whether the target parameter should be used.

The target parameter is accepted but unused in the callback. The function relies on Target.current() within compile_cuda to determine architecture. However, the caller in tilelang/engine/lower.py explicitly extracts architecture from the target parameter (nvcc.get_target_compute_version(target)), suggesting explicit usage might be preferred over ambient context.

Consider either:

Using the target parameter explicitly if it provides more reliable architecture detection, or

Documenting why relying on Target.current() is the intended design

Based on learnings (static analysis hints).

🏁 Script executed:

#!/bin/bash # Search for other usages of tilelang_callback_cuda_compile to understand expected behavior rg -n "tilelang_callback_cuda_compile" -A5 -B5

Length of output: 2342

Based on my verification, the unused target parameter in the tilelang_callback_cuda_compile function in tilelang/contrib/nvcc.py is indeed a concern.

The underlying compile_cuda function accepts an arch parameter and defaults to using Target.current() if none is provided. The current implementation relies on this ambient context, but since compiling with an explicit architecture flag enables faster runtime as code generation occurs during compilation, whereas omitting it forces JIT compilation by the CUDA driver, the docstring's promise of "better optimization" suggests the target should be explicitly used.

Use the target parameter to pass architecture to compile_cuda.

The target parameter is passed by the C++ caller but unused. To achieve the promised "better optimization," extract the compute version from the target and pass it as the arch parameter to compile_cuda, or use the target to establish the proper compilation context. This ensures architecture-specific optimization occurs at compile time rather than relying on ambient Target.current() context.

🧰 Tools

🪛 Ruff (0.14.1)

186-186: Unused function argument: target

(ARG001)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

requirements-test.txt (1)
15-15: Remove redundant apache-tvm-ffi from test requirements.

Since apache-tvm-ffi is now a runtime requirement (in requirements.txt), it is automatically available during test execution. Listing it in test requirements is redundant and increases maintenance burden. Test environments typically install both runtime and test dependencies.

Apply this diff to remove the duplicate:
-apache-tvm-ffi
 cffi

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 995315e and 7ef5d01.

📒 Files selected for processing (2)

requirements-test.txt (2 hunks)
requirements.txt (1 hunks)

🔇 Additional comments (1)

requirements-test.txt (1)

15-15: Review comment is based on incorrect assumptions about apache-tvm-ffi versioning and TVM dependency model.

The codebase manages TVM as a C++ build dependency via CMake submodule (3rdparty/tvm), not as a pip package. The PyPI package apache-tvm-ffi is only a Python FFI wrapper and contains no explicit TVM version constraint in its dependencies. Additionally, only version 0.1.0 is available on PyPI, so there is no "known-compatible version" to pin to as the review suggests.

While the inconsistency between unpinned apache-tvm-ffi and pinned flash-linear-attention==0.3.2 is technically observable, it does not represent a practical risk because apache-tvm-ffi's version does not control which TVM library is used—the C++ build system does. Pinning a single version on PyPI that has no version alternatives would provide no additional stability benefit.

Likely an incorrect or invalid review comment.

requirements.txt

LeiWang1999 · 2025-10-24T04:07:28Z

CI tests passed, but the bdist step failed. Would you mind taking a look, @oraluben or @XuehaiPan?

LeiWang1999 · 2025-10-24T06:41:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/layout/swizzle.cc

LeiWang1999 · 2025-10-24T14:20:54Z

I think it's most likely due to py38 compatibility of tvm 0.22.0. I'll take a look then.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.github/workflows/ci.yml (1)
45-46: Update step ID to match step name for clarity.

The step ID is setup-py38 but the step name is "Setup Python 3.9". While this doesn't break functionality (outputs are keyed by ID), updating the step ID to setup-py39 would improve consistency and avoid confusion for future maintainers.
-      - name: Setup Python 3.9
-        id: setup-py38
+      - name: Setup Python 3.9
+        id: setup-py39
Then update line 54 to reference the new ID:
-          "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang
+          "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelang
pyproject.toml (1)
86-86: Update ruff target-version to match project Python baseline.

The target-version is set to py38, but the project now requires Python ≥3.9. Updating this to py39 ensures ruff applies rules and fixes aligned with the current minimum supported version. This improves consistency between tool configuration and project requirements.
-target-version = "py38"
+target-version = "py39"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ecf41a and dc12ebc.

📒 Files selected for processing (6)

.github/workflows/ci.yml (1 hunks)
.github/workflows/dist.yml (1 hunks)
docs/get_started/Installation.md (2 hunks)
examples/gemm/README.md (1 hunks)
format.sh (1 hunks)
pyproject.toml (2 hunks)

✅ Files skipped from review due to trivial changes (2)

examples/gemm/README.md
docs/get_started/Installation.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1

🔇 Additional comments (3)

format.sh (1)

86-87: PIP_USER environment variable improves virtualenv isolation.

The addition of export PIP_USER=0 prevents pip from defaulting to --user installs, ensuring packages are installed in the virtual environment rather than the user site-packages. This is a sensible safeguard for CI and scripted contexts.

.github/workflows/ci.yml (1)

49-49: Python version bump is consistent with project requirements.

Updating to Python 3.9 aligns with the requires-python = ">=3.9" in pyproject.toml and reflects the TVM 0.22.0 compatibility baseline.

.github/workflows/dist.yml (1)

44-45: Python 3.9 wheel build aligns with TVM 0.22.0 Limited API.

The build matrix now uses Python 3.9, and the comment correctly notes that TVM is built with the Python 3.9 Limited API (enabling wheel compatibility with all Python versions ≥3.9). This supports the project's shift to Python 3.9+ and the TVM rebase.

pyproject.toml

…ebase

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/ci.yml (1)

45-54: Update step ID to reflect Python 3.9.

The step ID is still setup-py38 but now sets up Python 3.9. For clarity and maintainability, rename the step ID and update the reference on line 54.

      - name: Setup Python 3.9
-       id: setup-py38
+       id: setup-py39
        uses: actions/setup-python@v6
        with:
          python-version: "3.9" # use lowest supported version for linting
          update-environment: false

      - name: Check AST with Python 3.9
        run: |
-         "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang
+         "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelang

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38f7e49 and 46dfea1.

📒 Files selected for processing (1)

.github/workflows/ci.yml (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Quick Lint
GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal

🔇 Additional comments (1)

.github/workflows/ci.yml (1)

170-170: HTTP timeout configuration added for CUDA setup.

Adding UV_HTTP_TIMEOUT=600 to the CUDA environment is reasonable for addressing potential timeout issues during package downloads in that environment.

oraluben · 2025-10-28T08:37:40Z

apache-tvm-ffi claims it supports Python 3.8, but not tested.

Checked a little bit, this ~~not exactly a tvm-ffi issue. ml_dtypes is a optional dependencies:~~ is a tvm-ffi issue that it does not support ml_dtypes<0.5

https://github.com/apache/tvm-ffi/blob/789e9e5ccf3fe590132d352671461922d30ba526/python/tvm_ffi/cython/dtype.pxi#L35-L39

When installing tilelang on py3.8, it installs the latest ml_dtypes (0.4.x) that do not have those types.
User should be able to build from sdist after tvm-ffi's new release with apache/tvm-ffi#198 .

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

pyproject.toml (1)
227-231: Add explicit dylib exclusions to macOS repair-wheel command.

The macOS repair-wheel-command (lines 227–231) uses --no-sanitize-rpaths but does not include --exclude flags to prevent bundling of libtvm_ffi.dylib and libcuda.dylib. Per web search results in past review comments, --no-sanitize-rpaths only preserves RPATH entries; it does not prevent delocate-wheel from copying dylibs into the wheel.

In contrast, the Linux configuration (lines 222–225) explicitly excludes libtvm_ffi.so and libcuda.so.1. Without equivalent exclusions on macOS, the wheel will bundle libtvm_ffi.dylib, creating a conflict with the pip-installed apache-tvm-ffi package and risking duplication or ABI mismatches.

Add --exclude libtvm_ffi.dylib and --exclude libcuda.dylib to the delocate-wheel command:
[tool.cibuildwheel.macos]
repair-wheel-command = [
-   "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --require-archs {delocate_archs} -w {dest_dir} -v {wheel}",
+   "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --exclude libtvm_ffi.dylib --exclude libcuda.dylib --require-archs {delocate_archs} -w {dest_dir} -w {dest_dir} {wheel}",
    "pipx run abi3audit --verbose --strict {wheel}",
]

🧹 Nitpick comments (2)

pyproject.toml (2)
31-31: Clarify redundant torch dependency.

Line 31 adds apache-tvm-ffi~=0.1.0, but lines 36–37 declare torch twice: once without a version constraint (torch) and again with a macOS-specific pin (torch>=2.7; platform_system == 'Darwin'). Typically, a base requirement should precede environment markers.

Consider restructuring to a single torch entry with the appropriate version constraint:
dependencies = [
    "apache-tvm-ffi~=0.1.0",
    "cloudpickle",
    "ml-dtypes",
    "numpy>=1.23.5",
    "psutil",
-   "torch",
-   "torch>=2.7; platform_system == 'Darwin'",
+   "torch>=2.7; platform_system == 'Darwin'",
+   "torch; platform_system != 'Darwin'",
    "tqdm>=4.62.3",
    "typing-extensions>=4.10.0",
]
Alternatively, if the minimum torch version differs by platform, clarify the intent with a comment.

Also applies to: 36-37

186-188: Consider expanding test-command to verify FFI integration.

The added test-command (lines 186–188) only checks that tilelang imports and prints its version. For a PR that migrates extensively to FFI-based object creation and adds a new FFI dependency (apache-tvm-ffi), consider a slightly more comprehensive smoke test to ensure FFI components load correctly.

For example:
python -c "import tilelang; from tilelang import ir; print(f'tilelang {tilelang.__version__}'); print('FFI OK')"
This ensures not just import, but also FFI-dependent submodules (like ir) load without error.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f276a26 and 8424e99.

📒 Files selected for processing (2)

.github/workflows/pr-perfbench-bot.yml (2 hunks)
pyproject.toml (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/pr-perfbench-bot.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Quick Lint
GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
GitHub Check: Build SDist

🔇 Additional comments (1)

pyproject.toml (1)

30-40: Pin ml-dtypes version to ensure Python 3.8 compatibility.

The project declares requires-python = ">=3.8" (line 5), but does not pin ml-dtypes (line 33). Per PR comments, apache-tvm-ffi requires ml-dtypes >= 0.5.0, which in turn requires Python ≥ 3.9 for the int2 type. This creates a latent incompatibility: pip may resolve ml-dtypes >= 0.5.0 on Python 3.8, causing an import failure.

Either:

Explicitly pin ml-dtypes>=0.5.0 and update requires-python to >=3.9, or

Pin ml-dtypes to a version compatible with Python 3.8 if older versions support it, or

Make ml-dtypes conditional on Python ≥ 3.9 (if the library is optional for older Python).

As a workaround until apache-tvm-ffi resolves this upstream, consider pinning ml-dtypes>=0.5.0; python_version >= '3.9'.

As title, currently it failed with `ml_dtypes==0.4.x`: tile-ai/tilelang#1108 (comment)

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/target/codegen_cuda.h (2)
63-64: Missing FFI type includes; verify signature consistency with implementation.

This signature change to ffi::String is consistent with the FFI migration. However, the header doesn't include the necessary FFI alias header, which a past review comment already flagged.

Additionally, the implementation signature in the .cc file (shown in relevant snippets) uses String rather than ffi::String. If String is aliased to ffi::String via src/support/ffi_aliases.h, they're compatible—but verify the .cc file has been updated or includes the alias header to avoid signature mismatch.

Run the following script to confirm the implementation signature matches:
#!/bin/bash
# Verify PrintFunctionSignature signature in implementation matches header

# Check the implementation signature
rg -n -A 2 "void CodeGenTileLangCUDA::PrintFunctionSignature" src/target/codegen_cuda.cc

# Check if ffi_aliases.h is included in the .cc file
rg -n "ffi_aliases.h" src/target/codegen_cuda.cc
69-71: FFI type migration requires include; verify implementation consistency.

The parameter type changes to ffi::String and ffi::Array<PrimExpr> align with the FFI refactor. However, this file still lacks the FFI alias include (as previously flagged), and the .cc implementation (per relevant snippets) uses String and Array<PrimExpr> without the ffi:: prefix. Confirm that either the implementation has been updated to match, or that the alias header makes these signatures compatible.

Run the following script to verify:
#!/bin/bash
# Verify PrintCallExtern signature in implementation

# Check the implementation signature
rg -n -A 3 "void CodeGenTileLangCUDA::PrintCallExtern" src/target/codegen_cuda.cc

# Verify if the .cc file includes the alias header
rg -n "ffi_aliases" src/target/codegen_cuda.cc

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ef647c and 4bf2524.

📒 Files selected for processing (7)

src/target/codegen_cuda.cc (4 hunks)
src/target/codegen_cuda.h (1 hunks)
src/transform/merge_shared_memory_allocations.cc (5 hunks)
src/transform/storage_access.cc (5 hunks)
src/transform/storage_access.h (1 hunks)
src/transform/thread_storage_sync.cc (1 hunks)
tilelang/tileop/gemm/__init__.py (2 hunks)

✅ Files skipped from review due to trivial changes (2)

src/transform/storage_access.h
src/transform/thread_storage_sync.cc

🚧 Files skipped from review as they are similar to previous changes (2)

src/transform/merge_shared_memory_allocations.cc
src/target/codegen_cuda.cc

🧰 Additional context used

🧬 Code graph analysis (3)

src/transform/storage_access.cc (2)

src/transform/merge_shared_memory_allocations.cc (6)

buf (228-248)

buf (228-228)

buffer_var (579-584)

buffer_var (579-579)

buffer (514-533)

buffer (514-514)

src/transform/thread_storage_sync.cc (21)

buf (57-59)

buf (57-57)

op (421-433)

op (421-421)

op (479-493)

op (479-479)

op (524-530)

op (524-524)

op (531-537)

op (531-531)

op (538-556)

op (538-538)

op (558-597)

op (558-558)

op (612-631)

op (612-612)

op (682-701)

op (682-682)

ICHECK (632-654)

buffer_var (607-609)

buffer_var (607-607)

src/target/codegen_cuda.h (1)

src/target/codegen_cuda.cc (12)

PrintFunctionSignature (2797-2856)

PrintFunctionSignature (2797-2799)

t (26-64)

t (26-26)

t (68-75)

t (68-68)

t (79-95)

t (79-79)

t (99-107)

t (99-100)

PrintCallExtern (1086-1142)

PrintCallExtern (1086-1089)

tilelang/tileop/gemm/__init__.py (2)

tilelang/ir.py (1)

GemmWarpPolicy (30-39)

tilelang/tileop/gemm/gemm_mma.py (1)

infer_layout (15-58)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
GitHub Check: Build SDist

🔇 Additional comments (3)

src/transform/storage_access.cc (1)

42-42: LGTM! FFI namespace qualification is correct.

The mechanical refactoring to qualify all GetRef calls with tvm::ffi:: is consistent with the broader FFI migration effort across the codebase. All changes preserve the original logic while aligning with the new FFI API patterns.

Also applies to: 45-46, 70-70, 330-331, 369-369, 374-375, 379-380, 411-411

tilelang/tileop/gemm/__init__.py (2)

49-74: GemmPy registration looks good; verify type_key alignment

Registration with tvm_ffi.register_object("tl.GemmPy") and Node/Scriptable field schema LGTM. Please confirm the Python type_key "tl.GemmPy" exactly matches the C++ TVM_OBJECT registration; mismatches cause construction/dispatch failures at runtime.

7-7: tvm_ffi imports are working; centralization is optional not required

No issues found. The codebase imports tvm_ffi directly in 11 files with a consistent pattern and no reported failures. While creating a central compat shim (e.g., tilelang/_ffi.py) could be operationally cleaner for diagnostics, the current approach is already stable and supported by the existing codebase patterns. The concerns about dual lib loads and Python 3.8 compatibility are not backed by observed errors in the code—project requires Python >= 3.8 and no ml-dtypes conflicts appear.

Minor note: .extent is used inconsistently—some files cast it explicitly to int() (e.g., kernel.py:149) while others like gemm/__init__.py:18 use it directly. Both patterns appear functional, though explicit casting would be safer for type clarity.

tilelang/tileop/gemm/__init__.py

oraluben

LGTM

oraluben · 2025-10-29T06:26:59Z

pyproject.toml

 cudaver="$(echo "${CUDA_VERSION:-"12.4"}" | cut -d '.' -f-2)"
 v="${cudaver//./-}"
-yum install -y "cuda-minimal-build-${v}" "cuda-driver-devel-${v}" "cuda-nvrtc-devel-${v}"
+yum install -y "cuda-minimal-build-${v}" "cuda-driver-devel-${v}" "cuda-nvrtc-devel-${v}" nvidia-driver-cuda-libs


Just wonder why we need this

See failure for import test here: https://github.com/tile-ai/tilelang/actions/runs/18868204898/job/53840132709?pr=1108#step:5:2960

+ mkdir -p /tmp/tmp.nmlKE1wZW4/test_cwd + sh -c 'python -c '"'"'import tilelang; print(tilelang.__version__)'"'"'' Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/tmp.nmlKE1wZW4/venv/lib/python3.9/site-packages/tilelang/__init__.py", line 77, in <module> import tvm File "/tmp/tmp.nmlKE1wZW4/venv/lib/python3.9/site-packages/tilelang/3rdparty/tvm/python/tvm/__init__.py", line 27, in <module> from .base import TVMError, __version__, _RUNTIME_ONLY File "/tmp/tmp.nmlKE1wZW4/venv/lib/python3.9/site-packages/tilelang/3rdparty/tvm/python/tvm/base.py", line 58, in <module> _LIB, _LIB_NAME = _load_lib() File "/tmp/tmp.nmlKE1wZW4/venv/lib/python3.9/site-packages/tilelang/3rdparty/tvm/python/tvm/base.py", line 45, in _load_lib lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL) File "/opt/python/cp39-cp39/lib/python3.9/ctypes/__init__.py", line 374, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcuda.so.1: cannot open shared object file: No such file or directory

Maybe we should use different container for import test in the future. For the build environment, we only need nvcc and stub libcuda.

Maybe the TVM upstream should initialize the CUDA context lazily?

That also makes sense. However I'm a little bit not sure about the actual behavior, libcuda.so.1 should exist as the stub library no matter if runtime is available, but it bailed out here.

There is a stub library /usr/local/bin/cuda/lib64/stubs/libcuda.so which does not have the .1 suffix. The library loader complains about that.

LeiWang1999 · 2025-10-31T01:41:54Z

Thanks, I think we can let it in.

* 3rdparty tvm bump * bump tvm into v0.22.0 * lint fix * rebase tvm * Update submodule tvm to latest commit 3085bc4 * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang * test fix * add requirement * atomic_fix * atomic_fix * phaseout py39 * optimize * optimize * lint fix * do not clean cache * do not clean cache * [Minor] Minor update for Python versions and dependencies * [Lint] fix lint for py39 * [Lint] fix lint for ROCm * [Build][CI] Sync CI changes from upstream/sdist * [Lint] fix lint for ROCm * [Build][CI] Update `repair-wheel-command` * [Minor] update abi3audit result format * [Lint] fix lint for ROCm * [BugFix] fix build * [Lint] fix lint for ROCm * [BugFix] set rpath for libtvm and libtvm_runtime * [Deps] pin apache-tvm-ffi version * [Build] set Python 3.9 Limited API for Cython target * [Build] set Python 3.9 Limited API for Cython target * [Deps] Restore Python 3.8 support * [Build] use `apache-tvm-ffi`'s `libtvm_ffi` * [BugFix] use `;` as delimiter for RPATH on macOS * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel` * [Build] support `sccache` if available * [Build] add CIBW import test * [Build][CI] enable ccache for CIBW on Linux * [BugFix] set rpath for libtvm and libtvm_runtime * Revert "[Build][CI] enable ccache for CIBW on Linux" This reverts commit cd9ab57. * [CI] fix perfbench bot * [BugFix] use Python 3.9 to build wheel * [Minor] update perfbench bot envs * [BugFix] fix CIBW environment on Linux * [CI] skip import test on CentOS 7 * [CI] use Python urllib to download file instead of Wget --------- Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

LeiWang1999 added 3 commits October 22, 2025 17:15

3rdparty tvm bump

ef95183

bump tvm into v0.22.0

3c175e4

lint fix

951f2de

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

LeiWang1999 added 4 commits October 23, 2025 13:08

rebase tvm

6d29c1e

Update submodule tvm to latest commit 3085bc4

21e7a0a

Refactor: Update configuration retrieval in CopyNode and adjust test …

3877610

…registration in tilelang

lint fix

cd6daaf

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

LeiWang1999 added 2 commits October 24, 2025 01:33

test fix

995315e

add requirement

7ef5d01

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

LeiWang1999 added 2 commits October 24, 2025 14:16

atomic_fix

fdf4669

atomic_fix

9ecf41a

chatgpt-codex-connector bot reviewed Oct 24, 2025

View reviewed changes

src/layout/swizzle.cc Show resolved Hide resolved

LeiWang1999 added 2 commits October 25, 2025 16:01

phaseout py39

68b6ada

optimize

dc12ebc

coderabbitai bot reviewed Oct 25, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

LeiWang1999 added 4 commits October 25, 2025 22:53

optimize

751dbc7

Merge branch 'main' of https://github.com/tile-ai/tilelang into tvm_r…

fde6a50

…ebase

lint fix

38f7e49

do not clean cache

46dfea1

coderabbitai bot reviewed Oct 26, 2025

View reviewed changes

do not clean cache

33d6ad1

[Minor] update perfbench bot envs

8a23c08

oraluben mentioned this pull request Oct 28, 2025

feat: support ml_dtypes<0.5 apache/tvm-ffi#198

Merged

[BugFix] fix CIBW environment on Linux

8424e99

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

[CI] skip import test on CentOS 7

9ef647c

tqchen pushed a commit to apache/tvm-ffi that referenced this pull request Oct 28, 2025

feat: support ml_dtypes<0.5 (#198)

9574e9d

As title, currently it failed with `ml_dtypes==0.4.x`: tile-ai/tilelang#1108 (comment)

Merge remote-tracking branch 'upstream/main' into tvm_rebase

4bf2524

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

tilelang/tileop/gemm/__init__.py Show resolved Hide resolved

[CI] use Python urllib to download file instead of Wget

c9e9191

XuehaiPan force-pushed the tvm_rebase branch from 1a6ace8 to c9e9191 Compare October 29, 2025 05:09

Merge remote-tracking branch 'upstream/main' into tvm_rebase

74ec78c

XuehaiPan requested a review from oraluben October 29, 2025 06:20

XuehaiPan approved these changes Oct 29, 2025

View reviewed changes

oraluben approved these changes Oct 29, 2025

View reviewed changes

LeiWang1999 merged commit 10911e2 into tile-ai:main Oct 31, 2025
20 of 23 checks passed

This was referenced Nov 3, 2025

[Fix] fix type imcompatible error in #1115 #1180

Merged

[Feat] Add swap like grammar in tuple assignment #1185

Merged

[Fix] Remove unsupported type params #1186

Merged

[Feat] Add support for T.serial with step and negative step #1188

Merged

coderabbitai bot mentioned this pull request Nov 4, 2025

[Refactor] Improve Python3.9 compatibility for ParamSpec and Self #1190

Merged

This was referenced Nov 5, 2025

[Feat] Add A Pass to Handle Negative Index #1192

Merged

[Fix] Fix buffer re-import typo in tilelang.languge #1214

Merged

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

kurisu6912 mentioned this pull request Nov 12, 2025

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108

Uh oh!

Conversation

LeiWang1999 commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FFI and Object System Refactoring

Build System and Third-party Dependency Handling

Minor Codebase Cleanups

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Hzfengsy commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeiWang1999 commented Oct 24, 2025

Uh oh!

LeiWang1999 commented Oct 24, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

LeiWang1999 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

oraluben commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

LeiWang1999 commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

LeiWang1999 commented Oct 24, 2025 •

edited

Loading

oraluben commented Oct 28, 2025 •

edited

Loading