Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Oct 22, 2025

This pull request introduces significant refactoring and modernization of the FFI (Foreign Function Interface) and object system usage in the codebase, particularly in the layout and IR (Intermediate Representation) modules. The changes improve consistency, type safety, and compatibility with updated TVM submodules. Additionally, there are CMake and build system updates to better handle third-party dependencies and include paths.

Key changes include:

FFI and Object System Refactoring

  • Replaced direct usage of make_object and related macros with tvm::ffi::make_object and new FFI macros (e.g., TVM_FFI_DECLARE_OBJECT_INFO_FINAL, TVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE/NOTNULLABLE) across src/layout/layout.cc, src/layout/layout.h, src/layout/swizzle.cc, and src/ir.cc for Layout, Fragment, SwizzledLayout, and various IR frame classes. This modernizes the object creation and reflection system, improving type safety and FFI integration. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

  • Removed custom SEqualReduce methods and replaced legacy object registration macros with FFI-based equivalents in layout classes, further aligning with the new FFI system. [1] [2] [3]

  • Updated FFI static initialization blocks to use the new syntax and patterns for registering reflection and global functions. [1] [2] [3] [4]

Build System and Third-party Dependency Handling

  • Updated the TVM submodule to a newer commit, ensuring compatibility with the latest upstream changes.

  • Improved CMake logic for including TVM and related third-party headers, adding conditional checks for alternative include paths (e.g., for ffi/include and dlpack/include), increasing build robustness across different setups.

  • Removed installation and dependency logic for TVM's Cython extension, simplifying the build process and avoiding unnecessary steps for certain build types.

  • Removed the now-unused codegen_webgpu.cc from build sources, reflecting changes in dependency or platform support.

Minor Codebase Cleanups

  • Added missing includes for FFI aliases and object headers, and made minor formatting and whitespace adjustments for consistency. [1] [2] [3] [4]

These changes collectively modernize the codebase, improve maintainability, and ensure compatibility with upstream dependencies and evolving FFI standards.

Summary by CodeRabbit

  • New Features

    • Added is_wgmma option for GEMM warp-partition tuning.
  • Improvements

    • Large-scale migration to a new FFI-based ABI for cross-language types and registrations.
    • Consolidated install behavior and improved runtime library path handling.
    • Centralized FFI type aliases for consistent usage.
  • Removals

    • WebGPU TileLang backend removed.
  • Breaking Changes

    • Public APIs and module returns now use FFI-backed types; update callers.
  • Chores

    • Packaging, dependencies, and CI/workflow updates.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Mass migration to TVM FFI: replace local object factories/GetRef and TVM object macros with tvm::ffi equivalents, add an ffi_aliases header, switch many public/internal types to ffi variants, normalize static-init macro usage, change several API signatures (ArgBinder, target/runtime modules), remove the WebGPU TileLang backend, and update Python FFI registrations and packaging.

Changes

Cohort / File(s) Summary
Submodule & Build
3rdparty/tvm, CMakeLists.txt, cmake/load_tvm.cmake
TVM submodule pointer updated; added GNU warning suppression and SCCACHE fallback; removed WebGPU source from build; conditionalized FFI/DLPack include paths; consolidated install(TARGETS ...) and expanded INSTALL_RPATH handling.
FFI aliases
src/support/ffi_aliases.h
New header re-exporting select tvm::ffi types as tvm::Array/Map/Optional/String/Function.
Core IR & Layout
src/ir.cc, src/layout/layout.cc, src/layout/layout.h, tilelang/layout/layout.py, tilelang/layout/fragment.py
Replace make_object/GetRef with tvm::ffi factories/refs, swap TVM object macros for TVM_FFI_* variants, add ffi includes/aliases, and update Python object registration to tvm_ffi.
Operators
src/op/*.cc, src/op/*.h (atomic_add, copy, fill, gemm*, parallel, reduce, region, etc.)
Migrate object creation to tvm::ffi::make_object, replace object-info macros with TVM_FFI_DECLARE_*, remove SEqualReduce/SHashReduce hooks/flags, and normalize TVM_FFI_STATIC_INIT_BLOCK() usage.
Transforms / Passes
src/transform/** (many files)
Qualify GetRef/make_object as tvm::ffi::..., add ffi_aliases includes where needed, and standardize TVM_FFI_STATIC_INIT_BLOCK() invocation across passes.
ArgBinder API
src/transform/arg_binder.h, src/transform/arg_binder.cc
Public signature/type changes: BindArray and def_handle_dtype/member now use ffi::Array / ffi::Map instead of Array / Map.
Targets & runtime modules
src/target/*, src/target/rt_mod_*.cc, src/target/codegen_*.{cc,h}, src/target/intrin_rule_*.cc
Convert String/Array/Module to ffi::String/ffi::Array/ffi::Module, update GetRef usage and existence checks, change codegen/rt function signatures (CUDA/HIP/CPP), and include ffi_aliases.
WebGPU removal
src/target/codegen_webgpu.cc, src/target/codegen_webgpu.h, build entries
Entire WebGPU TileLang backend (header + implementation) removed; Python device mapping updated to target.build.webgpu.
Python FFI & registrations
tilelang/_ffi_api.py, tilelang/ir.py, tilelang/contrib/*.py, tilelang/engine/*.py, tilelang/engine/callback.py
Replace tvm.ffi imports with tvm_ffi, switch registration decorators to tvm_ffi.register_*/register_global_func; small API tweak (e.g., GemmWarpPolicy.compute_warp_partition signature).
Runtime / tests / examples
src/target/rt_mod_*.cc, tilelang/contrib/*, testing/python/*, examples/gemm/README.md
Update runtime/module return types and attribute types to ffi variants; update test registration decorators; minor README and test-runner tweaks.
Packaging & CI
pyproject.toml, requirements*.txt, .github/workflows/*, .clang-tidy, format.sh
Added apache-tvm-ffi dependency and other packaging updates, changed cibuildwheel/repair commands, CI env/steps adjusted, removed clang-tidy verbose arg, and set PIP_USER=0 in format.sh.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Cpp as C++ code
    participant FFI as tvm::ffi
    Caller->>Cpp: request create Node / access Node
    alt pre-migration
      Cpp->>Cpp: make_object<Node>() / GetRef<Node>(...)
      note right of Cpp: local TVM factory / unqualified GetRef
      Cpp-->>Caller: ObjectRef
    else post-migration
      Cpp->>FFI: tvm::ffi::make_object<Node>() / tvm::ffi::GetRef<Node>(...)
      note right of FFI #D2E7D5: FFI-managed allocation/reflection and ffi types
      FFI-->>Cpp: ObjectPtr / ObjectRef
      Cpp-->>Caller: ObjectRef
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas to focus during review:
    • ArgBinder signature changes and all call sites (ffi::Array / ffi::Map migration).
    • Target/runtime API changes returning ffi::Module and updated codegen signatures for CUDA/HIP/CPP.
    • Broad consistency of migrations: make_objecttvm::ffi::make_object, GetReftvm::ffi::GetRef, and replacement of TVM object macros with TVM_FFI_*.
    • Removal of SEqualReduce/SHashReduce methods and flags—verify there are no remaining consumers (serialization, hashing, or reflection expectations).
    • Complete removal of WebGPU backend—ensure no residual references in build scripts, Python mappings, or tests.

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • LeiWang1999
  • oraluben

Poem

"I hopped through headers, swift and spry,
swapped make_object for ffi nearby.
WebGPU waved and hopped away,
Python hooks now lead the way.
A rabbit’s patch — neat, small, and spry 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.62% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi" directly describes the primary changes in this changeset. The title accurately identifies that the main effort is rebasing the TVM submodule to a newer version (v0.22.0) and refactoring the codebase to use updated TVM FFI APIs. This aligns with the substantial changes visible across the repository: updating object creation macros to tvm::ffi::make_object, replacing legacy object-reflection macros with FFI equivalents (TVM_FFI_DECLARE_OBJECT_INFO_FINAL, TVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE, etc.), updating Python FFI registration decorators, and adjusting build system configuration. The title is concise and specific—it clearly conveys the version target and the underlying motivation without excessive noise.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8424e99 and 9ef647c.

📒 Files selected for processing (1)
  • pyproject.toml (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
  • GitHub Check: Build SDist
🔇 Additional comments (2)
pyproject.toml (2)

30-40: Clarify torch dependency redundancy and constrain ml-dtypes version.

Line 36 specifies "torch" unconditionally, while line 37 specifies "torch>=2.7; platform_system == 'Darwin'" for macOS. This creates a redundancy: pip will resolve to the most restrictive constraint, so macOS gets torch>=2.7 but other platforms get any version. Please clarify intent.

Additionally, line 33 adds "ml-dtypes" without an explicit version constraint. Per the PR objectives, apache-tvm-ffi requires ml_dtypes.int2 which is only available in ml-dtypes >= 0.5.0 and effectively requires Python ≥ 3.9. The current unconstrained version may pull incompatible versions on Python 3.8. Consider pinning "ml-dtypes>=0.5.0" or documenting the version strategy.


47-49: Verify scikit-build-core sufficiency after setuptools removal.

The build-system requires field removed "setuptools>=63", relying solely on scikit-build-core. Confirm that all prior setuptools functionality (e.g., custom entry points, version hooks, setup.py scripts) is now handled by scikit-build-core and its configuration in [tool.scikit-build] and the provided pyproject.toml metadata. If any legacy setup.py hooks or setuptools-specific tasks remain, this change may break the build.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (14)
src/transform/make_packed_api.cc (2)

366-377: FFI type-code check is wrong for ints/floats.

buf_packed_arg_type_ids holds FFI type indices, but lines 369 and 375 check against kDLInt and kDLFloat (DLPack dtype codes), causing invalid runtime validation. The handle type case (line 361–363) correctly uses ffi::TypeIndex::kTVMFFI* constants on the same buffer. Replace with:

     } else if (t.is_int() || t.is_uint()) {
       std::ostringstream msg;
       msg << name_hint << ": Expect arg[" << i << "] to be int";
-      seq_init.emplace_back(AssertStmt(type_index == kDLInt,
+      seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIInt,
                                        tvm::tir::StringImm(msg.str()), nop));
     } else {
       ICHECK(t.is_float());
       std::ostringstream msg;
       msg << name_hint << ": Expect arg[" << i << "] to be float";
-      seq_init.emplace_back(AssertStmt(type_index == kDLFloat,
+      seq_init.emplace_back(AssertStmt(type_index == ffi::TypeIndex::kTVMFFIFloat,
                                        tvm::tir::StringImm(msg.str()), nop));
     }

129-135: Fix incorrect ret() call construction at lines 133 and 450.

Both lines incorrectly use undefined or incorrect functions. Line 133 uses tvm::ret() which doesn't exist, and line 450 uses bare ret() which is undefined. The correct approach is to construct a Call with builtin::ret(), consistent with the pattern already in the codebase at line 417.

-    Stmt ret_zero = Evaluate(tvm::ret(0));
+    Stmt ret_zero = Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}));

and

-  body = SeqStmt({body, Evaluate(ret(Integer(0)))});
+  body = SeqStmt({body, Evaluate(Call(DataType::Int(32), builtin::ret(), {Integer(0)}))});
src/target/utils.cc (1)

59-64: Bug: bitwise '&' used instead of logical '&&' in TargetIsSm100.

This will yield incorrect results for most values. Use logical and.

Apply this diff:

-  return arch >= 100 & arch <= 110;
+  return arch >= 100 && arch <= 110;
src/op/reduce.cc (1)

45-90: 64‑bit overflow/UB in bit shifts when building init constants.

(1 << (bits-1)) and (1 << bits) overflow for bits ≥ 31 (signed) / 32 (unsigned). Use min_value/max_value helpers instead of shifting.

   if (type->isMax()) {
-    if (is_int) {
-      return make_const(dst->dtype, -(1 << (bits - 1)));
-    } else if (is_uint) {
-      return make_const(dst->dtype, 0);
-    } else {
-      return make_const(dst->dtype, -INFINITY);
-    }
+    if (is_int) {
+      return min_value(dst->dtype);
+    } else if (is_uint) {
+      return make_const(dst->dtype, 0);
+    } else {
+      return make_const(dst->dtype, -INFINITY);
+    }
   } else if (type->isMin()) {
-    if (is_int) {
-      return make_const(dst->dtype, (1 << (bits - 1)) - 1);
-    } else if (is_uint) {
-      return make_const(dst->dtype, (1 << bits) - 1);
-    } else {
-      return make_const(dst->dtype, INFINITY);
-    }
+    if (is_int) {
+      return max_value(dst->dtype);
+    } else if (is_uint) {
+      return max_value(dst->dtype);
+    } else {
+      return make_const(dst->dtype, INFINITY);
+    }
   } else if (type->isBitAnd()) {
-    if (is_int) {
-      return make_const(dst->dtype, -1);
-    } else if (is_uint) {
-      return make_const(dst->dtype, (1 << bits) - 1);
-    } else {
-      // Should not arrive here
-      return make_const(dst->dtype, -INFINITY);
-    }
+    if (is_int) {
+      return make_const(dst->dtype, -1);
+    } else if (is_uint) {
+      return max_value(dst->dtype);
+    } else {
+      // Should not arrive here
+      return make_const(dst->dtype, -INFINITY);
+    }

Note: min_value/max_value are in tir/op.h; already included above.

src/transform/vectorize_loop.cc (1)

170-178: Bug: strides are cleared due to iterating an empty array

You initialize Array<PrimExpr> strides; and then loop over strides.size(), which is 0. This drops existing strides when assigning back, changing layout semantics.

Apply:

-      // Update strides if defined.
-      Array<PrimExpr> strides;
-      for (size_t i = 0; i < strides.size(); i++) {
-        PrimExpr stride = strides[i];
-        if (i != strides.size() - 1) {
-          stride *= var_lanes_;
-        }
-        strides.push_back(analyzer_.Simplify(stride));
-      }
+      // Update strides if defined.
+      Array<PrimExpr> strides = node->buffer->strides;
+      if (!strides.empty()) {
+        for (size_t i = 0; i < strides.size(); i++) {
+          PrimExpr stride = strides[i];
+          if (i != strides.size() - 1) {
+            stride *= var_lanes_;
+          }
+          strides.Set(i, analyzer_.Simplify(stride));
+        }
+      }
src/op/gemm_py.cc (1)

65-66: Fix type casting for clear_accum field

The comment in the file (lines 38-41) and Python interface both document clear_accum as Bool type. The parallel implementation in src/op/gemm_sp.cc (line 98) and validation in src/op/gemm.cc (lines 592-595) both use .as<Bool>() casting. Update the parsing to match:

-  node->clear_accum = args[9].as<PrimExpr>().value();
+  node->clear_accum = args[9].as<Bool>().value();
src/target/rt_mod_cuda.cc (3)

47-54: Unqualified TIR types; qualify to avoid ADL/visibility pitfall.

PrimFuncNode/PrimFunc are used without tir:: here, unlike earlier in the file. This can break if no using brings them in.

-    ICHECK(kv.second->IsInstance<PrimFuncNode>())
+    ICHECK(kv.second->IsInstance<tir::PrimFuncNode>())
...
-    auto f = Downcast<PrimFunc>(kv.second);
+    auto f = Downcast<tir::PrimFunc>(kv.second);

Applies to both loops.

Also applies to: 79-87


64-68: Guard against empty PTX before indexing.

ptx[0] is UB when ptx is empty (compile callback could return an empty string).

-    if (ptx[0] != '/')
+    if (!ptx.empty() && ptx[0] != '/')
       fmt = "cubin";

69-70: Replace ICHECK(0) with a descriptive fatal.

Improves diagnosability.

-  } else {
-    ICHECK(0);
-  }
+  } else {
+    LOG(FATAL) << "Missing global func: tilelang_callback_cuda_compile";
+  }
src/ir.cc (1)

7-16: Include the standard headers used here.

  • std::vector and std::sort are used but not explicitly included.

Apply:

 #include "./transform/common/attr.h"
 #include "op/builtin.h"
 #include "tvm/ffi/any.h"
 #include <tvm/ffi/object.h>
 
 #include "support/ffi_aliases.h"
 #include <tvm/arith/analyzer.h>
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/script/ir_builder/tir/ir.h>
+#include <vector>
+#include <algorithm>
src/op/gemm.h (1)

10-12: Add missing <optional> header to fix std::optional usage.

The header uses std::optional (line 101) but does not include <optional>. While support/ffi_aliases.h is already transitively included via operator.h, it provides only the FFI wrapper Optional, not the standard library std::optional.

 #include "operator.h"
+#include <optional>

The #include "support/ffi_aliases.h" is optional—it's already transitively available through operator.hlayout.hsupport/ffi_aliases.h, but adding it explicitly improves header self-sufficiency if desired.

src/op/finalize_reducer.cc (1)

84-86: Potential null deref in extent check.

If T.thread_bounds->extent is symbolic, as_const_int(...) may be null and dereferenced in the ICHECK. Guard before deref.

Apply this minimal fix:

-  int extent = *p_extent, scale = 1;
-  ICHECK(extent == 1 || extent == *as_const_int(T.thread_bounds->extent))
+  int extent = *p_extent, scale = 1;
+  const int64_t* p_thr = as_const_int(T.thread_bounds->extent);
+  ICHECK(p_thr != nullptr && (extent == 1 || extent == *p_thr))
       << "Illegal finalize_reducer: extent=" << extent
       << "; T.thread_bounds=" << T.thread_bounds;
src/op/gemm.cc (2)

115-152: Constructor reads beyond args without size checks; fix parsing robustness.

Accessing args[16..18] unconditionally can OOB when older call sites omit them. Also prefer robust Downcast/Optional handling for types.

Apply:

 Gemm::Gemm(Array<PrimExpr> args, BufferMap vmap) {
-  ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>();
+  ObjectPtr<GemmNode> node = tvm::ffi::make_object<GemmNode>();

   node->Aptr = args[0];
   node->Bptr = args[1];
   node->Cptr = args[2];
   node->A = vmap[GetVarFromAccessPtr(node->Aptr)];
   node->B = vmap[GetVarFromAccessPtr(node->Bptr)];
   node->C = vmap[GetVarFromAccessPtr(node->Cptr)];
-  node->trans_A = args[3].as<Bool>().value();
-  node->trans_B = args[4].as<Bool>().value();
+  node->trans_A = Downcast<Bool>(args[3]).value();
+  node->trans_B = Downcast<Bool>(args[4]).value();
   node->M = args[5].as<IntImm>().value()->value;
   node->N = args[6].as<IntImm>().value()->value;
   node->K = args[7].as<IntImm>().value()->value;
   node->policy = GemmWarpPolicy(args[8].as<IntImm>().value()->value);
-  node->clear_accum = args[9].as<PrimExpr>().value();
+  node->clear_accum = Downcast<Bool>(args[9]);
   node->stride_A = args[10].as<IntImm>().value()->value;
   node->stride_B = args[11].as<IntImm>().value()->value;
   node->offset_A = args[12].as<IntImm>().value()->value;
   node->offset_B = args[13].as<IntImm>().value()->value;
   if (args.size() > 14) {
     node->kPack = args[14].as<IntImm>().value()->value;
     if (node->kPack != 1 && node->kPack != 2) {
       ICHECK(false) << "kPack must be 1 or 2";
     }
   }
-  if (args.size() > 15) {
-    node->wg_wait = args[15].as<IntImm>().value()->value;
-  }
-  node->mbarptr = args[16];
-  if (node->mbarptr.as<CallNode>()) {
-    node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)];
-  } else {
-    node->mbar = std::nullopt;
-  }
-  node->C_coords = Array<PrimExpr>(
-      {args[17].as<PrimExpr>().value(), args[18].as<PrimExpr>().value()});
+  if (args.size() > 15) {
+    node->wg_wait = args[15].as(IntImmNode())->value;
+  }
+  if (args.size() > 16) {
+    node->mbarptr = args[16];
+    if (node->mbarptr.as<CallNode>()) {
+      node->mbar = vmap[GetVarFromAccessPtr(node->mbarptr)];
+    } else {
+      node->mbar = std::nullopt;
+    }
+  }
+  if (args.size() > 18) {
+    node->C_coords = {args[17], args[18]};
+  } else {
+    node->C_coords = {};
+  }

Also update the comment above to document the additional parameters and new optionality.


877-887: FFI binding drops return value from ComputeWarpPartition.

The lambda doesn’t return the computed pair; callers can’t observe the result. Return both values (e.g., as Array or a Tuple).

Apply one of:

-  refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition",
-                        [](GemmWarpPolicy policy, int M, int N, int block_size,
-                           Target target, GemmInst gemm_inst) {
-                          policy->ComputeWarpPartition(M, N, block_size, target,
-                                                       gemm_inst);
-                        });
+  refl::GlobalDef().def("tl.GemmWarpPolicyComputeWarpPartition",
+                        [](GemmWarpPolicy policy, int M, int N, int block_size,
+                           Target target, GemmInst gemm_inst) -> Array<Integer> {
+                          auto res = policy->ComputeWarpPartition(M, N, block_size, target, gemm_inst);
+                          return {Integer(res.first), Integer(res.second)};
+                        });

If your FFI prefers packed returns, switch to def_packed and set rv accordingly.

🧹 Nitpick comments (21)
src/transform/lower_device_kernel_launch.cc (1)

39-39: Optional: The using namespace ffi; directive appears unused.

All FFI components in this file are explicitly qualified (e.g., tvm::ffi::GetRef on lines 376 and 392), so this namespace directive isn't currently utilized. Consider removing it or using unqualified FFI names if consistency with the using namespace tir; pattern is desired.

src/transform/storage_access.cc (1)

343-407: FFI migration correctly applied.

All tvm::ffi::GetRef<Var> conversions are necessary to convert the raw const VarNode* (line 346) to Var objects.

Minor optimization opportunity: Lines 350, 355, 360-361, and 392 each create a new Var reference from the same buffer_var pointer. You could cache tvm::ffi::GetRef<Var>(buffer_var) in a local variable to avoid repeated conversions, though the performance impact is likely negligible.

Optional optimization
     DataType dtype = op->args[0].dtype();
     const VarNode *buffer_var = op->args[1].as<VarNode>();
+    Var buffer_var_ref = tvm::ffi::GetRef<Var>(buffer_var);
     PrimExpr offset = op->args[2];
     PrimExpr extent = op->args[3];
     const IntImmNode *flag = op->args[4].as<IntImmNode>();
-    StorageScope scope = GetScope(tvm::ffi::GetRef<Var>(buffer_var));
+    StorageScope scope = GetScope(buffer_var_ref);
     // The buffer scope.
     if (Enabled(buffer_var, scope)) {
       ICHECK(allow_append_);
       Array<Range> buffer_ranges;
-      if (buffer_data_to_buffer_.find(tvm::ffi::GetRef<Var>(buffer_var)) ==
+      if (buffer_data_to_buffer_.find(buffer_var_ref) ==
           buffer_data_to_buffer_.end()) {
         // cannot find buffer map, use the default buffer
         buffer_ranges = {Range::FromMinExtent(offset, extent)};
       } else {
-        Buffer buffer =
-            buffer_data_to_buffer_.at(tvm::ffi::GetRef<Var>(buffer_var));
+        Buffer buffer = buffer_data_to_buffer_.at(buffer_var_ref);
         // ... (rest of logic unchanged)
       }
       AccessEntry e;
       e.threads = env_threads();
       e.thread_range = this->ComputeThreadRange(e.threads);
       e.dtype = dtype;
-      e.buffer = tvm::ffi::GetRef<Var>(buffer_var);
+      e.buffer = buffer_var_ref;
src/transform/inject_pipeline.cc (1)

40-40: Verify if this namespace import is actually used.

The using namespace ffi; directive appears unused, as all FFI members throughout the file are explicitly qualified (e.g., tvm::ffi::make_object at line 463, tvm::ffi::GetRef at line 985, tvm::ffi::reflection at line 1077).

If this import is not used elsewhere in the file, consider removing it:

 using namespace tir;
-using namespace ffi;
src/transform/storage_access.h (1)

42-42: Consider moving the using namespace ffi; directive to the implementation file.

Using namespace directives in headers can lead to namespace pollution for all translation units that include this header. Since no FFI types or functions are directly used in this header declaration, consider moving this directive to the corresponding .cc implementation file where FFI functions like make_object or GetRef are likely called.

Alternatively, use explicit ffi:: qualification at call sites or prefer specific using declarations (e.g., using ffi::make_object;) for better clarity and to avoid potential name collisions.

src/transform/inject_tma_barrier.cc (4)

296-301: Assumes barrier_id is always Call(get_mbarrier(...)) — verify or guard.

tma_op_to_barrier_id_[GetRef(op)].as()->args[0] will segfault if the value isn’t a Call. Add an ICHECK for as() or handle Var/IntImm cases.

Would you like me to harden this with a small guard and fallback?


408-416: Unused local ‘block’.

auto block = tvm::ffi::GetRef(op) is unused. Remove to avoid warnings.

-  auto block = tvm::ffi::GetRef<Block>(op);

487-494: Eliding arrive_barrier by returning 0 — confirm semantics.

Replacing builtin::ptx_arrive_barrier() with literal 0 relies on the surrounding Evaluate(…) context treating it as a no-op. If any code inspects side effects, this may mislead. Consider returning an Evaluate(0) at the stmt layer instead where possible.


82-87: Typo: old_loop_evtents.

Rename to old_loop_extents for clarity.

-    PrimExpr old_loop_evtents = loop_extents;
+    PrimExpr old_loop_extents = loop_extents;
     loop_extents *= op->extent;
     StmtExprVisitor::VisitStmt_(op);
-    loop_extents = old_loop_evtents;
+    loop_extents = old_loop_extents;
src/op/copy.h (2)

106-114: Expose remaining fields in reflection (disable_tma, eviction_policy, par_op_).

For parity with other ops (e.g., AtomicAddNode), add def_ro for disable_tma and eviction_policy to aid Python-side introspection/debug.

   refl::ObjectDef<CopyNode>()
       .def_ro("src", &CopyNode::src)
       .def_ro("dst", &CopyNode::dst)
       .def_ro("src_range", &CopyNode::src_range)
       .def_ro("dst_range", &CopyNode::dst_range)
-      .def_ro("coalesced_width", &CopyNode::coalesced_width);
+      .def_ro("coalesced_width", &CopyNode::coalesced_width)
+      .def_ro("disable_tma", &CopyNode::disable_tma)
+      .def_ro("eviction_policy", &CopyNode::eviction_policy);

308-345: FFI macros for Conv2DIm2ColOpNode/Op look correct; consider reflecting nhw_step/c_step.

nhw_step and c_step are important for debugging/instrumentation but not reflected. Consider adding def_ro for both.

   refl::ObjectDef<Conv2DIm2ColOpNode>()
       .def_ro("src", &Conv2DIm2ColOpNode::src)
       .def_ro("dst", &Conv2DIm2ColOpNode::dst)
       .def_ro("stride", &Conv2DIm2ColOpNode::stride)
       .def_ro("padding", &Conv2DIm2ColOpNode::padding)
       .def_ro("dilation", &Conv2DIm2ColOpNode::dilation)
       .def_ro("kernel", &Conv2DIm2ColOpNode::kernel)
-      .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy);
+      .def_ro("eviction_policy", &Conv2DIm2ColOpNode::eviction_policy)
+      .def_ro("nhw_step", &Conv2DIm2ColOpNode::nhw_step)
+      .def_ro("c_step", &Conv2DIm2ColOpNode::c_step);
src/transform/simplify.cc (2)

26-27: Avoid broad ‘using namespace ffi’ in .cc unless needed.

Localize to needed symbols or qualify with tvm::ffi to reduce ambiguity.


515-523: Default argument differs from Python wrapper.

C++ Simplify defaults simplify_arguments = true, while tilelang/transform/simplify.py defaults to False. Align to False to avoid surprises when invoked from C++.

-tvm::transform::Pass Simplify(bool simplify_arguments = true) {
+tvm::transform::Pass Simplify(bool simplify_arguments = false) {

Would you like me to scan for in-repo C++ call sites that rely on the current default?

src/op/reduce.h (1)

54-58: Constructor uses tvm::ffi::make_object — good. Consider std::string_view.

Optional: accept std::string_view to avoid copies and normalize input via to_lower for robustness.

src/op/reduce.cc (1)

317-321: AllReduce workspace conditional looks fine; small nit: reduce literal 32 as constexpr.

Optional micro‑refactor to constexpr kWarpSize = 32 for clarity.

src/layout/swizzle.cc (3)

103-106: Register reflection in a static init block

RegisterReflection() won’t run unless called. Add a static init block.

Apply:

 void SwizzledLayoutNode::RegisterReflection() {
   namespace refl = tvm::ffi::reflection;
   refl::ObjectDef<SwizzledLayoutNode>();
 }
+
+TVM_FFI_STATIC_INIT_BLOCK() { SwizzledLayoutNode::RegisterReflection(); }

82-86: Use ICHECK instead of CHECK for consistency

Elsewhere you use ICHECK; align this check as well.

-    CHECK(is_zero(forward_var[i]->dom->min));
+    ICHECK(is_zero(forward_var[i]->dom->min));

27-33: Make bit shifts 64-bit safe

1 << bits_ and 1 << base_ on int can overflow/UB for large values. Use 1LL.

-  int base = (1 << base_);
-  int mask = ((1 << bits_) - 1) << shift_;
+  int base = static_cast<int>(1LL << base_);
+  int mask = static_cast<int>(((1LL << bits_) - 1) << shift_);
src/transform/layout_reducer.h (1)

8-11: Make header self-sufficient for FFI types

ReducerInfoNode/ReducerInfo use String and FFI macros but this header doesn’t include the necessary FFI headers. Include the aliases header or specific FFI headers to avoid relying on transitive includes.

 #include <tvm/tir/op.h>
 
 #include "../layout/layout.h"
+#include "../support/ffi_aliases.h"
+#include <tvm/ffi/object.h>
src/target/codegen_cuda.h (1)

67-69: Type qualification inconsistency confirmed—refactor to maintain consistency across translation unit boundaries.

The header declares PrintFunctionSignature and PrintCallExtern with ffi::String and ffi::Array<PrimExpr>, while the .cc implementation uses bare String and Array<PrimExpr>. Although semantically equivalent due to using namespace ffi; at line 22 of codegen_cuda.cc, this mixed qualification style can obscure intent across TU boundaries and risks confusion in future maintenance.

Recommendation: Align the .cc definitions to use explicit ffi::String and ffi::Array for both methods to match the header declarations, or conversely, update the header to use bare types if the unqualified form is preferred in this codebase.

src/ir.cc (1)

246-249: Remove no-op ICHECK.

  • ICHECK(grid_size.size() >= 0) is always true; drop it.
-    ICHECK(grid_size.size() >= 0);
src/layout/layout.cc (1)

481-565: FFI GlobalDef and reflection blocks look valid.

Two static init blocks for API defs and reflection are fine; no duplicate registrations detected.

Minor: In LayoutNode::Inverse() and FragmentNode::Inverse(), returning std::move(...) is unnecessary for local values; consider returning the value directly for clarity.

Also applies to: 567-572

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e28433e and 951f2de.

📒 Files selected for processing (107)
  • 3rdparty/tvm (1 hunks)
  • CMakeLists.txt (1 hunks)
  • cmake/load_tvm.cmake (1 hunks)
  • src/ir.cc (11 hunks)
  • src/layout/layout.cc (7 hunks)
  • src/layout/layout.h (5 hunks)
  • src/layout/swizzle.cc (3 hunks)
  • src/layout/swizzle.h (2 hunks)
  • src/layout/utils.cc (1 hunks)
  • src/layout/utils.h (1 hunks)
  • src/op/atomic_add.cc (3 hunks)
  • src/op/atomic_add.h (2 hunks)
  • src/op/copy.cc (5 hunks)
  • src/op/copy.h (4 hunks)
  • src/op/fill.cc (3 hunks)
  • src/op/fill.h (2 hunks)
  • src/op/finalize_reducer.cc (3 hunks)
  • src/op/finalize_reducer.h (2 hunks)
  • src/op/gemm.cc (5 hunks)
  • src/op/gemm.h (4 hunks)
  • src/op/gemm_py.cc (7 hunks)
  • src/op/gemm_py.h (2 hunks)
  • src/op/gemm_sp.cc (3 hunks)
  • src/op/gemm_sp.h (3 hunks)
  • src/op/logical.cc (2 hunks)
  • src/op/math.cc (1 hunks)
  • src/op/operator.cc (2 hunks)
  • src/op/operator.h (1 hunks)
  • src/op/parallel.cc (2 hunks)
  • src/op/parallel.h (2 hunks)
  • src/op/reduce.cc (5 hunks)
  • src/op/reduce.h (6 hunks)
  • src/op/region.cc (3 hunks)
  • src/op/region.h (2 hunks)
  • src/runtime/runtime.cc (4 hunks)
  • src/support/ffi_aliases.h (1 hunks)
  • src/target/codegen_cpp.cc (3 hunks)
  • src/target/codegen_cpp.h (2 hunks)
  • src/target/codegen_cuda.cc (4 hunks)
  • src/target/codegen_cuda.h (1 hunks)
  • src/target/codegen_hip.cc (1 hunks)
  • src/target/codegen_webgpu.cc (0 hunks)
  • src/target/codegen_webgpu.h (0 hunks)
  • src/target/intrin_rule_cuda.cc (1 hunks)
  • src/target/intrin_rule_hip.cc (2 hunks)
  • src/target/rt_mod_cpp.cc (2 hunks)
  • src/target/rt_mod_cuda.cc (3 hunks)
  • src/target/rt_mod_hip.cc (1 hunks)
  • src/target/utils.cc (6 hunks)
  • src/transform/align_dynamic_shared_memory_allocations.cc (5 hunks)
  • src/transform/annotate_device_regions.cc (2 hunks)
  • src/transform/annotate_warp_group_reg_alloc.cc (1 hunks)
  • src/transform/cluster_planning.cc (3 hunks)
  • src/transform/common/loop_parallel_transform_utils.h (2 hunks)
  • src/transform/common/loop_vectorization_utils.h (19 hunks)
  • src/transform/config_index_bitwidth.cc (3 hunks)
  • src/transform/eliminate_storage_sync_for_mbarrier.cc (3 hunks)
  • src/transform/flatten_buffer.cc (3 hunks)
  • src/transform/frontend_legalize.cc (1 hunks)
  • src/transform/if_stmt_binding.cc (2 hunks)
  • src/transform/inject_assumes.cc (1 hunks)
  • src/transform/inject_fence_proxy.cc (1 hunks)
  • src/transform/inject_pipeline.cc (5 hunks)
  • src/transform/inject_ptx_async_copy.cc (1 hunks)
  • src/transform/inject_tma_barrier.cc (6 hunks)
  • src/transform/layout_inference.cc (5 hunks)
  • src/transform/layout_reducer.cc (1 hunks)
  • src/transform/layout_reducer.h (1 hunks)
  • src/transform/legalize_safe_memory_access.cc (2 hunks)
  • src/transform/legalize_vectorized_loop.cc (1 hunks)
  • src/transform/loop_partition.cc (1 hunks)
  • src/transform/loop_vectorize.cc (1 hunks)
  • src/transform/loop_vectorize_dynamic.cc (2 hunks)
  • src/transform/lower_device_kernel_launch.cc (4 hunks)
  • src/transform/lower_device_storage_access_info.cc (1 hunks)
  • src/transform/lower_hopper_intrin.cc (2 hunks)
  • src/transform/lower_intrin.cc (7 hunks)
  • src/transform/lower_l2_persistent_annotation.cc (1 hunks)
  • src/transform/lower_opaque_block.cc (2 hunks)
  • src/transform/lower_shared_barrier.cc (2 hunks)
  • src/transform/lower_shared_tmem.cc (2 hunks)
  • src/transform/lower_thread_allreduce.cc (2 hunks)
  • src/transform/lower_tile_op.cc (4 hunks)
  • src/transform/make_packed_api.cc (6 hunks)
  • src/transform/merge_if_stmt.cc (1 hunks)
  • src/transform/merge_shared_memory_allocations.cc (5 hunks)
  • src/transform/multi_version_buffer_rewriter.cc (3 hunks)
  • src/transform/persist_threadblock.cc (1 hunks)
  • src/transform/pipeline_planning.cc (5 hunks)
  • src/transform/simplify.cc (5 hunks)
  • src/transform/split_host_device.cc (2 hunks)
  • src/transform/storage_access.cc (5 hunks)
  • src/transform/storage_access.h (1 hunks)
  • src/transform/storage_rewrite.cc (5 hunks)
  • src/transform/thread_storage_sync.cc (1 hunks)
  • src/transform/vectorize_loop.cc (20 hunks)
  • src/transform/warp_specialized_rewriter.cc (5 hunks)
  • src/transform/wgmma_sync_rewriter.cc (1 hunks)
  • tilelang/_ffi_api.py (1 hunks)
  • tilelang/contrib/dlpack.py (2 hunks)
  • tilelang/contrib/hipcc.py (2 hunks)
  • tilelang/contrib/nvcc.py (6 hunks)
  • tilelang/contrib/rocm.py (4 hunks)
  • tilelang/engine/callback.py (3 hunks)
  • tilelang/engine/lower.py (5 hunks)
  • tilelang/ir.py (2 hunks)
  • tilelang/layout/fragment.py (1 hunks)
⛔ Files not processed due to max files limit (4)
  • tilelang/layout/layout.py
  • tilelang/tileop/gemm/init.py
  • tilelang/transform/_ffi_api.py
  • tilelang/utils/tensor.py
💤 Files with no reviewable changes (2)
  • src/target/codegen_webgpu.h
  • src/target/codegen_webgpu.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_shared_barrier.cc
  • src/transform/inject_tma_barrier.cc
  • src/transform/align_dynamic_shared_memory_allocations.cc
  • src/transform/lower_shared_tmem.cc
🧬 Code graph analysis (75)
src/transform/legalize_vectorized_loop.cc (1)
tilelang/transform/__init__.py (1)
  • LegalizeVectorizedLoop (241-249)
src/transform/persist_threadblock.cc (1)
tilelang/transform/__init__.py (1)
  • PersistThreadblock (392-395)
src/op/gemm_sp.cc (2)
src/op/finalize_reducer.h (1)
  • RegisterReflection (33-38)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/transform/inject_ptx_async_copy.cc (1)
tilelang/transform/__init__.py (1)
  • InjectPTXAsyncCopy (307-315)
src/transform/annotate_warp_group_reg_alloc.cc (1)
tilelang/transform/__init__.py (1)
  • AnnotateWarpGroupRegAlloc (204-216)
src/transform/inject_assumes.cc (1)
tilelang/transform/__init__.py (1)
  • InjectAssumes (83-91)
src/transform/flatten_buffer.cc (2)
src/transform/config_index_bitwidth.cc (16)
  • op (27-35)
  • op (27-27)
  • op (37-42)
  • op (37-37)
  • op (44-50)
  • op (44-44)
  • op (52-59)
  • op (52-52)
  • op (61-68)
  • op (61-61)
  • op (89-94)
  • op (89-89)
  • op (96-101)
  • op (96-96)
  • op (103-108)
  • op (103-103)
tilelang/transform/__init__.py (1)
  • FlattenBuffer (357-365)
tilelang/contrib/nvcc.py (2)
tilelang/engine/lower.py (1)
  • tilelang_callback_cuda_compile (56-89)
tilelang/contrib/nvrtc.py (1)
  • compile_cuda (13-110)
src/transform/lower_shared_barrier.cc (2)
src/transform/layout_inference.cc (10)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
tilelang/transform/__init__.py (1)
  • LowerSharedBarrier (412-415)
src/transform/common/loop_parallel_transform_utils.h (2)
src/transform/common/loop_vectorization_utils.h (1)
  • tvm (39-783)
src/transform/layout_inference.cc (16)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
  • op (657-669)
  • op (657-657)
  • op (706-801)
  • op (706-706)
  • op (803-812)
  • op (803-803)
src/transform/layout_reducer.cc (1)
tilelang/transform/__init__.py (1)
  • LayoutReducer (466-475)
src/layout/swizzle.h (2)
src/layout/swizzle.cc (3)
  • SwizzledLayoutNode (40-49)
  • SwizzledLayout (77-93)
  • SwizzledLayout (95-101)
src/layout/layout.h (1)
  • LayoutNode (24-209)
src/transform/common/loop_vectorization_utils.h (1)
src/transform/common/loop_parallel_transform_utils.h (1)
  • tvm (20-165)
src/transform/lower_hopper_intrin.cc (2)
src/op/builtin.cc (2)
  • cuTensorMapType (38-38)
  • cuTensorMapType (38-38)
tilelang/transform/__init__.py (1)
  • LowerHopperIntrin (94-103)
src/transform/storage_access.cc (2)
src/transform/merge_shared_memory_allocations.cc (6)
  • buf (215-232)
  • buf (215-215)
  • buffer_var (614-619)
  • buffer_var (614-614)
  • buffer (549-568)
  • buffer (549-549)
src/transform/thread_storage_sync.cc (21)
  • buf (57-59)
  • buf (57-57)
  • op (353-365)
  • op (353-353)
  • op (411-425)
  • op (411-411)
  • op (456-462)
  • op (456-456)
  • op (463-469)
  • op (463-463)
  • op (470-488)
  • op (470-470)
  • op (490-529)
  • op (490-490)
  • op (544-563)
  • op (544-544)
  • op (614-633)
  • op (614-614)
  • ICHECK (564-586)
  • buffer_var (539-541)
  • buffer_var (539-539)
src/transform/split_host_device.cc (1)
tilelang/transform/__init__.py (1)
  • SplitHostDevice (285-293)
src/layout/utils.cc (1)
src/transform/layout_inference.cc (12)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
  • op (657-669)
  • op (657-657)
src/transform/loop_vectorize.cc (1)
src/transform/lower_tile_op.cc (2)
  • expr (433-445)
  • expr (433-433)
src/transform/inject_tma_barrier.cc (2)
src/transform/lower_hopper_intrin.cc (2)
  • call (102-132)
  • call (102-102)
src/transform/warp_specialized_rewriter.cc (18)
  • call (31-36)
  • call (31-31)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
  • op (179-189)
  • op (179-179)
tilelang/layout/fragment.py (2)
src/layout/layout.cc (2)
  • Layout (57-69)
  • Layout (71-74)
tilelang/layout/layout.py (1)
  • Layout (13-148)
src/op/fill.h (2)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/fill.cc (1)
  • Fill (62-109)
src/layout/swizzle.cc (1)
src/layout/swizzle.h (1)
  • SwizzledLayout (58-66)
src/transform/layout_reducer.h (5)
src/op/atomic_add.h (1)
  • tvm (12-73)
src/op/finalize_reducer.h (1)
  • tvm (20-56)
src/op/gemm_sp.h (1)
  • tvm (13-104)
src/op/parallel.h (1)
  • tvm (22-148)
src/support/ffi_aliases.h (1)
  • tvm (10-16)
src/op/fill.cc (1)
src/op/fill.h (1)
  • RegisterReflection (29-35)
src/transform/eliminate_storage_sync_for_mbarrier.cc (2)
src/transform/lower_hopper_intrin.cc (2)
  • op (55-100)
  • op (55-55)
tilelang/transform/__init__.py (1)
  • EliminateStorageSyncForMBarrier (368-371)
src/op/atomic_add.cc (6)
src/op/finalize_reducer.h (1)
  • RegisterReflection (33-38)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/op/parallel.h (1)
  • RegisterReflection (72-78)
src/op/reduce.h (3)
  • RegisterReflection (35-38)
  • RegisterReflection (92-100)
  • RegisterReflection (137-144)
src/op/copy.h (2)
  • RegisterReflection (106-114)
  • RegisterReflection (311-321)
src/op/fill.h (1)
  • RegisterReflection (29-35)
src/op/finalize_reducer.cc (1)
src/op/finalize_reducer.h (1)
  • RegisterReflection (33-38)
src/transform/make_packed_api.cc (2)
src/transform/lower_device_kernel_launch.cc (6)
  • gvar (74-98)
  • gvar (74-74)
  • gvar (206-219)
  • gvar (206-206)
  • gvar (221-266)
  • gvar (221-221)
tilelang/transform/__init__.py (1)
  • MakePackedAPI (263-271)
src/transform/config_index_bitwidth.cc (1)
src/transform/flatten_buffer.cc (16)
  • op (76-81)
  • op (76-76)
  • op (83-88)
  • op (83-83)
  • op (90-95)
  • op (90-90)
  • op (97-101)
  • op (97-97)
  • op (103-106)
  • op (103-103)
  • op (111-144)
  • op (111-111)
  • op (146-217)
  • op (146-146)
  • op (219-225)
  • op (219-219)
src/op/finalize_reducer.h (3)
src/op/operator.h (1)
  • TileOperatorNode (56-66)
src/op/finalize_reducer.cc (1)
  • FinalizeReducerOp (35-40)
tilelang/ir.py (1)
  • FinalizeReducerOp (53-54)
src/op/atomic_add.h (2)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/atomic_add.cc (1)
  • AtomicAdd (44-69)
src/op/parallel.h (3)
src/op/parallel.cc (1)
  • ParallelOpNode (176-178)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/transform/layout_inference.cc (16)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
  • op (657-669)
  • op (657-657)
  • op (706-801)
  • op (706-706)
  • op (803-812)
  • op (803-803)
src/transform/storage_rewrite.cc (1)
src/transform/flatten_buffer.cc (10)
  • op (76-81)
  • op (76-76)
  • op (83-88)
  • op (83-83)
  • op (90-95)
  • op (90-90)
  • op (97-101)
  • op (97-97)
  • op (103-106)
  • op (103-103)
tilelang/engine/lower.py (1)
tilelang/engine/param.py (1)
  • CompiledArtifact (107-117)
src/transform/wgmma_sync_rewriter.cc (1)
tilelang/transform/__init__.py (1)
  • RewriteWgmmaSync (117-125)
src/transform/thread_storage_sync.cc (1)
tilelang/transform/__init__.py (1)
  • ThreadSync (128-141)
src/transform/if_stmt_binding.cc (2)
src/transform/warp_specialized_rewriter.cc (14)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
tilelang/transform/__init__.py (1)
  • IfStmtBinding (160-168)
src/op/gemm_py.cc (3)
src/transform/pipeline_planning.cc (12)
  • op (48-57)
  • op (48-48)
  • op (87-143)
  • op (87-87)
  • op (166-184)
  • op (166-166)
  • op (186-204)
  • op (186-186)
  • op (206-260)
  • op (206-206)
  • f (287-297)
  • f (287-287)
src/layout/layout.cc (4)
  • RegisterReflection (76-81)
  • RegisterReflection (76-76)
  • RegisterReflection (474-479)
  • RegisterReflection (474-474)
src/op/gemm_py.h (1)
  • RegisterReflection (38-60)
src/op/gemm.h (7)
src/op/atomic_add.h (1)
  • tvm (12-73)
src/op/finalize_reducer.h (1)
  • tvm (20-56)
src/op/gemm_sp.h (1)
  • tvm (13-104)
src/op/region.h (1)
  • tvm (72-116)
src/support/ffi_aliases.h (1)
  • tvm (10-16)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/gemm.cc (1)
  • Gemm (114-152)
src/support/ffi_aliases.h (4)
src/op/atomic_add.h (1)
  • tvm (12-73)
src/op/parallel.h (1)
  • tvm (22-148)
src/op/region.h (1)
  • tvm (72-116)
src/layout/layout.h (1)
  • Array (39-71)
src/op/reduce.cc (1)
src/op/reduce.h (4)
  • ReduceOp (120-126)
  • RegisterReflection (35-38)
  • RegisterReflection (92-100)
  • RegisterReflection (137-144)
src/transform/legalize_safe_memory_access.cc (1)
tilelang/transform/__init__.py (1)
  • LegalizeSafeMemoryAccess (252-260)
src/transform/lower_device_storage_access_info.cc (1)
tilelang/transform/__init__.py (1)
  • LowerDeviceStorageAccessInfo (318-330)
src/layout/layout.cc (4)
src/layout/swizzle.cc (2)
  • RegisterReflection (103-106)
  • RegisterReflection (103-103)
src/op/finalize_reducer.h (1)
  • RegisterReflection (33-38)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/op/parallel.h (1)
  • RegisterReflection (72-78)
src/op/region.cc (4)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/op/parallel.h (1)
  • RegisterReflection (72-78)
src/op/reduce.h (3)
  • RegisterReflection (35-38)
  • RegisterReflection (92-100)
  • RegisterReflection (137-144)
src/op/copy.h (2)
  • RegisterReflection (106-114)
  • RegisterReflection (311-321)
src/transform/multi_version_buffer_rewriter.cc (2)
src/transform/warp_specialized_rewriter.cc (16)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
  • op (179-189)
  • op (179-179)
src/transform/inject_pipeline.cc (4)
  • buffer (398-453)
  • buffer (398-399)
  • buffer (461-471)
  • buffer (461-461)
src/transform/simplify.cc (3)
src/op/atomic_add.h (1)
  • tl (13-72)
src/layout/layout.cc (4)
  • RegisterReflection (76-81)
  • RegisterReflection (76-76)
  • RegisterReflection (474-479)
  • RegisterReflection (474-474)
tilelang/transform/simplify.py (1)
  • Simplify (19-27)
src/transform/inject_fence_proxy.cc (1)
tilelang/transform/__init__.py (1)
  • InjectFenceProxy (230-238)
src/transform/merge_if_stmt.cc (1)
tilelang/transform/__init__.py (1)
  • MergeIfStmt (171-179)
src/layout/layout.h (4)
src/op/atomic_add.h (1)
  • tl (13-72)
src/op/parallel.h (1)
  • tl (23-146)
src/layout/swizzle.h (1)
  • tl (13-68)
src/layout/layout.cc (6)
  • Layout (57-69)
  • Layout (71-74)
  • LayoutNode (48-55)
  • Fragment (318-340)
  • Fragment (342-352)
  • FragmentNode (302-316)
src/transform/align_dynamic_shared_memory_allocations.cc (1)
tilelang/transform/__init__.py (1)
  • AlignDynamicSharedMemoryAllocations (398-409)
src/transform/layout_inference.cc (3)
src/op/operator.cc (4)
  • ParseOperator (32-41)
  • ParseOperator (32-32)
  • ParseOperator (55-61)
  • ParseOperator (55-55)
src/op/parallel.h (1)
  • ParallelOp (137-145)
tilelang/transform/__init__.py (1)
  • LayoutInference (39-47)
src/op/gemm_py.h (2)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/gemm_py.cc (1)
  • GemmPy (50-80)
src/transform/frontend_legalize.cc (1)
tilelang/transform/simplify.py (1)
  • LetInline (8-16)
src/ir.cc (3)
src/op/gemm_sp.h (1)
  • tl (15-103)
src/op/parallel.h (2)
  • tl (23-146)
  • RegisterReflection (72-78)
src/transform/layout_reducer.h (1)
  • tl (58-87)
src/transform/lower_intrin.cc (1)
tilelang/transform/__init__.py (1)
  • LowerIntrin (441-444)
src/transform/lower_shared_tmem.cc (2)
src/transform/lower_tile_op.cc (10)
  • op (96-102)
  • op (96-96)
  • op (117-151)
  • op (117-117)
  • op (181-186)
  • op (181-181)
  • op (193-207)
  • op (193-193)
  • op (282-316)
  • op (282-282)
tilelang/transform/__init__.py (1)
  • LowerSharedTmem (460-463)
src/op/copy.cc (4)
src/layout/layout.cc (4)
  • RegisterReflection (76-81)
  • RegisterReflection (76-76)
  • RegisterReflection (474-479)
  • RegisterReflection (474-474)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/op/parallel.h (1)
  • RegisterReflection (72-78)
src/op/copy.h (2)
  • RegisterReflection (106-114)
  • RegisterReflection (311-321)
src/target/codegen_cpp.h (1)
src/target/codegen_cpp.cc (2)
  • GenerateForwardFunctionDeclarations (65-87)
  • GenerateForwardFunctionDeclarations (65-68)
src/op/region.h (2)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/region.cc (1)
  • RegionOp (34-52)
src/transform/loop_vectorize_dynamic.cc (2)
src/transform/make_packed_api.cc (2)
  • op (168-189)
  • op (168-168)
tilelang/transform/__init__.py (1)
  • LoopVectorizeDynamic (333-342)
src/op/parallel.cc (7)
src/layout/layout.cc (4)
  • RegisterReflection (76-81)
  • RegisterReflection (76-76)
  • RegisterReflection (474-479)
  • RegisterReflection (474-474)
src/layout/swizzle.cc (2)
  • RegisterReflection (103-106)
  • RegisterReflection (103-103)
src/op/finalize_reducer.h (1)
  • RegisterReflection (33-38)
src/op/gemm.h (2)
  • RegisterReflection (35-41)
  • RegisterReflection (106-128)
src/op/parallel.h (1)
  • RegisterReflection (72-78)
src/op/copy.h (2)
  • RegisterReflection (106-114)
  • RegisterReflection (311-321)
src/op/fill.h (1)
  • RegisterReflection (29-35)
src/transform/annotate_device_regions.cc (3)
src/transform/make_packed_api.cc (2)
  • op (168-189)
  • op (168-168)
src/transform/layout_inference.cc (14)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
  • op (657-669)
  • op (657-657)
  • op (706-801)
  • op (706-706)
tilelang/transform/__init__.py (1)
  • AnnotateDeviceRegions (274-282)
src/transform/inject_pipeline.cc (2)
src/transform/multi_version_buffer_rewriter.cc (2)
  • buffer (255-265)
  • buffer (255-255)
src/transform/pipeline_planning.cc (14)
  • op (48-57)
  • op (48-48)
  • op (87-143)
  • op (87-87)
  • op (166-184)
  • op (166-166)
  • op (186-204)
  • op (186-186)
  • op (206-260)
  • op (206-206)
  • op (262-272)
  • op (262-262)
  • op (703-712)
  • op (703-703)
src/transform/lower_tile_op.cc (2)
src/transform/legalize_safe_memory_access.cc (10)
  • op (33-46)
  • op (33-33)
  • op (64-74)
  • op (64-64)
  • op (76-84)
  • op (76-76)
  • op (158-180)
  • op (158-158)
  • op (182-214)
  • op (182-182)
src/op/operator.cc (4)
  • ParseOperator (32-41)
  • ParseOperator (32-32)
  • ParseOperator (55-61)
  • ParseOperator (55-55)
src/transform/warp_specialized_rewriter.cc (2)
src/transform/annotate_warp_group_reg_alloc.cc (10)
  • op (30-47)
  • op (30-30)
  • op (49-54)
  • op (49-49)
  • op (71-78)
  • op (71-71)
  • op (96-105)
  • op (96-96)
  • op (107-166)
  • op (107-107)
tilelang/transform/__init__.py (1)
  • WarpSpecialized (193-201)
src/transform/cluster_planning.cc (1)
tilelang/transform/__init__.py (1)
  • ClusterPlanning (17-25)
src/target/codegen_cuda.h (1)
src/target/codegen_cuda.cc (12)
  • PrintFunctionSignature (2675-2734)
  • PrintFunctionSignature (2675-2677)
  • t (26-64)
  • t (26-26)
  • t (68-75)
  • t (68-68)
  • t (79-95)
  • t (79-79)
  • t (99-107)
  • t (99-100)
  • PrintCallExtern (974-1030)
  • PrintCallExtern (974-977)
src/target/codegen_cuda.cc (1)
src/transform/warp_specialized_rewriter.cc (16)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
  • op (179-189)
  • op (179-179)
src/transform/merge_shared_memory_allocations.cc (2)
src/transform/layout_inference.cc (6)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
tilelang/transform/__init__.py (1)
  • MergeSharedMemoryAllocations (374-383)
src/op/reduce.h (8)
src/op/atomic_add.h (1)
  • tvm (12-73)
src/op/finalize_reducer.h (2)
  • tvm (20-56)
  • RegisterReflection (33-38)
src/op/gemm_sp.h (1)
  • tvm (13-104)
src/op/region.h (1)
  • tvm (72-116)
src/support/ffi_aliases.h (1)
  • tvm (10-16)
src/op/fill.h (2)
  • tvm (12-53)
  • RegisterReflection (29-35)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/reduce.cc (2)
  • ReduceOp (24-33)
  • CumSumOp (509-523)
src/op/operator.h (1)
src/transform/layout_reducer.h (2)
  • Object (63-70)
  • ObjectRef (72-76)
src/op/gemm_sp.h (10)
src/op/gemm.h (1)
  • GemmWarpPolicyNode (27-57)
src/transform/layout_reducer.h (1)
  • ObjectRef (72-76)
src/op/atomic_add.h (1)
  • tvm (12-73)
src/op/finalize_reducer.h (1)
  • tvm (20-56)
src/op/parallel.h (1)
  • tvm (22-148)
src/op/region.h (1)
  • tvm (72-116)
src/support/ffi_aliases.h (1)
  • tvm (10-16)
src/op/fill.h (1)
  • tvm (12-53)
src/op/operator.h (2)
  • TileOperatorNode (56-66)
  • TileOperator (68-72)
src/op/gemm_sp.cc (1)
  • GemmSP (86-109)
src/transform/lower_l2_persistent_annotation.cc (1)
tilelang/transform/__init__.py (1)
  • LowerL2Persistent (386-389)
src/op/copy.h (2)
src/op/operator.h (1)
  • TileOperatorNode (56-66)
src/op/copy.cc (1)
  • Conv2DIm2ColOp (1725-1738)
🪛 Ruff (0.14.1)
tilelang/contrib/nvcc.py

186-186: Unused function argument: target

(ARG001)

Comment on lines +14 to 16
#include "../support/ffi_aliases.h"
#include "../target/utils.h"
#include "tvm/ffi/string.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure ffi::Function is declared

This file uses ffi::Function::GetGlobal but doesn’t include <tvm/ffi/function.h>. Either include it here or rely on the aliases header after adding the missing include there (preferred).

+#include <tvm/ffi/function.h>
 #include "../support/ffi_aliases.h"
🤖 Prompt for AI Agents
In src/op/gemm_py.cc around lines 14 to 16, the file calls
ffi::Function::GetGlobal but does not include the declaration for ffi::Function;
add #include <tvm/ffi/function.h> so the symbol is declared. Preferably, add
that include to src/support/ffi_aliases.h (which this file already includes) so
all files using ffi::Function get the declaration transitively; otherwise add
the include directly in src/op/gemm_py.cc.

}

runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {
ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Return type likely mismatched with factory return (ffi::Module vs runtime::Module).

runtime::CUDAModuleCreate(...) returns runtime::Module, but these functions return ffi::Module. Unless tvm::ffi::Module is an alias of tvm::runtime::Module, this won’t compile.

Apply one of:

  • Preferred: keep FFI-neutral and return runtime::Module.
  • If you do intend to use an alias, ensure tvm::ffi::Module is defined and visible here (otherwise switch back to runtime::Module).
-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {
...
-ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {

Run the script below to confirm whether tvm::ffi::Module exists and is an alias to tvm::runtime::Module:

Also applies to: 74-74


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail
# Find any definition/alias of ffi::Module
rg -nP -C2 '\bnamespace\s+ffi\b|using\s+Module\s*=' --glob '!**/third_party/**'
# Check all uses of CUDAModuleCreate and its expected return type
rg -nP -C2 'CUDAModuleCreate\s*\(' --glob '!**/third_party/**'

Length of output: 3253


I need to inspect the actual file and search more thoroughly for any ffi::Module definition or typedef.

[/run_scripts]


🏁 Script executed:

#!/bin/bash
# View the actual function signatures at lines 41 and 74
head -100 src/target/rt_mod_cuda.cc | tail -60

# Search for any Module typedef/using in ffi namespace more broadly
rg -nP 'class\s+Module|struct\s+Module|using\s+Module|typedef.*Module' --glob '!**/third_party/**' -A2 -B2 | head -50

# Check includes in rt_mod_cuda.cc for ffi-related headers
head -50 src/target/rt_mod_cuda.cc

Length of output: 3742


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Search for ffi::Module in header files (likely in include/tvm/ffi or similar)
rg -nP 'class\s+Module|struct\s+Module' src/target/rt_mod_cuda.cc --context=30

# Find ffi header files
fd -e 'h$|hpp$' 'ffi' --glob '!**/third_party/**' | head -20

# Search for Module in ffi-related includes
rg -nP '\bModule\b' include/tvm/ffi --type=h -A1 -B1 | grep -i module | head -40

# Check if ffi namespace re-exports runtime::Module
rg -nP 'using.*Module|Module.*=' include/tvm/ffi --type=h

Length of output: 217


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Find all header files with ffi in them
find . -type f \( -name '*ffi*.h' -o -name '*ffi*.hpp' \) -not -path '*/third_party/*' 2>/dev/null | head -30

# Search for ffi namespace definition and Module references
rg -nP 'namespace\s+ffi\s*\{|class\s+Module\s*\{|struct\s+Module\s*\{' --type=h --type=cc -A3 | head -60

# Check the actual include path in rt_mod_cuda.cc and trace it
rg -nP '#include.*ffi' src/target/rt_mod_cuda.cc

# Search for runtime::Module definition to understand the type
rg -nP 'class\s+Module|struct\s+Module' --type=h --type=cc -B2 -A5 2>/dev/null | grep -A7 'runtime' | head -40

# Check if ffi::Module is used elsewhere in the codebase
rg -nP 'ffi::Module' --type=h --type=cc | head -20

Length of output: 124


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# View the ffi_aliases.h file which likely contains Module alias
cat src/support/ffi_aliases.h

# Search for Module in runtime namespace
grep -rn "class Module\|struct Module" src/ include/ 2>/dev/null | grep -i runtime

# Check if ffi::Module is used elsewhere
grep -rn "ffi::Module" src/ include/ 2>/dev/null | head -30

# Find runtime module creation function
grep -rn "CUDAModuleCreate" src/ 2>/dev/null

Length of output: 816


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Search for ALL Module definitions/usages
grep -rn "^.*Module" src/support/ src/target/*.h src/target/*.cc 2>/dev/null | grep -E "(class|struct|using|typedef).*Module" | head -40

# Check rt_mod_cpp.cc for context
cat src/target/rt_mod_cpp.cc | head -50

# Search in include directories for Module
find . -path './third_party' -prune -o -type f -name '*.h' -print 2>/dev/null | xargs grep -l "class Module\|struct Module" 2>/dev/null | head -20

# Look for ffi::Module specifically
grep -rn "ffi::Module" . --include="*.h" --include="*.cpp" 2>/dev/null | grep -v third_party | head -30

# Check if Module is in runtime namespace and its definition
grep -rn "namespace runtime" src/ include/ 2>/dev/null | head -5
grep -A10 "class Module" src/runtime/ include/tvm/runtime/ 2>/dev/null | head -50

Length of output: 1859


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Get ALL includes from rt_mod_cuda.cc
head -20 src/target/rt_mod_cuda.cc

# Check if codegen_cuda.h includes tvm/ffi/extra/module.h
cat src/target/codegen_cuda.h 2>/dev/null | head -30

# Check if runtime/cuda/cuda_module.h includes it
find . -name "cuda_module.h" -type f | head -5

# Search for where ffi::Module is actually defined/declared
grep -rn "ffi::Module\|namespace.*ffi" . --include="*.h" 2>/dev/null | grep -v third_party | grep -E "(class|struct|using|typedef|namespace)" | head -50

Length of output: 2406


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Check what codegen_cuda.h includes - it was partially shown, get full file
cat src/target/codegen_cuda.h

# Check runtime/cuda/cuda_module.h if it exists
find . -name "cuda_module.h" -type f ! -path "*/third_party/*" -exec cat {} \; 2>/dev/null | head -50

# Search for where ffi::Module is actually defined
grep -rn "class Module\|struct Module" . --include="*.h" 2>/dev/null | grep -v third_party | head -30

# Check if any header brings in ffi/extra/module.h transitively
grep -rn "ffi/extra/module.h\|ffi::Module" . --include="*.h" 2>/dev/null | grep -v third_party

Length of output: 5913


Add missing #include <tvm/ffi/extra/module.h> and align return type with actual value returned.

The functions on lines 41 and 74 declare ffi::Module as the return type but are missing the header that defines this type. Additionally, runtime::CUDAModuleCreate() returns runtime::Module, not ffi::Module. The parallel file src/target/rt_mod_cpp.cc correctly includes <tvm/ffi/extra/module.h> and declares the same return type, but rt_mod_cuda.cc lacks this include and has the type mismatch.

Apply the suggested fix:

+#include <tvm/ffi/extra/module.h>
 #include "codegen_cuda.h"
 #include "runtime/cuda/cuda_module.h"
 #include "runtime/pack_args.h"
 #include <tvm/ffi/reflection/registry.h>

And change the return types:

-ffi::Module BuildTileLangCUDA(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDA(IRModule mod, Target target) {
-ffi::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {
+runtime::Module BuildTileLangCUDAWithoutCompile(IRModule mod, Target target) {

This matches the actual return value from runtime::CUDAModuleCreate().

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 41 to 41 (and also the second function
around line 74), add the missing include #include <tvm/ffi/extra/module.h> at
the top of the file and change the return type of the functions that currently
declare ffi::Module to runtime::Module so the signature matches the actual value
returned by runtime::CUDAModuleCreate(); keep function bodies unchanged.

if (const auto *var_node = expr.as<VarNode>()) {
Var var = GetRef<Var>(var_node);
Var var = tvm::ffi::GetRef<Var>(var_node);
auto it = let_bindings_.find(var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

FFI GetRef migration — LGTM; fix IterVar bound lookup.

  • Var/Stmt GetRef updates and static-init macro usage are correct.
  • Issue: In thread bounds computation, Analyzer is called with IterVar instead of its Var.

Apply this diff:

-    if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
-      auto const_int_bound = analyzer_->const_int_bound(thread_var_);
+    if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
+      auto const_int_bound = analyzer_->const_int_bound(thread_var_->var);

Also applies to: 615-615, 655-657, 734-737

🤖 Prompt for AI Agents
In src/transform/lower_tile_op.cc around lines 439 (and also at 615, 655-657,
734-737), the code passes an IterVar object into the Analyzer call when
computing thread bounds; instead, extract and pass the underlying Var from the
IterVar (e.g., use iter_var->var or iter_var->var.get() per existing code
patterns) so Analyzer receives the Var key it expects. Update each location to
obtain the IterVar's Var before calling the Analyzer-bound lookup and use that
Var in place of the IterVar argument.

Comment on lines +243 to +245
host_mod, device_mod, params, codegen_mod.inspect_source(), rt_mod=host_mod)

return CompiledArtifact(host_mod, device_mod, params, codegen_mod.get_source())
return CompiledArtifact(host_mod, device_mod, params, codegen_mod.inspect_source())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

get_source → inspect_source: confirm presence on all backends.

inspect_source() is the right API on newer TVM, but ensure all paths (CUDA/HIP/LLVM/C/Metal/WebGPU) return a Module that implements inspect_source(), otherwise Artifact.kernel_source may be None.


🏁 Script executed:

#!/bin/bash
# Grep for inspect_source uses and ensure we don’t still call get_source elsewhere.
rg -n 'get_source\(|inspect_source\(' -g 'tilelang/**'

Length of output: 323


Let me gather more context to verify the scope and completeness of the migration.Now I need to examine the codebase directly to understand the scope and context.

I need to inspect the actual codebase to understand what inspect_source() is and whether it's a valid method.


🏁 Script executed:

#!/bin/bash
# Check if inspect_source is a custom method in tilelang
rg -n 'def inspect_source|\.inspect_source' tilelang/ --type py -B 2 -A 2

# Also check for codegen_mod type and imports in lower.py
echo "=== lower.py context around 243-245 ==="
sed -n '230,250p' tilelang/engine/lower.py

# Check base.py context
echo "=== base.py context around line 52 ==="
sed -n '40,65p' tilelang/jit/adapter/base.py

Length of output: 1581


Let me search for the device_codegen function and inspect_source imports to verify if this is a valid method.


🏁 Script executed:

#!/bin/bash
# Find device_codegen definition
echo "=== Searching for device_codegen definition ==="
rg -n 'def device_codegen\b|^device_codegen\s*=' tilelang/ -A 3

# Check for inspect_source imports
echo "=== Searching for inspect_source in imports ==="
rg -n 'inspect_source' tilelang/ --type py -B 1 -A 1

# Check what codegen_mod type should be - look at return type annotations
echo "=== Checking device_codegen and device_codegen_without_compile signatures ==="
rg -n 'device_codegen|device_codegen_without_compile' tilelang/engine/lower.py -B 2 -A 2

# Check imports at top of lower.py
echo "=== Imports in lower.py ==="
head -40 tilelang/engine/lower.py

Length of output: 2878


🏁 Script executed:

#!/bin/bash
# Find device_codegen definition
echo "=== Searching for device_codegen definition ==="
rg -n 'def device_codegen\b' tilelang/ --type py -A 5

# Check for inspect_source imports or usage
echo "=== Searching for inspect_source anywhere in codebase ==="
rg -n 'inspect_source' tilelang/ --type py

# Check what codegen_mod type annotation is
echo "=== Checking lower.py imports and device_codegen calls ==="
head -50 tilelang/engine/lower.py

# Find where device_codegen comes from
echo "=== Search for device_codegen in entire tilelang ==="
rg -n 'device_codegen' tilelang/ --type py -g '!lower.py' -B 1

Length of output: 2396


Let me verify if tvm.IRModule has an inspect_source() method by checking TVM module API.

Let me verify one more time if inspect_source() is defined anywhere in the tilelang custom code or if this is purely broken code.


🏁 Script executed:

#!/bin/bash
# Comprehensive search for inspect_source definition
echo "=== Searching for ANY definition of inspect_source ==="
find tilelang -type f -name "*.py" -exec grep -l "inspect_source" {} \;

# Look for monkey-patching or property decoration
echo "=== Searching for property or method definitions that might alias inspect_source ==="
rg -n "@property|def " tilelang/ --type py -B 1 -A 1 | grep -C 2 "source"

# Check if it's being imported from external library
echo "=== Checking tvm module or tilelang.tvm for inspect_source ==="
python3 -c "from tilelang import tvm; print(dir(tvm))" 2>&1 | grep -i inspect

Length of output: 6901


Replace inspect_source() with get_source() on lines 243 and 245.

The method inspect_source() does not exist on tvm.IRModule. Calling it will fail at runtime with AttributeError. The standard TVM API for retrieving source code is get_source(), as evidenced by its use elsewhere in the codebase (e.g., tilelang/jit/adapter/base.py:52).

# Lines 243, 245 should use:
codegen_mod.get_source()
# instead of:
codegen_mod.inspect_source()
🤖 Prompt for AI Agents
In tilelang/engine/lower.py around lines 243 to 245, the code calls
codegen_mod.inspect_source() which does not exist on tvm.IRModule and will raise
AttributeError; replace both calls to use codegen_mod.get_source() instead so
the module source is retrieved via the correct TVM API.

@Hzfengsy
Copy link
Collaborator

Thanks @LeiWang1999

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/target/codegen_cuda.cc (1)

23-23: Limit namespace pollution from ffi.

using namespace ffi; in a large TU can shadow or collide with nearby symbols. Prefer explicit tvm::ffi::... or local using declarations for specific types/functions.

Apply:

-using namespace ffi;
+// Prefer explicit tvm::ffi:: qualifiers at use sites to avoid namespace pollution.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21e7a0a and cd6daaf.

📒 Files selected for processing (11)
  • 3rdparty/tvm (1 hunks)
  • src/op/copy.cc (7 hunks)
  • src/target/codegen_cuda.cc (4 hunks)
  • src/transform/loop_partition.cc (1 hunks)
  • testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2 hunks)
  • tilelang/contrib/hipcc.py (2 hunks)
  • tilelang/contrib/nvcc.py (6 hunks)
  • tilelang/engine/callback.py (3 hunks)
  • tilelang/engine/lower.py (5 hunks)
  • tilelang/layout/fragment.py (1 hunks)
  • tilelang/layout/layout.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/transform/loop_partition.cc
  • tilelang/engine/callback.py
  • tilelang/contrib/hipcc.py
  • 3rdparty/tvm
  • tilelang/layout/fragment.py
  • src/op/copy.cc
  • tilelang/engine/lower.py
🧰 Additional context used
🧬 Code graph analysis (3)
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)
testing/python/jit/test_tilelang_jit_gemm_cython.py (1)
  • test_gemm_f16f16f16_nn (100-114)
testing/python/jit/test_tilelang_jit_callback.py (1)
  • test_gemm_f16f16f16_nn (101-115)
src/target/codegen_cuda.cc (1)
src/transform/warp_specialized_rewriter.cc (16)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
  • op (179-189)
  • op (179-179)
tilelang/contrib/nvcc.py (2)
tilelang/engine/lower.py (1)
  • tilelang_callback_cuda_compile (57-90)
tilelang/contrib/nvrtc.py (1)
  • compile_cuda (14-111)
🪛 Ruff (0.14.1)
tilelang/contrib/nvcc.py

186-186: Unused function argument: target

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Quick Lint
  • GitHub Check: Build wheels for Python 3.8 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.8 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.8 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (9)
testing/python/jit/test_tilelang_jit_gemm_ctypes.py (2)

88-88: LGTM! FFI migration applied correctly.

The migration from @tvm.register_func to @tvm.register_global_func with explicit function naming aligns with the TVM v0.22.0 FFI update. The override=True parameter appropriately allows the test to replace any existing registration.


410-411: Ignore this review comment—this change follows an established codebase pattern.

The change reflects a deliberate design choice, not a temporary debug modification. Five test files across different directories (kernel, transform, jit) use the identical pattern of commenting out tilelang.testing.main() and calling specific test functions directly:

  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/transform/test_tilelang_transform_inject_set_max_nreg.py
  • testing/python/transform/test_tilelang_transform_layout_inference.py
  • testing/python/kernel/test_tilelang_kernel_bf16_gemm_mma.py
  • testing/python/jit/test_tilelang_jit_gemm_ctypes.py (this file)

This consistent pattern across multiple files indicates intentional design, not ad-hoc debugging. The review's assumption that this is temporary and should be reverted contradicts established codebase practice.

Likely an incorrect or invalid review comment.

tilelang/layout/layout.py (1)

5-5: LGTM: FFI migration verified and complete.

The migration from tvm.ffi to tvm_ffi has been comprehensively applied across the entire codebase. The @tvm_ffi.register_object() decorator is confirmed functional and actively used for 15+ class registrations (Fill, AtomicAdd, Copy, Gemm, Layout, Fragment, GemmPy, and others) as well as global function registrations. No orphaned old-style patterns remain, and all imports are consistent.

src/target/codegen_cuda.cc (4)

2025-2026: Correct migration to tvm::ffi::GetRef for tl_gemm.

Switching to tvm::ffi::GetRef<PrimExpr>(op) is consistent with the new FFI. Looks good.

Confirm the TU includes the header that declares tvm::ffi::GetRef (e.g., <tvm/ffi/object.h> or the umbrella that provides it). If not, add it to avoid ADL/dep-inclusion pitfalls.


2034-2035: Same as above for tl_gemm_sp.

The FFI-based GetRef usage is consistent and correct.

Same header check as the previous comment to ensure GetRef is declared in this TU.


2308-2310: FFI GetRef in Ramp CHECK message is fine.

Diagnostic now prints an FFI-backed Ramp ref; no functional change. OK.


2811-2811: Optional truthiness check is acceptable.

ICHECK(global_symbol) is equivalent to defined() for Optional<String> on modern TVM. Safe change.

If you target older compilers/TVM variants, keep defined() for clarity; otherwise this is fine.

tilelang/contrib/nvcc.py (2)

11-12: LGTM! Import changes align with FFI migration.

The new imports correctly support the migration to tvm_ffi.register_global_func and maintain access to TVM utilities through the local tvm alias.


185-185: LGTM! Consistent FFI decorator migration.

All function registrations have been properly migrated from the legacy @tvm.ffi.register_func to @tvm_ffi.register_global_func with override=True, maintaining backward compatibility through unchanged registered names.

Also applies to: 192-192, 257-257, 403-403, 416-416, 433-433

Comment on lines 186 to 189
def tilelang_callback_cuda_compile(code, target): # pylint: disable=unused-argument
"""use nvcc to generate fatbin code for better optimization"""
ptx = compile_cuda(code, target_format="fatbin")
return ptx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify whether the target parameter should be used.

The target parameter is accepted but unused in the callback. The function relies on Target.current() within compile_cuda to determine architecture. However, the caller in tilelang/engine/lower.py explicitly extracts architecture from the target parameter (nvcc.get_target_compute_version(target)), suggesting explicit usage might be preferred over ambient context.

Consider either:

  1. Using the target parameter explicitly if it provides more reliable architecture detection, or
  2. Documenting why relying on Target.current() is the intended design

Based on learnings (static analysis hints).


🏁 Script executed:

#!/bin/bash
# Search for other usages of tilelang_callback_cuda_compile to understand expected behavior
rg -n "tilelang_callback_cuda_compile" -A5 -B5

Length of output: 2342


Based on my verification, the unused target parameter in the tilelang_callback_cuda_compile function in tilelang/contrib/nvcc.py is indeed a concern.

The underlying compile_cuda function accepts an arch parameter and defaults to using Target.current() if none is provided. The current implementation relies on this ambient context, but since compiling with an explicit architecture flag enables faster runtime as code generation occurs during compilation, whereas omitting it forces JIT compilation by the CUDA driver, the docstring's promise of "better optimization" suggests the target should be explicitly used.

Use the target parameter to pass architecture to compile_cuda.

The target parameter is passed by the C++ caller but unused. To achieve the promised "better optimization," extract the compute version from the target and pass it as the arch parameter to compile_cuda, or use the target to establish the proper compilation context. This ensures architecture-specific optimization occurs at compile time rather than relying on ambient Target.current() context.

🧰 Tools
🪛 Ruff (0.14.1)

186-186: Unused function argument: target

(ARG001)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
requirements-test.txt (1)

15-15: Remove redundant apache-tvm-ffi from test requirements.

Since apache-tvm-ffi is now a runtime requirement (in requirements.txt), it is automatically available during test execution. Listing it in test requirements is redundant and increases maintenance burden. Test environments typically install both runtime and test dependencies.

Apply this diff to remove the duplicate:

-apache-tvm-ffi
 cffi
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 995315e and 7ef5d01.

📒 Files selected for processing (2)
  • requirements-test.txt (2 hunks)
  • requirements.txt (1 hunks)
🔇 Additional comments (1)
requirements-test.txt (1)

15-15: Review comment is based on incorrect assumptions about apache-tvm-ffi versioning and TVM dependency model.

The codebase manages TVM as a C++ build dependency via CMake submodule (3rdparty/tvm), not as a pip package. The PyPI package apache-tvm-ffi is only a Python FFI wrapper and contains no explicit TVM version constraint in its dependencies. Additionally, only version 0.1.0 is available on PyPI, so there is no "known-compatible version" to pin to as the review suggests.

While the inconsistency between unpinned apache-tvm-ffi and pinned flash-linear-attention==0.3.2 is technically observable, it does not represent a practical risk because apache-tvm-ffi's version does not control which TVM library is used—the C++ build system does. Pinning a single version on PyPI that has no version alternatives would provide no additional stability benefit.

Likely an incorrect or invalid review comment.

@LeiWang1999
Copy link
Member Author

CI tests passed, but the bdist step failed. Would you mind taking a look, @oraluben or @XuehaiPan?

@LeiWang1999
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@LeiWang1999
Copy link
Member Author

LeiWang1999 commented Oct 24, 2025

I think it's most likely due to py38 compatibility of tvm 0.22.0. I'll take a look then.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/workflows/ci.yml (1)

45-46: Update step ID to match step name for clarity.

The step ID is setup-py38 but the step name is "Setup Python 3.9". While this doesn't break functionality (outputs are keyed by ID), updating the step ID to setup-py39 would improve consistency and avoid confusion for future maintainers.

-      - name: Setup Python 3.9
-        id: setup-py38
+      - name: Setup Python 3.9
+        id: setup-py39

Then update line 54 to reference the new ID:

-          "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang
+          "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelang
pyproject.toml (1)

86-86: Update ruff target-version to match project Python baseline.

The target-version is set to py38, but the project now requires Python ≥3.9. Updating this to py39 ensures ruff applies rules and fixes aligned with the current minimum supported version. This improves consistency between tool configuration and project requirements.

-target-version = "py38"
+target-version = "py39"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ecf41a and dc12ebc.

📒 Files selected for processing (6)
  • .github/workflows/ci.yml (1 hunks)
  • .github/workflows/dist.yml (1 hunks)
  • docs/get_started/Installation.md (2 hunks)
  • examples/gemm/README.md (1 hunks)
  • format.sh (1 hunks)
  • pyproject.toml (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • examples/gemm/README.md
  • docs/get_started/Installation.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (3)
format.sh (1)

86-87: PIP_USER environment variable improves virtualenv isolation.

The addition of export PIP_USER=0 prevents pip from defaulting to --user installs, ensuring packages are installed in the virtual environment rather than the user site-packages. This is a sensible safeguard for CI and scripted contexts.

.github/workflows/ci.yml (1)

49-49: Python version bump is consistent with project requirements.

Updating to Python 3.9 aligns with the requires-python = ">=3.9" in pyproject.toml and reflects the TVM 0.22.0 compatibility baseline.

.github/workflows/dist.yml (1)

44-45: Python 3.9 wheel build aligns with TVM 0.22.0 Limited API.

The build matrix now uses Python 3.9, and the comment correctly notes that TVM is built with the Python 3.9 Limited API (enabling wheel compatibility with all Python versions ≥3.9). This supports the project's shift to Python 3.9+ and the TVM rebase.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

45-54: Update step ID to reflect Python 3.9.

The step ID is still setup-py38 but now sets up Python 3.9. For clarity and maintainability, rename the step ID and update the reference on line 54.

      - name: Setup Python 3.9
-       id: setup-py38
+       id: setup-py39
        uses: actions/setup-python@v6
        with:
          python-version: "3.9" # use lowest supported version for linting
          update-environment: false

      - name: Check AST with Python 3.9
        run: |
-         "${{ steps.setup-py38.outputs.python-path }}" -m compileall -q -f tilelang
+         "${{ steps.setup-py39.outputs.python-path }}" -m compileall -q -f tilelang
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38f7e49 and 46dfea1.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Quick Lint
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
🔇 Additional comments (1)
.github/workflows/ci.yml (1)

170-170: HTTP timeout configuration added for CUDA setup.

Adding UV_HTTP_TIMEOUT=600 to the CUDA environment is reasonable for addressing potential timeout issues during package downloads in that environment.

@XuehaiPan
Copy link
Collaborator

XuehaiPan commented Oct 28, 2025

Some issues for duplicated libtvm_ffi from bundled tvm and pypi apache-tvm-ffi package.

@oraluben We have already packed the whole 3rdparty directory in the wheel, which contains many built artifacts. I don't think this is a very big problem to duplicate libtvm_ffi because we have included it in the wheel anyway. We should stop shipping the 3rdparty artifacts first.

Here are the contents of a wheel built from a fresh git clone:

$ unzip tilelang-*.whl

$ du -d 1 -h tilelang | sort -rh
662M    tilelang
592M    tilelang/3rdparty
 65M    tilelang/lib
3.3M    tilelang/src
408K    tilelang/language
292K    tilelang/carver
212K    tilelang/jit
128K    tilelang/intrinsics
 88K    tilelang/quantize
 64K    tilelang/contrib
 60K    tilelang/autotuner
 52K    tilelang/utils
 36K    tilelang/transform
 36K    tilelang/layout
 36K    tilelang/engine
 32K    tilelang/tileop
 32K    tilelang/primitives
 24K    tilelang/tools
 24K    tilelang/cache
 20K    tilelang/profiler
8.0K    tilelang/common
4.0K    tilelang/testing
4.0K    tilelang/math

$ du -d 1 -h tilelang/3rdparty | sort -rh
592M    tilelang/3rdparty
406M    tilelang/3rdparty/tvm
149M    tilelang/3rdparty/cutlass
 38M    tilelang/3rdparty/composable_kernel

@oraluben
Copy link
Contributor

@XuehaiPan It's unexpected when we have apache-tvm-ffi in dependency but use a bundled lib. Space is not my main concern.

@XuehaiPan
Copy link
Collaborator

when we have apache-tvm-ffi in dependency but use a bundled lib.

@oraluben Then you need to set an unreliable RPATH to site-packages:

$ORIGIN:$ORIGIN/../../tvm_ffi/lib

I can change to this approach, but it might not work with venv with --system-site-packages.

@oraluben
Copy link
Contributor

Then you need to set an unreliable RPATH to site-packages

I don't think that's necessary, when we have import tvm_ffi in python, libtvm_ffi.so would be loaded automatically.

@XuehaiPan
Copy link
Collaborator

Then you need to set an unreliable RPATH to site-packages

I don't think that's necessary, when we have import tvm_ffi in python, libtvm_ffi.so would be loaded automatically.

Then you rely on another unguaranteed thing: the execution order of Python code.

@oraluben
Copy link
Contributor

oraluben commented Oct 28, 2025

Then you need to set an unreliable RPATH to site-packages

I don't think that's necessary, when we have import tvm_ffi in python, libtvm_ffi.so would be loaded automatically.

Then you rely on another unguaranteed thing: the execution order of Python code.

That's true. But when we're dealing with ffi I think it's ok to add one exception and maybe to explicitly import tvm_ffi first before loading other libs (even after we have import sorting in lint)

Update: isn't it very dangerous trying to load two libtvm_ffi.so in same process? e.g. bundled one and one from pip package (maybe requested by flash infer or other package)

@XuehaiPan
Copy link
Collaborator

Update: isn't it very dangerous trying to load two libtvm_ffi.so in same process?

This makes sense to me.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e2ed4f and 1807298.

📒 Files selected for processing (3)
  • .github/workflows/dist.yml (2 hunks)
  • CMakeLists.txt (3 hunks)
  • pyproject.toml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/dist.yml
  • CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Build wheels for Python 3.8 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.8 on macos-latest with Metal
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.8 on ubuntu-latest with CUDA-12.1
🔇 Additional comments (5)
pyproject.toml (5)

11-28: Classifiers look good; align with FFI modernization.

Added Development Status Beta, C++ and CPython classifiers, plus Python 3.13 and 3.14 support. These changes appropriately reflect the C++ FFI integration and broader Python version support.


30-40: apache-tvm-ffi dependency is critical; audit torch specification.

The new apache-tvm-ffi~=0.1.0 dependency directly supports the PR's FFI rebase. However, note that lines 36–37 list torch twice:

"torch",
"torch>=2.7; platform_system == 'Darwin'",

This is redundant and may confuse dependency resolvers. Typically, you'd either specify a single unconditional version ("torch>=2.7"), or platform-specific variants without duplication. Verify whether torch should require ≥2.7 universally, or only on macOS for a specific reason.


48-48: Build system requires list is well-formatted.

Single-line requires is cleaner and more maintainable than the previous multi-entry format.


52-52: AI summary inconsistency: py-api and target-version.

The AI summary claims wheel.py-api changed from "cp38" to "cp39" and target-version changed from "py38" to "py39", but neither line is marked with ~ and both still show cp38/py38. Verify whether these were intentionally kept at cp38/py38 or whether the summary is outdated.

Also applies to: 111-111


182-194: Linux wheel repair command: libtvm_ffi exclusion is correct.

The updated auditwheel command (line 193) includes -v for verbosity and correctly excludes libtvm_ffi.so along with CUDA libraries. This aligns with the PR's objective to avoid bundling conflicting libtvm_ffi instances alongside the apache-tvm-ffi pip package.

@XuehaiPan
Copy link
Collaborator

XuehaiPan commented Oct 28, 2025

The built wheel is not usable on Python 3.8 due to apache-tvm-ffi requiring ml_dtypes.int2, which requires ml-dtypes >= 0.5.0 (requires Python 3.9).

  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tilelang/__init__.py", line 77, in <module>
      import tvm
    File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tilelang/3rdparty/tvm/python/tvm/__init__.py", line 24, in <module>
      from tvm_ffi import register_object, register_global_func, get_global_func
    File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tvm_ffi/__init__.py", line 39, in <module>
      from .registry import (
    File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/tvm_ffi/registry.py", line 25, in <module>
      from . import core
    File "python/tvm_ffi/cython/dtype.pxi", line 192, in init core
    File "/tmp/tmp.OwA4ZfWgj8/venv/lib/python3.8/site-packages/ml_dtypes/__init__.py", line 71, in __getattr__
      raise AttributeError(f'cannot import name {name!r} from {__name__!r}')
  AttributeError: cannot import name 'int2' from 'ml_dtypes'

apache-tvm-ffi claims it supports Python 3.8, but not tested.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa4eb5d and f276a26.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml (2 hunks)
  • .github/workflows/dist.yml (1 hunks)
  • .github/workflows/pr-perfbench-bot.yml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/ci.yml
  • .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
🔇 Additional comments (1)
.github/workflows/pr-perfbench-bot.yml (1)

49-49: Verify Python 3.12 compatibility with TVM v0.22.0.

Upgrading from 3.9 to 3.12 addresses the Python 3.8 incompatibility noted in PR comments. However, verify that this single-version workflow is intentional and that TVM v0.22.0 is fully tested on Python 3.12. If the project maintains support for multiple Python versions, consider whether other workflows (e.g., build, test) validate compatibility across a version range.

@oraluben
Copy link
Contributor

oraluben commented Oct 28, 2025

apache-tvm-ffi claims it supports Python 3.8, but not tested.

Checked a little bit, this not exactly a tvm-ffi issue. ml_dtypes is a optional dependencies: is a tvm-ffi issue that it does not support ml_dtypes<0.5

https://github.com/apache/tvm-ffi/blob/789e9e5ccf3fe590132d352671461922d30ba526/python/tvm_ffi/cython/dtype.pxi#L35-L39

When installing tilelang on py3.8, it installs the latest ml_dtypes (0.4.x) that do not have those types.
User should be able to build from sdist after tvm-ffi's new release with apache/tvm-ffi#198 .

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pyproject.toml (1)

227-231: Add explicit dylib exclusions to macOS repair-wheel command.

The macOS repair-wheel-command (lines 227–231) uses --no-sanitize-rpaths but does not include --exclude flags to prevent bundling of libtvm_ffi.dylib and libcuda.dylib. Per web search results in past review comments, --no-sanitize-rpaths only preserves RPATH entries; it does not prevent delocate-wheel from copying dylibs into the wheel.

In contrast, the Linux configuration (lines 222–225) explicitly excludes libtvm_ffi.so and libcuda.so.1. Without equivalent exclusions on macOS, the wheel will bundle libtvm_ffi.dylib, creating a conflict with the pip-installed apache-tvm-ffi package and risking duplication or ABI mismatches.

Add --exclude libtvm_ffi.dylib and --exclude libcuda.dylib to the delocate-wheel command:

[tool.cibuildwheel.macos]
repair-wheel-command = [
-   "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --require-archs {delocate_archs} -w {dest_dir} -v {wheel}",
+   "delocate-wheel --verbose --ignore-missing-dependencies --no-sanitize-rpaths --exclude libtvm_ffi.dylib --exclude libcuda.dylib --require-archs {delocate_archs} -w {dest_dir} -w {dest_dir} {wheel}",
    "pipx run abi3audit --verbose --strict {wheel}",
]
🧹 Nitpick comments (2)
pyproject.toml (2)

31-31: Clarify redundant torch dependency.

Line 31 adds apache-tvm-ffi~=0.1.0, but lines 36–37 declare torch twice: once without a version constraint (torch) and again with a macOS-specific pin (torch>=2.7; platform_system == 'Darwin'). Typically, a base requirement should precede environment markers.

Consider restructuring to a single torch entry with the appropriate version constraint:

dependencies = [
    "apache-tvm-ffi~=0.1.0",
    "cloudpickle",
    "ml-dtypes",
    "numpy>=1.23.5",
    "psutil",
-   "torch",
-   "torch>=2.7; platform_system == 'Darwin'",
+   "torch>=2.7; platform_system == 'Darwin'",
+   "torch; platform_system != 'Darwin'",
    "tqdm>=4.62.3",
    "typing-extensions>=4.10.0",
]

Alternatively, if the minimum torch version differs by platform, clarify the intent with a comment.

Also applies to: 36-37


186-188: Consider expanding test-command to verify FFI integration.

The added test-command (lines 186–188) only checks that tilelang imports and prints its version. For a PR that migrates extensively to FFI-based object creation and adds a new FFI dependency (apache-tvm-ffi), consider a slightly more comprehensive smoke test to ensure FFI components load correctly.

For example:

python -c "import tilelang; from tilelang import ir; print(f'tilelang {tilelang.__version__}'); print('FFI OK')"

This ensures not just import, but also FFI-dependent submodules (like ir) load without error.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f276a26 and 8424e99.

📒 Files selected for processing (2)
  • .github/workflows/pr-perfbench-bot.yml (2 hunks)
  • pyproject.toml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/pr-perfbench-bot.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Quick Lint
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.1
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build SDist
🔇 Additional comments (1)
pyproject.toml (1)

30-40: Pin ml-dtypes version to ensure Python 3.8 compatibility.

The project declares requires-python = ">=3.8" (line 5), but does not pin ml-dtypes (line 33). Per PR comments, apache-tvm-ffi requires ml-dtypes >= 0.5.0, which in turn requires Python ≥ 3.9 for the int2 type. This creates a latent incompatibility: pip may resolve ml-dtypes >= 0.5.0 on Python 3.8, causing an import failure.

Either:

  1. Explicitly pin ml-dtypes>=0.5.0 and update requires-python to >=3.9, or
  2. Pin ml-dtypes to a version compatible with Python 3.8 if older versions support it, or
  3. Make ml-dtypes conditional on Python ≥ 3.9 (if the library is optional for older Python).

As a workaround until apache-tvm-ffi resolves this upstream, consider pinning ml-dtypes>=0.5.0; python_version >= '3.9'.

tqchen pushed a commit to apache/tvm-ffi that referenced this pull request Oct 28, 2025
As title, currently it failed with `ml_dtypes==0.4.x`:
tile-ai/tilelang#1108 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants