Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Sep 1, 2025

Summary by CodeRabbit

  • New Features

    • clang-tidy integrated into formatting checks with selectable run modes (--files/--all/changed).
  • Chores

    • CI/AMD workflows run temporary build steps to generate compile commands and clean up; added clang-tidy to lint deps.
  • Refactor

    • Widespread const-correctness, move-semantics, emptiness checks, and safer initializations to reduce copies.
  • Style

    • Lint rules narrowed to a targeted, less-intrusive subset.
  • Breaking Changes

    • Multiple public signatures, enum storage types, and a small helper/field were changed or removed.

- Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
- Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
- Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
- Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
- General code cleanup and adherence to best practices for better maintainability.
- Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
- Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
- Replaced size checks with `empty()` method calls in various locations for clearer intent.
- Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
- General code cleanup to adhere to best practices and improve maintainability.
- Added clang-tidy checks to the format script for improved code quality assurance.
- Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
- Updated the requirements-lint.txt file to include clang-tidy as a dependency.
- General code cleanup to adhere to best practices and improve maintainability.
- Added steps to create a build directory and configure CMake with ROCm support during the format check process.
- Ensured cleanup of the build directory after the format check to maintain a clean workspace.
- Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
- This change enhances code clarity and maintainability by focusing on relevant attributes for each class.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Narrow clang-tidy rules and add clang-tidy runs to format/CI; introduce clang-tidy dependency. Apply extensive C++ hygiene: convert many parameters to const references, add std::move and , default-initialize members, replace size() checks with empty(), and set explicit enum underlying types; several public signatures changed.

Changes

Cohort / File(s) Summary
Lint config
\.clang-tidy
Replace broad check groups with a curated enabled set, add categorized comments, and disable a large set of intrusive/style/modernize rules; keep WarningsAsErrors and HeaderFilterRegex.
Format tooling & CI
format.sh, requirements-lint.txt, .github/workflows/ci.yml, .github/workflows/amd_ci.yml
Add clang-tidy dependency; extend format.sh to run clang-tidy (modes: --files/--all/changed, -p build, -j64) — note duplicated clang-tidy block; CI/AMD CI create build dir, run cmake to generate compile_commands.json for CUDA/ROCm, invoke ./format.sh, and remove build dir on success.
Core / IR
src/ir.cc
Convert multiple public/FFI signatures to take const &, add std::move usage and <utility>, and replace size() checks with empty() checks.
Operators & enums
src/op/... (copy.h, atomic_add.h, gemm.h, gemm_sp.h, reduce.h, operator.h, parallel.h)
Add explicit : uint8_t enum underlying types, add CopyInst members (kBulkLoad, kBulkStore), remove public args_ from some nodes, change field types (e.g., eviction_policy → uint8_t), and adjust a few public method/constructor parameter types.
Pass factory / pass callbacks
src/transform/* (many files)
Widely change pass-factory lambdas to accept const IRModule & and const PassContext &; convert many rewriters/collectors to accept const & params; add <utility> and std::move in many passes; default-initialize members and replace size() checks with empty(); a set of public signatures updated (see raw summary).
Loop partition & vectorization
src/transform/loop_partition.*, src/transform/loop_vectorize.*, src/transform/atomicadd_vectorize.*, src/transform/vectorize_loop.*
Change public/private signatures to const refs, add move semantics and default initialization, update IndiceCanVectorize public signature, and tighten parameter passing.
Pipeline / warp / sync
src/transform/pipeline_planning.cc, src/transform/inject_pipeline.cc, src/transform/warp_specialized_rewriter.cc, src/transform/wgmma_sync_rewriter.cc, src/transform/lower_thread_allreduce.cc
Make enums explicit-typed, convert many helpers/constructors to accept const &, add moves, tighten sync/index types, and update several public-like signatures to const-ref.
Storage & access rewrites
src/transform/storage_access.*, src/transform/storage_rewrite.cc, src/transform/merge_shared_memory_allocations.*, src/transform/lower_device_storage_access_info.cc
ComputeThreadRange/GetScope/GetBufferOffset now take const &; some internals call GetPtrStorageScope(std::move(...)); many idiomatic empty() checks, default-inits, and const-correctness changes.
Lowering / device / barrier helpers
src/transform/lower_*.cc, src/transform/inject_*.*, src/transform/inject_fence_proxy.cc, src/transform/inject_tma_barrier.cc
Helper signatures to const &, enums given uint8_t underlying types, std::move added in constructors and call sites, and emptiness checks unified.
Misc utilities & transforms
src/transform/make_packed_api.cc, src/transform/simplify.cc, src/transform/cluster_planning.cc, src/transform/flatten_buffer.cc, src/transform/common/*, etc.
Broad const-correctness, std::move usage, default initialization, and smaller exported signature adjustments as listed in the raw summary.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant GH as GitHub Actions
  participant CMake as cmake
  participant Fmt as ./format.sh
  participant Tidy as clang-tidy

  Dev->>GH: Push/PR triggers workflow
  GH->>GH: create temporary build dir
  GH->>CMake: cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA/USE_ROCM=ON
  CMake-->>GH: generate compile_commands.json
  GH->>Fmt: run ./format.sh
  alt mode == all/files
    Fmt->>Tidy: clang-tidy -p build -j64 on selected files
  else mode == changed
    Fmt->>GH: compute changed C/C++ files (git merge-base)
    GH-->>Fmt: file list
    Fmt->>Tidy: clang-tidy -p build -j64 on changed files
  end
  Tidy-->>Fmt: diagnostics (WarningsAsErrors enforced)
  Fmt-->>GH: exit status
  GH->>GH: remove build dir on success
Loading
sequenceDiagram
  autonumber
  participant Pass as Transform Pass
  participant Mod as IRModule (const&)
  participant Ctx as PassContext (const&)
  participant Rewriter as Rewriter/Collector
  participant IR as Stmt/Expr

  Pass->>Pass: receive (PrimFunc f, Mod, Ctx)
  Pass->>Rewriter: construct (const refs / moves)
  Rewriter->>IR: analyze/visit (const& inputs)
  Rewriter-->>Pass: produce new IR (moved when applicable)
  Pass-->>Mod: return updated PrimFunc
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Possibly related PRs

Poem

I nibble on headers, hop through the code,
Consts snug like carrots in a neat little row.
Enums trimmed skinny, moves tuck them in tight,
CI hums a lullaby, tidy checks through the night.
🥕 — a rabbit who loves compile-time delight

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link

github-actions bot commented Sep 1, 2025

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates clang-tidy into the project's linting workflow, enhancing code quality and maintainability. The changes involve updating the build script to incorporate clang-tidy checks and refining the .clang-tidy configuration to enforce a focused set of C++ best practices. Additionally, numerous existing C++ files have been refactored to comply with the newly enforced linting rules, improving code robustness and clarity.

Highlights

  • Clang-Tidy Integration: Introduced clang-tidy into the format.sh script to enhance code quality and enforce coding standards for C/C++ files.
  • Updated Linting Configuration: Refined the .clang-tidy configuration to focus on detecting bugs, performance issues, and practical readability improvements, while explicitly disabling overly intrusive or style-altering rules.
  • Automated Clang-Tidy Execution: Modified format.sh to automatically run clang-tidy on all C/C++ source files or only on changed files, with new command-line options (--files, --all) for flexible execution.
  • Codebase Refactoring for Compliance: Applied extensive changes across numerous C++ files (including src/ir.cc, src/op/*.h, and various src/transform/*.cc files) to address clang-tidy warnings. This involved adopting modern C++ practices such as using std::move for efficient resource transfer, adding const qualifiers for immutability, and replacing size() == 0 checks with empty() for clarity and correctness.
  • Dependency Update: Added clang-tidy==18.1.8 to requirements-lint.txt, making it a required tool for the project's linting setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces clang-tidy for linting and applies a large number of fixes across the C++ codebase. The changes are generally positive, improving code quality, modern C++ practices, and performance in some areas. However, I've identified a few issues, including some potential performance regressions from removing std::move, unsafe reinterpret_cast usage, and potential data loss from incorrect type casting. I've also suggested improvements to the new shell script logic to make it more robust.

Comment on lines 108 to 109
if (gcd_base < static_cast<int>(Downcast<IntImm>(last_dim)->value)) {
max_vector_size = gcd_base;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The result of Downcast<IntImm>(last_dim)->value is int64_t, and gcd_base is also int64_t. Casting the result to int is unnecessary and potentially lossy if the value exceeds INT_MAX. Please remove the static_cast<int>.

Suggested change
if (gcd_base < static_cast<int>(Downcast<IntImm>(last_dim)->value)) {
max_vector_size = gcd_base;
if (gcd_base < Downcast<IntImm>(last_dim)->value) {

Comment on lines +221 to +222
auto sync_index =
static_cast<int>(Downcast<IntImm>(call->args[1])->value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The result of Downcast<IntImm>(call->args[1])->value is int64_t. Casting it to int is unnecessary and potentially lossy if the value exceeds INT_MAX. sync_index should be int64_t (or auto).

Suggested change
auto sync_index =
static_cast<int>(Downcast<IntImm>(call->args[1])->value);
auto sync_index = Downcast<IntImm>(call->args[1])->value;

Comment on lines +38 to +40
-cppcoreguidelines-narrowing-conversions,
-clang-diagnostic-error,
-cppcoreguidelines-pro-type-member-init,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are duplicated rules that are both enabled and disabled. cppcoreguidelines-pro-type-member-init is enabled on line 5 and disabled on line 40. cppcoreguidelines-narrowing-conversions is enabled on line 9 and disabled on line 38. This is likely unintentional. Please remove one of each pair to resolve the conflict.

Comment on lines +273 to +274
clang_tidy_all() {
run-clang-tidy -j 64 src/*.cc -p build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of src/*.cc will not find files in subdirectories of src, and it can fail if the number of files is very large, exceeding command-line length limits. Using find with xargs is a more robust way to handle this.

Suggested change
clang_tidy_all() {
run-clang-tidy -j 64 src/*.cc -p build
find src -name '*.cc' | xargs run-clang-tidy -j 64 -p build

Comment on lines 288 to 298
CHANGED_FILES=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)

if [ -n "$CHANGED_FILES" ]; then
echo "Running clang-tidy on changed files:"
echo "$CHANGED_FILES"
# Convert newline-separated files to space-separated and run clang-tidy once
CHANGED_FILES_SPACE=$(echo "$CHANGED_FILES" | tr '\n' ' ')
run-clang-tidy -j 64 $CHANGED_FILES_SPACE -p build
else
echo "No C/C++ files changed. Skipping clang-tidy."
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for getting changed files and running clang-tidy on them can fail if filenames contain spaces. Using xargs with null-terminated filenames is a more robust approach.

Suggested change
CHANGED_FILES=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)
if [ -n "$CHANGED_FILES" ]; then
echo "Running clang-tidy on changed files:"
echo "$CHANGED_FILES"
# Convert newline-separated files to space-separated and run clang-tidy once
CHANGED_FILES_SPACE=$(echo "$CHANGED_FILES" | tr '\n' ' ')
run-clang-tidy -j 64 $CHANGED_FILES_SPACE -p build
else
echo "No C/C++ files changed. Skipping clang-tidy."
fi
# Get changed C/C++ files
if ! git diff --diff-filter=ACM --quiet --exit-code "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' &>/dev/null; then
echo "Running clang-tidy on changed files:"
git diff --name-only --diff-filter=ACM -z "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' | tee /dev/stderr | xargs -0 run-clang-tidy -j 64 -p build
else
echo "No C/C++ files changed. Skipping clang-tidy."
fi

Comment on lines +71 to +72
body = For(var, dom->min, dom->extent, ForKind::kParallel, body,
/*thread_binding=*/std::nullopt, /*annotations=*/annotations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The std::move(body) was removed from the For constructor call. Since body is a local variable that is being consumed to create a new Stmt and then reassigned within the loop, std::move is appropriate and more efficient. This change introduces a performance regression due to an unnecessary copy.

      body = For(var, dom->min, dom->extent, ForKind::kParallel, std::move(body),
                 /*thread_binding=*/std::nullopt, /*annotations=*/annotations);

Comment on lines +105 to 106
body = For(vars[0], doms[0]->min, doms[0]->extent, ForKind::kSerial, body,
/*thread_binding=*/std::nullopt, /*annotations=*/anno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The std::move(body) was removed from the For constructor call. Since body is a local variable that is being consumed to create a new Stmt, std::move is appropriate and more efficient. This change introduces a performance regression due to an unnecessary copy.

    body = For(vars[0], doms[0]->min, doms[0]->extent, ForKind::kSerial, std::move(body),
               /*thread_binding=*/std::nullopt, /*annotations=*/anno);

Comment on lines 97 to 99
void AddPredicate(const PrimExpr &&expr) const {
predicate_ = predicate_.defined() ? And(expr, predicate_.value()) : expr;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function parameter expr is taken as const PrimExpr&&. This is an unusual type that binds to rvalues but doesn't allow moving from them. It should probably be PrimExpr&& to allow moving, or const PrimExpr& to avoid forcing temporaries on the caller.

Suggested change
void AddPredicate(const PrimExpr &&expr) const {
predicate_ = predicate_.defined() ? And(expr, predicate_.value()) : expr;
}
void AddPredicate(const PrimExpr &expr) const {
predicate_ = predicate_.defined() ? And(expr, predicate_.value()) : expr;
}

Comment on lines +150 to +151
reads.push_back(access[0]);
writes.push_back(access[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The std::move was removed from reads.push_back and writes.push_back. Since access is a local variable and its elements are not used after being pushed back, std::move is appropriate and more efficient. This change introduces a performance regression.

Suggested change
reads.push_back(access[0]);
writes.push_back(access[1]);
reads.push_back(std::move(access[0]));
writes.push_back(std::move(access[1]));

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (22)
src/transform/inject_pipeline.cc (1)

157-159: Bug: missing subtraction of loop min in versioned access offset (tvm_access_ptr).
Other places compute the version as floormod(loop_var - min, N). Here it uses floormod(loop_var, N), which is incorrect when pipeline_loop_->min != 0 and will mis-index the versioned slice.

Apply this fix:

-        PrimExpr new_index =
-            old_index +
-            floormod(pipeline_loop_->loop_var, new_buffer->shape[0]) * offset;
+        PrimExpr new_index =
+            old_index +
+            floormod((pipeline_loop_->loop_var - pipeline_loop_->min),
+                     new_buffer->shape[0]) *
+            offset;

Please add a test exercising InjectSoftwarePipeline on a loop with non-zero min that lowers to tvm_access_ptr to prevent regressions.

src/transform/wgmma_sync_rewriter.cc (1)

122-124: Fix possible null deref: check SeqStmt before use.

CollectWgmmaInfo op->seq is called before ICHECK; if body isn’t SeqStmt, this derefs null.

-    CollectWgmmaInfo(op->body.as<SeqStmtNode>());
-    auto stmt_node = (op->body).as<SeqStmtNode>();
-    ICHECK(stmt_node);
+    auto stmt_node = op->body.as<SeqStmtNode>();
+    ICHECK(stmt_node);
+    CollectWgmmaInfo(stmt_node);
src/transform/common/thread_sync_types.h (1)

27-32: Scoped enum arithmetic breaks compilation; fix enumerator initializer.

enum class disables implicit integral ops; kReduce_1 + 1 is ill-formed.

Apply:

-enum class ReservedNamedBarriers : uint8_t {
-  kSyncThreads = 0,
-  kReduce_0 = 1,
-  kReduce_1 = 2,
-  kFirstUsedBarrier = kReduce_1 + 1
-};
+enum class ReservedNamedBarriers : uint8_t {
+  kSyncThreads = 0,
+  kReduce_0 = 1,
+  kReduce_1 = 2,
+  // enum class: cast before arithmetic in a constant expression
+  kFirstUsedBarrier = static_cast<uint8_t>(kReduce_1) + 1
+};
src/transform/if_stmt_binding.cc (1)

32-37: Bug: else-branch short-circuits traversal and drops child rewrites

When else_case is defined, the code returns the original node, discarding the visited then_case and skipping visiting else. Also, the condition isn’t visited. This suppresses downstream rewrites under if-else.

Apply this fix to recurse into both branches and visit the condition:

-    auto condition = op->condition;
-    auto then_case = VisitStmt(op->then_case);
-    Optional<Stmt> else_case = op->else_case;
-    if (else_case.defined()) {
-      return GetRef<Stmt>(op);
-    }
+    PrimExpr condition = StmtExprMutator::VisitExpr(op->condition);
+    Stmt then_case = VisitStmt(op->then_case);
+    Optional<Stmt> else_case = op->else_case;
+    if (else_case.defined()) {
+      Stmt else_case_new = VisitStmt(else_case.value());
+      return IfThenElse(condition, then_case, else_case_new);
+    }
src/transform/inject_fence_proxy.cc (1)

73-83: Potential OOB in ProxyMarker for empty SeqStmt

ProxyMarker::VisitStmt_(SeqStmtNode) uses op->seq[0] without an emptiness guard. This can still trip even though InjectFenceProxy now guards its own SeqStmt.

   void VisitStmt_(const SeqStmtNode *op) final {
-    StmtVisitor::VisitStmt_(op);
-    auto role = GetProxy(op->seq[0]);
+    StmtVisitor::VisitStmt_(op);
+    if (op->seq.empty()) {
+      SetProxy(op, Proxy::kGeneric);
+      return;
+    }
+    auto role = GetProxy(op->seq[0]);
     for (auto stmt : op->seq) {
       if (role != GetProxy(stmt)) {
         role = Proxy::kBoth;
         break;
       }
     }
     SetProxy(op, role);
   }
src/transform/lower_shared_barrier.cc (2)

48-55: Null deref risk: ptr_type used before ICHECK

storage_scope is read from ptr_type before verifying ptr_type != nullptr.

-    for (const auto &[data, buffer] : buffer_map_) {
-      const auto *ptr_type =
-          buffer->data->type_annotation.as<PointerTypeNode>();
-      auto storage_scope = ptr_type->storage_scope;
-      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
+    for (const auto &[data, buffer] : buffer_map_) {
+      const auto* ptr_type = buffer->data->type_annotation.as<PointerTypeNode>();
+      ICHECK(ptr_type) << "Buffer Var's type annotation must be of PointerType";
+      auto storage_scope = ptr_type->storage_scope;

87-94: Barrier→shared scope not updated; data Var still has "shared.barrier"

New Buffer reuses the old data Var, whose PointerType still encodes storage_scope "shared.barrier". This likely leaves the allocation in the barrier scope despite intent to convert to plain shared.

Recommend creating a fresh Var with PointerType(..., "shared") and building the new Buffer on that:

-    for (auto buffer : barrier_buffers) {
-      auto data = buffer->data;
-      auto new_buffer = Buffer(data, buffer->dtype, Array<PrimExpr>({1}),
-                               Array<PrimExpr>({1}), PrimExpr(0), buffer->name,
-                               buffer->data_alignment, buffer->offset_factor,
-                               buffer->buffer_type);
+    for (auto buffer : barrier_buffers) {
+      const auto* old_ptr = buffer->data->type_annotation.as<PointerTypeNode>();
+      ICHECK(old_ptr);
+      // Create a new data var in "shared" scope
+      Var data_shared(buffer->data->name_hint + "_shared",
+                      PointerType(PrimType(buffer->dtype), "shared"));
+      auto new_buffer = Buffer(data_shared, buffer->dtype, Array<PrimExpr>({1}),
+                               Array<PrimExpr>({1}), PrimExpr(0), buffer->name,
+                               buffer->data_alignment, buffer->offset_factor,
+                               buffer->buffer_type);
       new_buffers.push_back(new_buffer);
       buffer_remap_.Set(buffer, new_buffer);
     }
src/transform/layout_inference.cc (1)

561-566: Typo in memory scope disables fragment checks

"local.framgent" should be "local.fragment". As written, the ICHECK will never trigger for fragment buffers, potentially hiding layout inference errors.

-      if (buffer.scope() == "local.framgent") {
+      if (buffer.scope() == "local.fragment") {
src/transform/config_index_bitwidth.cc (2)

126-137: Index promotion bug: loop edits don’t persist.

The for-each copies PrimExprs; assignments to index are lost. Write back to indices.

Apply this diff in both BufferStore and BufferLoad visitors:

-    for (auto index : indices) {
+    for (size_t i = 0; i < indices.size(); ++i) {
+      PrimExpr index = indices[i];
       if (index->dtype.is_int() && index->dtype.bits() < 64) {
         auto int_bound = analyzer_->const_int_bound(index);
         if (int_bound->max_value >= (1LL << (index->dtype.bits() - 1)) - 1 ||
             int_bound->min_value < -(1LL << (index->dtype.bits() - 1))) {
           Int64Promoter promoter;
-          index = promoter(index);
+          index = promoter(index);
+          indices.Set(i, index);
         }
       }
-    }
+    }

Also applies to: 144-156


27-35: Respect configured bitwidth for Var remap (and gate by is_enabled_).

Var remap is hardcoded to Int(64) and ignores index_bitwidth. Also, unlike other rewrites, it isn’t gated by is_enabled_. Align it with the pass intent.

-  PrimExpr VisitExpr_(const VarNode *op) final {
-    if (op->dtype.is_int() && op->dtype.bits() < 64) {
-      DataType new_dtype = DataType::Int(64);
+  PrimExpr VisitExpr_(const VarNode *op) final {
+    if (is_enabled_ && op->dtype.is_int() && op->dtype.bits() < _index_bitwidth_) {
+      DataType new_dtype = DataType::Int(_index_bitwidth_);
       if (!var_remap_.count(op)) {
         var_remap_[op] = Var(op->name_hint, new_dtype);
       }
     }
     return Parent::VisitExpr_(op);
   }
src/transform/loop_vectorize.cc (2)

86-92: Fix incorrect visitor call in BufferStore visitor (likely compile-time error).

VisitStmt_(const BufferStoreNode*) returns void, but the code uses return arith::IRVisitorWithAnalyzer::VisitExpr(node->value);. This both calls the wrong base (Expr vs Stmt) and attempts to return a value in a void function, which will not compile and also skips traversing the store node structure.

Apply:

   void VisitStmt_(const BufferStoreNode *node) final {
     if (node->buffer.scope() == "shared" || node->buffer.scope() == "global" ||
         node->buffer.scope() == "shared.dyn")
       has_nonlocal_memory_access_ = true;
     UpdateVectorSize(node->indices, node->buffer);
-    return arith::IRVisitorWithAnalyzer::VisitExpr(node->value);
+    arith::IRVisitorWithAnalyzer::VisitStmt_(node);
+    return;
   }

142-149: Reverse-construction of strides is incorrect; fix Array construction.

Array<PrimExpr>{strides.rbegin(), strides.rend()} interprets iterators as elements and won’t build the reversed array. Build explicitly, then assign.

       // Generate strides if not existed
       auto strides = buffer->strides;
       if (buffer->strides.empty()) {
         PrimExpr stride = 1;
         for (int i = indices.size() - 1; i >= 0; --i) {
           strides.push_back(stride);
           stride = stride * buffer->shape[i];
         }
-        strides = Array<PrimExpr>{strides.rbegin(), strides.rend()};
+        Array<PrimExpr> reversed;
+        for (auto it = strides.rbegin(); it != strides.rend(); ++it) {
+          reversed.push_back(*it);
+        }
+        strides = std::move(reversed);
       }
src/transform/lower_thread_allreduce.cc (2)

145-158: Return NullOpt instead of std::nullopt for tvm::Optional.

tvm::Optional uses NullOpt (not std::nullopt). Returning std::nullopt will fail to compile.

   if (auto it = var_remap_.find(buf->data.get()); it != var_remap_.end()) {
     Buffer new_buf = buf;
     new_buf.CopyOnWrite()->data = it->second;
     buf_remap_[buf.get()] = new_buf;
     return new_buf;
   }
-
-  return std::nullopt;
+  return NullOpt;

387-390: Use NullOpt for Optional parameters.

Passing std::nullopt to const Optional<PrimExpr>& relies on an unsupported conversion. Use NullOpt.

-        std::tie(reduce_results, new_alloc_bufs) = MakeWarpAllreduce(
-            values, types, combiner, reduce_index, reduce_extent, group_index,
-            mask, std::nullopt, &seq);
+        std::tie(reduce_results, new_alloc_bufs) = MakeWarpAllreduce(
+            values, types, combiner, reduce_index, reduce_extent, group_index,
+            mask, NullOpt, &seq);
-        std::tie(reduce_results, local_bufs) =
-            MakeWarpAllreduce(values, types, combiner, reduce_index, warp_size_,
-                              group_index, mask, std::nullopt, &seq);
+        std::tie(reduce_results, local_bufs) =
+            MakeWarpAllreduce(values, types, combiner, reduce_index, warp_size_,
+                              group_index, mask, NullOpt, &seq);
-        std::tie(reduce_results, local_bufs) = MakeWarpAllreduce(
+        std::tie(reduce_results, local_bufs) = MakeWarpAllreduce(
             values, types, combiner, reduce_index, n_warps, group_index, mask,
-            /*predicate=*/reduce_index <
-                make_const(reduce_index->dtype, n_warps),
+            /*predicate=*/reduce_index < make_const(reduce_index->dtype, n_warps),
             &seq);

Also applies to: 421-424, 451-454

src/transform/lower_tile_op.cc (1)

299-300: Implement tvm_access_ptr support in HandleAccessPtrAndOffset
Replace the LOG(FATAL) at src/transform/lower_tile_op.cc:299-300 with actual handling of tvm_access_ptr, mirroring existing logic (e.g. in storage_rewrite.cc and inject_pipeline.cc) instead of aborting.

src/transform/make_packed_api.cc (2)

487-489: Critical: updates IRModule is null and dereferenced (crash).

IRModule updates; is default-null. Both updates->Add(...) and updates->functions.empty() will dereference a null pointer when the first change is encountered or when no changes exist.

Apply this fix to initialize updates and keep the emptiness check safe:

-    IRModuleNode *mptr = mod.CopyOnWrite();
-    IRModule updates;
+    IRModuleNode *mptr = mod.CopyOnWrite();
+    IRModule updates = IRModule(Map<GlobalVar, BaseFunc>());
@@
-        if (!func.same_as(orig_func)) {
-          updates->Add(gvar, func);
-        }
+        if (!func.same_as(orig_func)) {
+          updates->Add(gvar, func);
+        }
@@
-    if (!updates->functions.empty()) {
+    if (!updates->functions.empty()) {
       mod.CopyOnWrite()->Update(updates);
     }

Also applies to: 500-505, 508-510


360-376: Use consistent FFI TypeIndex for scalar assertions
In src/transform/make_packed_api.cc (at the AssertStmt for ints on line 368 and floats on line 374), you’re comparing type_index to DLPack codes (kDLInt/kDLFloat) but the buffer stores FFI codes. Change those to ffi::TypeIndex::kTVMFFIInt and ffi::TypeIndex::kTVMFFIFloat, respectively.

src/transform/warp_specialized_rewriter.cc (3)

558-566: Map<String, Any> is invalid here; use ObjectRef.

TVM maps for annotations use ObjectRef, not Any. This likely fails to compile.

Apply:

-    Map<String, Any> for_annotations = op->annotations;
+    Map<String, ObjectRef> for_annotations = op->annotations;

596-603: HasWgMMA defaults to true → always-on predicate; fix detection.

has_wgmma_ starts true and only flips inside gemm branch, so a block with no gemm will incorrectly report true.

Apply:

-  void VisitExpr_(const CallNode *op) final {
-    if (op->op.same_as(tl_gemm()) || op->op.same_as(tl_gemm_sp())) {
-      auto op_name = std::string(op->args[0].as<StringImmNode>()->value);
-      if (has_wgmma_) {
-        has_wgmma_ =
-            op_name.find("false") == std::string::npos && !in_if_scope_;
-      }
-    }
-    StmtExprVisitor::VisitExpr_(op);
-  }
+  void VisitExpr_(const CallNode *op) final {
+    if (op->op.same_as(tl_gemm()) || op->op.same_as(tl_gemm_sp())) {
+      auto op_name = std::string(op->args[0].as<StringImmNode>()->value);
+      if (op_name.find("false") == std::string::npos && !in_if_scope_) {
+        has_wgmma_ = true;
+      }
+    }
+    StmtExprVisitor::VisitExpr_(op);
+  }
@@
-  bool has_wgmma_{true};
+  bool has_wgmma_{false};

768-801: Producer path asserts on empty release; handle no-dependency producers gracefully.

ICHECK(!map.release[i].empty()) will fail when a producer has no dependent consumer (valid scenario). Emit the transformed stmt without barriers in that case.

Apply:

-        ICHECK(!map.release[i].empty());
+        if (map.release[i].empty()) {
+          block_stmt.push_back(seq_transformed[i]);
+          new_body.push_back(MakeGroupBlock(
+              block_stmt.size() == 1 ? block_stmt[0] : SeqStmt(std::move(block_stmt)),
+              annotations));
+          continue;
+        }
src/transform/lower_device_storage_access_info.cc (1)

111-114: Fix typo in error message ("adddressable" → "addressable")

User-facing diagnostic has a spelling mistake.

-          << buffer_var << " is not adddressable.";
+          << buffer_var << " is not addressable.";
src/transform/storage_rewrite.cc (1)

1506-1513: Fix potential 0-lane dtype (crashes/asserts) when coeff==0.

When ModularSet.coeff==0 (unknown), with_lanes(0) is invalid and can trip dtype checks. Guard to at least 1.

-    if (detect_scalar_read_patterns_ && is_buffer_load && !indices.empty()) {
+    if (detect_scalar_read_patterns_ && is_buffer_load && !indices.empty()) {
       const PrimExpr last_dim_index = indices[indices.size() - 1];
       if (last_dim_index.dtype().lanes() == 1) {
         arith::ModularSet me = analyzer_.modular_set(last_dim_index);
-        var_info.scalar_read_dtype.emplace(access_dtype.with_lanes(me->coeff));
+        int coeff = me->coeff;
+        if (coeff <= 0) coeff = 1;
+        var_info.scalar_read_dtype.emplace(access_dtype.with_lanes(coeff));
         return;
       }
     }
🧹 Nitpick comments (72)
src/transform/persist_threadblock.cc (4)

56-58: Silence unused-parameter warnings and drop unnecessary lambda capture.

m and ctx aren’t used; clang-tidy/-Wunused-parameter will flag this. Also no capture needed.

-  auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
-    return PersistThreadblock::Substitute(f);
-  };
+  auto pass_func = [](PrimFunc f, const IRModule& /*m*/, const PassContext& /*ctx*/) {
+    return PersistThreadblock::Substitute(f);
+  };

2-4: Fix header metadata to match file contents.

Keep doxygen accurate.

- * \file lower_l2_persistent_annotation.cc
- * \brief Lower L2 persistent annotation
+ * \file persist_threadblock.cc
+ * \brief Mark functions to use cooperative groups when grid sync is detected

30-33: Correct stale comment.

The pass detects sync_grid, not buffer maps.

-    // Trace the buffer map for tvm_access_ptr
+    // Detect builtin::sync_grid usage to enable cooperative groups

27-37: Prefer pass-by-value for PrimFunc in Substitute.

PrimFunc is an ObjectRef; taking by value is cheap and avoids non-const reference API. Call sites already pass by value.

-  static PrimFunc Substitute(PrimFunc &f) {
+  static PrimFunc Substitute(PrimFunc f) {
     PrimFuncNode *fptr = f.CopyOnWrite();
     PersistThreadblock substituter;
     // Detect builtin::sync_grid usage to enable cooperative groups
     fptr->body = substituter.VisitStmt(f->body);
     if (substituter.has_sync_grid_) {
       f = WithAttr(std::move(f), attr::kUseCooperativeGroups,
                    IntImm(DataType::Int(32), 1));
     }
     return f;
   }
src/transform/inject_pipeline.cc (1)

624-656: need_bound_check flag in EmitImpl is reasonable; small doc nudge.
Using bound checks only in prologue/epilogue trims predicates in the steady-state loop. Consider a brief comment on why the body is safe without inbound checks given the chosen [start, end) ranges.

src/transform/simplify.cc (3)

90-90: Optional: take PrimFunc by const& to avoid clang-tidy performance warning.
PrimFunc is a cheap ObjectRef, but this silences performance-unnecessary-value-param if enabled.

Apply this diff:

-    Visitor(PrimFunc func) : func(std::move(func)) {}
+    explicit Visitor(const PrimFunc& func) : func(func) {}

220-223: Nit: mark unused param for clang-tidy.
If simplify_arguments isn’t referenced, annotate it to avoid readability-unused-parameter.

Apply this diff:

-        bool simplify_arguments = false) {
+        [[maybe_unused]] bool simplify_arguments = false) {

482-482: Nit: suppress unused parameter warning for m.
Helps with clang-tidy integration.

Apply this diff:

-  auto pass_func = [=](PrimFunc f, const IRModule &m, PassContext ctx) {
+  auto pass_func = [=](PrimFunc f, [[maybe_unused]] const IRModule &m, PassContext ctx) {
src/transform/wgmma_sync_rewriter.cc (5)

22-36: Const-ref helpers look good; limit linkage to this TU.

Mark these free helpers static to avoid exporting symbols unnecessarily.

-bool isGemm(const Stmt &stmt) {
+static bool isGemm(const Stmt &stmt) {
@@
-bool isGemmSync(const Stmt &stmt) {
+static bool isGemmSync(const Stmt &stmt) {
@@
-bool isArriveBarrier(const Stmt &stmt) {
+static bool isArriveBarrier(const Stmt &stmt) {

Also applies to: 38-52, 54-63


183-186: Guard template-suffix manipulation.

Assume the extern name ends with '>' before substr(…, size()-1) to avoid UB on unexpected inputs.

-          std::string new_name = name.substr(0, name.size() - 1) + ", -1>";
+          ICHECK(!name.empty() && name.back() == '>')
+              << "unexpected GEMM extern name: " << name;
+          std::string new_name = name.substr(0, name.size() - 1) + ", -1>";

221-223: Defensive check before Downcast on args[1].

If a pre-existing warpgroup_wait has an unexpected signature, Downcast will crash. Skip rewrite in that case.

-        auto sync_index =
-            static_cast<int>(Downcast<IntImm>(call->args[1])->value);
+        if (call->args.size() <= 1 || !call->args[1].as<IntImmNode>()) {
+          // Unexpected signature; leave as-is
+          continue;
+        }
+        int sync_index =
+            static_cast<int>(Downcast<IntImm>(call->args[1])->value);

263-265: Drop unnecessary capture.

No captures are used; prefer empty capture for clarity.

-  auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+  auto pass_func = [](PrimFunc f, const IRModule &m, const PassContext &ctx) {
     return WgmmaSyncRewriter::Substitute(std::move(f));
   };

2-3: Header name mismatch.

Comment says warp_specialized_pipeline.cc; file is wgmma_sync_rewriter.cc. Please update to avoid confusion.

requirements-lint.txt (1)

8-8: Align clang-format major version with clang-tidy
Both clang-tidy==18.1.8 and clang-format==18.1.8 are available on PyPI. Bump clang-format from 15.0.7 to 18.1.8 in requirements-lint.txt to keep both tools on the same major version and avoid AST/config mismatches.

- clang-format==15.0.7
+ clang-format==18.1.8
src/transform/lower_l2_persistent_annotation.cc (2)

95-98: Const-ref lambda params look good; prefer empty capture.

No captures are used; switching [=] to [] avoids accidental captures.

Apply:

-auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+auto pass_func = [](PrimFunc f, const IRModule &m, const PassContext &ctx) {

26-36: Remove unused buffer_map_ to satisfy clang-tidy (-Wunused-private-field).

The std::unordered_map buffer_map_ is populated but never read; buffer_data_to_buffer_ already covers lookups.

Apply:

 class LowerL2Persistent : public StmtExprMutator {
 public:
   static PrimFunc Substitute(PrimFunc &f) {
     PrimFuncNode *fptr = f.CopyOnWrite();
     LowerL2Persistent substituter;
-    // Trace the buffer map for tvm_access_ptr
-    substituter.buffer_map_.insert(f->buffer_map.begin(), f->buffer_map.end());
     for (const auto &[_, buffer] : f->buffer_map) {
       substituter.buffer_data_to_buffer_.Set(buffer->data, buffer);
     }
@@
   Stmt VisitStmt_(const BlockNode *op) final {
     // Record the mapping from buffer data var to buffer for later lookup
-    for (auto buffer : op->alloc_buffers) {
-      buffer_map_.insert({buffer->data, buffer});
-    }
-    for (auto match_buffer : op->match_buffers) {
-      buffer_map_.insert({match_buffer->buffer->data, match_buffer->buffer});
-    }
     for (auto buffer : op->alloc_buffers) {
       buffer_data_to_buffer_.Set(buffer->data, buffer);
     }
@@
 private:
   // Mapping from data Var of a Buffer to Buffer, for lookup
   Map<Var, Buffer> buffer_data_to_buffer_;
-  std::unordered_map<Var, Buffer, ObjectPtrHash, ObjectPtrEqual> buffer_map_;
   Map<Buffer, FloatImm> hit_ratio_map_;
   LowerL2Persistent() = default;

Also applies to: 59-64, 84-89

src/op/reduce.h (1)

17-23: Add size guard and verify serialization ABI
In src/op/reduce.h, append:

static_assert(sizeof(ReduceType) == 1, "ReduceType must remain 1 byte.");

No static_cast<int>(ReduceType) usages were found; still audit any JSON/AttrVisitor serialization paths to ensure they handle a 1-byte enum.

src/transform/merge_if_stmt.cc (1)

95-99: Const-ref params: good; also drop captures.

No lambda captures are used; prefer [].

Apply:

-auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+auto pass_func = [](PrimFunc f, const IRModule &m, const PassContext &ctx) {
src/transform/frontend_legalize.cc (1)

86-90: LGTM on const-ref params; minor capture nit.

Same minor as others: prefer [] since nothing is captured.

Apply:

-auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+auto pass_func = [](PrimFunc f, const IRModule &m, const PassContext &ctx) {
src/op/gemm_sp.h (1)

10-11: Make header self-contained: include .

This file uses uint8_t but doesn’t include . Avoid relying on transitive includes.

Apply:

-#include "operator.h"
+#include <cstdint>
+#include "operator.h"
src/transform/legalize_vectorized_loop.cc (2)

84-86: Narrow the lambda capture list.

No captures used; prefer empty capture for clarity.

Apply:

-auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+auto pass_func = [](PrimFunc f, const IRModule &m, const PassContext &ctx) {

70-72: Avoid double-visiting non-vectorized loops.

Return the already-mutated for_node instead of re-calling VisitStmt_, which does redundant work.

Apply:

-    if (for_node->kind != ForKind::kVectorized) {
-      return IRMutatorWithAnalyzer::VisitStmt_(op);
-    }
+    if (for_node->kind != ForKind::kVectorized) {
+      return for_node;
+    }
src/transform/annotate_device_regions.cc (1)

69-71: Silence clang-tidy unused-parameter in pass lambda.
Since mod/ctx aren’t used, drop parameter names (or cast to void) to appease clang-tidy’s misc-unused-parameters.

Apply:

-  auto pass_func = [](PrimFunc func, const IRModule &mod,
-                      const tvm::transform::PassContext &ctx) -> PrimFunc {
+  auto pass_func = [](PrimFunc func, const IRModule &,
+                      const tvm::transform::PassContext &) -> PrimFunc {
src/transform/lower_device_kernel_launch.cc (1)

359-360: Same here: avoid clang-tidy unused-parameter on ctx.
Name can be omitted.

-  auto pass_func = [](IRModule mod,
-                      const tir::transform::PassContext &ctx) -> IRModule {
+  auto pass_func = [](IRModule mod,
+                      const tir::transform::PassContext &) -> IRModule {
src/op/operator.h (1)

28-32: Enum underlying type changed to uint8_t — add cstdint and verify ABI/FFI assumptions.

  • Add an explicit include for uint8_t.
  • If InferLevel crosses FFI/serialization boundaries, confirm no sizeof/underlying-type assumptions.
 #include <tvm/tir/op_attr_types.h>
+#include <cstdint>
 #include <tvm/tir/stmt.h>
src/transform/if_stmt_binding.cc (1)

78-80: Pass lambda signature looks good; keep style consistent across passes

Using const IRModule& and const PassContext& avoids refcount churn; consider aligning other passes to the same signature for consistency.

src/op/gemm.h (2)

17-21: Underlying-type change to uint8_t: verify ABI/FFI assumptions and lock enum values

Good for size, but double-check no code assumes sizeof(GemmWarpPolicy)==sizeof(int) or serializes raw bytes. Consider a compile-time guard to pin values.

Add static assertions right after the enum:

 enum class GemmWarpPolicy : uint8_t {
   kSquare = 0,
   kFullRow = 1,
   kFullCol = 2,
 };
+static_assert(static_cast<int>(GemmWarpPolicy::kSquare) == 0 &&
+              static_cast<int>(GemmWarpPolicy::kFullRow)  == 1 &&
+              static_cast<int>(GemmWarpPolicy::kFullCol)  == 2,
+              "GemmWarpPolicy values changed; update Python IntEnum/FFI accordingly");

39-39: Default-initialize policy to a safe value

Avoids UB from uninitialized enum field.

-  GemmWarpPolicy policy;
+  GemmWarpPolicy policy = GemmWarpPolicy::kSquare;
src/transform/lower_hopper_intrin.cc (1)

146-150: Optional: make PassContext a const-ref for consistency

Match the style used elsewhere (e.g., IfStmtBinding) to reduce refcount churn.

-auto pass_func = [=](PrimFunc f, const IRModule &m, PassContext ctx) {
+auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
src/transform/common/loop_fusion_utils.h (1)

94-99: Annotate Fuse() with [[nodiscard]] to prevent accidental dropping of its return value. All existing call sites assign or use the returned Stmt, so no caller updates are needed.

src/transform/inject_fence_proxy.cc (1)

39-40: Include for uint8_t

Relying on transitive includes is brittle. Add the header explicitly.

 #include <tvm/tir/transform.h>
 
 #include "../op/builtin.h"
+#include <cstdint>
src/transform/common/loop_parallel_transform_utils.h (2)

29-33: Unused parameter in Substitute()

skip_thread_partition is not used. This will trigger clang-tidy warnings and confuses intent.

Apply one of:

-static Stmt Substitute(const Stmt &stmt, bool skip_thread_partition = false) {
+static Stmt Substitute(const Stmt &stmt, [[maybe_unused]] bool skip_thread_partition = false) {

or remove the parameter entirely if not needed.


71-112: Predicate construction runs per-inner-loop and can duplicate conditions

bound/upper_bound/shape are recomputed inside the for (size_t j...) loop and the predicate is appended regardless of whether loop_var is actually used in index, leading to repeated identical And(...)s and unnecessary IR bloat. Compute related loop vars first, then build the predicate once only when exactly one related var is found.

Minimal refactor:

-      for (size_t j = 0; j < loop_vars.size(); ++j) {
-        auto loop_var = loop_vars[j];
-        if (used_vars.count(loop_var)) {
-          related_loop_vars.push_back(loop_var);
-        }
-        if (related_loop_vars.size() > 1) {
-          return for_node;
-        }
-
-        auto bound = analyzer_->const_int_bound(index);
-        int64_t upper_bound = bound->max_value + 1;
-        int64_t shape = Downcast<IntImm>(buffer->shape[i])->value;
-        if (upper_bound < shape) {
-          PrimExpr predicate = LT(index, IntImm(index.dtype(), upper_bound));
-          condition = condition.defined() ? And(condition, predicate) : predicate;
-        }
-      }
+      for (size_t j = 0; j < loop_vars.size(); ++j) {
+        if (used_vars.count(loop_vars[j])) {
+          related_loop_vars.push_back(loop_vars[j]);
+          if (related_loop_vars.size() > 1) {
+            return for_node;  // only support a single related loop var
+          }
+        }
+      }
+      if (related_loop_vars.size() == 1) {
+        auto cbound = analyzer_->const_int_bound(index);
+        int64_t upper_bound = cbound->max_value + 1;
+        int64_t shape = Downcast<IntImm>(buffer->shape[i])->value;
+        if (upper_bound < shape) {
+          PrimExpr predicate = LT(index, IntImm(index.dtype(), upper_bound));
+          condition = condition.defined() ? And(condition, predicate) : predicate;
+        }
+      }
src/transform/layout_inference.cc (1)

651-658: Pass lambda signature update looks good; minor nit on move + const refs

Switching m and ctx to const-refs avoids copies. You can also drop std::move(f) in the return if you prefer clarity—NRVO handles it—but it’s benign with TVM handles.

-    return LayoutInferencer::Substitute(std::move(f), skip_thread_partition);
+    return LayoutInferencer::Substitute(f, skip_thread_partition);
src/transform/flatten_buffer.cc (1)

84-95: reads/writes MutateByApply: move is fine but unnecessary

You pass by value and immediately move; this is okay. Alternatively, take const BufferRegion& and return a new region to avoid the extra copy.

Example:

-reads.MutateByApply([this](BufferRegion region) {
-  return MutateBufferRegion(std::move(region));
-});
+reads.MutateByApply([this](const BufferRegion &region) {
+  return MutateBufferRegion(region);
+});

Apply similarly to writes.

src/transform/vectorize_loop.cc (1)

131-131: Remove unused TLVecAllocAccess in vectorize_loop.cc
Ctor move-init is correct; the class is never referenced in this translation unit—delete to eliminate dead code.

src/op/copy.h (2)

242-243: Inconsistent eviction_policy type across ops.
Conv2DIm2ColOpNode uses int while CopyNode now uses a byte-sized enum; align types or document rationale.


106-112: Use strongly-typed EvictionPolicy enum for eviction_policy field

  • In src/op/copy.h add #include <cstdint> and change
    - uint8_t eviction_policy; // Policy for cache eviction
    + EvictionPolicy eviction_policy; // Policy for cache eviction
  • In src/op/copy.cc, change
    node->eviction_policy = args[…].as<IntImmNode>()->value;
    to
    node->eviction_policy = static_cast<EvictionPolicy>(
      args[…].as<IntImmNode>()->value);
    at lines ~139–140 and ~1175–1176.
  • For the TIR op calls in src/op/copy.cc (around lines 1110–1111, 1119–1120), wrap the enum in static_cast<uint8_t>(…) when pushing into args.
  • In src/target/codegen_cuda.cc, when indexing eviction_policy_names_, cast EvictionPolicy back to its underlying byte:
    this->eviction_policy_names_[static_cast<uint8_t>(eviction_policy)]
  • Ensure any other serialization/FFI sites use explicit casts between EvictionPolicy and uint8_t.
src/transform/config_index_bitwidth.cc (2)

61-68: Nit: unnecessary std::move on local temporaries when returning.

Returning local objects already moves; explicit std::move here is noise.

-    return std::move(node);
+    return node;

(Apply similarly to BufferLoad/BufferStore returns.)

Also applies to: 110-119, 137-138, 155-156


162-170: Use the provided PassContext instead of PassContext::Current().

This avoids surprising behavior under nested/isolated contexts.

-    tvm::transform::PassContext ctxt = tvm::transform::PassContext::Current();
+    const tvm::transform::PassContext &ctxt = ctx;
src/transform/legalize_safe_memory_access.cc (1)

347-354: Align pass lambda signature with others: take PassContext by const reference.

Elsewhere in this PR lambdas use const PassContext& to avoid copies and match TVM helpers.

-  auto pass_func = [=](PrimFunc f, const IRModule &m, PassContext ctx) {
+  auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
src/transform/inject_tma_barrier.cc (1)

193-202: Simplify UpdateBarrierRange to avoid duplicated Set.
No behavior change; trims one branch write.

-  void UpdateBarrierRange(const PrimExpr &barrier_id, const IntImm &extent) {
-    if (barrier_id_to_range_.count(barrier_id)) {
-      auto old_extent = barrier_id_to_range_[barrier_id];
-      ICHECK_EQ(old_extent->value, extent->value)
-          << "barrier_id: " << barrier_id << " has different extent";
-      barrier_id_to_range_.Set(barrier_id, extent);
-    } else {
-      barrier_id_to_range_.Set(barrier_id, extent);
-    }
-  }
+  void UpdateBarrierRange(const PrimExpr &barrier_id, const IntImm &extent) {
+    if (barrier_id_to_range_.count(barrier_id)) {
+      auto old_extent = barrier_id_to_range_[barrier_id];
+      ICHECK_EQ(old_extent->value, extent->value)
+          << "barrier_id: " << barrier_id << " has different extent";
+    }
+    barrier_id_to_range_.Set(barrier_id, extent);
+  }
src/transform/annotate_warp_group_reg_alloc.cc (1)

112-121: Avoid inserting no-op Evaluate(0) when hints are absent.
Currently, Evaluate(0) is injected even when no register hints are applied. Push inc/dec only when the condition holds.

-      auto inc_reg_stmt = Evaluate(0);
-      auto dec_reg_stmt = Evaluate(0);
+      Optional<Stmt> inc_reg_stmt, dec_reg_stmt;
@@
-      if (dec_reg >= 0 && inc_reg >= 0 && !has_simt_copy) {
+      if (dec_reg >= 0 && inc_reg >= 0 && !has_simt_copy) {
         auto inc_reg_num =
             IntImm(DataType::Int(32), inc_reg == 0 ? 240 : inc_reg);
         auto dec_reg_num =
             IntImm(DataType::Int(32), dec_reg == 0 ? 24 : dec_reg);
-        inc_reg_stmt = Evaluate(
-            Call(DataType::Handle(), set_max_nreg(), {inc_reg_num, 1}));
-        dec_reg_stmt = Evaluate(
-            Call(DataType::Handle(), set_max_nreg(), {dec_reg_num, 0}));
+        inc_reg_stmt = Evaluate(
+            Call(DataType::Handle(), set_max_nreg(), {inc_reg_num, 1}));
+        dec_reg_stmt = Evaluate(
+            Call(DataType::Handle(), set_max_nreg(), {dec_reg_num, 0}));
       }
@@
-      producer_stmts.push_back(dec_reg_stmt);
+      if (dec_reg_stmt.defined()) producer_stmts.push_back(dec_reg_stmt.value());
       producer_stmts.push_back(producer_body);
@@
-      consumer_stmts.push_back(inc_reg_stmt);
+      if (inc_reg_stmt.defined()) consumer_stmts.push_back(inc_reg_stmt.value());
       consumer_stmts.push_back(consumer_body.value());

Also applies to: 123-133

src/transform/loop_vectorize.cc (2)

166-167: Be explicit about FloorMod’s constant type.

Use IntImm for the modulus to avoid implicit conversions and match patterns used elsewhere.

-      condition_ = (FloorMod(offset, vector_size_) == 0);
+      condition_ = (FloorMod(offset, IntImm(DataType::Int(32), vector_size_)) == 0);

239-241: Const-correctness improvement looks good; consider const-ref Var too.

Accepting expr and iter_var_size by const-ref is good. You can also take Var by const Var& for consistency.

-bool IndiceCanVectorize(const PrimExpr &expr, Var var,
+bool IndiceCanVectorize(const PrimExpr &expr, const Var &var,
                         const PrimExpr &iter_var_size,
                         int target_vectorized_size, arith::Analyzer *analyzer)
src/transform/pipeline_planning.cc (2)

47-48: Move into member: OK (minor nit).

std::move into buffer_data_to_buffer_ is fine. Note GetReads()/GetWrites() return by value later, so downstream std::move of those copies adds no benefit.


423-426: Prefer same_as for Buffer identity checks.

Using == on Buffer may perform structural equality; for dependency/conflict checks you likely want object identity like elsewhere in this file (same_as).

-                             [&](const BufferRegion &r) {
-                             return r->buffer == read->buffer &&
+                             [&](const BufferRegion &r) {
+                             return r->buffer.same_as(read->buffer) &&
                                      MayConflict(r->region, read->region);
                              })
-                             [&](const BufferRegion &r) {
-                             return r->buffer == write->buffer &&
+                             [&](const BufferRegion &r) {
+                             return r->buffer.same_as(write->buffer) &&
                                      MayConflict(r->region, write->region);
                              })

Also applies to: 439-443

src/transform/lower_thread_allreduce.cc (1)

114-123: Minor: null-check after mutation.

After stmt = StmtExprMutator::VisitStmt_(op);, stmt.as<EvaluateNode>() could return null if transformed; add a defensive check.

   Stmt stmt = StmtExprMutator::VisitStmt_(op);
-  op = stmt.as<EvaluateNode>();
-  const CallNode *call = op->value.as<CallNode>();
+  op = stmt.as<EvaluateNode>();
+  if (!op) return stmt;
+  const CallNode *call = op->value.as<CallNode>();
src/transform/make_packed_api.cc (1)

127-134: Make stored type code width explicit.

Store type_index as an explicit IntImm(32) to avoid implicit conversions and ensure dtype matches the buffer.

-    Stmt store_tcode =
-        BufferStore(info.dummy_tcode_buffer, info.type_index, {0});
+    Stmt store_tcode = BufferStore(
+        info.dummy_tcode_buffer,
+        IntImm(DataType::Int(32), info.type_index), {0});
src/ir.cc (3)

262-279: Prefer checking Optional presence instead of Array.

block_size is always defined after value_or(...); if (block_size.defined()) is thus always true. Check the Optional, and then gate on emptiness.

-    if (block_size.defined()) {
+    if (block_size_opt.defined()) {
       ICHECK(block_size.size() <= 3);
       if (!block_size.empty()) {
         n->frames.push_back(LaunchThread(
             CreateEnvThread("tx", "threadIdx.x", block_size[0].dtype()),
             block_size[0]));
       }

236-241: Drop tautological check.

ICHECK(grid_size.size() >= 0); is always true. Remove or replace with a meaningful constraint.

-    ICHECK(grid_size.size() >= 0);
+    // No-op check removed; CPU kernels allow arbitrary grid dims.

338-347: Guard against empty warp group set.

Constructing If(condition) with an undefined condition when warp_group_ids is empty is unsafe. Assert non-empty to avoid invalid IR.

 WarpSpecializeFrame WarpSpecialize(const Array<IntImm> &warp_group_ids,
                                    const PrimExpr &thread_idx,
                                    int warp_group_size = 128) {
   ObjectPtr<WarpSpecializeFrameNode> n = make_object<WarpSpecializeFrameNode>();
   PrimExpr condition;
   std::vector<int> warp_groups;
+  ICHECK(!warp_group_ids.empty()) << "warp_group_ids must be non-empty";
   warp_groups.reserve(warp_group_ids.size());
src/transform/loop_vectorize_dynamic.cc (3)

186-193: Make alignment condition width explicit.

Use an explicit IntImm(32) for vector_size_ in FloorMod to avoid implicit type assumptions and match other files.

-      condition_ = (FloorMod(offset, vector_size_) == 0);
+      condition_ =
+          (FloorMod(offset, IntImm(DataType::Int(32), vector_size_)) == 0);

358-363: Avoid shadowing pass-config flag.

disable_dynamic_tail_split (local) shadows the ctor’s disable_dynamic_tail_split_. Use the member to honor the value passed into the rewriter and reduce confusion.

-    Optional<Bool> opt_disable_dynamic_tail_split =
-        ctxt->GetConfig(kDisableDynamicTailSplit, Optional<Bool>());
-    bool disable_dynamic_tail_split =
-        opt_disable_dynamic_tail_split.value_or(Bool(false));
+    // Use the ctor-provided setting to avoid shadowing/confusion.
+    bool disable_dynamic_tail_split = disable_dynamic_tail_split_;

141-194: Bounds shrink loop: ensure progress when vector_size_ becomes 1.

The while-loop halves vector_size_ without guarding vector_size_ > 1. Add a floor to avoid infinite loops if IndiceCanVectorizeDynamic wrongly returns false for 1.

-      while (!IndiceCanVectorizeDynamic(elem_offset, inner_for_->loop_var,
-                                        inner_for_->extent, vector_size_,
-                                        &analyzer_)) {
-        vector_size_ /= 2;
-      }
+      while (vector_size_ > 1 &&
+             !IndiceCanVectorizeDynamic(elem_offset, inner_for_->loop_var,
+                                        inner_for_->extent, vector_size_,
+                                        &analyzer_)) {
+        vector_size_ /= 2;
+      }
src/transform/warp_specialized_rewriter.cc (2)

447-451: Use defaulted copy/move for PipelineInfo.

Manual element-wise copy is unnecessary and costlier. Let the compiler default it.

Apply:

-  PipelineInfo(const PipelineInfo &other) {
-    for (const auto &op_info : other.op_infos) {
-      op_infos.push_back(op_info);
-    }
-  }
+  PipelineInfo(const PipelineInfo &) = default;
+  PipelineInfo(PipelineInfo &&) = default;
+  PipelineInfo &operator=(const PipelineInfo &) = default;
+  PipelineInfo &operator=(PipelineInfo &&) = default;

65-76: Typo in method name: FindProducerusedBuffer → FindProducerUsedBuffer.

Minor readability nit.

-  auto FindProducerusedBuffer(const Stmt &stmt) {
+  auto FindProducerUsedBuffer(const Stmt &stmt) {
.github/workflows/ci.yml (3)

21-25: Upgrade setup-python to v5.

Newer action has fixes and better caching. Safe drop-in.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5

41-45: Skip flash_attn in format-only job.

It’s unnecessary for formatting and adds time; keep installs lean here.

-          pip install flash_attn==2.5.8 --no-user --no-build-isolation
+          # flash_attn not needed for formatting; skip to speed up CI

63-67: Auto-commit message clarity.

Consider a more descriptive message (e.g., "ci: apply clang-format/clang-tidy fixes") for traceability.

.github/workflows/amd_ci.yml (3)

23-26: Upgrade setup-python to v5.

Align with main CI for consistency and fixes.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5

102-108: Use consistent install flags and reduce duplication.

You install the project twice; keep one (the later “wheel form”) to simplify. Also PIP_NO_BUILD_ISOLATION should be set where needed only.

Also applies to: 111-116


63-67: Auto-commit message clarity.

Mirror the NVIDIA suggestion for commit message.

src/transform/storage_access.h (1)

147-147: Consider making ComputeThreadRange a const method

Method reads analyzer_ and inputs only; making it const improves API clarity and enables calls on const visitors.

-  Map<Var, Range> ComputeThreadRange(const Array<IterVar> &threads);
+  Map<Var, Range> ComputeThreadRange(const Array<IterVar> &threads) const;

Outside this file (implementation in src/transform/storage_access.cc), update the definition accordingly:

// in src/transform/storage_access.cc
Map<Var, Range> TileLangStorageAccessVisitor::ComputeThreadRange(
    const Array<IterVar> &threads) const { /* ... */ }
src/transform/storage_access.cc (4)

188-205: Guarded relax + invariant on touched — OK; consider defensive handling.

ICHECK(!e.touched.empty()) is sound given the buffer-defined guard, but if upstream ever emits a buffer-defined entry without touched, this will hard-fail. Optional: skip relax for such entries instead of crashing.

-        ICHECK(!e.touched.empty());
+        if (e.touched.empty()) {
+          continue;  // defensive: nothing to relax
+        }

342-345: Match stride dtype to shape dtype to avoid overflow/mismatch.

Shape may be 64-bit; using Int(32) can overflow or force casts.

-            PrimExpr stride = make_const(DataType::Int(32), 1);
+            PrimExpr stride = make_const(shape[i].dtype(), 1);

398-415: ComputeThreadRange: const-ref param and bound-based construction — good.

If const_int_bound can’t infer min/max, consider a fallback (e.g., skip Set or default [0,1)) to avoid bogus extents.

-      auto min_value = const_int_bound->min_value;
-      auto max_value = const_int_bound->max_value;
+      auto min_value = const_int_bound->min_value;
+      auto max_value = const_int_bound->max_value;
+      if (min_value > max_value) continue;  // unknown/invalid bound

417-423: Align GetScope in thread_storage_sync to use const Var&

  • Callers in storage_access already bind to the new const Var& signature; no updates required.
  • In src/transform/thread_storage_sync.cc (around line 538), change
    StorageScope GetScope(Var buffer_var) const {
      return StorageScope::Create(GetPtrStorageScope(std::move(buffer_var)));
    }
    to
    - StorageScope GetScope(Var buffer_var) const {
    -   return StorageScope::Create(GetPtrStorageScope(std::move(buffer_var)));
    - }
    + StorageScope GetScope(const Var &buffer_var) const {
    +   return StorageScope::Create(GetPtrStorageScope(buffer_var));
    + }
src/transform/storage_rewrite.cc (4)

349-358: Prefer static_cast over reinterpret_cast for downcasts after IsInstance checks.

You already check the dynamic type using IsInstance. Use static_cast for safer, intention-revealing casts.

-      VisitStmt_(reinterpret_cast<const AttrStmtNode *>(stmt));
+      VisitStmt_(static_cast<const AttrStmtNode *>(stmt));
-      VisitStmt_(reinterpret_cast<const ForNode *>(stmt));
+      VisitStmt_(static_cast<const ForNode *>(stmt));
-      VisitStmt_(reinterpret_cast<const IfThenElseNode *>(stmt));
+      VisitStmt_(static_cast<const IfThenElseNode *>(stmt));
-      VisitStmt_(reinterpret_cast<const WhileNode *>(stmt));
+      VisitStmt_(static_cast<const WhileNode *>(stmt));
-      VisitStmt_(reinterpret_cast<const BufferStoreNode *>(stmt));
+      VisitStmt_(static_cast<const BufferStoreNode *>(stmt));

997-1006: Same here: use static_cast for typed access to s.stmt.

Consistent with the above, avoid reinterpret_cast when a static_cast suffices post type-check.

-        const auto *op = reinterpret_cast<const AttrStmtNode *>(s.stmt);
+        const auto *op = static_cast<const AttrStmtNode *>(s.stmt);
...
-        const auto *op = reinterpret_cast<const ForNode *>(s.stmt);
+        const auto *op = static_cast<const ForNode *>(s.stmt);

1814-1816: Use last-extent dtype for division constant to avoid mixed dtypes.

extents[0].dtype() may differ from the last extent; use the last extent’s dtype for the make_const divisor.

-    extents.Set(extents.size() - 1, extents[extents.size() - 1] /
-                                        make_const(extents[0].dtype(), factor));
+    extents.Set(extents.size() - 1, extents[extents.size() - 1] /
+                                        make_const(extents.back().dtype(), factor));

1914-1951: No compatibility issue with CreatePrimFuncPass. Both const IRModule& and passing PassContext by value or as const PassContext& bind correctly to the expected TypedPackedFunc<PrimFunc(PrimFunc, IRModule, PassContext)>. Optional: update the StorageRewrite lambda to take const PassContext& ctx (instead of by value) to match PointerValueTypeRewrite and most other passes.

Comment on lines +51 to 53
mkdir -p build
cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd ..
if ! output=$(./format.sh 2>&1); then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden format step (ROCm): fail on cmake error and verify compile_commands.json.

Mirror NVIDIA workflow hardening.

-        mkdir -p build
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd ..
+        set -euo pipefail
+        cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON
+        test -f build/compile_commands.json || { echo "compile_commands.json not found"; exit 2; }
@@
-        rm -rf build
+        rm -rf build

Also applies to: 61-61

🤖 Prompt for AI Agents
.github/workflows/amd_ci.yml around lines 51-53 (and line 61): the format step
does not fail fast on cmake errors and doesn't verify that
build/compile_commands.json was generated; update the workflow to check the
cmake exit status (fail if cmake returns non-zero) and after running cmake
verify that build/compile_commands.json exists (fail the job if missing) before
invoking ./format.sh so the pipeline mirrors the NVIDIA hardening behavior.

Comment on lines +50 to 53
mkdir -p build
# run cmake to create the build directory with compile_commands.json
cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd ..
if ! output=$(./format.sh 2>&1); then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden format step: fail fast on cmake errors and verify compile_commands.json.

Currently, cmake failures won’t fail the step. Also check the file exists before running format.sh.

-        mkdir -p build
-        # run cmake to create the build directory with compile_commands.json
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd ..
+        set -euo pipefail
+        cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON
+        test -f build/compile_commands.json || { echo "compile_commands.json not found"; exit 2; }
@@
-        rm -rf build
+        rm -rf build

Also applies to: 61-61

🤖 Prompt for AI Agents
In .github/workflows/ci.yml around lines 50-53 and also at line 61, the workflow
currently runs cmake but does not fail the step on cmake errors and proceeds to
run format.sh without ensuring compile_commands.json exists; modify the script
so that the cmake invocation is checked (fail fast if cmake returns non-zero)
and then verify that build/compile_commands.json exists before calling
./format.sh (fail the job with a clear error if it's missing), applying the same
checks to the later occurrence at line 61.

Comment on lines 252 to 318
echo 'tile-lang clang-tidy: Check Start'
# If clang-tidy is available, run it; otherwise, skip
if command -v run-clang-tidy &>/dev/null; then
# Check if clang-tidy is available
if ! command -v clang-tidy &>/dev/null; then
echo "clang-tidy not found. Skipping clang-tidy checks."
else
# Get clang-tidy version
CLANG_TIDY_VERSION=$(clang-tidy --version | head -n1 | awk '{print $4}')
echo "Using clang-tidy version: $CLANG_TIDY_VERSION"

# Check if build directory exists
if [ ! -d "build" ]; then
echo "Build directory not found. Skipping clang-tidy checks."
else
# Run clang-tidy on specified files
clang_tidy_files() {
run-clang-tidy -j 64 "$@" -p build
}

# Run clang-tidy on all C/C++ source files
clang_tidy_all() {
run-clang-tidy -j 64 src/*.cc -p build
}

# Run clang-tidy on changed C/C++ files relative to main
clang_tidy_changed() {
if git show-ref --verify --quiet refs/remotes/origin/main; then
BASE_BRANCH="origin/main"
else
BASE_BRANCH="main"
fi

MERGEBASE="$(git merge-base $BASE_BRANCH HEAD)"

# Get changed C/C++ files
CHANGED_FILES=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)

if [ -n "$CHANGED_FILES" ]; then
echo "Running clang-tidy on changed files:"
echo "$CHANGED_FILES"
# Convert newline-separated files to space-separated and run clang-tidy once
CHANGED_FILES_SPACE=$(echo "$CHANGED_FILES" | tr '\n' ' ')
run-clang-tidy -j 64 $CHANGED_FILES_SPACE -p build
else
echo "No C/C++ files changed. Skipping clang-tidy."
fi
}

if [[ "$1" == '--files' ]]; then
# If --files is given, run clang-tidy only on the provided files
clang_tidy_files "${@:2}"
elif [[ "$1" == '--all' ]]; then
# If --all is given, run clang-tidy on all source files
clang_tidy_all
else
# Otherwise, run clang-tidy only on changed C/C++ files
clang_tidy_changed
fi
fi
fi
else
echo "run-clang-tidy not found. Skipping clang-tidy checks."
echo "To install clang-tidy tools, you may need to install clang-tidy and run-clang-tidy."
fi
echo 'tile-lang clang-tidy: Done'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden clang-tidy integration: detect binaries, honor custom build dir, and lint all TUs

Current block:

  • Assumes run-clang-tidy and clang-tidy exact names; misses version-suffixed binaries (e.g., run-clang-tidy-18).
  • Hardcodes build path to build and only lints src/*.cc for “--all”, which skips nested sources and headers compiled from compile_commands.json.
  • Uses fixed -j 64; better to adapt to host cores and allow override.

Proposed update keeps behavior but improves robustness.

-echo 'tile-lang clang-tidy: Check Start'
-# If clang-tidy is available, run it; otherwise, skip
-if command -v run-clang-tidy &>/dev/null; then
-    # Check if clang-tidy is available
-    if ! command -v clang-tidy &>/dev/null; then
-        echo "clang-tidy not found. Skipping clang-tidy checks."
-    else
-        # Get clang-tidy version
-        CLANG_TIDY_VERSION=$(clang-tidy --version | head -n1 | awk '{print $4}')
-        echo "Using clang-tidy version: $CLANG_TIDY_VERSION"
-
-        # Check if build directory exists
-        if [ ! -d "build" ]; then
-            echo "Build directory not found. Skipping clang-tidy checks."
-        else
-            # Run clang-tidy on specified files
-            clang_tidy_files() {
-                run-clang-tidy -j 64 "$@" -p build
-            }
-
-            # Run clang-tidy on all C/C++ source files
-            clang_tidy_all() {
-                run-clang-tidy -j 64 src/*.cc -p build
-            }
-
-            # Run clang-tidy on changed C/C++ files relative to main
-            clang_tidy_changed() {
-                if git show-ref --verify --quiet refs/remotes/origin/main; then
-                    BASE_BRANCH="origin/main"
-                else
-                    BASE_BRANCH="main"
-                fi
-
-                MERGEBASE="$(git merge-base $BASE_BRANCH HEAD)"
-
-                # Get changed C/C++ files
-                CHANGED_FILES=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)
-                
-                if [ -n "$CHANGED_FILES" ]; then
-                    echo "Running clang-tidy on changed files:"
-                    echo "$CHANGED_FILES"
-                    # Convert newline-separated files to space-separated and run clang-tidy once
-                    CHANGED_FILES_SPACE=$(echo "$CHANGED_FILES" | tr '\n' ' ')
-                    run-clang-tidy -j 64 $CHANGED_FILES_SPACE -p build
-                else
-                    echo "No C/C++ files changed. Skipping clang-tidy."
-                fi
-            }
-
-            if [[ "$1" == '--files' ]]; then
-               # If --files is given, run clang-tidy only on the provided files
-               clang_tidy_files "${@:2}"
-            elif [[ "$1" == '--all' ]]; then
-               # If --all is given, run clang-tidy on all source files
-               clang_tidy_all
-            else
-               # Otherwise, run clang-tidy only on changed C/C++ files
-               clang_tidy_changed
-            fi
-        fi
-    fi
-else
-    echo "run-clang-tidy not found. Skipping clang-tidy checks."
-    echo "To install clang-tidy tools, you may need to install clang-tidy and run-clang-tidy."
-fi
-echo 'tile-lang clang-tidy: Done'
+echo 'tile-lang clang-tidy: Check Start'
+# Prefer explicit environment overrides when provided:
+#  - RUN_CLANG_TIDY_BIN / CLANG_TIDY_BIN
+#  - BUILD_DIR (or CMAKE_BUILD_DIR)
+RUN_CLANG_TIDY_BIN="${RUN_CLANG_TIDY_BIN:-$(command -v run-clang-tidy || command -v run-clang-tidy-18 || command -v run-clang-tidy-17 || true)}"
+CLANG_TIDY_BIN="${CLANG_TIDY_BIN:-$(command -v clang-tidy || command -v clang-tidy-18 || command -v clang-tidy-17 || true)}"
+BUILD_DIR="${BUILD_DIR:-${CMAKE_BUILD_DIR:-build}}"
+JOBS="${CLANG_TIDY_JOBS:-$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8)}"
+
+if [[ -z "$RUN_CLANG_TIDY_BIN" ]]; then
+    echo "run-clang-tidy not found. Skipping clang-tidy checks."
+    echo "To install: llvm (provides run-clang-tidy)."
+elif [[ -z "$CLANG_TIDY_BIN" ]]; then
+    echo "clang-tidy not found. Skipping clang-tidy checks."
+elif [[ ! -f "$BUILD_DIR/compile_commands.json" ]]; then
+    echo "compile_commands.json not found under '$BUILD_DIR'. Skipping clang-tidy checks."
+else
+    CLANG_TIDY_VERSION="$($CLANG_TIDY_BIN --version | head -n1)"
+    echo "Using $CLANG_TIDY_VERSION"
+
+    clang_tidy_files() {
+        "$RUN_CLANG_TIDY_BIN" -j "$JOBS" "$@" -p "$BUILD_DIR"
+    }
+    # Run on all TUs from compile_commands.json
+    clang_tidy_all() {
+        "$RUN_CLANG_TIDY_BIN" -j "$JOBS" -p "$BUILD_DIR"
+    }
+    clang_tidy_changed() {
+        if git show-ref --verify --quiet refs/remotes/origin/main; then
+            BASE_BRANCH="origin/main"
+        else
+            BASE_BRANCH="main"
+        fi
+        MERGEBASE="$(git merge-base "$BASE_BRANCH" HEAD)"
+        CHANGED_FILES="$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)"
+        if [[ -n "$CHANGED_FILES" ]]; then
+            echo "Running clang-tidy on changed files:"
+            echo "$CHANGED_FILES"
+            # Space-separate and run once (run-clang-tidy handles parallelism)
+            CHANGED_FILES_SPACE="$(echo "$CHANGED_FILES" | tr '\n' ' ')"
+            "$RUN_CLANG_TIDY_BIN" -j "$JOBS" $CHANGED_FILES_SPACE -p "$BUILD_DIR"
+        else
+            echo "No C/C++ files changed. Skipping clang-tidy."
+        fi
+    }
+    if [[ "$1" == '--files' ]]; then
+        clang_tidy_files "${@:2}"
+    elif [[ "$1" == '--all' ]]; then
+        clang_tidy_all
+    else
+        clang_tidy_changed
+    fi
+fi
+echo 'tile-lang clang-tidy: Done'
🤖 Prompt for AI Agents
In format.sh around lines 252 to 318, the clang-tidy block assumes exact binary
names, hardcodes the build dir and file globs, and uses a fixed -j 64; update it
to (1) detect version-suffixed binaries by searching for run-clang-tidy* and
clang-tidy* and pick the first available, storing their names in variables; (2)
allow a configurable build directory (e.g., use BUILD_DIR env var or fallback to
"build") and pass it to -p; (3) for --all prefer invoking run-clang-tidy with
just -p $BUILD_DIR (or if you must supply files, recursively find source files
under src) so all translation units from compile_commands.json are linted; (4)
replace the hardcoded -j 64 with JOBS=${CLANG_TIDY_JOBS:-$(nproc)} so users can
override parallelism; and (5) quote variables and safely convert changed-file
lists to arguments when calling run-clang-tidy.

Comment on lines 121 to 124
ParallelOp(const For &&root) {
auto op = make_object<ParallelOpNode>(root);
data_ = std::move(op);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Anti-pattern: constructor takes const rvalue reference

ParallelOp(const For&&) rejects lvalues and blocks moving (const). Prefer pass-by-value or For&& with forwarding. Given TVM handles are cheap, pass-by-value is simplest.

-  ParallelOp(const For &&root) {
-    auto op = make_object<ParallelOpNode>(root);
+  explicit ParallelOp(For root) {
+    auto op = make_object<ParallelOpNode>(std::move(root));
     data_ = std::move(op);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ParallelOp(const For &&root) {
auto op = make_object<ParallelOpNode>(root);
data_ = std::move(op);
}
explicit ParallelOp(For root) {
auto op = make_object<ParallelOpNode>(std::move(root));
data_ = std::move(op);
}
🤖 Prompt for AI Agents
In src/op/parallel.h around lines 121-124, the constructor currently takes a
const rvalue reference (const For&&) which rejects lvalues and prevents moving;
change the signature to take For by value (ParallelOp(For root)) and then move
root into the node (e.g., pass std::move(root) to make_object or to the
ParallelOpNode constructor). Update any call sites if they relied on the old
signature so they continue to compile (no changes needed for temporaries;
lvalues will now be copied/moved as expected).

// we should analyze the second-to-last dimension
// in relation to the last dimension.
if (gcd_base < Downcast<IntImm>(last_dim)->value) {
if (gcd_base < static_cast<int>(Downcast<IntImm>(last_dim)->value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify type compatibility in comparison.

The explicit cast to int when comparing with last_dim value ensures type compatibility. However, ensure that the Downcast to IntImm doesn't fail before this comparison.


🏁 Script executed:

#!/bin/bash
# Check if there are any other similar comparisons that might need attention
rg -n "gcd_base.*Downcast.*IntImm.*value" --type cpp

Length of output: 367


Enforce IntImm type before downcast
In src/transform/atomicadd_vectorize.cc:108, add a precondition to guarantee last_dim is an IntImm before calling Downcast<IntImm>(…). For example:

ICHECK(last_dim->IsInstance<IntImm>());  
auto* imm = Downcast<IntImm>(last_dim);  
if (gcd_base < static_cast<int>(imm->value)) { … }

This prevents an invalid downcast at runtime.

🤖 Prompt for AI Agents
In src/transform/atomicadd_vectorize.cc around line 108, the code directly
Downcast<IntImm>(last_dim) without ensuring last_dim is an IntImm; add a
precondition check using ICHECK(last_dim->IsInstance<IntImm>()) (or equivalent)
before the downcast, assign the result to a local pointer (e.g., imm) and use
imm->value in the comparison so the downcast is safe and avoids runtime errors.

Comment on lines 68 to 78
if (mem_reuse_max > 0) {
cluster_tag =
"clusterIdx" + String(cluster_tag.c_str() + strlen("blockIdx"));
std::string tag_str = cluster_tag; // Convert to std::string
if (tag_str.rfind("blockIdx", 0) == 0) {
// starts with "blockIdx"
tag_str = "clusterIdx" + tag_str.substr(strlen("blockIdx"));
} else {
// Unexpected format — maybe just prefix
tag_str = "clusterIdx" + tag_str;
}
cluster_tag = tvm::ffi::String(tag_str); // Convert back
return WithAttr(f, cluster_tag, Integer(cluster_size_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid strlen (missing ) and simplify prefix replacement.

Using strlen without including can break builds; simpler string ops suffice. Also, prefer constructing a tvm::String directly.

-      std::string tag_str = cluster_tag; // Convert to std::string
-      if (tag_str.rfind("blockIdx", 0) == 0) {
-        // starts with "blockIdx"
-        tag_str = "clusterIdx" + tag_str.substr(strlen("blockIdx"));
-      } else {
-        // Unexpected format — maybe just prefix
-        tag_str = "clusterIdx" + tag_str;
-      }
-      cluster_tag = tvm::ffi::String(tag_str); // Convert back
+      std::string tag_str = std::string(cluster_tag);
+      if (tag_str.rfind("blockIdx", 0) == 0) {
+        // Replace "blockIdx" prefix (length 8) with "clusterIdx"
+        tag_str.replace(0, 8, "clusterIdx");
+      } else {
+        // Fallback: prefix as-is
+        tag_str = "clusterIdx" + tag_str;
+      }
+      cluster_tag = String(tag_str);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (mem_reuse_max > 0) {
cluster_tag =
"clusterIdx" + String(cluster_tag.c_str() + strlen("blockIdx"));
std::string tag_str = cluster_tag; // Convert to std::string
if (tag_str.rfind("blockIdx", 0) == 0) {
// starts with "blockIdx"
tag_str = "clusterIdx" + tag_str.substr(strlen("blockIdx"));
} else {
// Unexpected format — maybe just prefix
tag_str = "clusterIdx" + tag_str;
}
cluster_tag = tvm::ffi::String(tag_str); // Convert back
return WithAttr(f, cluster_tag, Integer(cluster_size_));
if (mem_reuse_max > 0) {
std::string tag_str = std::string(cluster_tag);
if (tag_str.rfind("blockIdx", 0) == 0) {
// Replace "blockIdx" prefix (length 8) with "clusterIdx"
tag_str.replace(0, 8, "clusterIdx");
} else {
// Fallback: prefix as-is
tag_str = "clusterIdx" + tag_str;
}
cluster_tag = String(tag_str);
return WithAttr(f, cluster_tag, Integer(cluster_size_));
🤖 Prompt for AI Agents
In src/transform/cluster_planning.cc around lines 68-78, replace the current
block that uses strlen and conversions with a safer, simpler approach: use
std::string::rfind("blockIdx", 0) and std::string::size() (or a literal 7) to
compute the substring index instead of strlen, perform the prefix replacement
directly on a std::string, and then construct the tvm::ffi::String (or
tvm::String) from that std::string in one step; remove the implicit dependency
on <cstring> and avoid unnecessary conversions by creating the tvm string
directly from the modified std::string before returning WithAttr.

Comment on lines 860 to 864
num_barriers_ += map.patterns.size() * num_stages_;

ICHECK(new_body.size() > 0);
ICHECK(!new_body.empty());
return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid crash when new_body is empty; fall back to role-filtered stmt.

Returning with ICHECK(!new_body.empty()) can trip when everything was filtered out. Prefer a safe fallback.

Apply:

-    ICHECK(!new_body.empty());
-    return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
+    if (new_body.empty()) {
+      return FilterByRole(op);
+    }
+    return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
num_barriers_ += map.patterns.size() * num_stages_;
ICHECK(new_body.size() > 0);
ICHECK(!new_body.empty());
return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
}
num_barriers_ += map.patterns.size() * num_stages_;
if (new_body.empty()) {
return FilterByRole(op);
}
return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
}
🤖 Prompt for AI Agents
In src/transform/warp_specialized_rewriter.cc around lines 860 to 864, remove
the ICHECK(!new_body.empty()) and instead handle the case where new_body is
empty by returning the role-filtered statement as a safe fallback; specifically,
if new_body.empty() then return the role-filtered stmt (the statement produced
before applying the body filtering), otherwise return new_body[0] when size==1
or SeqStmt(std::move(new_body)) as before.

- Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
- Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
- Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
- General code cleanup to adhere to best practices and improve maintainability.
…torize

- Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
- Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
- General code cleanup to maintain consistency and adhere to best practices.
- Included a step to initialize and update git submodules recursively in the CI workflow.
- This change ensures that all necessary submodules are available during the format check process, improving build reliability.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
src/transform/lower_tile_op.cc (3)

282-290: Guard against non-IntImm last dimension to prevent null deref

as<IntImmNode>()->value will crash for symbolic/dynamic shapes. Add a check.

   int CheckAndGetBufferRowSize(const Buffer &buffer) {
     CHECK(buffer->shape.size() >= 2)
         << "The dimension of Buffer \"" << buffer->name << "\" with shape "
         << buffer->shape << " should be at least 2";
 
     auto dim = buffer->shape.size();
-    auto buffer_row_size = buffer->shape[dim - 1].as<IntImmNode>()->value;
-    return buffer_row_size;
+    const auto* last_dim_imm = buffer->shape[dim - 1].as<IntImmNode>();
+    ICHECK(last_dim_imm)
+        << "The last dimension of Buffer \"" << buffer->name
+        << "\" must be IntImm but got " << buffer->shape[dim - 1];
+    return static_cast<int>(last_dim_imm->value);
   }

409-416: Fix call site: pass NullOpt to tvm::Optional

Passing std::nullopt will not compile with tvm::Optional.

-      auto new_access_ptr =
-          HandleAccessPtrAndOffset(access_ptr, std::nullopt, call->dtype);
+      auto new_access_ptr =
+          HandleAccessPtrAndOffset(access_ptr, NullOpt, call->dtype);

516-523: Incorrect ConstIntBound usage (arg and member access)

You’re querying the bound with IterVar instead of Var, and then treating the returned bound like a pointer. This is a compile-time bug.

-    if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
-      auto const_int_bound = analyzer_->const_int_bound(thread_var_);
-      auto min_value = const_int_bound->min_value;
-      auto max_value = const_int_bound->max_value;
-      auto extent = max_value + 1 - min_value;
+    if (analyzer_->const_int_bound.IsBound(thread_var_->var)) {
+      const auto bound = analyzer_->const_int_bound(thread_var_->var);
+      const auto min_value = bound.min_value;
+      const auto max_value = bound.max_value;
+      const auto extent = max_value + 1 - min_value;
       thread_bounds =
           Range::FromMinExtent(IntImm(thread_var_->var.dtype(), min_value),
                                IntImm(thread_var_->var.dtype(), extent));
src/transform/warp_specialized_rewriter.cc (2)

558-566: Fix annotation map type (Any → ObjectRef) to match For::annotations.

Map<String, Any> is not a valid C++ TVM map type here and will fail to compile; annotations are Map<String, ObjectRef>.

-    Map<String, Any> for_annotations = op->annotations;
+    Map<String, ObjectRef> for_annotations = op->annotations;

804-837: Don’t hard-crash when a producer has no release; emit safely instead.

ICHECK(!map.release[i].empty()) can trip for producers with no consumer dependency. Fall back to emitting the stmt (unless mbarrier-only) and continue.

-        ICHECK(!map.release[i].empty());
-        for (size_t j = 0; j < map.release[i].size(); j++) {
+        if (map.release[i].empty()) {
+          if (!mbarrier_only_) {
+            block_stmt.push_back(seq_transformed[i]);
+            new_body.push_back(MakeGroupBlock(
+                block_stmt.size() == 1 ? block_stmt[0]
+                                       : SeqStmt(std::move(block_stmt)),
+                annotations));
+          }
+          continue;
+        }
+        for (size_t j = 0; j < map.release[i].size(); ++j) {
           int pattern_idx = map.release[i][j];
           PrimExpr release_barrier_id =
               stage_ + num_barriers_ + num_stages_ * pattern_idx;
           auto stmt =
               MbarrierRewriter::Rewrite(seq_transformed[i], release_barrier_id);
           collector.Collect(stmt);
           block_stmt.push_back(stmt);
           if (collector.HasSimtCopy()) {
             block_stmt.push_back(makeCpAsyncBarrier(release_barrier_id));
             has_simt_copy_ = true;
           }
           if (map.release_after[i][j]) {
             block_stmt.push_back(makeArriveBarrier(release_barrier_id));
             for (int s = 0; s < num_stages_; s++) {
               released_barrier_.insert(s + num_barriers_ +
                                        num_stages_ * pattern_idx);
             }
           }
           collector.Clear();
           new_body.push_back(MakeGroupBlock(
               block_stmt.size() == 1 ? block_stmt[0]
                                      : SeqStmt(std::move(block_stmt)),
               annotations));
         }
src/transform/atomicadd_vectorize.cc (3)

53-81: Fix void-return misuse and strengthen arg bound check in VisitExpr_

  • Returning a value from a void function is ill-formed.
  • You access args[2] but only guard size() >= 2.

Apply:

-    if (node->op == builtin::call_extern() && node->args.size() >= 2) {
+    if (node->op == builtin::call_extern() && node->args.size() >= 3) {
@@
-    return arith::IRVisitorWithAnalyzer::VisitExpr_(node);
+    arith::IRVisitorWithAnalyzer::VisitExpr_(node);

133-141: Initialize inner_for_ to avoid UB
inner_for_ is read before guaranteed assignment; default-initialize to nullptr.

-  const ForNode *inner_for_;
+  const ForNode *inner_for_ = nullptr;

235-265: Potential division-by-zero when stride_x_ < vector_size_
Several expressions divide by (stride_x_ / vector_size_). If stride_x_ < vector_size_, this is zero.

Minimal safeguards:

  • Bail out early if stride_x is not a positive multiple of the planned vector size.
  • Assert the same in the rewriter ctor.

Apply (outside this hunk):

  1. In VectorizeAtomicAdd, after vectorize_hint is finalized:
@@
-    vectorize_hint = res.vector_size;
+    vectorize_hint = res.vector_size;
+    if (stride_x % vectorize_hint != 0) {
+      return for_node;
+    }
  1. In AtomicAddVectorizeRewriter ctor (after extent check):
@@
   ICHECK(tx_ext)
       << "thread_bounds->extent must be a constant for vectorization.";
   extent_tx_ = static_cast<int>(*tx_ext);
+  ICHECK(stride_x_ > 0 && (stride_x_ % vector_size_) == 0)
+      << "stride_x must be a positive multiple of vector_size_ to avoid division by zero.";
♻️ Duplicate comments (5)
format.sh (3)

252-318: Harden clang-tidy block: detect binaries, honor BUILD_DIR, verify compile_commands.json, safe filenames, correct “--all”, configurable jobs/fix

Consolidate to a robust, reproducible flow. Also avoids globbing limits and space-in-filename bugs.

-echo 'tile-lang clang-tidy: Check Start'
-# If clang-tidy is available, run it; otherwise, skip
-if command -v run-clang-tidy &>/dev/null; then
-    # Check if clang-tidy is available
-    if ! command -v clang-tidy &>/dev/null; then
-        echo "clang-tidy not found. Skipping clang-tidy checks."
-    else
-        # Get clang-tidy version
-        CLANG_TIDY_VERSION=$(clang-tidy --version | head -n1 | awk '{print $4}')
-        echo "Using clang-tidy version: $CLANG_TIDY_VERSION"
-
-        # Check if build directory exists
-        if [ ! -d "build" ]; then
-            echo "Build directory not found. Skipping clang-tidy checks."
-        else
-            # Run clang-tidy on specified files
-            clang_tidy_files() {
-                run-clang-tidy -j 64 "$@" -p build
-            }
-
-            # Run clang-tidy on all C/C++ source files
-            clang_tidy_all() {
-                run-clang-tidy -j 64 src/*.cc -p build
-            }
-
-            # Run clang-tidy on changed C/C++ files relative to main
-            clang_tidy_changed() {
-                if git show-ref --verify --quiet refs/remotes/origin/main; then
-                    BASE_BRANCH="origin/main"
-                else
-                    BASE_BRANCH="main"
-                fi
-
-                MERGEBASE="$(git merge-base $BASE_BRANCH HEAD)"
-
-                # Get changed C/C++ files
-                CHANGED_FILES=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)
-                
-                if [ -n "$CHANGED_FILES" ]; then
-                    echo "Running clang-tidy on changed files:"
-                    echo "$CHANGED_FILES"
-                    # Convert newline-separated files to space-separated and run clang-tidy once
-                    CHANGED_FILES_SPACE=$(echo "$CHANGED_FILES" | tr '\n' ' ')
-                    run-clang-tidy -j 64 $CHANGED_FILES_SPACE -p build -fix
-                else
-                    echo "No C/C++ files changed. Skipping clang-tidy."
-                fi
-            }
-
-            if [[ "$1" == '--files' ]]; then
-               # If --files is given, run clang-tidy only on the provided files
-               clang_tidy_files "${@:2}"
-            elif [[ "$1" == '--all' ]]; then
-               # If --all is given, run clang-tidy on all source files
-               clang_tidy_all
-            else
-               # Otherwise, run clang-tidy only on changed C/C++ files
-               clang_tidy_changed
-            fi
-        fi
-    fi
-else
-    echo "run-clang-tidy not found. Skipping clang-tidy checks."
-    echo "To install clang-tidy tools, you may need to install clang-tidy and run-clang-tidy."
-fi
-echo 'tile-lang clang-tidy: Done'
+echo 'tile-lang clang-tidy: Check Start'
+# Allow overrides and support version-suffixed binaries
+RUN_CLANG_TIDY_BIN="${RUN_CLANG_TIDY_BIN:-$(command -v run-clang-tidy || command -v run-clang-tidy-18 || command -v run-clang-tidy-17 || true)}"
+CLANG_TIDY_BIN="${CLANG_TIDY_BIN:-$(command -v clang-tidy || command -v clang-tidy-18 || command -v clang-tidy-17 || true)}"
+BUILD_DIR="${BUILD_DIR:-${CMAKE_BUILD_DIR:-build}}"
+JOBS="${CLANG_TIDY_JOBS:-$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8)}"
+FIX_FLAG="${CLANG_TIDY_FIX:+-fix}"
+
+if [[ -z "$RUN_CLANG_TIDY_BIN" ]]; then
+  echo "run-clang-tidy not found. Skipping clang-tidy checks."
+elif [[ -z "$CLANG_TIDY_BIN" ]]; then
+  echo "clang-tidy not found. Skipping clang-tidy checks."
+elif [[ ! -f "$BUILD_DIR/compile_commands.json" ]]; then
+  echo "compile_commands.json not found under '$BUILD_DIR'. Skipping clang-tidy checks."
+else
+  CLANG_TIDY_VERSION="$("$CLANG_TIDY_BIN" --version | head -n1)"
+  echo "Using $CLANG_TIDY_VERSION"
+
+  clang_tidy_files() {
+    "$RUN_CLANG_TIDY_BIN" -j "$JOBS" "$@" -p "$BUILD_DIR" ${FIX_FLAG}
+  }
+  # Lint all translation units from compile_commands.json
+  clang_tidy_all() {
+    "$RUN_CLANG_TIDY_BIN" -j "$JOBS" -p "$BUILD_DIR" ${FIX_FLAG}
+  }
+  clang_tidy_changed() {
+    if git show-ref --verify --quiet refs/remotes/origin/main; then
+      BASE_BRANCH="origin/main"
+    else
+      BASE_BRANCH="main"
+    fi
+    MERGEBASE="$(git merge-base "$BASE_BRANCH" HEAD)"
+    # Use NUL delimiters to handle spaces/newlines
+    CHANGED_FILES_NUL="$(git diff --name-only -z --diff-filter=ACM "$MERGEBASE" -- '*.c' '*.cc' '*.cpp' '*.h' '*.hpp' 2>/dev/null || true)"
+    if [[ -n "$CHANGED_FILES_NUL" ]]; then
+      echo "Running clang-tidy on changed files:"
+      printf '%s\0' "$CHANGED_FILES_NUL" | tr '\0' '\n' | sed '/^$/d'
+      # xargs -0 passes paths safely; run once to let run-clang-tidy handle parallelism
+      printf '%s' "$CHANGED_FILES_NUL" | xargs -0 "$RUN_CLANG_TIDY_BIN" -j "$JOBS" -p "$BUILD_DIR" ${FIX_FLAG}
+    else
+      echo "No C/C++ files changed. Skipping clang-tidy."
+    fi
+  }
+
+  if [[ "$1" == '--files' ]]; then
+    clang_tidy_files "${@:2}"
+  elif [[ "$1" == '--all' ]]; then
+    clang_tidy_all
+  else
+    clang_tidy_changed
+  fi
+fi
+echo 'tile-lang clang-tidy: Done'

273-275: “--all” should lint all TUs, not just src/*.cc

run-clang-tidy without an explicit file list uses compile_commands.json; using src/*.cc skips subdirs and headers. The diff above switches to “-p $BUILD_DIR” only.


288-296: Space-safe changed-file handling and configurable -fix

Current approach breaks on spaces and always applies -fix. The diff above uses -z/xargs -0 and gates fixing via CLANG_TIDY_FIX.

.github/workflows/ci.yml (1)

50-54: Fail fast on CMake and assert compile_commands.json before running format.sh

The current inline “cd build; cmake ..; cd ..” can mask cmake failures and proceed without compile_commands.json.

-        git submodule update --init --recursive
-        mkdir -p build
-        # run cmake to create the build directory with compile_commands.json
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd ..
+        set -euo pipefail
+        git submodule update --init --recursive
+        cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON
+        test -f build/compile_commands.json || { echo "compile_commands.json not found"; exit 2; }
src/transform/warp_specialized_rewriter.cc (1)

898-900: Avoid crash when new_body is empty; fall back to role-filtered stmt.

Match prior feedback: return FilterByRole(op) when everything was filtered out.

-    ICHECK(!new_body.empty());
-    return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
+    if (new_body.empty()) {
+      return FilterByRole(op);
+    }
+    return new_body.size() == 1 ? new_body[0] : SeqStmt(std::move(new_body));
🧹 Nitpick comments (11)
format.sh (1)

260-262: Version parsing robustness (nit)

awk '{print $4}' is brittle; the diff above prints the full --version line.

.github/workflows/ci.yml (1)

62-62: Ensure build cleanup on both success and failure (trap)

Current rm -rf build won’t run if the step exits early. Use a trap to always clean.

-        rm -rf build
+        trap 'rm -rf build' EXIT
src/op/reduce.h (3)

10-11: Include for uint8_t

Avoid relying on transitive includes; this can break under stricter clang-tidy/header-ordering.

 #include "operator.h"
+#include <cstdint>

167-171: Default-initialize fundamental members to satisfy clang-tidy and prevent accidental UB

Optional, but aligns with cppcoreguidelines-pro-type-member-init and the PR’s hygiene theme.

   tir::Buffer src, dst;
-  int dim;
-  ReduceType type;
-  bool clear;
+  int dim{0};
+  ReduceType type{ReduceType::kSum};
+  bool clear{false};

196-199: Ditto for CumSumOpNode: default-initialize

Keeps headers clean under clang-tidy and documents intended defaults.

   tir::Buffer src, dst;
-  int dim;
-  bool reverse;
+  int dim{0};
+  bool reverse{false};
src/transform/lower_tile_op.cc (1)

573-575: Silence unused lambda parameters (clang-tidy)

Avoid readability-unused-parameter by void-casting.

-  auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
-    return LowerTileOpPass::Substitute(std::move(f));
-  };
+  auto pass_func = [=](PrimFunc f, const IRModule &m, const PassContext &ctx) {
+    (void)m;
+    (void)ctx;
+    return LowerTileOpPass::Substitute(std::move(f));
+  };
src/transform/warp_specialized_rewriter.cc (4)

39-61: Avoid copying large sets in ProducerBufferDetector; hold a const reference.

Constructor currently copies the set and then moves; passing/holding a const reference eliminates repeated copies during traversal.

-class ProducerBufferDetector : public StmtExprVisitor {
+class ProducerBufferDetector : public StmtExprVisitor {
 public:
-  ProducerBufferDetector(
-      std::unordered_set<const BufferNode *> cur_producer_buffers)
-      : cur_producer_buffers_(std::move(cur_producer_buffers)) {}
+  explicit ProducerBufferDetector(
+      const std::unordered_set<const BufferNode *> &cur_producer_buffers)
+      : cur_producer_buffers_(cur_producer_buffers) {}
   ...
-  std::unordered_set<const BufferNode *> cur_producer_buffers_;
+  const std::unordered_set<const BufferNode *> &cur_producer_buffers_;

(Calls at Lines 88 and 101 remain valid.)

Also applies to: 88-101


65-76: Consistent naming: FindProducerUsedBuffers.

Minor readability nit: fix the camel-case “used” and pluralization.

-  auto FindProducerusedBuffer(const Stmt &stmt) {
+  auto FindProducerUsedBuffers(const Stmt &stmt) {
     ...
   }
-    producer_buffers_ = finder.FindProducerusedBuffer(stmt);
+    producer_buffers_ = finder.FindProducerUsedBuffers(stmt);

Also applies to: 136-138


257-263: Pass pred by value (and move) to avoid binding a default temporary to a const&.

Small simplification; consistent with makeGetBarrier/makeParityWait style.

-static Stmt makeArriveBarrier(PrimExpr barrier_id, int cta_id = -1,
-                              const PrimExpr &pred = 1) {
+static Stmt makeArriveBarrier(PrimExpr barrier_id, int cta_id = -1,
+                              PrimExpr pred = 1) {
   Array<PrimExpr> args = {makeGetBarrier(std::move(barrier_id))};
   if (cta_id != -1) {
     args.push_back(cta_id);
-    args.push_back(pred);
+    args.push_back(std::move(pred));
   }

966-971: Remove unused shadowed variable.

The outer for_node is unused and shadowed by the inner declaration.

-    auto for_node = result.as<For>();
     if (result.as<ForNode>()) {
       auto for_node = Downcast<For>(result);
src/transform/atomicadd_vectorize.cc (1)

145-152: Move-initialization OK; consider wider type for extent_tx_
Moving Var/PrimExpr into const members is fine. However, extent_tx_ is stored as int; thread extent may exceed INT_MAX on some targets. Prefer int64_t.

Apply (member change shown outside this hunk):

-  int extent_tx_;
+  int64_t extent_tx_;

And in the ctor:

-  extent_tx_ = static_cast<int>(*tx_ext);
+  extent_tx_ = *tx_ext;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1282724 and af8c56c.

📒 Files selected for processing (15)
  • .github/workflows/ci.yml (2 hunks)
  • format.sh (1 hunks)
  • src/op/atomic_add.h (0 hunks)
  • src/op/copy.h (2 hunks)
  • src/op/gemm.h (2 hunks)
  • src/op/gemm_sp.h (1 hunks)
  • src/op/parallel.h (2 hunks)
  • src/op/reduce.h (1 hunks)
  • src/op/region.h (1 hunks)
  • src/transform/atomicadd_vectorize.cc (6 hunks)
  • src/transform/layout_inference.cc (2 hunks)
  • src/transform/lower_tile_op.cc (5 hunks)
  • src/transform/merge_shared_memory_allocations.cc (8 hunks)
  • src/transform/storage_access.cc (6 hunks)
  • src/transform/warp_specialized_rewriter.cc (28 hunks)
💤 Files with no reviewable changes (1)
  • src/op/atomic_add.h
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/op/gemm.h
  • src/op/region.h
  • src/transform/layout_inference.cc
  • src/op/parallel.h
  • src/op/gemm_sp.h
  • src/op/copy.h
  • src/transform/storage_access.cc
  • src/transform/merge_shared_memory_allocations.cc
🧰 Additional context used
🧬 Code graph analysis (3)
src/transform/atomicadd_vectorize.cc (2)
src/transform/loop_vectorize.cc (2)
  • indices (113-168)
  • indices (113-113)
src/transform/loop_vectorize_dynamic.cc (2)
  • indices (141-194)
  • indices (141-141)
src/transform/warp_specialized_rewriter.cc (3)
src/transform/lower_tile_op.cc (6)
  • stmt (82-82)
  • stmt (82-82)
  • stmt (136-141)
  • stmt (136-136)
  • f (210-240)
  • f (210-210)
src/transform/inject_tma_barrier.cc (15)
  • stmt (60-60)
  • stmt (60-60)
  • barrier_id (107-109)
  • barrier_id (107-107)
  • barrier_id (111-115)
  • barrier_id (111-111)
  • barrier_id (193-202)
  • barrier_id (193-193)
  • call (65-80)
  • call (65-65)
  • rewriter (96-96)
  • f (95-99)
  • f (95-95)
  • f (379-405)
  • f (379-379)
src/transform/multi_version_buffer_rewriter.cc (20)
  • stmt (29-33)
  • stmt (29-29)
  • stmt (35-35)
  • stmt (35-35)
  • stmt (117-117)
  • stmt (117-117)
  • call (281-311)
  • call (281-282)
  • op (37-46)
  • op (37-37)
  • op (48-71)
  • op (48-48)
  • op (73-83)
  • op (73-73)
  • op (85-94)
  • op (85-85)
  • op (96-99)
  • op (96-96)
  • f (126-135)
  • f (126-126)
src/transform/lower_tile_op.cc (2)
src/transform/warp_specialized_rewriter.cc (18)
  • stmt (65-76)
  • stmt (65-65)
  • stmt (135-138)
  • stmt (135-135)
  • stmt (140-144)
  • stmt (140-140)
  • stmt (146-146)
  • stmt (146-146)
  • stmt (244-244)
  • stmt (244-244)
  • stmt (285-285)
  • stmt (285-285)
  • stmt (316-320)
  • stmt (316-316)
  • stmt (347-353)
  • stmt (347-348)
  • f (1222-1241)
  • f (1222-1223)
src/transform/layout_inference.cc (6)
  • buffer (340-346)
  • buffer (340-340)
  • f (284-293)
  • f (284-284)
  • f (541-551)
  • f (541-541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-nvidia
🔇 Additional comments (9)
src/op/reduce.h (1)

157-163: Explicit uint8_t underlying type for ReduceType looks good

Consistent with the PR’s enum-size standardization; no behavioral change implied by the enumerators.

src/transform/lower_tile_op.cc (2)

82-82: Const-correct Collect() looks good

Switching to const Stmt& matches usage elsewhere and avoids copies.


136-141: Substitute(): pass-by-value + move is appropriate

Taking buffer_remap by value and moving it into the rewriter is the right tradeoff here.

src/transform/warp_specialized_rewriter.cc (2)

508-514: Replace std::endl with '\n' — good.

Removes unnecessary flushes; improves perf in debug print.


1478-1493: Pass lambda signature uses const IRModule& — good.

Avoids IRModule copies in pass driver.

src/transform/atomicadd_vectorize.cc (4)

17-17: Include for std::move — LGTM
Required for the new move semantics below.


39-41: Good use of move in Plan()
Since parameters are by-value, moving into fields avoids an extra copy.


83-131: Bounds logic OK, but one correctness hole remains elsewhere
UpdateVectorSize() refactor to const refs looks good. The remaining correctness issues are flagged in separate comments below.


108-111: Unnecessary int cast removed — LGTM
Comparing gcd_base with Downcast(last_dim)->value without narrowing avoids potential loss.

Comment on lines +292 to 296
PrimExpr
HandleAccessPtrAndOffset(const PrimExpr &access_ptr,
const Optional<PrimExpr> &offset = std::nullopt,
DataType dtype = DataType::Int(32)) {
// The 2th arg of T.tvm_access_ptr call is offset, we set it to 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use TVM NullOpt instead of std::nullopt; mark dtype maybe_unused

tvm::Optional does not accept std::nullopt. Also silence clang-tidy for the unused dtype.

-  PrimExpr
-  HandleAccessPtrAndOffset(const PrimExpr &access_ptr,
-                           const Optional<PrimExpr> &offset = std::nullopt,
-                           DataType dtype = DataType::Int(32)) {
+  PrimExpr
+  HandleAccessPtrAndOffset(const PrimExpr &access_ptr,
+                           const Optional<PrimExpr> &offset = NullOpt,
+                           DataType dtype [[maybe_unused]] = DataType::Int(32)) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PrimExpr
HandleAccessPtrAndOffset(const PrimExpr &access_ptr,
const Optional<PrimExpr> &offset = std::nullopt,
DataType dtype = DataType::Int(32)) {
// The 2th arg of T.tvm_access_ptr call is offset, we set it to 0 and
PrimExpr
HandleAccessPtrAndOffset(const PrimExpr &access_ptr,
const Optional<PrimExpr> &offset = NullOpt,
DataType dtype [[maybe_unused]] = DataType::Int(32)) {
// The 2th arg of T.tvm_access_ptr call is offset, we set it to 0 and
🤖 Prompt for AI Agents
In src/transform/lower_tile_op.cc around lines 292-296, replace the std::nullopt
default for the tvm::Optional<PrimExpr> parameter with the TVM sentinel
(tvm::NullOpt) because tvm::Optional does not accept std::nullopt, and mark the
unused dtype parameter with the maybe_unused attribute (e.g., [[maybe_unused]]
DataType dtype) to silence clang-tidy; update the function signature accordingly
and keep the rest of the implementation unchanged.

Comment on lines +15 to +16
#include <utility>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add missing cstdint include for uint8_t.

Role uses uint8_t (Line 35). Include cstdint explicitly to avoid transitive include reliance.

+#include <cstdint>
 #include <utility>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <utility>
#include <cstdint>
#include <utility>
🤖 Prompt for AI Agents
In src/transform/warp_specialized_rewriter.cc around lines 15 to 16, the file
uses uint8_t (line 35) but doesn't include <cstdint>, relying on transitive
includes; add an explicit #include <cstdint> near the existing includes (e.g.,
alongside <utility>) so uint8_t is defined and the file no longer depends on
indirect headers.

- Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
- This enhancement ensures that all required submodules are available, contributing to improved build reliability.
- Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
- This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/transform/atomicadd_vectorize.h (2)

9-11: Trim unused heavy includes in header

arith/analyzer.h isn’t used in this declaration; keep heavy deps in the .cc to speed up builds.

Apply:

-#include <tvm/arith/analyzer.h>
 #include <tvm/tir/op.h>

Optionally, if op.h is also unused here, replace it with the minimal headers needed for types (e.g., stmt/expr) in a follow-up after a quick compile check.


3-3: Minor doc nit

“for atomicadd” → “for AtomicAdd” or “for atomic add loops” for clarity.

- * \brief A tool to automatically vectorize a for atomicadd
+ * \brief A tool to automatically vectorize AtomicAdd loops
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9f2a70 and d43d622.

📒 Files selected for processing (1)
  • src/transform/atomicadd_vectorize.h (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/atomicadd_vectorize.h (1)
src/transform/atomicadd_vectorize.cc (2)
  • VectorizeAtomicAdd (330-385)
  • VectorizeAtomicAdd (330-331)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: format-check
  • GitHub Check: format-check
🔇 Additional comments (1)
src/transform/atomicadd_vectorize.h (1)

17-18: Signature update is consistent across header, implementation, and call sites. No outdated prototypes detected.

namespace tvm {
namespace tl {

using namespace tir;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid using namespace in headers

Header-wide using-directives leak symbols to all includers and will likely be flagged by clang-tidy (google-build-using-namespace). Prefer fully qualified names.

Apply:

-using namespace tir;
+// Avoid namespace pollution in headers.

And qualify the declaration below:

-For VectorizeAtomicAdd(const For &for_node, const Var &thread_var,
-                       const Range &thread_bounds, int compute_capability);
+tir::For VectorizeAtomicAdd(const tir::For &for_node, const tir::Var &thread_var,
+                            const Range &thread_bounds, int compute_capability);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/transform/atomicadd_vectorize.h around line 15, remove the header-level
"using namespace tir;" statement and instead qualify any symbols from the tir
namespace in the declarations below (replace unqualified types/functions with
tir::TypeName or tir::functionName as needed); ensure all occurrences in this
header are updated to fully-qualified tir:: identifiers to avoid leaking symbols
to includers.

@LeiWang1999 LeiWang1999 merged commit cdc5d8d into tile-ai:main Sep 2, 2025
6 checks passed
chengyupku added a commit to tile-ai/tilescale that referenced this pull request Oct 24, 2025
* [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer (#714)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling

- Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
- Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
- Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
- Removed obsolete code and improved overall code clarity and maintainability.

* lint fix

* Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls

- Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
- Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.

* test fix

* Enhance global read detection in pipeline planning

- Updated the handling of global reads to account for condition expressions within IfThenElse nodes, ensuring accurate identification of global memory accesses.
- Introduced a new flag to track whether the visitor is within a condition expression, improving the correctness of buffer access analysis.
- Refactored the VisitStmt_ method to properly handle the structure of IfThenElse nodes, enhancing the clarity and maintainability of the code.

* Add IndexLegalizer to enforce int64 for out-of-bound indices

- Introduced the IndexLegalizer class to ensure that indices in BufferStore and BufferLoad nodes are promoted to int64 when they exceed their type bounds.
- Refactored the Int64Promoter logic from flatten_buffer.cc into IndexLegalizer, improving code organization and reusability.
- Updated the ConfigIndexBitwidth pass to apply IndexLegalizer after rewriting the body, enhancing the handling of index bitwidths in transformations.

* [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced (#718)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Rename build-test job to build-test-nvidia and specify nvidia as a runner label in CI workflow.

* Update CI workflow to specify 'nvidia' as an additional runner label for the format-check job.

* fix: NVRTC backend (#717)

* fix: NVRTC backend

* fix: CI

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CUDA] Init support for sm_120 (#716)

* Init support for sm120

* fmt

* resolve comments

* unify mma gemm

* fmt

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [CI] fix docs ci (#720)

* [Chore] fix typos (#719)

* chore: fix typos

* chore: fix ruff

* chore: fix clang-format

* [CI][AMD] Add AMD GPU CI and fix some related bugs (#694)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Update AMD FlashAttention example and TVM submodule

- Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
- Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
- Updated the TVM submodule to the latest commit for improved functionality.
- Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.

* Add CI workflow for automated format checking and testing

- Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
- The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
- Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.

* Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.

* Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.

* Update AMD CI workflow to install pytest directly instead of using requirements-test.txt

* Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt

* Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation

* Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.

* Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.

* Add installation of ROCm in AMD CI workflow

- Included a step to execute the `install_rocm.sh` script for improved setup.
- Removed unnecessary blank line for better readability in the workflow script.

* Remove installation step for ROCm in AMD CI workflow to simplify the setup process.

* Update AMD CI workflow to run specific test file with verbose output instead of all tests.

* Add new tilelang built-in operations for AMD architecture

- Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
- Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.

* Enhance autotuner configurations and GEMM operations in AMD example

- Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
- Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.

* Update autotuner configurations in AMD example for enhanced performance

- Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
- Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.

* Enhance autotuner configurations and memory handling in AMD example

- Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
- Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.

* Refine autotuner configurations and memory usage in AMD example

- Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
- Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.

* Update autotuner configurations in AMD example for enhanced performance

- Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
- Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.

* Enhance autotuner configurations and GEMM operations in AMD example

- Expanded thread counts in `get_configs` to include higher values for improved autotuning.
- Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.

* Update AMD CI workflow and remove obsolete test script

- Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
- Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.

* Remove TVM subproject from 3rdparty directory

* Refactor configuration generation and accumulation logic in AMD example

- Reformatted the `get_configs` function for improved readability by aligning parameters.
- Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.

* Enhance AMD CI workflow with additional logging and setup steps

- Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
- Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.

* Comment out package copying in AMD CI workflow to prevent potential issues during environment setup

* Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps

* Enhance BuildTileLangHIP function by adding whitespace for improved readability

* Refactor kTVMGridConstant definition for clarity and remove unnecessary comment

* Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* lint fix

* Update AMD CI workflow to use requirements-rocm.txt for dependency installation

* fix ci

* Remove dependency on format-check from AMD CI workflow

* fix ci

* fix ci

* fix ci

* Remove format-check job from AMD CI workflow

* Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow

* Add dependency on format-check job in AMD CI workflow

* Add format-check job to AMD CI workflow

* Update format-check job in AMD CI workflow to run on self-hosted environment

* Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes

* Update amd_ci.yml

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy (#724)

* [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy

* [Typo] Correct architecture selection for CUDA and CDNA

* [Refactor] Refactor CUDA code generation to simplify eviction policy handling (#721)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor CUDA code generation to simplify eviction policy handling

- Updated `VisitExpr_` methods in `codegen_cuda.cc` to use default eviction policy for `tma_load`, `tma_load_im2col`, and `tma_store` functions, reducing complexity.
- Removed conditional assembly code for `EVICT_NORMAL` in `copy_sm90.h`, streamlining the assembly calls for tensor memory operations.

* lint fix

* [Language] Introduce `StridedTensor` to support non contigious torch inputs (#722)

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Support strided tensors

* Refactor target attribute helper functions for improved clarity

* No code changes made in proxy.py and setup.py

* lint fix

* lint fix via gemini

* lint fix

* test fix

* test fix

* lint fix

* Update wrapper.py

* test fix

* Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.

* lint fix

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

* [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper (#712)

* bug fix and support gemm_sr fallback for hopper

* Update gemm.cc

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `fix` (#726)

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/712#issuecomment-3190680851

The following files were modified:

* `src/op/gemm.cc`
* `src/tl_templates/cuda/gemm_sm90.h`
* `src/transform/warp_specialized_rewriter.cc`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [CI] Fix AMD CI (#729)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

---------

Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725)

* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16

* [Dequant] Add extern call and serial dequantization

* [Dequant] Parallel Dequant wait for fence debug.

* [Scale] Add scale matrix to mxfp4 gemm

* [Remove] Remove fence-buggy example and some generated source cuda code

* [MXFP4] Update initial version of MXFP4 GEMM

* [Scale] Add scale to latest mxfp4 gemm

* [Lint]

* [BugFix] Load Scale, disabe TMA to recover performance

* [Lint]

* [Lint]

* [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance

* [Lint]

* Update example_dequant_gemm_bf16_fp4_hopper_serial.py

* Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.

* Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

* Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

* lint fix

* ci fix

* Remove non-existent example

* [BugFix] Add smem swizzle to recover performance of TMA

* [BugFix] Enough reg for producer when threads=512

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* 📝 Add docstrings to `mxfp4` (#732)

* 📝 Add docstrings to `mxfp4`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561

The following files were modified:

* `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py`
* `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py`
* `examples/dequantize_gemm/utils.py`
* `examples/gemm/example_gemm_autotune.py`
* `tilelang/intrinsics/utils.py`
* `tilelang/language/__init__.py`
* `tilelang/language/utils.py`
* `tilelang/quantize/mxfp.py`
* `tilelang/quantize/quantization.py`

* [Lint] More accurate docstring

* [Lint]

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

* [Refactor] Refactor env into a more flexible version (#740)

* Fix environment variable name for compilation print setting in `env.py`

* Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.

* lint fix

* Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

* [Enhancement] Add stride index validation in CythonKernelWrapper (#743)

* Introduced an assertion to ensure that the stride index is within the valid range of tensor dimensions in `cython_wrapper.pyx`.
* This change prevents potential out-of-bounds errors when accessing tensor dimensions, enhancing the robustness of the code.

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error (#742)

* [Bugfix]:Fix atomic add auto vectorize memory access out of bound error

* Update atomicadd_vectorize.cc

* format

* 📝 Add docstrings to PR #744 (#745)

* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/742#issuecomment-3205103559

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Refactor] Refactor barrier management (#744)

* Introduce Barrier

* Enhance CUDA kernel with new barrier management and post-processing support

- Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
- Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
- Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
- Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
- Enhanced the overall structure and readability of the codebase.

* Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.

* Enhance barrier management in TileLang

- Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
- Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
- Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
- Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
- Removed deprecated memory scope handling code to enhance clarity and maintainability.

* lint fix

* lint fix

* Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.

* Refactor logging in JITKernel to improve kernel compilation tracking

- Removed unused import of `torch.backends` in the example file.
- Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
- Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.

* Refactor dequantization tests and update barrier function

- Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
- Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.

* Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.

* Fix typos in rasterization parameters and update import path for cached module

- Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
- Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
- Added `StridedTensor` import in the language module for enhanced tensor functionality.

* Update ci.yml

* [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746)

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

* [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741)

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.

* Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.

* Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.

* Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.

* typo fix

* Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.

* lint fix

* test fix

* Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.

* use ninja instead of make

* Add CMake configuration step for Ninja build system in setup.py

* Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.

* Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.

* Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.

* Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.

* Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.

* Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.

* bugfix

* Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.

* Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.

* Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

* [Enhancement] Optimize loop body handling in IR (#749)

- Updated the loop body construction in `ir.cc` to conditionally include an output statement based on the analyzable condition of the `waves` variable.
- This change enhances performance by avoiding unnecessary statement wrapping when the condition is met, improving the efficiency of loop execution.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [MXFP4] Fix bugs and optimize exponential operation (#750)

* [MXFP4] Fix bugs
- Optimize exp2 with shift operation to boost performance
- Fix bug of simple dequantization function call
- Fix bug of scaling factor with bias

* [Lint]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h (#751)

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

* [Enhancement] Add shape checking for reduce options (#748)

* Add shape checking for reduce options

* lint fix

* Handle special case reducing into shape-1 tensor

Allow reducing [X, d, Y] into [X, Y] or [X, 1, Y]

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix] Add missing FP8 header include (#752)

* [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h

- Introduced specialized DispatchInstruction templates for fp8_e4_t and fp8_e5_t types, enhancing support for new data formats in CUDA GEMM operations.
- Each specialization defines the corresponding MMA and MMA_Group types, optimizing performance for specific configurations.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Enhancement] Include cuda_fp8.h in gemm_sm90.h

- Added the inclusion of the "cuda_fp8.h" header file to support new data formats in CUDA GEMM operations, enhancing compatibility with recent updates for fp8 types.

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* lint fix

* [Refactor] Remove unused tl_shuffle_elect and related functions from common.h

- Deleted the `tl_shuffle_elect` function and its associated comments to streamline the codebase.
- Added inclusion of "intrin.h" for improved intrinsic support in CUDA operations.
- Cleaned up the file by removing unnecessary template parameters and functions, enhancing clarity and maintainability.

* lint fix

* [Refactor] Update header inclusions in common.h and gemm_sm90.h

- Removed the inclusion of "intrin.h" from common.h to streamline dependencies.
- Added "intrin.h" inclusion in gemm_sm90.h to ensure intrinsic support for CUDA operations, enhancing functionality and maintainability.

* bug fix

* [MXFP4] Add bias to MXFP4 GEMM kernel (#753)

* [MXFP4] Add bias to gemm kernel

* [Lint]

* [Lint] Rename "bias" to "Bias"

* [Bugfix][WS] Consider loop min extent when computing phase id (#754)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* [Typo] Remove `disable_cache` in some tests (#755)

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Feature] Add 1D TMA support (#761)

* [Feature] Add 1D TMA support
- Check the contiguous conditions of 1D TMA copy
- Add new interface and params order of `tma_load` and `tma_store` call
- Add 1D `tma_store` interface in sm90 template
- Add elementwise kernel for 1D TMA example

* [Lint]

* [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors

* [Lint]

* [BugFix] 1D TMA load

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Lint]

* [Lint]

* [MXFP4] Add test for bf16&mxfp4 gemm

* [BugFix]

* [Lint]

---------

Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>

* [Example] Add vertical slash sparse attention pattern (#762)

* upd sparse attn

* lint

* rename

* update test file

* update benchmark

* lint

* update benchmark

* [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767)

* fix ci and pass bug

* fix

* try

* lint

* [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766)

* [TMA] Add 1D TMA copy for Scale tensor

* [Lint]

* [Test] Add test for kernel

* [BugFix]

* hot fix blackwell (#768)

* [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763)

* Refactor operator classes to inherit from TileOperator and update layout inference methods

- Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations.
- Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency.
- Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization.
- Added missing layout inference implementations for Fill and Conv2DIm2ColOp.
- Removed deprecated op.cc and op.h files to streamline the codebase.

* lint fix

* Refactor operator classes to use Node pattern and improve memory management

- Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation.
- Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access.
- Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design.
- Refactored InferLayout and Lower methods to ensure consistency across operator implementations.
- Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase.

* Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning

- Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects.
- Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations.
- Made minor adjustments in layout inference and other related methods for consistency and clarity.

* Refactor FillNode::Lower method to remove unused global function call

- Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity.
- Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies.

* [Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757)

* [Enhancement] Introduce finalize_reducer operator and layout reducer support

- Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
- Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
- Updated `setup.py` to include logging for build directory paths, improving build process visibility.
- Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
- Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
- Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.

* Refactor code formatting and improve readability in multiple files

- Cleaned up whitespace in `setup.py` to enhance logging clarity.
- Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
- Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
- Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.

* Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.

* [Enhancement] Disable reuse of small arrays in shared memory allocation

- Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.

* Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.

* Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.

* bug fix

* Add thread checks workaround for replicated cases

* Remove the is_one check

* fix lint error

* lint fix

* Update autotune tests to use smaller matrix sizes for improved performance and reliability

* [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods

- Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
- Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
- Adjusted header inclusions for improved organization and clarity across multiple files.

* [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py

- Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.

* [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution

- Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
- Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.

---------

Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>
Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com>

* 📝 Add docstrings to `pytile_0826` (#770)

* 📝 Add docstrings to `pytile_0826`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/763#issuecomment-3224197814

The following files were modified:

* `src/op/atomic_add.cc`
* `src/op/atomic_add.h`
* `src/op/copy.cc`
* `src/op/copy.h`
* `src/op/elem.cc`
* `src/op/elem.h`
* `src/op/gemm.cc`
* `src/op/gemm.h`
* `src/op/gemm_sp.cc`
* `src/op/gemm_sp.h`
* `src/op/operator.cc`
* `src/op/operator.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/op/reduce.h`
* `src/op/region.cc`
* `src/op/region.h`
* `src/transform/layout_inference.cc`
* `src/transform/lower_tile_op.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix]:Fix atomic add auto vectorize negative optimization (#765)

* [Bugfix]:Fix atomic add auto vectorize negative optimization

* fixbug

* format

* fix bug

* 📝 Add docstrings to `reducer_0825` (#772)

* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118

The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Allow fill global buffer (#774)

* Allow fill global buffer

* fix lint error

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match (#771)

* [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match

* [Lint]

* add bf16 exp fallback (#776)

* [Lint] Introduce clang-tidy into format.sh (#777)

* [Refactor] Update Clang-Tidy Checks and Improve Code Consistency

- Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
- Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
- Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
- Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
- General code cleanup and adherence to best practices for better maintainability.

* [Refactor] Enhance Code Consistency and Clang-Tidy Configuration

- Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
- Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
- Replaced size checks with `empty()` method calls in various locations for clearer intent.
- Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency

- Added clang-tidy checks to the format script for improved code quality assurance.
- Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
- Updated the requirements-lint.txt file to include clang-tidy as a dependency.
- General code cleanup to adhere to best practices and improve maintainability.

* [CI] Update AMD CI Workflow to Include Build Directory Creation

- Added steps to create a build directory and configure CMake with ROCm support during the format check process.
- Ensured cleanup of the build directory after the format check to maintain a clean workspace.

* [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode

- Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
- This change enhances code clarity and maintainability by focusing on relevant attributes for each class.

* [Refactor] Update Clang-Tidy Integration and Code Improvements

- Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
- Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
- Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
- General code cleanup to adhere to best practices and improve maintainability.

* [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize

- Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
- Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
- General code cleanup to maintain consistency and adhere to best practices.

* [CI] Add Git Submodule Initialization to CI Workflow

- Included a step to initialize and update git submodules recursively in the CI workflow.
- This change ensures that all necessary submodules are available during the format check process, improving build reliability.

* [CI] Add Git Submodule Update Step to Format Check

- Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
- This enhancement ensures that all required submodules are available, contributing to improved build reliability.

* [Refactor] Update Function Signatures in AtomicAddVectorize

- Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
- This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.

* [Cache] Introduce detailed target information for the disk kernel cache (#780)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* [Example]Adds example for top-k operation (#775)

* [Example]Adds example for top-k operation

Adds an example demonstrating the top-k operation using tilelang

* format

* Adds topk tilelang example test

* fix lint

* [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781)

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

* Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h

* lint fix

* remove cmake dep in pyproject as it may lead to different cmake paths in diff stages

* lint fix

* Add cmake dependency to pyproject.toml and improve build logging in setup.py

* [CI] Adds pytest-durations for test timing (#782)

* [Ci] Adds pytest-durations for test timing

Adds `pytest-durations` to the test requirements and configures pytest to display test durations.

This helps in identifying slow-running tests and optimizing the test suite for faster feedback.

* add amd ci durations

* Removes flash_attn installation from CI

* [Refactor] Support python reflection for tile operators (#783)

* Implement Fill operator and related reflection methods in TileLang

- Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
- Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
- Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
- Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
- Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.

* Refactor operator reflection methods and improve code clarity

- Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
- Consolidated static initialization blocks for various operators to a single line for improved consistency.
- Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
- Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.

* Refactor GEMM and AtomicAdd operations for improved clarity

- Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
- Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
- Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
- Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.

* Remove deprecated operator files from the tilelang IR module

- Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
- This cleanup enhances maintainability by removing unused code and improving overall organization of the module.

* Refactor imports in tilelang IR module for improved organization

- Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.

* lint fix

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
- Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
- Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
- Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.

* Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability

- Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
- Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
- Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
- Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.

* comment

* Refactor operator header files for improved readability

- Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
- Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.

* Refactor MakeReduce method in ReduceOpNode for clarity

- Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
- This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.

* Update Reduce operation type checks for consistency

- Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
- This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.

* [AMD] Fix amd tir&add examples (#784)

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file na…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant