Skip to content

Conversation

@SiriusNEO
Copy link
Contributor

@SiriusNEO SiriusNEO commented Oct 16, 2025

Current LegalizeSafeMemoryAccess Pass always rewrite the value of outermost BufferStore stmt, which may introduce some strange wrong logic (See #1013). This PR refactors the Pass so that it substitute the value recursively for each Load/Store.

Summary by CodeRabbit

  • Refactor

    • Renamed annotation API from annotate_padding to annotate_safe_value and renamed public annotation key to "safe_value_map".
    • Memory-safety legalization now uses safe-value mappings and consistently wraps loads/stores and extern calls with conditional boundary checks that supply safe replacement values when accesses are out-of-bounds.
    • Public constant name updated to reflect safe-value terminology.
  • Tests

    • Added tests for boundary-check legalization, atomic-add/vectorized access paths, and out-of-bounds store handling.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR renames the "padding map" concept to "safe value map", updates APIs and annotations accordingly, and refactors legalization to collect boundary conditions and wrap loads, stores, and extern calls with conditional IfThenElse chains that use GetSafeValue lookups for out-of-bounds cases.

Changes

Cohort / File(s) Summary
Core transform refactor
src/transform/legalize_safe_memory_access.cc
Adds recursively_collect_conds flag to GlobalMemChecker; renames annotated_padding_map_annotated_safe_value_map_; introduces GetSafeValue; applies non-recursive checks for Loads and recursive checks for Stores/CallExtern; wraps loads/stores/extern-eval in IfThenElse chains and switches to safe-value replacements.
Public constant rename
src/op/builtin.h
Replaces kPaddingMap ("padding_map") with kSafeValueMap ("safe_value_map") in tvm::tl::attr.
Lowering update
src/transform/lower_tile_op.cc
RemapBufferRewriter and related logic now read/write SafeValueMap annotations instead of PaddingMap; local variable names and annotation Set/Get updated to safe_value_map.
TileLang API rename
tilelang/language/__init__.py
Renames annotate_paddingannotate_safe_value; changes parameter name and returned block_attr key from padding_mapsafe_value_map; updates docstring/example and internal variable names.
Language test update
testing/python/language/test_tilelang_language_annotate_safe_value.py
Replaces calls to T.annotate_padding(...) with T.annotate_safe_value(...) and adjusts annotation dict keys accordingly.
Transform tests added/updated
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py
Updates dtype usage in a vectorize test; adds helpers and tests for safe-value legalization (issue_1013_buggy_kernel, vectorize_access_with_atmoic_add_legalize, assert_vectorize_access_with_atmoic_add) and new tests (test_issue_1013, test_vectorize_access_with_atmoic_add, test_oob_store).

Sequence Diagram(s)

sequenceDiagram
    participant Load as BufferLoadNode
    participant Checker as GlobalMemChecker
    participant Rewriter as SafeMemorysRewriter
    participant Safe as GetSafeValue

    Load->>Checker: Collect boundary conditions (non-recursive)
    Checker-->>Rewriter: Return conditions
    alt Any OOB condition
        Rewriter->>Safe: Lookup safe value from annotated_safe_value_map_
        Safe-->>Rewriter: Safe value
        Rewriter->>Rewriter: Build nested IfThenElse chain using conditions
        Rewriter-->>Load: Replace load with conditional GetSafeValue choices
    else All in-bounds
        Rewriter-->>Load: Keep original load
    end
Loading
sequenceDiagram
    participant Store as BufferStoreNode
    participant Checker as GlobalMemChecker
    participant Rewriter as SafeMemorysRewriter

    Store->>Checker: Collect boundary conditions (recursive)
    Checker-->>Rewriter: Return conditions
    alt Conditions exist
        Rewriter->>Rewriter: Wrap store in nested IfThenElse chain across conditions
        Rewriter-->>Store: Conditional store execution (use safe-value fallbacks where applicable)
    else No conditions
        Rewriter-->>Store: Emit direct store
    end
Loading
sequenceDiagram
    participant Extern as CallExtern
    participant Checker as GlobalMemChecker
    participant Rewriter as SafeMemorysRewriter

    Extern->>Checker: Collect boundary conditions (recursive)
    Checker-->>Rewriter: Return conditions
    alt Conditions exist
        Rewriter->>Rewriter: Wrap extern evaluation in IfThenElse chain using conditions
        Rewriter-->>Extern: Conditional extern call / safe-value fallbacks
    else No conditions
        Rewriter-->>Extern: Direct extern call
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

🐇 I nibble bits where buffers roam,
I swap old pads for safe-value home,
When bounds wander far from sight,
GetSafeValue keeps outputs right,
Hop—conditionally safe tonight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "[Refactor] Refactor Pass LegalizeSafeMemoryAccess to support recursive load/store rewrite" accurately captures the main objective of the changeset. The changes substantiate this claim through: (1) modifications to GlobalMemChecker that introduce a parameter to control recursive condition collection, (2) per-load and per-store safe-value application logic in SafeMemorysRewriter instead of only rewriting the outermost BufferStore, and (3) new test cases (test_issue_1013, test_vectorize_access_with_atmoic_add, test_oob_store) that validate the recursive rewrite functionality. While the changeset also includes broader nomenclature shifts from "padding" to "safe value" across multiple files, the title appropriately focuses on the primary technical change without attempting to enumerate every refactoring detail.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7eea81 and f821d0d.

📒 Files selected for processing (2)
  • src/transform/legalize_safe_memory_access.cc (8 hunks)
  • tilelang/language/__init__.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/language/__init__.py (3)
src/transform/lower_tile_op.cc (4)
  • safe_value_map (227-238)
  • safe_value_map (227-228)
  • buffer (318-326)
  • buffer (318-318)
src/transform/legalize_safe_memory_access.cc (12)
  • buffer (87-95)
  • buffer (87-87)
  • buffer (98-138)
  • buffer (98-99)
  • buffer (256-260)
  • buffer (256-256)
  • buffer (262-265)
  • buffer (262-262)
  • buffer (267-270)
  • buffer (267-267)
  • buffer (272-277)
  • buffer (272-272)
tilelang/language/ast/ir.py (1)
  • block_attr (430-438)
src/transform/legalize_safe_memory_access.cc (1)
src/transform/lower_tile_op.cc (22)
  • op (96-102)
  • op (96-96)
  • op (117-151)
  • op (117-117)
  • op (181-186)
  • op (181-181)
  • op (193-207)
  • op (193-193)
  • op (282-316)
  • op (282-282)
  • op (485-548)
  • op (485-485)
  • op (550-569)
  • op (550-550)
  • op (571-587)
  • op (571-571)
  • buffer (318-326)
  • buffer (318-318)
  • buffer (447-464)
  • buffer (447-447)
  • buffer (466-483)
  • buffer (466-466)
🪛 Clang (14.0.6)
src/transform/legalize_safe_memory_access.cc

[error] 61-61: constructor does not initialize these fields: conditions, analyzer

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 64-64: method 'VisitExpr_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 64-64: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 72-72: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: method 'VisitStmt_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 76-76: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 82-82: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 82-82: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 148-148: constructor does not initialize these fields: analyzer_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 152-152: member variable '' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

🪛 Ruff (0.14.0)
tilelang/language/__init__.py

184-184: block_attr may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (7)
src/transform/legalize_safe_memory_access.cc (6)

61-63: LGTM: Recursion control for condition collection.

The addition of the recursively_collect_conds parameter to GlobalMemChecker is a well-designed solution that enables precise control over when to recurse into child nodes. This distinction is crucial for:

  • Load/Store nodes: Non-recursive checking (since the rewriter visits children separately).
  • CallExtern nodes: Recursive checking (since we can't rewrite BufferLoads inside extern calls).

The conditional invocation of base visitors based on this flag is implemented correctly.

Also applies to: 71-73, 81-83, 145-145


152-155: LGTM: Cleaner abstraction with GetSafeValue.

The rename from annotated_padding_map to annotated_safe_value_map improves semantic clarity. The new GetSafeValue helper method provides a clean abstraction for retrieving safe values, with a sensible fallback to make_zero for buffers without explicit annotations.

Also applies to: 271-277, 279-279


158-180: LGTM: Core fix for recursive load rewriting.

This is the key change that addresses issue #1013. Each BufferLoad is now individually checked and wrapped with if_then_else using GetSafeValue for out-of-bounds cases, rather than only rewriting the outermost BufferStore. The use of a non-recursive checker is correct since the rewriter recursively visits child nodes. The ICHECK for boolean conditions adds a good defensive validation.


182-214: LGTM: Consistent per-store handling.

The refactored store handling is now consistent with load handling:

  • Uses a non-recursive checker (the rewriter visits children).
  • Directly wraps all boundary conditions with IfThenElse.
  • Preserves the special-case warning for stores with IfThenElse values (manual boundary checks).

The simplified logic improves maintainability.


221-254: LGTM: Appropriate handling of extern calls.

The CallExtern handling correctly uses a recursive checker (recursively_collect_conds=true) because BufferLoad nodes inside extern calls cannot be rewritten (rewriting would cause exceptions). The approach of collecting all boundary conditions from call arguments and applying them to the outer Evaluate statement is pragmatic and, as the comment notes, handles most common scenarios like atomicAdd. The explicit documentation of this design choice is helpful.


311-311: LGTM: Consistent updates in SafeMemoryLegalizer.

All references to the renamed annotated_safe_value_map_ are consistent throughout SafeMemoryLegalizer. The annotation lookup from BlockNode correctly uses attr::kSafeValueMap and properly populates the map for use by SafeMemorysRewriter.

Also applies to: 339-350, 361-361

tilelang/language/__init__.py (1)

149-184: LGTM: Complete and consistent API rename.

The function rename from annotate_padding to annotate_safe_value is thorough and improves semantic clarity:

  • Function name, parameter names, and internal variables all updated consistently.
  • Docstring clearly explains the "safe value" concept (value used when buffer is accessed out of bounds).
  • Example usage demonstrates the annotation in context.
  • Return statement correctly uses "safe_value_map" key, matching attr::kSafeValueMap in the C++ implementation.

The past review comment about the docstring has been addressed (line 156 correctly says "safe value").


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/legalize_safe_memory_access.cc (1)

103-112: Do not skip constant-index bounds checks

Skipping indices without variables can miss out-of-bounds constant accesses. Let the analyzer prove constants too, and only add guards when not provable.

Apply this diff to remove the early-continue:

-      bool has_variable = false;
-      PostOrderVisit(index, [&](const ObjectRef &obj) {
-        if (const VarNode *v = obj.as<VarNode>()) {
-          has_variable = true;
-        }
-      });
-      if (!has_variable) {
-        // If index is a constant, we can skip the check
-        continue;
-      }
🧹 Nitpick comments (7)
src/transform/legalize_safe_memory_access.cc (2)

148-168: Load rewrite logic LGTM; consider merging conditions

The recursive load rewrite is correct and composes guards as conjunction. You can reduce ITE nesting by combining conditions first (e.g., using tir::And/tir::call_intrin("tir.bool.and") or a fold-all helper), then a single if_then_else.


229-241: Comment clarifications

Minor nits:

  • Typo: “Recursively check Load/Store…” comment precedes Evaluate-only handling.
  • “not the most rigorous solution” is fine; consider briefly stating why you avoid mutating call_extern args (to gate side effects instead of rewriting them).
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (5)

41-70: Good corner-case coverage; minor dtype/style nits

Looks good. Two optional cleanups:

  • Use T.int32 in alloc_var if that’s the intended width for count to match the cast and avoid width ambiguity.
  • If Ruff flags dtype names, consider x: T.Tensor((num_tokens,), dtype="int64") for consistency with other tests, though the current annotation is fine in TileLang.

72-103: Fix typo and silence unused kernel indices

  • Rename “atmoic” → “atomic” in function names.
  • _bx, _by instead of bx, by to silence unused-variable warnings.
-def vectorize_access_with_atmoic_add_legalize(M: int = 64,
+def vectorize_access_with_atomic_add_legalize(M: int = 64,
@@
-    def main(A: T.Tensor((M, N), dtype="float32"),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def main(A: T.Tensor((M, N), dtype="float32"),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
-    def expected(A: T.Tensor((M, N), dtype="float32"),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def expected(A: T.Tensor((M, N), dtype="float32"),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):

106-111: Keep names consistent after rename

Update the assertion helper name to match the typo fix.

-def assert_vectorize_access_with_atmoic_add(M: int = 64, N: int = 64):
-    func, expected = vectorize_access_with_atmoic_add_legalize(M, N)
+def assert_vectorize_access_with_atomic_add(M: int = 64, N: int = 64):
+    func, expected = vectorize_access_with_atomic_add_legalize(M, N)

117-122: Test reads well; consider adding a negative-size probe later

Test matches the intended guarded semantics for issue 1013. Optionally add a case where the loop bounds guarantee safety to ensure the pass does not introduce guards (non-regression).


124-126: Update test call after rename

Adjust to the new helper name.

-def test_vectorize_access_with_atmoic_add():
-    assert_vectorize_access_with_atmoic_add(64, 64)
+def test_vectorize_access_with_atomic_add():
+    assert_vectorize_access_with_atomic_add(64, 64)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3742d3 and 3899678.

📒 Files selected for processing (2)
  • src/transform/legalize_safe_memory_access.cc (4 hunks)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/legalize_safe_memory_access.cc (3)
src/transform/layout_inference.cc (16)
  • op (42-48)
  • op (42-42)
  • op (327-387)
  • op (327-327)
  • op (417-441)
  • op (417-417)
  • op (443-460)
  • op (443-443)
  • op (462-471)
  • op (462-462)
  • op (657-669)
  • op (657-657)
  • op (706-783)
  • op (706-706)
  • op (785-794)
  • op (785-785)
src/op/parallel.cc (2)
  • VisitExpr_ (162-173)
  • VisitExpr_ (162-162)
src/transform/warp_specialized_rewriter.cc (2)
  • call (31-36)
  • call (31-31)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (4)
tilelang/language/__init__.py (1)
  • symbolic (85-96)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/ast/ir.py (1)
  • reads (390-407)
src/transform/legalize_safe_memory_access.cc (2)
  • LegalizeSafeMemoryAccess (375-388)
  • LegalizeSafeMemoryAccess (375-375)
🪛 Ruff (0.14.0)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

48-48: Undefined name int64

(F821)


60-60: Undefined name int64

(F821)


80-80: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (1)
src/transform/legalize_safe_memory_access.cc (1)

242-260: Gating extern calls looks correct; verify broader coverage

Guarding call_extern with conjunction of conditions is sound to prevent side effects on OOB. If other extern forms (e.g., call_pure_extern for some backends) appear, consider including them, or document scope is limited to call_extern.

Would you like me to scan the repo for other extern call forms and propose an exhaustive matcher?

@LeiWang1999 LeiWang1999 self-requested a review October 16, 2025 15:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/transform/legalize_safe_memory_access.cc (2)

170-227: Consider reducing code duplication in buffer scope handling.

The code for shared and local buffers (lines 203-221) is identical. This duplication can be reduced by consolidating the logic.

Apply this diff to consolidate the shared and local buffer handling:

     auto value = store->value;
     if (IsGlobalBuffer(store->buffer)) {
       Stmt store_with_conditions = store;
       for (auto cond : conditions) {
         store_with_conditions = IfThenElse(cond, store_with_conditions);
       }
       return store_with_conditions;
-    } else if (isSharedBuffer(store->buffer)) {
+    } else if (isSharedBuffer(store->buffer) || IsLocalBuffer(store->buffer)) {
       PrimExpr value = store->value;
       for (auto cond : conditions) {
         ICHECK(cond.dtype() == DataType::Bool(1))
             << "condition is not a boolean: " << cond;
         value = if_then_else(cond, value, GetPadding(store->buffer));
       }
       store.CopyOnWrite()->value = value;
       return store;
-    } else if (IsLocalBuffer(store->buffer)) {
-      PrimExpr value = store->value;
-      for (auto cond : conditions) {
-        ICHECK(cond.dtype() == DataType::Bool(1))
-            << "condition is not a boolean: " << cond;
-        value = if_then_else(cond, value, GetPadding(store->buffer));
-      }
-      store.CopyOnWrite()->value = value;
-      return store;
     } else {
       LOG(FATAL) << "Check store buffer: " << store->buffer
                  << " is not a global or shared or local buffer";
     }

234-240: Clarify the NOTE about extern call boundary checking.

The NOTE provides valuable context about the simplified approach for handling extern functions. However, it could benefit from a concrete example of what "not entirely precise" means and what edge cases might not be covered.

Consider enhancing the comment with a specific example:

 // NOTE(chaofan): This is currently not the most rigorous solution.
 // The check here is primarily intended to handle extern functions like
 // atomicAdd, which may involve memory access. Due to their special nature,
 // the BufferLoad in their parameters might be used for boundary checks of the
 // current statement. The current solution adopts a simplified approach:
 // directly applying the boundary constraints of all parameters to the
 // statement. While not entirely precise, it addresses most common scenarios.
+// Example edge case: if an extern function has multiple buffer parameters
+// with independent boundary conditions, this approach may overly restrict
+// the statement by requiring all conditions to be satisfied simultaneously,
+// even if only one buffer is actually accessed in a given execution path.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3899678 and 631b2c2.

📒 Files selected for processing (2)
  • src/transform/legalize_safe_memory_access.cc (4 hunks)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/legalize_safe_memory_access.cc (3)
src/transform/atomicadd_vectorize.cc (2)
  • VisitExpr_ (77-97)
  • VisitExpr_ (77-77)
src/op/parallel.cc (2)
  • VisitExpr_ (162-173)
  • VisitExpr_ (162-162)
tilelang/language/tir/op.py (1)
  • if_then_else (2906-2936)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (5)
tilelang/language/__init__.py (1)
  • symbolic (85-96)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/allocate.py (1)
  • alloc_shared (21-36)
tilelang/language/ast/ir.py (1)
  • reads (390-407)
src/transform/legalize_safe_memory_access.cc (2)
  • LegalizeSafeMemoryAccess (375-388)
  • LegalizeSafeMemoryAccess (375-375)
🪛 Ruff (0.14.0)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

1-1: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)


82-82: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


82-82: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


91-91: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


91-91: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (5)
src/transform/legalize_safe_memory_access.cc (3)

63-75: LGTM! Improved boundary checking with explicit buffer scope validation.

The refactored VisitExpr_ and VisitStmt_ methods now explicitly check if buffers are in global scope before performing boundary checks. This aligns with the PR objective to handle load/store operations individually rather than only at the outermost level.


148-168: LGTM! Recursive boundary checking for BufferLoad nodes.

The new VisitExpr_ for BufferLoadNode implements the core refactoring objective: it applies boundary checks recursively to each load operation and wraps out-of-bounds accesses with if_then_else using padding values. This is a significant improvement over the previous approach of only rewriting outermost stores.


241-264: LGTM! Appropriate boundary checking for extern calls.

The implementation correctly detects boundary conditions for extern calls (like atomicAdd) and wraps them with IfThenElse statements. The approach is pragmatic and addresses the common use cases mentioned in the NOTE.

testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2)

43-71: LGTM! Comprehensive test for issue #1013.

The test correctly exercises the corner case where boundary checks are needed for symbolic tensor sizes. The expected kernel properly validates that out-of-bounds accesses are guarded with if_then_else statements. The NOTE comment (lines 58-60) provides useful context about current prover limitations.


109-128: LGTM! Well-structured test helpers and test functions.

The test functions assert_vectorize_access_with_atmoic_add, test_issue_1013, and test_vectorize_access_with_atmoic_add are well-implemented and provide good coverage for the refactored pass. They correctly validate the transformed IR against expected results.

@SiriusNEO
Copy link
Contributor Author

@codex review

1 similar comment
@LeiWang1999
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1)

72-76: Fix unused bx/by, remove unused noqa, and correct typo “atmoic” → “atomic”.

  • bx/by are never used; prefer underscore-prefixed.
  • The inline # noqa: SIM102 is unused per Ruff.
  • Fix function name typo for clarity/consistency; update callsites accordingly.

As per static analysis hints.

Apply these diffs:

-def vectorize_access_with_atmoic_add_legalize(M: int = 64,
+def vectorize_access_with_atomic_add_legalize(M: int = 64,
                                               N: int = 64,
                                               M_offset: int = 2,
                                               N_offset: int = 2):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
-                    if tid + 2 < 64:  # noqa: SIM102
+                    if tid + 2 < 64:

Also applies to: 80-80, 89-89, 101-101

🧹 Nitpick comments (2)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2)

107-112: Propagate “atomic” rename in helper and its call.

-def assert_vectorize_access_with_atmoic_add(M: int = 64, N: int = 64):
-    func, expected = vectorize_access_with_atmoic_add_legalize(M, N)
+def assert_vectorize_access_with_atomic_add(M: int = 64, N: int = 64):
+    func, expected = vectorize_access_with_atomic_add_legalize(M, N)

118-127: Propagate “atomic” rename in test and its call.

-def test_vectorize_access_with_atmoic_add():
-    assert_vectorize_access_with_atmoic_add(64, 64)
+def test_vectorize_access_with_atomic_add():
+    assert_vectorize_access_with_atomic_add(64, 64)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c85ae9a and 930f914.

📒 Files selected for processing (1)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (3)
tilelang/language/__init__.py (1)
  • symbolic (85-96)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/ast/ir.py (1)
  • reads (390-407)
🪛 Ruff (0.14.0)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

80-80: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


101-101: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (1)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1)

41-70: Issue 1013 kernel test: LGTM

The boundary-check rewrite and explicit cast look correct for the corner case you’re asserting.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2)

80-80: Prefix unused variables with underscore.

The variables bx and by are unpacked from the Kernel context but never used. Static analysis correctly flags these as unused variables.

Also applies to: 89-89


100-100: Remove unused noqa directive.

The noqa: SIM102 directive is unnecessary as SIM102 is not enabled in the project's ruff configuration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 930f914 and aaa26a0.

📒 Files selected for processing (1)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (4)
tilelang/language/__init__.py (1)
  • symbolic (85-96)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/ast/ir.py (1)
  • reads (390-407)
src/transform/legalize_safe_memory_access.cc (2)
  • LegalizeSafeMemoryAccess (388-401)
  • LegalizeSafeMemoryAccess (388-388)
🪛 Ruff (0.14.0)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

80-80: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


100-100: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (2)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2)

41-69: LGTM! Good test case for symbolic boundary checking.

This test effectively validates that the transformation correctly handles boundary checks with symbolic parameters. The expected kernel appropriately wraps the load in an if_then_else with a padding value of 0, and the cast to int32 for the boolean comparison result is correct.


107-127: LGTM! Test functions follow proper pattern.

The helper and test functions correctly:

  • Construct IRModules from the test kernels
  • Apply the LegalizeSafeMemoryAccess transformation
  • Assert structural equality between transformed and expected IR

These additions properly exercise the boundary checking and atomic operation handling in the refactored pass.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/transform/lower_tile_op.cc (1)

227-237: Consider renaming RemapPaddingMap function.

While the parameter and variable names have been updated to use safe_value_map, the function name RemapPaddingMap still references the old terminology.

Apply this diff to align the function name with the updated terminology:

-  Map<Var, PrimExpr> RemapPaddingMap(const Map<Var, PrimExpr> &safe_value_map,
+  Map<Var, PrimExpr> RemapSafeValueMap(const Map<Var, PrimExpr> &safe_value_map,
                                      const Map<Var, Var> &var_remap) const {

Also update the call site at line 200:

-    Map<Var, PrimExpr> new_safe_value_map = RemapPaddingMap(
+    Map<Var, PrimExpr> new_safe_value_map = RemapSafeValueMap(
         Downcast<Map<Var, PrimExpr>>(safe_value_map.value()), var_remap);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cc1a66 and e219fc7.

📒 Files selected for processing (6)
  • src/op/builtin.h (1 hunks)
  • src/transform/legalize_safe_memory_access.cc (8 hunks)
  • src/transform/lower_tile_op.cc (3 hunks)
  • testing/python/language/test_tilelang_language_annotate_safe_value.py (1 hunks)
  • tilelang/engine/phase.py (1 hunks)
  • tilelang/language/__init__.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_tile_op.cc
🧬 Code graph analysis (4)
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
tilelang/language/__init__.py (1)
  • annotate_safe_value (149-184)
src/transform/legalize_safe_memory_access.cc (1)
src/transform/lower_tile_op.cc (22)
  • op (96-102)
  • op (96-96)
  • op (117-151)
  • op (117-117)
  • op (181-186)
  • op (181-181)
  • op (193-207)
  • op (193-193)
  • op (282-316)
  • op (282-282)
  • op (485-548)
  • op (485-485)
  • op (550-569)
  • op (550-550)
  • op (571-587)
  • op (571-571)
  • buffer (318-326)
  • buffer (318-318)
  • buffer (447-464)
  • buffer (447-447)
  • buffer (466-483)
  • buffer (466-466)
src/transform/lower_tile_op.cc (1)
src/transform/legalize_safe_memory_access.cc (10)
  • op (33-46)
  • op (33-33)
  • op (64-74)
  • op (64-64)
  • op (76-84)
  • op (76-76)
  • op (158-180)
  • op (158-158)
  • op (182-215)
  • op (182-182)
tilelang/language/__init__.py (3)
src/transform/lower_tile_op.cc (4)
  • safe_value_map (227-238)
  • safe_value_map (227-228)
  • buffer (318-326)
  • buffer (318-318)
src/transform/legalize_safe_memory_access.cc (12)
  • buffer (87-95)
  • buffer (87-87)
  • buffer (98-138)
  • buffer (98-99)
  • buffer (257-261)
  • buffer (257-257)
  • buffer (263-266)
  • buffer (263-263)
  • buffer (268-271)
  • buffer (268-268)
  • buffer (273-278)
  • buffer (273-273)
tilelang/language/ast/ir.py (1)
  • block_attr (430-438)
🪛 Clang (14.0.6)
src/transform/legalize_safe_memory_access.cc

[error] 61-61: constructor does not initialize these fields: conditions, analyzer

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 64-64: method 'VisitExpr_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 64-64: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 72-72: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: method 'VisitStmt_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 76-76: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 82-82: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 82-82: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 148-148: constructor does not initialize these fields: analyzer_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 152-152: member variable '' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

src/transform/lower_tile_op.cc

[error] 199-199: variable 'var_remap' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 200-200: variable 'new_safe_value_map' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 227-227: 2 adjacent parameters of 'RemapPaddingMap' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 229-229: variable 'new_safe_value_map' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

🪛 Ruff (0.14.0)
tilelang/language/__init__.py

184-184: block_attr may be undefined, or defined from star imports

(F405)

🔇 Additional comments (11)
src/op/builtin.h (1)

25-25: LGTM!

The constant rename from kPaddingMap to kSafeValueMap aligns with the PR's refactoring objectives and maintains consistency across the codebase.

testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

20-20: LGTM!

The API update from annotate_padding to annotate_safe_value and the key change from A_shared to A correctly reflect the new semantics where safe values are annotated on the source buffer being accessed.

src/transform/lower_tile_op.cc (1)

182-205: LGTM!

The renaming from kPaddingMap to kSafeValueMap throughout the block rewriting logic is consistent and aligns with the broader refactoring.

tilelang/language/__init__.py (1)

149-184: LGTM!

The API rename from annotate_padding to annotate_safe_value is well-documented and correctly implemented. The docstring clearly explains the new semantics, and the example demonstrates proper usage.

Note: The Ruff warning about block_attr being undefined (F405) is a false positive due to the star import from .tir.ir at line 14.

src/transform/legalize_safe_memory_access.cc (7)

53-63: LGTM!

The addition of the recursively_collect_conds parameter to GlobalMemChecker enables fine-grained control over condition collection, which is essential for the recursive load/store rewrite approach.


64-84: LGTM!

The conditional recursion based on recursively_collect_conds_ correctly implements the dual-mode behavior needed for different scenarios:

  • Non-recursive for direct Load/Store rewriting (where the rewriter handles recursion)
  • Recursive for extern calls (where we need all nested conditions upfront)

182-215: LGTM!

The BufferStore rewriting correctly wraps stores with IfThenElse statements to skip out-of-bounds writes. The warning when an IfThenElse value is detected (lines 191-201) appropriately alerts users to potential manual boundary check conflicts.


272-280: LGTM!

The rename from GetPadding to GetSafeValue and the updated logic correctly retrieve safe values from annotated_safe_value_map_, defaulting to zero when no annotation exists.


336-352: LGTM!

The BlockNode visitor correctly processes kSafeValueMap annotations and populates annotated_safe_value_map_ for use by the rewriter.


217-255: The recursive condition collection for extern calls is working as intended and properly validated.

The implementation correctly recursively collects boundary conditions from all buffer accesses within call_extern arguments. The test at lines 102–108 confirms this: a single T.call_extern("handle", "AtomicAdd", A[tid + M_offset, j + N_offset], 1) generates nested if-guards wrapping both dimension checks (j + N_offset < N and tid + M_offset < M).

The conservative approach is explicitly acknowledged in the code comments (lines 222–228) as "a simplified approach" that "addresses most common scenarios." This is the correct design choice: for safety-critical memory operations, it is preferable to over-guard than to miss a boundary check. There is no practical scenario where partial condition application would be more precise without sacrificing correctness.


158-180: BufferLoad rewriting logic is correct and well-tested.

All verification checks pass:

  1. Condition ordering — The loop nesting produces correct logical precedence. Conditions are collected as upper_bound then lower_bound, creating nested if_then_else that short-circuits correctly: if(upper && lower) load else safe.

  2. Safe value retrievalGetSafeValue() handles all buffer types via make_zero(buffer->dtype) with support for annotated safe values. Tests cover float32 and int64 types.

  3. Non-recursive checker — The design is sound: SafeMemorysRewriter inherits from StmtExprMutator for recursive visitation, while GlobalMemChecker with recursively_collect_conds=false checks only the current node to avoid double-wrapping. The comment justifies this.

Tests demonstrate the expected nested if_then_else structure and the transformation handles multiple conditions correctly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e219fc7 and 80f53cd.

📒 Files selected for processing (1)
  • tilelang/language/__init__.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/language/__init__.py (3)
src/transform/lower_tile_op.cc (4)
  • safe_value_map (227-238)
  • safe_value_map (227-228)
  • buffer (318-326)
  • buffer (318-318)
src/transform/legalize_safe_memory_access.cc (12)
  • buffer (87-95)
  • buffer (87-87)
  • buffer (98-138)
  • buffer (98-99)
  • buffer (257-261)
  • buffer (257-257)
  • buffer (263-266)
  • buffer (263-263)
  • buffer (268-271)
  • buffer (268-268)
  • buffer (273-278)
  • buffer (273-273)
tilelang/language/ast/ir.py (1)
  • block_attr (430-438)
🪛 Ruff (0.14.0)
tilelang/language/__init__.py

184-184: block_attr may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (1)
tilelang/language/__init__.py (1)

149-184: All call sites have been successfully updated to use the new function name.

The codebase contains zero references to annotate_padding, confirming the rename is complete. The new annotate_safe_value function is properly defined and actively used in tests (testing/python/language/test_tilelang_language_annotate_safe_value.py). The implementation is correct and the migration is complete.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tilelang/language/__init__.py (1)

156-156: Docstring inconsistency remains unresolved.

The parameter description still references "padding value" instead of "safe value", which is inconsistent with the function rename.

🧹 Nitpick comments (2)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2)

12-12: Prefix unused Kernel context variables with underscores.

The variables bx and by are unpacked from the Kernel context but never used. Following Python conventions, prefix them with underscores to indicate they are intentionally unused.

Apply this diff to fix the unused variables:

-        with T.Kernel(1, 1, threads=M) as (bx, by):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):

Apply this pattern to all occurrences in the file (lines 12, 20, 80, 89, 119, 126).

Also applies to: 20-20, 80-80, 89-89, 119-119, 126-126


100-100: Remove unused noqa directives.

The # noqa: SIM102 directives are unused if the SIM102 check is not enabled in the project's ruff configuration.

Apply this diff:

-                if j + N_offset < N:  # noqa: SIM102
+                if j + N_offset < N:

Apply this to both occurrences (lines 100 and 130).

Also applies to: 130-130

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80f53cd and b7eea81.

📒 Files selected for processing (3)
  • src/transform/legalize_safe_memory_access.cc (8 hunks)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2 hunks)
  • tilelang/language/__init__.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/transform/legalize_safe_memory_access.cc (1)
src/transform/lower_tile_op.cc (22)
  • op (96-102)
  • op (96-96)
  • op (117-151)
  • op (117-117)
  • op (181-186)
  • op (181-181)
  • op (193-207)
  • op (193-193)
  • op (282-316)
  • op (282-282)
  • op (485-548)
  • op (485-485)
  • op (550-569)
  • op (550-550)
  • op (571-587)
  • op (571-571)
  • buffer (318-326)
  • buffer (318-318)
  • buffer (447-464)
  • buffer (447-447)
  • buffer (466-483)
  • buffer (466-466)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (3)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/ast/ir.py (2)
  • reads (390-407)
  • writes (410-427)
src/transform/legalize_safe_memory_access.cc (2)
  • LegalizeSafeMemoryAccess (365-378)
  • LegalizeSafeMemoryAccess (365-365)
tilelang/language/__init__.py (3)
src/transform/lower_tile_op.cc (4)
  • safe_value_map (227-238)
  • safe_value_map (227-228)
  • buffer (318-326)
  • buffer (318-318)
src/transform/legalize_safe_memory_access.cc (12)
  • buffer (87-95)
  • buffer (87-87)
  • buffer (98-138)
  • buffer (98-99)
  • buffer (256-260)
  • buffer (256-256)
  • buffer (262-265)
  • buffer (262-262)
  • buffer (267-270)
  • buffer (267-267)
  • buffer (272-277)
  • buffer (272-272)
tilelang/language/ast/ir.py (1)
  • block_attr (430-438)
🪛 Clang (14.0.6)
src/transform/legalize_safe_memory_access.cc

[error] 61-61: constructor does not initialize these fields: conditions, analyzer

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 64-64: method 'VisitExpr_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 64-64: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 72-72: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: method 'VisitStmt_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 76-76: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 82-82: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 82-82: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 148-148: constructor does not initialize these fields: analyzer_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 152-152: member variable '' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

🪛 Ruff (0.14.0)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

12-12: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


12-12: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


100-100: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)


119-119: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


119-119: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


126-126: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


126-126: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


130-130: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)

tilelang/language/__init__.py

184-184: block_attr may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (5)
src/transform/legalize_safe_memory_access.cc (4)

158-180: LGTM: Recursive boundary checking for loads.

The refactored VisitExpr_ for BufferLoad correctly:

  1. Collects boundary conditions non-recursively (since the rewriter will visit children)
  2. Wraps the load with nested if_then_else using GetSafeValue as the fallback
  3. Returns the original load when no boundary violations are detected

This addresses the issue described in PR #1013 by performing value substitution recursively for each load.


182-214: LGTM: Conditional store for out-of-bounds protection.

The BufferStore handling correctly wraps the store statement with nested IfThenElse to prevent out-of-bounds writes. The special handling for stores with existing IfThenElse values (lines 191-202) appropriately warns about potential conflicts while respecting manual boundary checks.


228-254: LGTM: Extern call boundary checking.

The EvaluateNode handling for call_extern correctly uses recursive condition collection (recursively_collect_conds=true) since BufferLoad nodes in extern call arguments cannot be safely rewritten. The approach of collecting all boundary conditions and wrapping the entire evaluate statement is appropriate for this use case.


271-277: LGTM: Safe value resolution.

The GetSafeValue method correctly fetches annotated safe values from the map or defaults to zero for the buffer's dtype, aligning with the renamed API.

testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (1)

41-69: Excellent test coverage for the refactoring.

The new test cases comprehensively validate the refactored safe memory access legalization:

  1. test_issue_1013() - validates the fix for the issue described in #1013
  2. test_vectorize_access_with_atmoic_add() - tests atomic operations with boundary checks
  3. test_oob_store() - validates out-of-bounds store protection

The expected kernels correctly show nested if_then_else for loads and nested IfThenElse statements for stores and extern calls, aligning with the new recursive rewriting logic.

Also applies to: 72-104, 114-134, 148-160

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/legalize_safe_memory_access.cc (1)

216-254: Non-extern T.evaluate expressions are no longer rewritten — potential OOB loads.

Overriding VisitStmt_(EvaluateNode) and returning the original node for non-extern calls bypasses child rewrites, so BufferLoad inside T.evaluate(...) won’t be guarded. Restore base mutation for non-extern cases.

   Stmt VisitStmt_(const EvaluateNode *op) final {
-    auto evaluate = Downcast<Evaluate>(op);
+    auto evaluate = Downcast<Evaluate>(op);
 
     if (const CallNode *call_op = op->value.as<CallNode>()) {
       auto call = Downcast<Call>(op->value);
       if (call->op == builtin::call_extern()) {
@@
         return evaluate_with_conditions;
       }
     }
-
-    return evaluate;
+    // For non-extern calls or plain Evaluate(expr), allow recursive rewriting
+    // so BufferLoad inside Evaluate still gets legalized.
+    return StmtExprMutator::VisitStmt_(op);
   }
🧹 Nitpick comments (5)
tilelang/language/__init__.py (1)

149-185: API rename looks good; add type-safety and explicit import.

  • Enforce safe_value dtype compatibility with buffer dtype (cast if needed) to avoid type errors at use sites.
  • Silence Ruff F405 by explicitly importing block_attr.

Apply outside-range imports:

@@
-from .tir.ir import *  # noqa: F401
+from .tir.ir import *  # noqa: F401
+from .ast.ir import block_attr  # explicit import for clarity

Optionally clarify dtype in docstring:

-    # safe_value_map is a dictionary of buffer to safe value
+    # safe_value_map is a dictionary of buffer to safe value (dtype-compatible with the buffer)
src/transform/legalize_safe_memory_access.cc (1)

271-277: Cast annotated safe values to buffer dtype to ensure type consistency.

If an annotated safe value’s dtype differs from the buffer’s dtype, cast it to avoid type mismatch during rewriting.

   // Get the safe value of the buffer
   PrimExpr GetSafeValue(const Buffer &buffer) {
     if (annotated_safe_value_map_.count(buffer)) {
-      return annotated_safe_value_map_[buffer];
+      PrimExpr v = annotated_safe_value_map_[buffer];
+      if (v.dtype() != buffer->dtype) {
+        v = cast(buffer->dtype, v);
+      }
+      return v;
     }
     return make_zero(buffer->dtype);
   }
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (3)

10-21: Silence unused unpacked vars in Kernel contexts.

Prefix bx/by with underscores.

-    def main(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def main(A: T.Tensor((M, N), dtype=dtype),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
-    def expected(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def expected(A: T.Tensor((M, N), dtype=dtype),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):

72-112: Fix typo “atmoic” → “atomic” and silence unused vars/noqa.

  • Rename helper/assert/test to “atomic”.
  • Prefix bx/by with underscores.
  • Remove unused SIM102 noqa markers (the rule isn’t enabled).
-def vectorize_access_with_atmoic_add_legalize(M: int = 64,
+def vectorize_access_with_atomic_add_legalize(M: int = 64,
                                               N: int = 64,
                                               M_offset: int = 2,
                                               N_offset: int = 2):
@@
-    def main(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def main(A: T.Tensor((M, N), dtype=dtype),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
-    def expected(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+    def expected(A: T.Tensor((M, N), dtype=dtype),):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
-                # Nest if-then-else is expected, do not flatten it to pass structural equal check
-                if j + N_offset < N:  # noqa: SIM102
+                # Nest if-then-else is expected, do not flatten it to pass structural equal check
+                if j + N_offset < N:
                     if tid + M_offset < M:
                         T.call_extern("handle", "AtomicAdd", A[tid + M_offset, j + N_offset], 1)
@@
-def assert_vectorize_access_with_atmoic_add(M: int = 64, N: int = 64):
-    func, expected = vectorize_access_with_atmoic_add_legalize(M, N)
+def assert_vectorize_access_with_atomic_add(M: int = 64, N: int = 64):
+    func, expected = vectorize_access_with_atomic_add_legalize(M, N)

And update the test entry point:

-def test_vectorize_access_with_atmoic_add():
-    assert_vectorize_access_with_atmoic_add(64, 64)
+def test_vectorize_access_with_atomic_add():
+    assert_vectorize_access_with_atomic_add(64, 64)

114-134: Silence unused unpacked vars and remove unused noqa.

     def main(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
     def expected(A: T.Tensor((M, N), dtype=dtype),):
-        with T.Kernel(1, 1, threads=M) as (bx, by):
+        with T.Kernel(1, 1, threads=M) as (_bx, _by):
@@
-                if j + N_offset < N:  # noqa: SIM102
+                if j + N_offset < N:
                     if tid + M_offset < M:
                         A[tid + M_offset, j + N_offset] = T.float32(1.0)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80f53cd and b7eea81.

📒 Files selected for processing (3)
  • src/transform/legalize_safe_memory_access.cc (8 hunks)
  • testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (2 hunks)
  • tilelang/language/__init__.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tilelang/language/__init__.py (3)
src/transform/lower_tile_op.cc (4)
  • safe_value_map (227-238)
  • safe_value_map (227-228)
  • buffer (318-326)
  • buffer (318-318)
src/transform/legalize_safe_memory_access.cc (12)
  • buffer (87-95)
  • buffer (87-87)
  • buffer (98-138)
  • buffer (98-99)
  • buffer (256-260)
  • buffer (256-256)
  • buffer (262-265)
  • buffer (262-262)
  • buffer (267-270)
  • buffer (267-267)
  • buffer (272-277)
  • buffer (272-272)
tilelang/language/ast/ir.py (1)
  • block_attr (430-438)
src/transform/legalize_safe_memory_access.cc (1)
src/transform/lower_tile_op.cc (22)
  • op (96-102)
  • op (96-96)
  • op (117-151)
  • op (117-117)
  • op (181-186)
  • op (181-181)
  • op (193-207)
  • op (193-193)
  • op (282-316)
  • op (282-282)
  • op (485-548)
  • op (485-485)
  • op (550-569)
  • op (550-550)
  • op (571-587)
  • op (571-571)
  • buffer (318-326)
  • buffer (318-318)
  • buffer (447-464)
  • buffer (447-447)
  • buffer (466-483)
  • buffer (466-466)
testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py (4)
tilelang/language/__init__.py (1)
  • symbolic (85-96)
tilelang/language/tir/op.py (2)
  • if_then_else (2906-2936)
  • call_extern (172-194)
tilelang/language/ast/ir.py (2)
  • reads (390-407)
  • writes (410-427)
src/transform/legalize_safe_memory_access.cc (2)
  • LegalizeSafeMemoryAccess (365-378)
  • LegalizeSafeMemoryAccess (365-365)
🪛 Clang (14.0.6)
src/transform/legalize_safe_memory_access.cc

[error] 61-61: constructor does not initialize these fields: conditions, analyzer

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 64-64: method 'VisitExpr_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 64-64: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 72-72: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: method 'VisitStmt_' can be made const

(readability-make-member-function-const,-warnings-as-errors)


[error] 76-76: parameter name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 82-82: variable 'op' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 82-82: variable name 'op' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 148-148: constructor does not initialize these fields: analyzer_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 152-152: member variable '' has public visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)

🪛 Ruff (0.14.0)
tilelang/language/__init__.py

184-184: block_attr may be undefined, or defined from star imports

(F405)

testing/python/transform/test_tilelang_transform_legalize_safe_memory_access.py

12-12: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


12-12: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


80-80: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


89-89: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


100-100: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)


119-119: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


119-119: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


126-126: Unpacked variable bx is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


126-126: Unpacked variable by is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


130-130: Unused noqa directive (non-enabled: SIM102)

Remove unused noqa directive

(RUF100)

@LeiWang1999 LeiWang1999 merged commit 7211164 into tile-ai:main Oct 17, 2025
6 checks passed
RubiaCx added a commit to RubiaCx/tilelang that referenced this pull request Oct 20, 2025
commit b2acfc3
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 22:08:13 2025 +0800

    [Benchmark] Add matmul FP16 benchmark results (tile-ai#1067)

commit 17bd0a6
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Sun Oct 19 17:34:12 2025 +0800

    [Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (tile-ai#1065)

commit ae9a6f0
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Sun Oct 19 15:45:58 2025 +0800

    [Refactor][Example] Update linear attention examples and add tests (tile-ai#1010)

    * [Refactor][Example] Update linear attention examples and add tests

    - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance.
    - Introduced L2 normalization in the main functions of both examples.
    - Added a new test suite for the linear attention examples to ensure correctness and performance.
    - Updated argument parsing in the main functions for better usability.

    * upd docstring for tma atomic add

    * lint

    * Add flash-linear-attention dependency to requirements.txt

    * Rename main function to chunk_linear_attn_bwd

    * Rename main function to chunk_linear_attn_fwd

    * chore

    ---------

    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

commit b7dfdb3
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Sun Oct 19 12:16:41 2025 +0800

    [Misc] Add GitHub issue templates (tile-ai#1057)

commit fb8b3af
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 12:15:44 2025 +0800

    [Benchmark] Add H800 SXM Benchmark results (tile-ai#1063)

    * Add document PYTHONPATH build path

    * update fp8 benchmark result

    * remove redpath

    * remove path

    * tflops fix

commit 4ca6c13
Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com>
Date:   Sun Oct 19 02:43:00 2025 +0800

    [CI]:Reduce test shapes to avoid OOM errors during CI. (tile-ai#1060)

    * [CI]:Reduce test shapes to avoid OOM errors during CI.

    * rabbit

    * Increase number of processes for pytest from 2 to 4

    ---------

    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

commit 759c2e3
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 00:35:06 2025 +0800

    [DOC] Add document for develop with PYTHONPATH (tile-ai#1062)

commit bf2de5b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Sun Oct 19 00:21:59 2025 +0800

    Making version parser more robust against missing or unavailable metadata (tile-ai#1061)

commit 7211164
Author: Chaofan Lin <linchaofan@bytedance.com>
Date:   Fri Oct 17 20:56:01 2025 +0800

    [Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (tile-ai#1050)

    * [Refactor] Refactor Pass  to support recursive load/store rewrite

    * lint

    * recursive collect conds for call_extern

    * fix name

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * lint

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * lint

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * address comment

    * rename pad_value to safe_value

    * lint

    * add oob store test

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * fix

    * fix

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 278c0fb
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 18:32:43 2025 +0800

    [Enhancement] Introduce a workaround for layout inference for local buffer store (tile-ai#1055)

    * [Enhancement] Improve layout inference for local buffer handling in parallel operations

    * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
    * Updated the condition for determining parallel loop execution to account for local buffer stores.
    * Cleaned up comments for clarity and future considerations.

    * [Refactor] Clean up parallel loop condition formatting in layout inference

    * Reformatted the condition for determining parallel loop execution for better readability.
    * Maintained existing logic while enhancing code clarity for future modifications.

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 37b3dbd
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Fri Oct 17 17:15:59 2025 +0800

    [Enhancement] Improve CUDA compiler detection in CMake (tile-ai#1054)

    * improve CUDA compiler detection in CMake

    * Minor fix

commit 1281d6f
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 13:44:08 2025 +0800

    [CI] Disable autofix for pre-commit CI (tile-ai#1053)

commit 35cf888
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Fri Oct 17 13:43:08 2025 +0800

    [Enhancement] Remove constraint requiring last dimension stride to be 1 (tile-ai#1040)

    * remove last dimension stride must be 1 constraint

    * add vectorize test

    * minor fix

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit fd1493b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Fri Oct 17 11:34:35 2025 +0800

    Automatically initialize submodule if missing (tile-ai#1052)

commit cc00fb6
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Fri Oct 17 11:28:14 2025 +0800

    [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (tile-ai#1024)

    * [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper

    * [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example

    * [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR

    - Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities.
    - Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes.

commit a79bc5c
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Thu Oct 16 20:38:23 2025 +0800

    [CI] Fix ROCm CI (tile-ai#1043)

    * [CI] fix ROCm CI

    * feat: add a hook to error out on no test runs

commit 1f4ffdb
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Thu Oct 16 17:53:45 2025 +0800

    [Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (tile-ai#1051)

commit e3742d3
Author: Yichen Yan <wenji.yyc@alibaba-inc.com>
Date:   Thu Oct 16 15:52:10 2025 +0800

    Allow mma gemm for all cuda (tile-ai#1047)

commit 0ff4f42
Author: Yuqi Dong <134183314+yyttt6@users.noreply.github.com>
Date:   Thu Oct 16 12:41:09 2025 +0800

    [Feature]: Add test for atomicadd auto vectorize and remove useless code (tile-ai#1019)

    * update

    * format

    * rabbit

commit bd1c7b3
Author: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Date:   Thu Oct 16 02:52:35 2025 +0800

    [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (tile-ai#982)

commit 8f001e0
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Thu Oct 16 01:10:28 2025 +0800

    [BugFix] Phaseout dependency of Triton in sink examples to make CI happy (tile-ai#1045)

    * [BugFix] Phaseout dependency of Triton in sink examples to make CI happy

    - Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton.
    - Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code.
    - Updated input generation and benchmarking logic to enhance configurability and performance measurement.
    - Improved overall structure and organization of the examples for better clarity and usability.

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 8ce2778
Author: Xuehai Pan <XuehaiPan@pku.edu.cn>
Date:   Wed Oct 15 22:12:41 2025 +0800

    [CI][Refactor] Merge test CI workflow files into one (tile-ai#973)

    * refactor: merge test CI workflow files into one

    * chore: set `UV_INDEX_STRATEGY=unsafe-best-match`

    * feat: add AST test with Python 3.8

    * feat: implement manual caching mechanism for self-hosted runners

    * refactor: simplify cache logic for self-hosted runners

    * chore: clear uv cache on failure

    * chore: print format.sh output to logs

    * chore: improve uv caching

    * chore: disable parallel test

    * chore: use `PYTHONDEVMODE=1` in CI

    * feat: enable coredump generation

    * fix: fix perfbench condition

    * Revert "feat: enable coredump generation"

    This reverts commit c52da65.

    * chore: move example CI down

    * Revert "chore: move example CI down"

    This reverts commit 9d8e650.

    * chore: skip example `test_example_mha_sink_bwd_bhsd`

    * chore: skip example `test_example_gqa_sink_bwd_bhsd`

    * fix: fix example argument passing

    * fix: loosen test criteria

    * chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity

    * feat: enable parallel testings

    * chore: update pytest options

    * remove skipped test as now been resolved

    * chore: empty commit to re-trigger ci

    * test for n 1

    * chore: remove ` --numprocesses=1` option in example

    * chore: disable failfast

    * chore: update cibw selection

    * fix: fix git submodule clone

    * chore: update cibw commands

    * fix: fix yapf multiprocessing

    * chore: setup ccache for CIBW on macOS only

    * chore: update comments

    * chore: update artifact listing

    * fix: do not fail if not found nvcc in PATH

    * fix: fix flash-attn installation

    * chore: update dist workflow trigger

    * chore: remove outdated comments

    * chore(workflows/dist): simplify build matrix strategy

    * fix: fix CUDA path finding

    * fix: fix CUDA path finding

    * chore: imcrease CI timeout

    * ci: disable failfast

    * fix: hide path prefix

    * chore: more verbose

    * chore: disable PR trigger for dist workflow

    * fix: seed for tests

    * fix: use nightly torch for ROCm tests

    * chore: enable PR trigger for dist workflow

    * chore: stop uploading debug wheels as artifacts in PR

    * chore: do not run workflows in forks

    * chore: housekeep requirements

    * chore: use Nightly-ROCm-6.3 for CI

    * chore: use Nightly-ROCm-6.4 for CI

    * Update ROCm toolkit version to 7.0

    * chore: restore previous rocm-ci.yml for test

    * fix: cleanup PYTHONPATH

    * chore: remove previous rocm-ci.yml

    * ci fix

    * chore: remove previous rocm-ci.yml

    * chore: enable parallel example run

    ---------

    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
    Co-authored-by: alex_xiao <xinyuxiao2024@gmail.com>

commit 80665cd
Author: alex_xiao <xinyuxiao2024@gmail.com>
Date:   Wed Oct 15 21:17:14 2025 +0800

    fix bug&add amd examples (tile-ai#966)

    * [Enhancement] Refactor buffer index handling for improved precision and clarity (tile-ai#668)

    - Enhanced buffer index handling to address precision issues by removing redundant operations.
    - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
    - Updated related documentation to reflect changes in buffer management practices.

    * Remove obsolete test script for AMD example, streamlining the examples directory.

    * Remove unused dtype_size variable in AMD example script to streamline code.

    * Add input configuration file and update AMD example script for enhanced flexibility

    - Introduced a new input.txt file for configurable parameters.
    - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
    - Streamlined the main function for better clarity and organization.
    - Added a new test script to facilitate running the example with specified parameters.

    * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

    - Deleted input.txt and test.sh files as they are no longer needed.
    - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
    - Reintroduced swizzle usage in the kernel for better performance.

    * Refactor AMD example script for FlashAttention-2

    - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
    - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
    - Removed outdated comments and improved code organization for better readability.

    * Refactor formatting in AMD FlashAttention example script

    - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
    - Streamlined the `main` function parameter formatting for consistency.
    - Removed unnecessary blank lines to enhance overall code organization.

    * Update example_amd_flash_attn_fwd.py

    * Enhance AMD example script and update CI workflows

    - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
    - Added new CI workflows for AMD and documentation publishing.
    - Updated various requirements files to include necessary dependencies.
    - Introduced new test cases and examples for better coverage and functionality.
    - Refactored existing code for improved readability and maintainability.

    * Remove redundant tool cache cleanup step in AMD CI workflow

    * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

    * Add new AMD FlashAttention example and test script

    - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
    - Added `test.sh` script to facilitate running the new example with specified parameters.
    - Enhanced the overall structure and organization of the example for better clarity and usability.

    * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

    - Reduced the number of threads and `num_split_q` options for improved performance.
    - Adjusted `panel_size` options to streamline configuration settings.

    * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

    * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

    * Add example for AMD Flash Attention backward pass implementation

    - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
    - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
    - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
    - Included reference implementation for validation against PyTorch's attention mechanism.

    This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

    * Enhance AMD Flash Attention example with additional testing capabilities

    - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
    - Improved the main function to allow for better parameter configuration and benchmarking.
    - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

    This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

    * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

    * Refactor HIP intrinsic rules to CUDA

    - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
    - Adjusted include paths for better organization and clarity in the code structure.

    * Update AMD CI workflow to uninstall specific PyTorch packages before installation

    - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
    - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

    * Remove unused shared memory allocations in AMD Flash Attention backward example

    - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
    - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

    * Remove unnecessary pip uninstall command from AMD CI workflow

    - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
    - This change simplifies the CI process and reduces potential overhead during package management.

    * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

    - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
    - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

    * Refactor formatting of HIP intrinsic rule registrations

    - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
    - No functional changes were made; this update focuses on code style improvements to enhance maintainability.

    * Update file name and documentation for HIP intrinsic rules

    - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
    - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

    * Enhance DispatchHIPShuffle function with clang-analyzer comments

    - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
    - This change improves code clarity and maintains compliance with static analysis tools.

    * lint fix

    * fix

    * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.

    * Add backward attention example to test script

    - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
    - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
    - Updated the command for the backward tile example to match the new configuration.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

    - Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
    - Updated the FlashAttention forward and backward functions to improve performance and maintainability.
    - Adjusted test script parameters for consistency and clarity, including the addition of group handling.
    - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
    - Updated the main function to reflect changes in parameter names and types for better usability.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
    - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
    - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
    - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
    - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Introduced a scaling factor for improved numerical stability in gradient calculations.
    - Optimized shared memory usage by adding new shared buffers for intermediate calculations.
    - Refined the handling of tensor fragments to improve performance and maintainability.
    - Updated the main function to ensure compatibility with the new output parameters for backward operations.
    - Removed unnecessary parameters from the test script to streamline execution.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py

    - Updated the forward and backward functions to improve numerical stability and performance.
    - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
    - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
    - Added debugging and benchmarking functions for improved correctness verification and performance analysis.
    - Updated the main function to reflect changes in parameter handling and ensure consistency across examples.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated scaling factor application for improved numerical stability in gradient calculations.
    - Refined tensor handling to ensure consistency with forward pass operations.
    - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
    - Adjusted comments for clarity and alignment with standard implementation practices.

    * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh

    - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
    - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
    - Improved comments for clarity and alignment with the updated configurations.

    * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py

    - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
    - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
    - Improved comments for better understanding of the performance metrics and implementation details.
    - Removed unnecessary parameter from test.sh to streamline execution.

    * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.

    * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

    - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
    - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
    - Refined scaling factor calculations for numerical stability in both forward and backward passes.
    - Improved comments and documentation for clarity and consistency across implementations.
    - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.

    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py

    - Removed outdated comments and improved clarity in the code.
    - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
    - Updated autotuner configurations to include new parameters for better performance tuning.
    - Refined tensor handling and scaling factor calculations for improved numerical stability.
    - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.

    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

    - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
    - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
    - Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
    - Enhanced GEMM operations with k_pack for improved computational efficiency.
    - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.

    * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py

    - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
    - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
    - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
    - Adjusted GEMM operations to improve computational efficiency without the use of k_pack.

    * Enhance HIP code generation and FP8 type support

    - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
    - Updated error logging to include unsupported FP8 type details for better debugging.
    - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
    - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
    - Added overloads for AtomicAdd in common.h to support both pointer and value arguments.

    * Enhance FP8 type support and clarify accumulator handling in HIP

    - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
    - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
    - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.

    * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py

    * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py

    * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.

    * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

    * Refactor and enhance HIP code generation for improved FP8 support

    - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
    - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
    - Updated AtomicAdd function in common.h to streamline its implementation.
    - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
    - Added function to manage the addition of new functions in the code generation process.

    * Fix formatting issue in HIP code generation for MFMA call

    - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

    * Refactor HIP code generation and enhance FP8 type handling

    - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
    - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
    - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
    - Refined the AddFunction method to streamline function addition in the code generation process.

    * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

    * Refactor backward attention implementation in example_amd_flash_attn_bwd.py

    - Updated the GEMM operation to use shared memory for improved performance.
    - Adjusted parallelization parameters to enhance efficiency in the backward pass.

    * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

    * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations

    * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`

    ---------

    Co-authored-by: xinxyxiao <xinyxiao@amd.com>
    Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

commit b78d840
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 16:38:55 2025 +0800

    [Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (tile-ai#989)

    * Expose CUDA warp/lane intrinsics in TileLang frontend

    * generalize warp indexing intrinsics and add coverage

    * [Lint]: [pre-commit.ci] auto fixes [...]

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 32ddc1a
Author: LJC00118 <77378439+LJC00118@users.noreply.github.com>
Date:   Wed Oct 15 15:25:43 2025 +0800

    [CUDA] Add pack functions for FP8 types (tile-ai#967)

    * Remove an incorrect check

    * add fp8 pack function

    * code lint

    * minor fix

    * minor fix

    * minor fix

    * Minor fix

    * Minor fix

commit c67f73b
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 15:12:08 2025 +0800

    [Env] Optimize the mechanism for locating `TL_LIBS` (tile-ai#1038)

commit e539952
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Wed Oct 15 15:11:40 2025 +0800

    [TIR] Revert some changes of Pass `LowerIntrin` (tile-ai#1035)

    * keep >> instead of /

    * re think replicate

    * lint fix

    * handle const int buffers

    * rep fix

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 5767475
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Tue Oct 14 23:55:27 2025 +0800

    [CI] Disable buggy(maybe) warp specialized kernel ci test for H20 (tile-ai#1033)

commit eed320f
Author: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Date:   Tue Oct 14 21:51:31 2025 +0800

    [Bugfix] Recover code for flexible parallel (tile-ai#1032)

    * recover flex parallel process

    * lint fix

    ---------

    Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

commit 1e8f0b1
Author: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Date:   Tue Oct 14 17:26:23 2025 +0800

    [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (tile-ai#1023)

    * [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation

    * [Lint]: [pre-commit.ci] auto fixes [...]

    * optimize amd ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants