Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Oct 7, 2025

as title.

  • Add a doc about let stmt inline.

Summary by CodeRabbit

  • New Features

    • FusedMoE example main() is now configurable via function parameters (model size, experts, batch, seq_len).
  • Bug Fixes

    • More robust software-pipeline injection/planning that preserves wrappers and attributes.
    • Relaxed shared-memory boundary checks and improved buffer remapping/versioning for correctness.
    • Safer let-binding handling and IR simplification to avoid incorrect aliasing/inlining.
    • New pass option to control eager let inlining.
  • Documentation

    • Added “Compiler Internals: LetStmt Inlining” page and updated docs index.
  • Tests

    • Added regression tests for temporary-variable handling and multi-version buffers; updated example tests to pass explicit parameters.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 7, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds tests and docs for LetStmt inlining, introduces a TL_FORCE_LET_INLINE config flag and conditional inlining, parameterizes examples/tests, and implements extensive transform and lowering changes for wrapper-aware pipeline injection/planning, buffer remapping/layout handling, shared-memory merge logic, multi-version buffer rewriting, and LetStmt simplification heuristics.

Changes

Cohort / File(s) Summary
New test: TileLang issue 814
testing/python/issue/test_tilelang_issue_814.py
New CUDA test module adding _tmp_var_kernel, run_tmp_var_test, test_issue_814, and main guard; compiles/runs a tilelang CUDA kernel and validates outputs against a PyTorch CUDA reference with tilelang.testing.torch_assert_close.
Engine phase / pass config
tilelang/engine/phase.py, src/op/builtin.{cc,h}, tilelang/transform/pass_config.py
Adds TL_FORCE_LET_INLINE pass config key and exposes kForceLetInline; introduces should_force_let_inline() and conditions LetStmt inlining in LowerAndLegalize based on the config; swaps TIR Simplify() to tilelang.transform.Simplify() in LowerAndLegalize.
Docs: LetStmt inlining
docs/compiler_internals/letstmt_inline.md, docs/index.md
New compiler-internals documentation describing LetStmt inlining heuristics, collection of variables used in buffer definitions, and the TL_FORCE_LET_INLINE option; adds COMPILER INTERNALS toctree entry.
Examples & tests: parameterization
examples/fusedmoe/example_fusedmoe_tilelang.py, examples/fusedmoe/test_example_fusedmoe.py, examples/blocksparse_attention/test_example_blocksparse_attention.py
Parameterizes main() in fusedmoe example (adds hyperparameter args with defaults) and updates tests to call main() with explicit, smaller keyword args; blocksparse attention tests pass explicit batch and max_cache_seqlen.
Transform: shared memory merging
src/transform/merge_shared_memory_allocations.cc
Relaxes scope boundary checks (ICHECK_LTICHECK_LE), computes an access_level = min(level, scope_.size()-1) when not aggressive, and updates touched-buffer logic for BufferLoad/Var reads.
Transform: wrapper-aware pipeline injection
src/transform/inject_pipeline.cc
Introduces wrapper-collection/apply mechanism (AttrStmt/IfThenElse/LetStmt/etc.), tracks pipeline body origin, unwraps to find SeqStmt, and reapplies wrappers LIFO around the constructed pipeline; adds <functional>.
Transform: pipeline planning updates
src/transform/pipeline_planning.cc
Adds pipeline_body_root and unwrap traversal to locate SeqStmt through Let/If/Attr wrappers; updates BufferRegionCollector to handle BufferLoad/Var-backed reads and uses unwrapped SeqStmt for planning and validation.
Lowering: remapped layouts & Let bindings
src/transform/lower_tile_op.cc
Adds remap/layout helpers (ResolveBufferLoad, FindRemapBuffer, FindLayout, AccessPtrResult), optionally rewrites access_ptr/offset when buffers are remapped, and introduces let_bindings_ to bind BufferLoad results during LetStmt handling for analyzer use.
Transform: multi-version buffer rewriter
src/transform/multi_version_buffer_rewriter.cc
Flattens pipeline statements, tracks copy-stage semantics (is_copy_stage), records first_write_index/last_read_index, block_alloc_buffers_, stmt_stack_, unwraps wrappers to find SeqStmt, and adjusts buffer remapping/versioning decisions accordingly.
Simplify: LetStmt buffer-alias suppression
src/transform/simplify.cc
Adds a pre-inline path that removes Let bindings whose value is a BufferLoad/BufferRegion spanning multiple non-unit extents (alias-like), using range analysis to safely bypass the Let.
New test: multi-version buffer with let
testing/python/transform/test_tilelang_transform_multi_version_buffer.py
Adds test_multi_version_buffer_with_let() validating multi-version buffer rewriting in a Let-like scenario; compares before/after prim_funcs via existing helper.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant For as ForLoop
  participant Planner as PipelinePlanner
  participant Unwrapper as WrapperUnwrapper
  participant Builder as PipelineBuilder
  participant Injector as Injector

  Note over For,Injector: Wrapper-aware pipeline injection (new flow)
  For->>Planner: identify pipeline_body_root
  Planner->>Unwrapper: traverse Let/If/Attr wrappers to find SeqStmt
  Unwrapper-->>Planner: SeqStmt + wrappers(stack)
  Planner->>Builder: build stages from SeqStmt
  Builder->>Injector: transformed pipeline body
  Injector->>Injector: reapply wrappers LIFO around body
  Injector-->>For: attach wrapped pipeline body
Loading
sequenceDiagram
  autonumber
  participant Lower as lower_tile_op
  participant Remap as RemapResolver
  participant Layout as LayoutFinder
  participant Analyzer as Analyzer
  participant IR as IRTransformer

  Note over Lower,IR: Buffer remap/layout resolution and Let bindings
  Lower->>Remap: Resolve BufferLoad/Var -> remapped buffer?
  Remap-->>Layout: query layout info
  alt remap found
    Lower->>IR: rewrite access_ptr/offset/mma with remapped expr
  else
    Lower->>IR: keep original access
  end
  Lower->>Analyzer: bind BufferLoad values into let_bindings_ for LetStmt scope
  Analyzer-->>Lower: supply bound values for later passes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tzj-fxz

Poem

🐇 I hopped through loops where wrappers hide,
I unbound Lets and found SeqStmt inside.
I remapped buffers, tucked shared memory tight,
Ran CUDA kernels that blinked with light.
A little rabbit cheers — compile, pass, good night!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the primary bug fix of removing forced inlining of let statements, which aligns directly with the pull request’s main change to disable LetStmt inlining.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ee3540 and 2311fc7.

📒 Files selected for processing (5)
  • docs/compiler_internals/letstmt_inline.md (1 hunks)
  • src/op/builtin.cc (1 hunks)
  • src/op/builtin.h (2 hunks)
  • tilelang/engine/phase.py (2 hunks)
  • tilelang/transform/pass_config.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
testing/python/issue/test_tilelang_issue_814.py (2)

7-21: Consider adding a docstring.

Adding a docstring would clarify the purpose of this kernel factory function and its parameters.

 def test_tmp_var(N, block_N, dtype="float"):
+    """
+    Create a kernel that tests temporary variable handling in TileLang.
+    
+    The kernel computes tmp = max(A[idx], 1) and uses it twice to ensure
+    the temporary is not over-inlined, which would cause redundant computation.
+    
+    Parameters
+    ----------
+    N : int
+        Tensor size
+    block_N : int
+        Elements processed per block
+    dtype : str
+        Data type (default: "float")
+    """

14-19: Align thread count to block_N and drop manual bounds guard

  • Safe memory access checks are automatically inserted by the LegalizeSafeMemoryAccess pass, so a manual if idx < N guard isn’t needed.
  • Replace threads=128 with threads=block_N to maintain one-to-one thread-to-element mapping and avoid mismatch. [line 14]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb0677 and abc2f8c.

📒 Files selected for processing (3)
  • testing/python/issue/test_tilelang_issue_814.py (1 hunks)
  • tilelang/engine/phase.py (0 hunks)
  • tilelang/language/utils.py (0 hunks)
💤 Files with no reviewable changes (2)
  • tilelang/engine/phase.py
  • tilelang/language/utils.py
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/issue/test_tilelang_issue_814.py (4)
tilelang/language/parallel.py (1)
  • Parallel (8-28)
tilelang/jit/__init__.py (1)
  • compile (33-86)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-101)
tilelang/utils/tensor.py (1)
  • torch_assert_close (220-312)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-test-amd
  • GitHub Check: bot-task
  • GitHub Check: format-check
🔇 Additional comments (2)
testing/python/issue/test_tilelang_issue_814.py (2)

24-49: LGTM! Test logic is sound.

The test correctly validates temporary variable handling:

  • Compiles the kernel with appropriate pass configs (disabling warp specialization and TMA lowering)
  • Creates reference computation using PyTorch
  • Validates both output tensors with reasonable tolerances
  • The key test: tmp is computed once and used twice (for both B and A updates), ensuring the compiler doesn't over-inline the let binding

52-54: LGTM! Clear test entry point.

The docstring accurately describes the test's purpose.

…checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.
…in example_fusedmoe_tilelang.py

- Added a new section for compiler internals in the documentation.
- Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.
@LeiWang1999
Copy link
Member Author

a fix for issue #814

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/fusedmoe/example_fusedmoe_tilelang.py (2)

1-1: Fix the formatting issue flagged by CI.

The pipeline failures indicate this file needs reformatting. Please run the project's formatter and commit the changes.


536-536: Align generate_input parameters with config keys
The function signature uses names like dhidden, dexpert, nroutedexperts, etc., while the config dict uses d_hidden, d_expert, n_routed_experts, etc., causing keyword argument mismatches. Rename either the config keys or the function parameters so they match exactly.

🧹 Nitpick comments (4)
docs/compiler_internals/letstmt_inline.md (4)

11-22: Clarify current pipeline behavior (no forced inlining).

Given this PR stops forcing Let inlining in LowerAndLegalize, add a short note here that Simplify may still inline under these rules, but the pipeline no longer forces it. This avoids confusion about “when” inlining happens.


15-21: Avoid fragile line-number references; link to source or name symbols instead.

Hardcoding source line numbers will drift. Replace “(lines 315–326)”, “line 357”, and “(lines 168–206)” with:

  • Symbol names (e.g., CanInlineLetStmt, used_in_buffer_def_, VarUseDefAnalyzer)
  • A GitHub permalink to the file/anchor

Also applies to: 47-55, 75-79


1-8: Add a brief “As of PR #947” note for traceability.

Add one sentence noting that the pipeline no longer forces Let inlining as of PR #947, and this page documents the Simplify pass rules.


141-144: Use GitHub permalinks for Related Files Replace the plain paths under “Related Files” in docs/compiler_internals/letstmt_inline.md with clickable permalinks to the exact lines in src/transform/simplify.cc (e.g. line 315 for CanInlineLetStmt) and in testing/python/issue/test_tilelang_issue_814.py.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1aa27e and 9354899.

📒 Files selected for processing (4)
  • docs/compiler_internals/letstmt_inline.md (1 hunks)
  • docs/index.md (1 hunks)
  • examples/fusedmoe/example_fusedmoe_tilelang.py (1 hunks)
  • examples/fusedmoe/test_example_fusedmoe.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/fusedmoe/test_example_fusedmoe.py (1)
examples/fusedmoe/example_fusedmoe_tilelang.py (1)
  • main (524-545)
examples/fusedmoe/example_fusedmoe_tilelang.py (1)
examples/warp_specialize/example_warp_specialize_flashmla.py (1)
  • main (385-399)
🪛 GitHub Actions: CI Test on AMD
examples/fusedmoe/test_example_fusedmoe.py

[error] 1-1: Reformatted files. Please review and stage the changes.

examples/fusedmoe/example_fusedmoe_tilelang.py

[error] 1-1: Reformatted files. Please review and stage the changes.

🪛 GitHub Actions: CI Test on Metal
examples/fusedmoe/test_example_fusedmoe.py

[error] 1-1: Code reformatted by formatter. Changes not staged for commit. Please run 'git add' and commit the changes, or revert formatting if undesired.

examples/fusedmoe/example_fusedmoe_tilelang.py

[error] 1-1: Code reformatted by formatter. Changes not staged for commit. Please run 'git add' and commit the changes, or revert formatting if undesired.

🔇 Additional comments (1)
docs/index.md (1)

38-44: Section addition looks good; verify link resolves.

New COMPILER INTERNALS toctree is consistent with existing style. Please confirm docs/compiler_internals/letstmt_inline.md builds and the link renders in the sidebar.

Comment on lines +126 to +127
If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Revise the hazard explanation; current sequence doesn’t demonstrate a read-after-write issue.

In the example, both reads occur before the write to A[idx], so “due to the first write” isn’t accurate. Consider a case where A[idx] is written between two reads to illustrate why duplicated loads from inlining can change behavior, or rephrase to note duplicate loads impact performance but not semantics here.

-If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write)...
+Be aware that inlining duplicates loads. If a write to the same location occurs between two uses, re‑evaluation can observe updated memory and change behavior. In this example, both reads occur before the write, so semantics are unchanged, though duplicated loads may affect performance.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.
Be aware that inlining duplicates loads. If a write to the same location occurs between two uses, re-evaluation can observe updated memory and change behavior. In this example, both reads occur before the write, so semantics are unchanged, though duplicated loads may affect performance.
🤖 Prompt for AI Agents
In docs/compiler_internals/letstmt_inline.md around lines 126 to 127, the hazard
explanation is incorrect because the shown sequence has both reads before the
write so it does not demonstrate a read-after-write semantic change; either
replace the example with one where A[idx] is written between two reads (so the
first write can change the value seen by the second read) to illustrate the
semantic hazard, or rephrase the sentence to say that duplicated loads in this
specific sequence only affect performance (not semantics) and clarify when
duplicated loads can change behavior (i.e., when a write to the same location
occurs between the reads).

Comment on lines 524 to 534
def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):
config = {
"dhidden": 7168,
"dexpert": 2048,
"nroutedexperts": 8,
"nsharedexperts": 1,
"nexpertspertoken": 4,
"bs": 1,
"seqlen": 8192,
"dhidden": d_hidden,
"dexpert": d_expert,
"nroutedexperts": n_routed_experts,
"nsharedexperts": n_shared_experts,
"nexpertspertoken": n_experts_per_token,
"bs": batch_size,
"seqlen": seq_len,
"seed": 81394
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Config dictionary keys don't match usage throughout the codebase.

The config dictionary uses lowercase keys without underscores ("dhidden", "dexpert", "nroutedexperts", etc.), but the Expert, MoEGate, and MoE classes expect keys with underscores ("d_hidden", "d_expert", "n_routed_experts", etc.). This will cause KeyError exceptions at runtime.

For example:

  • Line 282: self.d_hidden: int = config["d_hidden"]
  • Line 283: self.d_expert: int = config["d_expert"]
  • Line 300: self.top_k: int = config["n_experts_per_token"]
  • Line 301: self.num_experts: int = config["n_routed_experts"]

Apply this diff to fix the config keys:

 def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):
     config = {
-        "dhidden": d_hidden,
-        "dexpert": d_expert,
-        "nroutedexperts": n_routed_experts,
-        "nsharedexperts": n_shared_experts,
-        "nexpertspertoken": n_experts_per_token,
-        "bs": batch_size,
-        "seqlen": seq_len,
+        "d_hidden": d_hidden,
+        "d_expert": d_expert,
+        "n_routed_experts": n_routed_experts,
+        "n_shared_experts": n_shared_experts,
+        "n_experts_per_token": n_experts_per_token,
+        "batch_size": batch_size,
+        "seq_len": seq_len,
         "seed": 81394
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):
config = {
"dhidden": 7168,
"dexpert": 2048,
"nroutedexperts": 8,
"nsharedexperts": 1,
"nexpertspertoken": 4,
"bs": 1,
"seqlen": 8192,
"dhidden": d_hidden,
"dexpert": d_expert,
"nroutedexperts": n_routed_experts,
"nsharedexperts": n_shared_experts,
"nexpertspertoken": n_experts_per_token,
"bs": batch_size,
"seqlen": seq_len,
"seed": 81394
}
def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):
config = {
"d_hidden": d_hidden,
"d_expert": d_expert,
"n_routed_experts": n_routed_experts,
"n_shared_experts": n_shared_experts,
"n_experts_per_token": n_experts_per_token,
"batch_size": batch_size,
"seq_len": seq_len,
"seed": 81394
}
🤖 Prompt for AI Agents
In examples/fusedmoe/example_fusedmoe_tilelang.py around lines 524 to 534, the
config dict uses keys like "dhidden", "dexpert", "nroutedexperts", etc., but the
rest of the code expects underscored keys (e.g., "d_hidden", "d_expert",
"n_routed_experts", "n_shared_experts", "n_experts_per_token", "bs", "seqlen",
"seed"); update the dict keys to match those expected names (replace
"dhidden"->"d_hidden", "dexpert"->"d_expert",
"nroutedexperts"->"n_routed_experts", "nsharedexperts"->"n_shared_experts",
"nexpertspertoken"->"n_experts_per_token") and keep "bs", "seqlen", and "seed"
as-is so the Expert/MoEGate/MoE classes access config without KeyError.

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
- Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9354899 and f372812.

📒 Files selected for processing (1)
  • src/transform/merge_shared_memory_allocations.cc (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/merge_shared_memory_allocations.cc (1)
src/transform/storage_rewrite.cc (6)
  • buf (247-254)
  • buf (247-247)
  • buf (505-522)
  • buf (505-505)
  • buf (1715-1740)
  • buf (1715-1715)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: format-check
  • GitHub Check: format-check
  • GitHub Check: bot-task
🔇 Additional comments (2)
src/transform/merge_shared_memory_allocations.cc (2)

219-220: LGTM!

The relaxation of the boundary check is consistent with the fix in BufferLoadNode and correctly handles buffer variable references at the same scope level. The comment clearly explains the change.


197-201: LGTM with consistency check The update in merge_shared_memory_allocations.cc is correct. storage_rewrite.cc still uses strict ICHECK_LT(it->second.level, scope_.size()). Confirm whether it should also be relaxed to ICHECK_LE and use the access_level logic to prevent out-of-bounds.

Comment on lines +207 to +209
// When accessing at the same level, use that level
size_t access_level = std::min(it->second.level, scope_.size() - 1);
scope_[access_level].touched.push_back(buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against unsigned underflow in access_level calculation.

The expression scope_.size() - 1 can underflow when scope_.size() == 0 (wrapping to SIZE_MAX for size_t). While the ICHECK_LE guard makes this scenario unlikely in practice, relying on wraparound behavior is a code smell that could mask bugs or cause issues in unexpected edge cases.

Consider using a safer calculation that explicitly handles the zero case:

         if (enable_aggressive_merge) {
           scope_[scope_.size() - 1].touched.push_back(buf);
         } else {
           // When accessing at the same level, use that level
-          size_t access_level = std::min(it->second.level, scope_.size() - 1);
+          size_t access_level = (scope_.size() == 0) ? 0 : std::min(it->second.level, scope_.size() - 1);
           scope_[access_level].touched.push_back(buf);
         }

Apply the same fix to both locations (lines 207-209 and lines 226-228).

Alternatively, add a runtime check:

           // When accessing at the same level, use that level
+          ICHECK_GT(scope_.size(), 0) << "Scope stack should not be empty when accessing buffers";
           size_t access_level = std::min(it->second.level, scope_.size() - 1);

Also applies to: 226-228

🤖 Prompt for AI Agents
In src/transform/merge_shared_memory_allocations.cc around lines 207-209 and
226-228, the computation scope_.size() - 1 can underflow when scope_.size() ==
0; change the calculation to guard the zero case (e.g. compute size_t max_index
= scope_.empty() ? 0 : scope_.size() - 1; then use std::min(it->second.level,
max_index) for access_level) and replace both occurrences; optionally keep or
add a runtime check/ICHECK that scope_ is non-empty where logically required to
catch unexpected states.

  - inline let-bound BufferLoad nodes via resolver helpers and structured return
  - remap layouts/buffers using original data vars and only rewrite when needed
  - update pipeline planner to understand let-bound address_of buffers
  - document the new inline behaviour in docs/let_inline_fix.md
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/transform/lower_tile_op.cc (1)

8-8: Remove unused #include <tvm/tir/op.h> (line 8). No tvm::tir operator functions or symbols from this header are referenced in lower_tile_op.cc.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597d8b1 and 36736f3.

📒 Files selected for processing (2)
  • src/transform/lower_tile_op.cc (10 hunks)
  • src/transform/pipeline_planning.cc (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/pipeline_planning.cc (1)
src/transform/inject_pipeline.cc (20)
  • buffer_region (113-132)
  • buffer_region (114-114)
  • op (166-182)
  • op (166-166)
  • op (184-197)
  • op (184-184)
  • op (199-212)
  • op (199-199)
  • op (214-220)
  • op (214-214)
  • op (840-1020)
  • op (840-840)
  • op (1022-1039)
  • op (1022-1022)
  • op (1041-1058)
  • op (1041-1041)
  • buffer (398-453)
  • buffer (398-399)
  • buffer (461-470)
  • buffer (461-461)
src/transform/lower_tile_op.cc (3)
src/transform/layout_inference.cc (4)
  • expr (389-407)
  • expr (389-389)
  • buffer (409-415)
  • buffer (409-409)
src/transform/inject_pipeline.cc (2)
  • call (134-164)
  • call (134-135)
src/transform/pipeline_planning.cc (14)
  • op (48-57)
  • op (48-48)
  • op (87-143)
  • op (87-87)
  • op (166-184)
  • op (166-166)
  • op (186-204)
  • op (186-186)
  • op (206-260)
  • op (206-206)
  • op (262-272)
  • op (262-262)
  • op (703-712)
  • op (703-703)
🔇 Additional comments (10)
src/transform/pipeline_planning.cc (4)

209-222: LGTM! Enhanced buffer region resolution.

The refactored address_of handling now properly resolves both BufferLoad and Var cases, making buffer region tracking more robust. The defensive programming with the early check if (buffer_region.defined()) prevents unnecessary additions.


413-449: LGTM! Robust unwrapping of pipeline body wrappers.

The new unwrapping loop properly handles LetStmt and IfThenElse (without else) wrappers around the pipeline body, aligning with the PR objective of not forcing inline let statements. The error message accurately describes the supported wrapper types.


455-455: Verify that pipeline_body_root is used consistently.

Ensure that downstream processing uses pipeline_body_seq for iterating the pipeline stages, as pipeline_body_root may still contain wrappers. Based on line 458, this appears correct.


5-5: Retain the <tvm/tir/op.h> include – it provides definitions for tir::builtin::if_then_else() and tir::attr constants.

src/transform/lower_tile_op.cc (6)

323-426: LGTM! Well-structured refactoring with AccessPtrResult.

The introduction of AccessPtrResult struct and the refactored flow clearly separates cases where rewriting occurs from cases where the original expression is preserved. The use of helper functions (ResolveBufferLoad, FindRemapBuffer, FindLayout) improves modularity and readability.


428-440: LGTM! ResolveBufferLoad properly handles let-bound expressions.

This helper correctly resolves variables to their let-bound BufferLoad expressions when available, enabling support for non-inlined let statements as per the PR objective.


442-478: LGTM! Comprehensive buffer and layout resolution.

Both FindRemapBuffer and FindLayout implement thorough lookup strategies:

  • Direct map lookup
  • Lookup via buffer_map_ by data var
  • Fallback to same_as comparisons by data or name

This handles various buffer aliasing scenarios robustly.


502-543: LGTM! Consistent use of ResolveBufferLoad in PTX paths.

The ptx_ldmatrix and mma_store handling now properly resolve let-bound buffer loads before transformation, ensuring correct remapping when buffers are wrapped in let statements.


594-616: LGTM! Proper LetStmt handling with scope management.

The new VisitStmt_ for LetStmtNode:

  • Records BufferLoad values in let_bindings_ for later resolution
  • Binds pure expressions in the analyzer for constant folding
  • Properly erases bindings after visiting the body to maintain correct scope

This implementation correctly supports non-inlined let statements as intended by the PR.


708-709: LGTM! let_bindings_ properly scoped.

The let_bindings_ member is correctly managed:

  • Added in VisitStmt_(LetStmtNode) when visiting the let statement
  • Erased after visiting the body to maintain lexical scope
  • Used by ResolveBufferLoad for resolution

No memory leaks or scope issues detected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36736f3 and 6f115d3.

📒 Files selected for processing (1)
  • src/transform/multi_version_buffer_rewriter.cc (5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/multi_version_buffer_rewriter.cc
🧬 Code graph analysis (1)
src/transform/multi_version_buffer_rewriter.cc (1)
src/transform/warp_specialized_rewriter.cc (36)
  • seq_stmt (917-990)
  • seq_stmt (918-919)
  • seq_stmt (1033-1090)
  • seq_stmt (1033-1033)
  • op (38-43)
  • op (38-38)
  • op (73-83)
  • op (73-73)
  • op (85-93)
  • op (85-85)
  • op (95-100)
  • op (95-95)
  • op (102-110)
  • op (102-102)
  • op (134-146)
  • op (134-134)
  • op (148-177)
  • op (148-148)
  • op (179-189)
  • op (179-179)
  • stmt (51-62)
  • stmt (51-51)
  • stmt (121-124)
  • stmt (121-121)
  • stmt (126-130)
  • stmt (126-126)
  • stmt (132-132)
  • stmt (132-132)
  • stmt (230-230)
  • stmt (230-230)
  • stmt (271-271)
  • stmt (271-271)
  • stmt (302-306)
  • stmt (302-302)
  • stmt (333-339)
  • stmt (333-334)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-metal
  • GitHub Check: build-test-amd

Comment on lines +301 to 333
bool in_scope = false;
for (const StmtNode *ancestor : stmt_stack_) {
if (ancestor == lca) {
in_scope = true;
break;
}
}
if (!in_scope)
continue;
// Only double-buffer shared allocations; locals do not need versioning.
auto scope = buffer.scope();
if (!(scope == "shared" || scope == "shared.dyn"))
continue;
if (seen.insert(buffer.get()).second) {
scoped_buffers.push_back(buffer);
}
}
for (auto it = stmt_stack_.rbegin(); it != stmt_stack_.rend(); ++it) {
if (!(*it)->IsInstance<BlockNode>())
continue;
const auto *block = static_cast<const BlockNode *>(*it);
auto map_it = block_alloc_buffers_.find(block);
if (map_it == block_alloc_buffers_.end())
continue;
for (const Buffer &buffer : map_it->second) {
auto scope = buffer.scope();
if (!(scope == "shared" || scope == "shared.dyn"))
continue;
if (seen.insert(buffer.get()).second) {
scoped_buffers.push_back(buffer);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix LCA scope tracking for Let/If wrappers.

We only push For/Block nodes into stmt_stack_, so when DetectBufferAccessLCA says a buffer’s LCA is a LetStmt or IfThenElse (exactly the wrappers we now keep instead of inlining), the ancestor scan never sees it. in_scope stays false, the buffer drops out of scoped_buffers, and we silently skip versioning—regressing double buffering for the new “non-inlined let” pipelines. Please push those wrapper nodes before visiting their bodies.

+  Stmt VisitStmt_(const LetStmtNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
+
+  Stmt VisitStmt_(const IfThenElseNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
🤖 Prompt for AI Agents
In src/transform/multi_version_buffer_rewriter.cc around lines 301 to 333, the
ancestor scan for LCAs always checks stmt_stack_ which only contains For/Block
nodes, so LCAs that are LetStmt or IfThenElse wrappers are missed causing
buffers to be excluded from scoped_buffers; update the traversal to push LetStmt
and IfThenElse nodes onto stmt_stack_ before visiting their bodies (and pop them
after visiting) so DetectBufferAccessLCA finds wrapper LCAs and buffers get
properly considered for double-buffering.

@LeiWang1999
Copy link
Member Author

PR Summary via GPT-5 Pro – LetStmt-Aware Pipeline Fixes (Issue #814)

Overview

  • Addressed correctness regressions that appear when LetInline is disabled for CUDA pipelines.
  • Hardened multiple lowering passes so they can work with LetStmt, AttrStmt, and single-branch IfThenElse wrappers that now remain in the IR.
  • Added documentation and regression tests to capture the behaviour.

Compiler Pass Updates

  • Remove unconditional LetInline (tilelang/engine/phase.py:88): the pass was masking issues in downstream transforms; we now rely on updated passes that explicitly tolerate LetStmt wrappers.

  • Pipeline planning (src/transform/pipeline_planning.cc:410):

    • Walks through LetStmt/single-branch IfThenElse wrappers to recover the underlying SeqStmt.
    • Records shared buffers defined in ancestor blocks and handles address_of calls whose operand is a Var, ensuring dependency analysis still sees shared memory accesses.
  • Software-pipeline injection (src/transform/inject_pipeline.cc:846):

    • Uses the same unwrapping logic and re-applies the stripped wrappers after the pipeline is rewritten so attributes, lets, and guards survive the transform intact.
  • Multi-version buffer rewriter (src/transform/multi_version_buffer_rewriter.cc:141):

    • Tracks ancestor statements and block-local allocations to find shared buffers even when their LCA is outside the loop body.
    • Treats Role::kBoth stages as both producers and consumers and falls back to “first write / last read” indices to decide when double-buffering is required.
    • Restricts multi-versioning to shared / shared.dyn scopes so locals are not duplicated accidentally and handles let/if wrappers around the pipeline body.
  • Shared-memory merger (src/transform/merge_shared_memory_allocations.cc:194):

    • Allows reads in the same scope level where the buffer is allocated, preventing false negatives in the presence of let-wrapped expressions.
  • Tile op lowering (src/transform/lower_tile_op.cc:330):

    • Resolves let-bound BufferLoad expressions before remapping, so address_of / tvm_access_ptr seen after LetInline removal still convert to the permuted layouts required by Hopper intrinsics.
    • Caches let bindings during visitation to avoid repeated re-resolution.
  • Docs (docs/compiler_internals/letstmt_inline.md, docs/index.md:35):

    • Document how and why LetStmt inlining works, providing context for the removal in the lowering pipeline.

Examples & Tests

  • FusedMoE example (examples/fusedmoe/example_fusedmoe_tilelang.py:521):

    • Parameterized the demo so CI can run a reduced problem size, keeping coverage while respecting resource limits.
    • Updated test harness (examples/fusedmoe/test_example_fusedmoe.py:3) to pass the smaller configuration.
  • Issue regression test (testing/python/issue/test_tilelang_issue_814.py):

    • Captures the scenario where a let-bound temporary feeds both load and store paths, ensuring future changes do not reintroduce silent inlining bugs.

Verification

  • python examples/deepseek_deepgemm/example_deepgemm_fp8_2xAcc.py
  • python -m pytest testing/python/issue/test_tilelang_issue_814.py
  • python -m pytest examples/fusedmoe/test_example_fusedmoe.py

@LeiWang1999
Copy link
Member Author

have to modify some codes for aws related pass, @chengyupku should take a look.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/transform/multi_version_buffer_rewriter.cc (1)

286-291: Fix LCA scope tracking: push Let/If/Attr onto stmt_stack_.

When LCA is a LetStmt or IfThenElse (now preserved), in-scope detection fails because stmt_stack_ only tracks For/Block. Buffers with wrapper LCAs get skipped for versioning.

Add visitors that push/pop these wrappers, mirroring Block/For.

   Stmt VisitStmt_(const BlockNode *op) final {
     stmt_stack_.push_back(op);
     Stmt stmt = StmtExprMutator::VisitStmt_(op);
     stmt_stack_.pop_back();
     return stmt;
   }

   Stmt VisitStmt_(const ForNode *op) final {
     stmt_stack_.push_back(op);
     ...
     stmt_stack_.pop_back();
     return for_node;
   }
+
+  Stmt VisitStmt_(const LetStmtNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
+
+  Stmt VisitStmt_(const IfThenElseNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
+
+  Stmt VisitStmt_(const AttrStmtNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }

Also applies to: 293-406, 474-485

🧹 Nitpick comments (5)
examples/blocksparse_attention/test_example_blocksparse_attention.py (1)

10-11: Consider parameterizing the remaining tests for consistency.

Tests test_block_sparse_attn_triton, test_example_tilelang_block_sparse_attn, test_example_triton_sparse_gqa_decode_varlen_indice, and test_example_triton_sparse_gqa_decode_varlen_mask still call main() without explicit parameters. If their default configurations also result in long CI runtimes, consider applying similar parameterization for consistency and efficiency.

Also applies to: 14-15, 26-31

testing/python/transform/test_tilelang_transform_multi_version_buffer.py (2)

108-135: Good regression test for let-bound producer/consumer.

Covers non-inlined let across pipeline stages and matches expected double buffering (k % 2). Consider adding complementary cases:

  • Single-branch IfThenElse wrapper around the pipeline body.
  • AttrStmt wrapper (e.g., pragma) above SeqStmt.

This will guard the additional wrapper paths tightened in the pass.


114-121: Optional: zero‑init accum to avoid undefined reads if executed.

If this TIR is ever run, accum[i] += shared[...] reads undefined initial values. You can add a small init loop before k.

         accum = T.alloc_buffer((8,), "float32", scope="local")
+        for i in T.serial(8):
+            accum[i] = T.float32(0)
         for k in T.serial(4, annotations={"num_stages": T.int32(2)}):

Also applies to: 126-134

src/transform/multi_version_buffer_rewriter.cc (2)

142-173: Consider unwrapping single-branch IfThenElse in collect_stmts.

collect_stmts flattens Let/Attr/Block wrappers but leaves IfThenElse. For trivial guards inserted around stages, flattening the then_case (when else is absent) can stabilize stage detection and access analysis.

Optional improvement: unwrap IfThenElse without else same as Let/Attr.

     if (const auto *let = stmt.as<LetStmtNode>()) {
       collect_stmts(let->body);
       return;
     }
     if (const auto *attr = stmt.as<AttrStmtNode>()) {
       collect_stmts(attr->body);
       return;
     }
+    if (const auto *ite = stmt.as<IfThenElseNode>()) {
+      if (!ite->else_case.defined()) {
+        collect_stmts(ite->then_case);
+        return;
+      }
+    }

434-472: Also rewrite address_of to include version offset (optional).

Only tvm_access_ptr is adjusted. Calls like tir.address_of(buffer[...]) or builtin::tvm_address_of may appear after letting/intrinsic lowering and need the version stride added.

Add a branch similar to tvm_access_ptr to rewrite address_of indices.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c637767 and 22ae8c5.

📒 Files selected for processing (4)
  • examples/blocksparse_attention/test_example_blocksparse_attention.py (1 hunks)
  • src/transform/multi_version_buffer_rewriter.cc (6 hunks)
  • src/transform/simplify.cc (2 hunks)
  • testing/python/transform/test_tilelang_transform_multi_version_buffer.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/multi_version_buffer_rewriter.cc
🧬 Code graph analysis (4)
src/transform/simplify.cc (2)
src/transform/lower_tile_op.cc (2)
  • expr (428-440)
  • expr (428-428)
src/transform/layout_inference.cc (2)
  • expr (389-407)
  • expr (389-389)
testing/python/transform/test_tilelang_transform_multi_version_buffer.py (2)
tilelang/language/ast/ir.py (1)
  • alloc_buffer (441-508)
src/transform/storage_rewrite.cc (1)
  • i (705-841)
examples/blocksparse_attention/test_example_blocksparse_attention.py (2)
examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_indice.py (2)
  • main (173-186)
  • main (381-468)
examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_mask.py (2)
  • main (162-173)
  • main (371-449)
src/transform/multi_version_buffer_rewriter.cc (1)
src/transform/warp_specialized_rewriter.cc (17)
  • stmt (51-62)
  • stmt (51-51)
  • stmt (121-124)
  • stmt (121-121)
  • stmt (126-130)
  • stmt (126-126)
  • stmt (132-132)
  • stmt (132-132)
  • stmt (230-230)
  • stmt (230-230)
  • stmt (271-271)
  • stmt (271-271)
  • seq_stmt (917-990)
  • seq_stmt (918-919)
  • seq_stmt (1033-1090)
  • seq_stmt (1033-1033)
  • marker (1181-1181)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-metal
🔇 Additional comments (5)
examples/blocksparse_attention/test_example_blocksparse_attention.py (2)

18-19: LGTM! Explicit parameterization reduces CI runtime.

The explicit parameters (batch=1, max_cache_seqlen=2048) reduce the test workload from the defaults (batch=8, max_cache_seqlen=8192), aligning with the PR's goal of parameterizing examples for faster CI runs.


22-23: LGTM! Consistent parameterization with the other test.

The parameters match line 19, ensuring consistency across the modified tests.

src/transform/simplify.cc (3)

8-8: LGTM!

The inclusion of tvm/tir/analysis.h supports the new buffer alias analysis functionality.


15-15: LGTM!

The <optional> header is appropriate for C++17 optional types used in this file.


356-370: Verify the non-unit dimension counting aligns with intended removal criteria.

The current logic counts non-unit dimensions to decide on buffer alias removal. Please confirm:

  1. Should a single-dimension vectorized access (e.g., buffer[ramp(0, 1, 4)] representing a 4-lane vector) trigger removal? The comment mentions "or carries vector lanes" as a separate condition, but the current non_unit_dims > 1 check would NOT remove such cases.

  2. Is the threshold non_unit_dims > 1 (2+ dims) intentional, or should it be >= 2 or some other value?

If single-dimension vectorized accesses should also be removed, consider detecting Ramp nodes explicitly:

     Array<Range> ranges = get_ranges(value);
     if (!ranges.empty()) {
       int non_unit_dims = 0;
+      bool has_ramp = false;
       for (const Range &range : ranges) {
         PrimExpr extent = analyzer_->Simplify(range->extent);
         if (is_const_int(extent, 1) || analyzer_->CanProveEqual(extent, 1)) {
           continue;
         }
+        // Check if this came from a Ramp (vectorized access)
+        if (value.as<BufferLoadNode>()) {
+          auto load = value.as<BufferLoadNode>();
+          for (const PrimExpr &index : load->indices) {
+            if (index.as<RampNode>()) {
+              has_ramp = true;
+            }
+          }
+        }
         ++non_unit_dims;
         if (non_unit_dims > 1) {
           remove_buffer_alias = true;
           break;
         }
       }
+      if (has_ramp) {
+        remove_buffer_alias = true;
+      }
     }

Comment on lines +319 to +344
const SeqStmtNode *pipeline_body_seq = nullptr;
{
// Traverse trivial wrappers (let/if) to find the actual SeqStmt body.
Stmt current = pipeline_body_root;
while (true) {
if (const auto *seq_stmt = current.as<SeqStmtNode>()) {
pipeline_body_seq = seq_stmt;
break;
}
if (const auto *if_then_else = current.as<IfThenElseNode>()) {
ICHECK(!if_then_else->else_case.defined())
<< "MultiVersionBuffer: Can't handle the body of the loop "
"because the IfThenElse node has an else branch";
current = if_then_else->then_case;
continue;
}
if (const auto *let_stmt = current.as<LetStmtNode>()) {
current = let_stmt->body;
continue;
}
LOG(FATAL)
<< "MultiVersionBuffer: Can't handle the body of the loop because "
<< "it is not a SeqStmt, IfThenElse without else, "
<< "or LetStmt wrapping them, but got " << current->GetTypeKey();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Unwrap AttrStmt while locating pipeline body.

The finder unwraps IfThenElse and LetStmt but not AttrStmt. If For->body is an AttrStmt wrapping a SeqStmt (common after LetInline removal), this path LOG(FATAL)s.

Add AttrStmt unwrapping alongside Let/If.

       while (true) {
         if (const auto *seq_stmt = current.as<SeqStmtNode>()) {
           pipeline_body_seq = seq_stmt;
           break;
         }
         if (const auto *if_then_else = current.as<IfThenElseNode>()) {
           ICHECK(!if_then_else->else_case.defined())
               << "MultiVersionBuffer: Can't handle the body of the loop "
                  "because the IfThenElse node has an else branch";
           current = if_then_else->then_case;
           continue;
         }
+        if (const auto *attr_stmt = current.as<AttrStmtNode>()) {
+          current = attr_stmt->body;
+          continue;
+        }
         if (const auto *let_stmt = current.as<LetStmtNode>()) {
           current = let_stmt->body;
           continue;
         }
         LOG(FATAL)
             << "MultiVersionBuffer: Can't handle the body of the loop because "
             << "it is not a SeqStmt, IfThenElse without else, "
-            << "or LetStmt wrapping them, but got " << current->GetTypeKey();
+            << "or LetStmt/AttrStmt wrapping them, but got "
+            << current->GetTypeKey();
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const SeqStmtNode *pipeline_body_seq = nullptr;
{
// Traverse trivial wrappers (let/if) to find the actual SeqStmt body.
Stmt current = pipeline_body_root;
while (true) {
if (const auto *seq_stmt = current.as<SeqStmtNode>()) {
pipeline_body_seq = seq_stmt;
break;
}
if (const auto *if_then_else = current.as<IfThenElseNode>()) {
ICHECK(!if_then_else->else_case.defined())
<< "MultiVersionBuffer: Can't handle the body of the loop "
"because the IfThenElse node has an else branch";
current = if_then_else->then_case;
continue;
}
if (const auto *let_stmt = current.as<LetStmtNode>()) {
current = let_stmt->body;
continue;
}
LOG(FATAL)
<< "MultiVersionBuffer: Can't handle the body of the loop because "
<< "it is not a SeqStmt, IfThenElse without else, "
<< "or LetStmt wrapping them, but got " << current->GetTypeKey();
}
}
const SeqStmtNode *pipeline_body_seq = nullptr;
{
// Traverse trivial wrappers (let/if) to find the actual SeqStmt body.
Stmt current = pipeline_body_root;
while (true) {
if (const auto *seq_stmt = current.as<SeqStmtNode>()) {
pipeline_body_seq = seq_stmt;
break;
}
if (const auto *if_then_else = current.as<IfThenElseNode>()) {
ICHECK(!if_then_else->else_case.defined())
<< "MultiVersionBuffer: Can't handle the body of the loop "
"because the IfThenElse node has an else branch";
current = if_then_else->then_case;
continue;
}
if (const auto *attr_stmt = current.as<AttrStmtNode>()) {
current = attr_stmt->body;
continue;
}
if (const auto *let_stmt = current.as<LetStmtNode>()) {
current = let_stmt->body;
continue;
}
LOG(FATAL)
<< "MultiVersionBuffer: Can't handle the body of the loop because "
<< "it is not a SeqStmt, IfThenElse without else, "
<< "or LetStmt/AttrStmt wrapping them, but got "
<< current->GetTypeKey();
}
}
🤖 Prompt for AI Agents
In src/transform/multi_version_buffer_rewriter.cc around lines 319 to 344, the
traversal that unwraps trivial wrappers when locating the SeqStmt body handles
IfThenElse and LetStmt but omits AttrStmt, causing a fatal error when the loop
body is an AttrStmt-wrapped SeqStmt; add a branch that recognizes AttrStmtNode
and sets current = attr_stmt->body (continuing the loop) alongside the existing
Let/If handling so AttrStmt wrappers are unwrapped before the SeqStmt check.

Comment on lines +333 to +338
// TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
// fragment types. TVM currently reinterprets vectorized/shared accesses as
// Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
// passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
// that our layout can't handle. Force-inline (by dropping the let) whenever
// the alias spans more than 2 dims or carries vector lanes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the comment to match implementation logic.

The comment states removal triggers when the alias "spans more than 2 dims or carries vector lanes", but the implementation at line 365 checks non_unit_dims > 1, which means 2+ non-unit dimensions. This creates two mismatches:

  1. "More than 2 dims" typically means 3+, but the code triggers on 2+.
  2. The "or carries vector lanes" suggests a vectorized single-dimension access should also trigger removal, but it won't because a single Ramp index contributes only 1 non-unit dimension.

Apply this diff to align the comment with the actual logic:

-    // TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
-    // fragment types. TVM currently reinterprets vectorized/shared accesses as
-    // Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
-    // passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
-    // that our layout can't handle. Force-inline (by dropping the let) whenever
-    // the alias spans more than 2 dims or carries vector lanes.
+    // TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
+    // fragment types. TVM currently reinterprets vectorized/shared accesses as
+    // Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
+    // passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
+    // that our layout can't handle. Force-inline (by dropping the let) whenever
+    // the alias has 2 or more non-unit dimensions (counting vector lanes).

Alternatively, if the intent is to also catch single-dimension vectorized accesses, update the logic:

     if (!ranges.empty()) {
       int non_unit_dims = 0;
+      bool has_vector_lanes = false;
       for (const Range &range : ranges) {
         PrimExpr extent = analyzer_->Simplify(range->extent);
         if (is_const_int(extent, 1) || analyzer_->CanProveEqual(extent, 1)) {
           continue;
         }
+        if (is_const_int(extent) && *as_const_int(extent) > 1) {
+          has_vector_lanes = true;
+        }
         ++non_unit_dims;
         if (non_unit_dims > 1) {
           remove_buffer_alias = true;
           break;
         }
       }
+      if (has_vector_lanes) {
+        remove_buffer_alias = true;
+      }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
// fragment types. TVM currently reinterprets vectorized/shared accesses as
// Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
// passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
// that our layout can't handle. Force-inline (by dropping the let) whenever
// the alias spans more than 2 dims or carries vector lanes.
// TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
// fragment types. TVM currently reinterprets vectorized/shared accesses as
// Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
// passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
// that our layout can't handle. Force-inline (by dropping the let) whenever
// the alias has 2 or more non-unit dimensions (counting vector lanes).

Comment on lines +371 to +379
if (remove_buffer_alias) {
Stmt body = this->VisitStmt(op->body);
bool used = UsesVar(
body, [&](const VarNode *var) { return var == op->var.get(); });
ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "
"before removal "
<< op->var << " : " << op->value << " .";
return body;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

ICHECK could crash on valid code; consider graceful fallback.

The ICHECK at line 375 will crash the compiler if the let-bound variable is used in the body. While the comment states "expected to be unused", this expectation may not always hold—especially during incremental refactoring or in edge cases not covered by tests. A compiler crash is a poor user experience.

Consider one of these alternatives:

Option 1 (preferred): Fall back to normal inlining if the variable is used:

     if (remove_buffer_alias) {
       Stmt body = this->VisitStmt(op->body);
       bool used = UsesVar(
           body, [&](const VarNode *var) { return var == op->var.get(); });
-      ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "
-                       "before removal "
-                    << op->var << " : " << op->value << " .";
-      return body;
+      if (!used) {
+        return body;
+      }
+      // Variable is used; fall back to normal inlining logic below
+      remove_buffer_alias = false;
     }

Option 2: Perform actual inlining (substitute the variable) instead of just removal:

     if (remove_buffer_alias) {
-      Stmt body = this->VisitStmt(op->body);
-      bool used = UsesVar(
-          body, [&](const VarNode *var) { return var == op->var.get(); });
-      ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "
-                       "before removal "
-                    << op->var << " : " << op->value << " .";
-      return body;
+      // Inline the buffer alias by substituting it in the body
+      analyzer_->Bind(op->var, value);
+      Stmt body = this->VisitStmt(op->body);
+      return body;
     }

Option 3: If the ICHECK is truly an invariant, add a comment explaining why the variable is guaranteed to be unused (e.g., due to upstream pass behavior).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (remove_buffer_alias) {
Stmt body = this->VisitStmt(op->body);
bool used = UsesVar(
body, [&](const VarNode *var) { return var == op->var.get(); });
ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "
"before removal "
<< op->var << " : " << op->value << " .";
return body;
}
if (remove_buffer_alias) {
Stmt body = this->VisitStmt(op->body);
bool used = UsesVar(
body, [&](const VarNode *var) { return var == op->var.get(); });
if (!used) {
return body;
}
// Variable is used; fall back to normal inlining logic below
remove_buffer_alias = false;
}
🤖 Prompt for AI Agents
In src/transform/simplify.cc around lines 371-379, replace the ICHECK that
aborts when the let-bound variable is actually used with a graceful fallback:
detect when used==true and in that case do not crash but return a
preserved/updated Let binding (e.g., reconstruct the original Let with the
visited value and visited body or simply return a let node built from op->var,
the visited value and the visited body) so the pass continues; only remove the
let when used==false. Ensure no crash occurs and the original behavior (removal
when unused) is kept.

@LeiWang1999
Copy link
Member Author

ci has failed due to limited storage, but tests and examples have been passed in my local environment.

  - register the new config in builtin headers/registration
  - add helper to pipeline enabling LetInline based on pass context
  - document LetStmt inlining controls and usage
@LeiWang1999 LeiWang1999 merged commit f8ae600 into tile-ai:main Oct 10, 2025
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant