[Bugfix] Do not force inline let stmt #947

LeiWang1999 · 2025-10-07T08:25:02Z

as title.

Add a doc about let stmt inline.

Summary by CodeRabbit

New Features
- FusedMoE example main() is now configurable via function parameters (model size, experts, batch, seq_len).
Bug Fixes
- More robust software-pipeline injection/planning that preserves wrappers and attributes.
- Relaxed shared-memory boundary checks and improved buffer remapping/versioning for correctness.
- Safer let-binding handling and IR simplification to avoid incorrect aliasing/inlining.
- New pass option to control eager let inlining.
Documentation
- Added “Compiler Internals: LetStmt Inlining” page and updated docs index.
Tests
- Added regression tests for temporary-variable handling and multi-version buffers; updated example tests to pass explicit parameters.

…hase.py

coderabbitai · 2025-10-07T08:25:13Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds tests and docs for LetStmt inlining, introduces a TL_FORCE_LET_INLINE config flag and conditional inlining, parameterizes examples/tests, and implements extensive transform and lowering changes for wrapper-aware pipeline injection/planning, buffer remapping/layout handling, shared-memory merge logic, multi-version buffer rewriting, and LetStmt simplification heuristics.

Changes

Cohort / File(s)	Summary
New test: TileLang issue 814 `testing/python/issue/test_tilelang_issue_814.py`	New CUDA test module adding `_tmp_var_kernel`, `run_tmp_var_test`, `test_issue_814`, and main guard; compiles/runs a tilelang CUDA kernel and validates outputs against a PyTorch CUDA reference with `tilelang.testing.torch_assert_close`.
Engine phase / pass config `tilelang/engine/phase.py`, `src/op/builtin.{cc,h}`, `tilelang/transform/pass_config.py`	Adds TL_FORCE_LET_INLINE pass config key and exposes `kForceLetInline`; introduces `should_force_let_inline()` and conditions LetStmt inlining in LowerAndLegalize based on the config; swaps TIR `Simplify()` to `tilelang.transform.Simplify()` in LowerAndLegalize.
Docs: LetStmt inlining `docs/compiler_internals/letstmt_inline.md`, `docs/index.md`	New compiler-internals documentation describing LetStmt inlining heuristics, collection of variables used in buffer definitions, and the `TL_FORCE_LET_INLINE` option; adds COMPILER INTERNALS toctree entry.
Examples & tests: parameterization `examples/fusedmoe/example_fusedmoe_tilelang.py`, `examples/fusedmoe/test_example_fusedmoe.py`, `examples/blocksparse_attention/test_example_blocksparse_attention.py`	Parameterizes `main()` in fusedmoe example (adds hyperparameter args with defaults) and updates tests to call `main()` with explicit, smaller keyword args; blocksparse attention tests pass explicit `batch` and `max_cache_seqlen`.
Transform: shared memory merging `src/transform/merge_shared_memory_allocations.cc`	Relaxes scope boundary checks (`ICHECK_LT` → `ICHECK_LE`), computes an `access_level = min(level, scope_.size()-1)` when not aggressive, and updates touched-buffer logic for `BufferLoad`/`Var` reads.
Transform: wrapper-aware pipeline injection `src/transform/inject_pipeline.cc`	Introduces wrapper-collection/apply mechanism (AttrStmt/IfThenElse/LetStmt/etc.), tracks pipeline body origin, unwraps to find `SeqStmt`, and reapplies wrappers LIFO around the constructed pipeline; adds `<functional>`.
Transform: pipeline planning updates `src/transform/pipeline_planning.cc`	Adds `pipeline_body_root` and unwrap traversal to locate `SeqStmt` through Let/If/Attr wrappers; updates BufferRegionCollector to handle BufferLoad/Var-backed reads and uses unwrapped `SeqStmt` for planning and validation.
Lowering: remapped layouts & Let bindings `src/transform/lower_tile_op.cc`	Adds remap/layout helpers (`ResolveBufferLoad`, `FindRemapBuffer`, `FindLayout`, `AccessPtrResult`), optionally rewrites access_ptr/offset when buffers are remapped, and introduces `let_bindings_` to bind `BufferLoad` results during `LetStmt` handling for analyzer use.
Transform: multi-version buffer rewriter `src/transform/multi_version_buffer_rewriter.cc`	Flattens pipeline statements, tracks copy-stage semantics (`is_copy_stage`), records `first_write_index`/`last_read_index`, `block_alloc_buffers_`, `stmt_stack_`, unwraps wrappers to find `SeqStmt`, and adjusts buffer remapping/versioning decisions accordingly.
Simplify: LetStmt buffer-alias suppression `src/transform/simplify.cc`	Adds a pre-inline path that removes Let bindings whose value is a `BufferLoad`/`BufferRegion` spanning multiple non-unit extents (alias-like), using range analysis to safely bypass the Let.
New test: multi-version buffer with let `testing/python/transform/test_tilelang_transform_multi_version_buffer.py`	Adds `test_multi_version_buffer_with_let()` validating multi-version buffer rewriting in a Let-like scenario; compares before/after prim_funcs via existing helper.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant For as ForLoop
  participant Planner as PipelinePlanner
  participant Unwrapper as WrapperUnwrapper
  participant Builder as PipelineBuilder
  participant Injector as Injector

  Note over For,Injector: Wrapper-aware pipeline injection (new flow)
  For->>Planner: identify pipeline_body_root
  Planner->>Unwrapper: traverse Let/If/Attr wrappers to find SeqStmt
  Unwrapper-->>Planner: SeqStmt + wrappers(stack)
  Planner->>Builder: build stages from SeqStmt
  Builder->>Injector: transformed pipeline body
  Injector->>Injector: reapply wrappers LIFO around body
  Injector-->>For: attach wrapped pipeline body

sequenceDiagram
  autonumber
  participant Lower as lower_tile_op
  participant Remap as RemapResolver
  participant Layout as LayoutFinder
  participant Analyzer as Analyzer
  participant IR as IRTransformer

  Note over Lower,IR: Buffer remap/layout resolution and Let bindings
  Lower->>Remap: Resolve BufferLoad/Var -> remapped buffer?
  Remap-->>Layout: query layout info
  alt remap found
    Lower->>IR: rewrite access_ptr/offset/mma with remapped expr
  else
    Lower->>IR: keep original access
  end
  Lower->>Analyzer: bind BufferLoad values into let_bindings_ for LetStmt scope
  Analyzer-->>Lower: supply bound values for later passes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[Feature] Add 1D TMA support #761 — Touches src/transform/lower_tile_op.cc; related to buffer remap/layout and lowering plumbing.
[Refactor] Merge ThreadPartialSync and ThreadStorageSync #741 — Modifies phase.py/LowerAndLegalize pipeline; overlaps with conditional LetStmt inlining work.
[TileOp] Introduce a experimental python defined T.gemm_v2 #793 — Also concerns LetStmt inlining and related transform pass changes.

Suggested reviewers

tzj-fxz

Poem

🐇 I hopped through loops where wrappers hide,
I unbound Lets and found SeqStmt inside.
I remapped buffers, tucked shared memory tight,
Ran CUDA kernels that blinked with light.
A little rabbit cheers — compile, pass, good night!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly describes the primary bug fix of removing forced inlining of let statements, which aligns directly with the pull request’s main change to disable LetStmt inlining.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ee3540 and 2311fc7.

📒 Files selected for processing (5)

docs/compiler_internals/letstmt_inline.md (1 hunks)
src/op/builtin.cc (1 hunks)
src/op/builtin.h (2 hunks)
tilelang/engine/phase.py (2 hunks)
tilelang/transform/pass_config.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-07T08:25:14Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

testing/python/issue/test_tilelang_issue_814.py (2)
7-21: Consider adding a docstring.

Adding a docstring would clarify the purpose of this kernel factory function and its parameters.
 def test_tmp_var(N, block_N, dtype="float"):
+    """
+    Create a kernel that tests temporary variable handling in TileLang.
+    
+    The kernel computes tmp = max(A[idx], 1) and uses it twice to ensure
+    the temporary is not over-inlined, which would cause redundant computation.
+    
+    Parameters
+    ----------
+    N : int
+        Tensor size
+    block_N : int
+        Elements processed per block
+    dtype : str
+        Data type (default: "float")
+    """
14-19: Align thread count to block_N and drop manual bounds guard

Safe memory access checks are automatically inserted by the LegalizeSafeMemoryAccess pass, so a manual if idx < N guard isn’t needed.

Replace threads=128 with threads=block_N to maintain one-to-one thread-to-element mapping and avoid mismatch. [line 14]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb0677 and abc2f8c.

📒 Files selected for processing (3)

testing/python/issue/test_tilelang_issue_814.py (1 hunks)
tilelang/engine/phase.py (0 hunks)
tilelang/language/utils.py (0 hunks)

💤 Files with no reviewable changes (2)

tilelang/engine/phase.py
tilelang/language/utils.py

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/issue/test_tilelang_issue_814.py (4)

tilelang/language/parallel.py (1)

Parallel (8-28)

tilelang/jit/__init__.py (1)

compile (33-86)

tilelang/transform/pass_config.py (1)

PassConfigKey (6-101)

tilelang/utils/tensor.py (1)

torch_assert_close (220-312)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-amd
GitHub Check: bot-task
GitHub Check: format-check

🔇 Additional comments (2)

testing/python/issue/test_tilelang_issue_814.py (2)

24-49: LGTM! Test logic is sound.

The test correctly validates temporary variable handling:

Compiles the kernel with appropriate pass configs (disabling warp specialization and TMA lowering)

Creates reference computation using PyTorch

Validates both output tensors with reasonable tolerances

The key test: tmp is computed once and used twice (for both B and A updates), ensuring the compiler doesn't over-inline the let binding

52-54: LGTM! Clear test entry point.

The docstring accurately describes the test's purpose.

…checks - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests. - Added check_correctness parameter to test functions for validation of outputs. - Updated test cases to reflect new SKV values and correctness checks.

…in example_fusedmoe_tilelang.py - Added a new section for compiler internals in the documentation. - Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.

LeiWang1999 · 2025-10-07T12:48:42Z

a fix for issue #814

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/fusedmoe/example_fusedmoe_tilelang.py (2)

1-1: Fix the formatting issue flagged by CI.

The pipeline failures indicate this file needs reformatting. Please run the project's formatter and commit the changes.

536-536: Align generate_input parameters with config keys
The function signature uses names like dhidden, dexpert, nroutedexperts, etc., while the config dict uses d_hidden, d_expert, n_routed_experts, etc., causing keyword argument mismatches. Rename either the config keys or the function parameters so they match exactly.

🧹 Nitpick comments (4)

docs/compiler_internals/letstmt_inline.md (4)

11-22: Clarify current pipeline behavior (no forced inlining).

Given this PR stops forcing Let inlining in LowerAndLegalize, add a short note here that Simplify may still inline under these rules, but the pipeline no longer forces it. This avoids confusion about “when” inlining happens.

15-21: Avoid fragile line-number references; link to source or name symbols instead.

Hardcoding source line numbers will drift. Replace “(lines 315–326)”, “line 357”, and “(lines 168–206)” with:

Symbol names (e.g., CanInlineLetStmt, used_in_buffer_def_, VarUseDefAnalyzer)

A GitHub permalink to the file/anchor

Also applies to: 47-55, 75-79

1-8: Add a brief “As of PR #947” note for traceability.

Add one sentence noting that the pipeline no longer forces Let inlining as of PR #947, and this page documents the Simplify pass rules.

141-144: Use GitHub permalinks for Related Files Replace the plain paths under “Related Files” in docs/compiler_internals/letstmt_inline.md with clickable permalinks to the exact lines in src/transform/simplify.cc (e.g. line 315 for CanInlineLetStmt) and in testing/python/issue/test_tilelang_issue_814.py.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1aa27e and 9354899.

📒 Files selected for processing (4)

docs/compiler_internals/letstmt_inline.md (1 hunks)
docs/index.md (1 hunks)
examples/fusedmoe/example_fusedmoe_tilelang.py (1 hunks)
examples/fusedmoe/test_example_fusedmoe.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

examples/fusedmoe/test_example_fusedmoe.py (1)

examples/fusedmoe/example_fusedmoe_tilelang.py (1)

main (524-545)

examples/fusedmoe/example_fusedmoe_tilelang.py (1)

examples/warp_specialize/example_warp_specialize_flashmla.py (1)

main (385-399)

🪛 GitHub Actions: CI Test on AMD

examples/fusedmoe/test_example_fusedmoe.py

[error] 1-1: Reformatted files. Please review and stage the changes.

examples/fusedmoe/example_fusedmoe_tilelang.py

[error] 1-1: Reformatted files. Please review and stage the changes.

🪛 GitHub Actions: CI Test on Metal

examples/fusedmoe/test_example_fusedmoe.py

[error] 1-1: Code reformatted by formatter. Changes not staged for commit. Please run 'git add' and commit the changes, or revert formatting if undesired.

examples/fusedmoe/example_fusedmoe_tilelang.py

[error] 1-1: Code reformatted by formatter. Changes not staged for commit. Please run 'git add' and commit the changes, or revert formatting if undesired.

🔇 Additional comments (1)

docs/index.md (1)

38-44: Section addition looks good; verify link resolves.

New COMPILER INTERNALS toctree is consistent with existing style. Please confirm docs/compiler_internals/letstmt_inline.md builds and the link renders in the sidebar.

docs/compiler_internals/letstmt_inline.md

coderabbitai · 2025-10-07T12:52:09Z

docs/compiler_internals/letstmt_inline.md

+If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.
+


⚠️ Potential issue | 🟡 Minor

Revise the hazard explanation; current sequence doesn’t demonstrate a read-after-write issue.

In the example, both reads occur before the write to A[idx], so “due to the first write” isn’t accurate. Consider a case where A[idx] is written between two reads to illustrate why duplicated loads from inlining can change behavior, or rephrase to note duplicate loads impact performance but not semantics here.

-If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write)... +Be aware that inlining duplicates loads. If a write to the same location occurs between two uses, re‑evaluation can observe updated memory and change behavior. In this example, both reads occur before the write, so semantics are unchanged, though duplicated loads may affect performance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.

Be aware that inlining duplicates loads. If a write to the same location occurs between two uses, re-evaluation can observe updated memory and change behavior. In this example, both reads occur before the write, so semantics are unchanged, though duplicated loads may affect performance.

🤖 Prompt for AI Agents

In docs/compiler_internals/letstmt_inline.md around lines 126 to 127, the hazard explanation is incorrect because the shown sequence has both reads before the write so it does not demonstrate a read-after-write semantic change; either replace the example with one where A[idx] is written between two reads (so the first write can change the value seen by the second read) to illustrate the semantic hazard, or rephrase the sentence to say that duplicated loads in this specific sequence only affect performance (not semantics) and clarify when duplicated loads can change behavior (i.e., when a write to the same location occurs between the reads).

coderabbitai · 2025-10-07T12:52:10Z

examples/fusedmoe/example_fusedmoe_tilelang.py

+def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):
    config = {
-        "dhidden": 7168,
-        "dexpert": 2048,
-        "nroutedexperts": 8,
-        "nsharedexperts": 1,
-        "nexpertspertoken": 4,
-        "bs": 1,
-        "seqlen": 8192,
+        "dhidden": d_hidden,
+        "dexpert": d_expert,
+        "nroutedexperts": n_routed_experts,
+        "nsharedexperts": n_shared_experts,
+        "nexpertspertoken": n_experts_per_token,
+        "bs": batch_size,
+        "seqlen": seq_len,
        "seed": 81394
    }


⚠️ Potential issue | 🔴 Critical

Critical: Config dictionary keys don't match usage throughout the codebase.

The config dictionary uses lowercase keys without underscores ("dhidden", "dexpert", "nroutedexperts", etc.), but the Expert, MoEGate, and MoE classes expect keys with underscores ("d_hidden", "d_expert", "n_routed_experts", etc.). This will cause KeyError exceptions at runtime.

For example:

Line 282: self.d_hidden: int = config["d_hidden"]

Line 283: self.d_expert: int = config["d_expert"]

Line 300: self.top_k: int = config["n_experts_per_token"]

Line 301: self.num_experts: int = config["n_routed_experts"]

Apply this diff to fix the config keys:

def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192): config = { - "dhidden": d_hidden, - "dexpert": d_expert, - "nroutedexperts": n_routed_experts, - "nsharedexperts": n_shared_experts, - "nexpertspertoken": n_experts_per_token, - "bs": batch_size, - "seqlen": seq_len, + "d_hidden": d_hidden, + "d_expert": d_expert, + "n_routed_experts": n_routed_experts, + "n_shared_experts": n_shared_experts, + "n_experts_per_token": n_experts_per_token, + "batch_size": batch_size, + "seq_len": seq_len, "seed": 81394 }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):

config = {

"dhidden": 7168,

"dexpert": 2048,

"nroutedexperts": 8,

"nsharedexperts": 1,

"nexpertspertoken": 4,

"bs": 1,

"seqlen": 8192,

"dhidden": d_hidden,

"dexpert": d_expert,

"nroutedexperts": n_routed_experts,

"nsharedexperts": n_shared_experts,

"nexpertspertoken": n_experts_per_token,

"bs": batch_size,

"seqlen": seq_len,

"seed": 81394

}

def main(d_hidden=7168, d_expert=2048, n_routed_experts=8, n_shared_experts=1, n_experts_per_token=4, batch_size=1, seq_len=8192):

config = {

"d_hidden": d_hidden,

"d_expert": d_expert,

"n_routed_experts": n_routed_experts,

"n_shared_experts": n_shared_experts,

"n_experts_per_token": n_experts_per_token,

"batch_size": batch_size,

"seq_len": seq_len,

"seed": 81394

}

🤖 Prompt for AI Agents

In examples/fusedmoe/example_fusedmoe_tilelang.py around lines 524 to 534, the config dict uses keys like "dhidden", "dexpert", "nroutedexperts", etc., but the rest of the code expects underscored keys (e.g., "d_hidden", "d_expert", "n_routed_experts", "n_shared_experts", "n_experts_per_token", "bs", "seqlen", "seed"); update the dict keys to match those expected names (replace "dhidden"->"d_hidden", "dexpert"->"d_expert", "nroutedexperts"->"n_routed_experts", "nsharedexperts"->"n_shared_experts", "nexpertspertoken"->"n_experts_per_token") and keep "bs", "seqlen", and "seed" as-is so the Expert/MoEGate/MoE classes access config without KeyError.

examples/fusedmoe/test_example_fusedmoe.py

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level. - Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9354899 and f372812.

📒 Files selected for processing (1)

src/transform/merge_shared_memory_allocations.cc (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/transform/merge_shared_memory_allocations.cc (1)

src/transform/storage_rewrite.cc (6)

buf (247-254)

buf (247-247)

buf (505-522)

buf (505-505)

buf (1715-1740)

buf (1715-1715)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: format-check
GitHub Check: format-check
GitHub Check: bot-task

🔇 Additional comments (2)

src/transform/merge_shared_memory_allocations.cc (2)

219-220: LGTM!

The relaxation of the boundary check is consistent with the fix in BufferLoadNode and correctly handles buffer variable references at the same scope level. The comment clearly explains the change.

197-201: LGTM with consistency check The update in merge_shared_memory_allocations.cc is correct. storage_rewrite.cc still uses strict ICHECK_LT(it->second.level, scope_.size()). Confirm whether it should also be relaxed to ICHECK_LE and use the access_level logic to prevent out-of-bounds.

coderabbitai · 2025-10-07T17:02:27Z

src/transform/merge_shared_memory_allocations.cc

+          // When accessing at the same level, use that level
+          size_t access_level = std::min(it->second.level, scope_.size() - 1);
+          scope_[access_level].touched.push_back(buf);


⚠️ Potential issue | 🟡 Minor

Guard against unsigned underflow in access_level calculation.

The expression scope_.size() - 1 can underflow when scope_.size() == 0 (wrapping to SIZE_MAX for size_t). While the ICHECK_LE guard makes this scenario unlikely in practice, relying on wraparound behavior is a code smell that could mask bugs or cause issues in unexpected edge cases.

Consider using a safer calculation that explicitly handles the zero case:

if (enable_aggressive_merge) { scope_[scope_.size() - 1].touched.push_back(buf); } else { // When accessing at the same level, use that level - size_t access_level = std::min(it->second.level, scope_.size() - 1); + size_t access_level = (scope_.size() == 0) ? 0 : std::min(it->second.level, scope_.size() - 1); scope_[access_level].touched.push_back(buf); }

Apply the same fix to both locations (lines 207-209 and lines 226-228).

Alternatively, add a runtime check:

// When accessing at the same level, use that level + ICHECK_GT(scope_.size(), 0) << "Scope stack should not be empty when accessing buffers"; size_t access_level = std::min(it->second.level, scope_.size() - 1);

Also applies to: 226-228

🤖 Prompt for AI Agents

In src/transform/merge_shared_memory_allocations.cc around lines 207-209 and 226-228, the computation scope_.size() - 1 can underflow when scope_.size() == 0; change the calculation to guard the zero case (e.g. compute size_t max_index = scope_.empty() ? 0 : scope_.size() - 1; then use std::min(it->second.level, max_index) for access_level) and replace both occurrences; optionally keep or add a runtime check/ICHECK that scope_ is non-empty where logically required to catch unexpected states.

- inline let-bound BufferLoad nodes via resolver helpers and structured return - remap layouts/buffers using original data vars and only rewrite when needed - update pipeline planner to understand let-bound address_of buffers - document the new inline behaviour in docs/let_inline_fix.md

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/transform/lower_tile_op.cc (1)

8-8: Remove unused #include <tvm/tir/op.h> (line 8). No tvm::tir operator functions or symbols from this header are referenced in lower_tile_op.cc.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597d8b1 and 36736f3.

📒 Files selected for processing (2)

src/transform/lower_tile_op.cc (10 hunks)
src/transform/pipeline_planning.cc (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

src/transform/pipeline_planning.cc (1)

src/transform/inject_pipeline.cc (20)

buffer_region (113-132)

buffer_region (114-114)

op (166-182)

op (166-166)

op (184-197)

op (184-184)

op (199-212)

op (199-199)

op (214-220)

op (214-214)

op (840-1020)

op (840-840)

op (1022-1039)

op (1022-1022)

op (1041-1058)

op (1041-1041)

buffer (398-453)

buffer (398-399)

buffer (461-470)

buffer (461-461)

src/transform/lower_tile_op.cc (3)

src/transform/layout_inference.cc (4)

expr (389-407)

expr (389-389)

buffer (409-415)

buffer (409-409)

src/transform/inject_pipeline.cc (2)

call (134-164)

call (134-135)

src/transform/pipeline_planning.cc (14)

op (48-57)

op (48-48)

op (87-143)

op (87-87)

op (166-184)

op (166-166)

op (186-204)

op (186-186)

op (206-260)

op (206-206)

op (262-272)

op (262-262)

op (703-712)

op (703-703)

🔇 Additional comments (10)

src/transform/pipeline_planning.cc (4)

209-222: LGTM! Enhanced buffer region resolution.

The refactored address_of handling now properly resolves both BufferLoad and Var cases, making buffer region tracking more robust. The defensive programming with the early check if (buffer_region.defined()) prevents unnecessary additions.

413-449: LGTM! Robust unwrapping of pipeline body wrappers.

The new unwrapping loop properly handles LetStmt and IfThenElse (without else) wrappers around the pipeline body, aligning with the PR objective of not forcing inline let statements. The error message accurately describes the supported wrapper types.

455-455: Verify that pipeline_body_root is used consistently.

Ensure that downstream processing uses pipeline_body_seq for iterating the pipeline stages, as pipeline_body_root may still contain wrappers. Based on line 458, this appears correct.

5-5: Retain the <tvm/tir/op.h> include – it provides definitions for tir::builtin::if_then_else() and tir::attr constants.

src/transform/lower_tile_op.cc (6)

323-426: LGTM! Well-structured refactoring with AccessPtrResult.

The introduction of AccessPtrResult struct and the refactored flow clearly separates cases where rewriting occurs from cases where the original expression is preserved. The use of helper functions (ResolveBufferLoad, FindRemapBuffer, FindLayout) improves modularity and readability.

428-440: LGTM! ResolveBufferLoad properly handles let-bound expressions.

This helper correctly resolves variables to their let-bound BufferLoad expressions when available, enabling support for non-inlined let statements as per the PR objective.

442-478: LGTM! Comprehensive buffer and layout resolution.

Both FindRemapBuffer and FindLayout implement thorough lookup strategies:

Direct map lookup

Lookup via buffer_map_ by data var

Fallback to same_as comparisons by data or name

This handles various buffer aliasing scenarios robustly.

502-543: LGTM! Consistent use of ResolveBufferLoad in PTX paths.

The ptx_ldmatrix and mma_store handling now properly resolve let-bound buffer loads before transformation, ensuring correct remapping when buffers are wrapped in let statements.

594-616: LGTM! Proper LetStmt handling with scope management.

The new VisitStmt_ for LetStmtNode:

Records BufferLoad values in let_bindings_ for later resolution

Binds pure expressions in the analyzer for constant folding

Properly erases bindings after visiting the body to maintain correct scope

This implementation correctly supports non-inlined let statements as intended by the PR.

708-709: LGTM! let_bindings_ properly scoped.

The let_bindings_ member is correctly managed:

Added in VisitStmt_(LetStmtNode) when visiting the let statement

Erased after visiting the body to maintain lexical scope

Used by ResolveBufferLoad for resolution

No memory leaks or scope issues detected.

…_814

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36736f3 and 6f115d3.

📒 Files selected for processing (1)

src/transform/multi_version_buffer_rewriter.cc (5 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/multi_version_buffer_rewriter.cc

🧬 Code graph analysis (1)

src/transform/multi_version_buffer_rewriter.cc (1)

src/transform/warp_specialized_rewriter.cc (36)

seq_stmt (917-990)

seq_stmt (918-919)

seq_stmt (1033-1090)

seq_stmt (1033-1033)

op (38-43)

op (38-38)

op (73-83)

op (73-73)

op (85-93)

op (85-85)

op (95-100)

op (95-95)

op (102-110)

op (102-102)

op (134-146)

op (134-134)

op (148-177)

op (148-148)

op (179-189)

op (179-179)

stmt (51-62)

stmt (51-51)

stmt (121-124)

stmt (121-121)

stmt (126-130)

stmt (126-126)

stmt (132-132)

stmt (132-132)

stmt (230-230)

stmt (230-230)

stmt (271-271)

stmt (271-271)

stmt (302-306)

stmt (302-302)

stmt (333-339)

stmt (333-334)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-metal
GitHub Check: build-test-amd

coderabbitai · 2025-10-10T02:29:24Z

src/transform/multi_version_buffer_rewriter.cc

+      bool in_scope = false;
+      for (const StmtNode *ancestor : stmt_stack_) {
+        if (ancestor == lca) {
+          in_scope = true;
+          break;
+        }
+      }
+      if (!in_scope)
+        continue;
+      // Only double-buffer shared allocations; locals do not need versioning.
+      auto scope = buffer.scope();
+      if (!(scope == "shared" || scope == "shared.dyn"))
+        continue;
+      if (seen.insert(buffer.get()).second) {
        scoped_buffers.push_back(buffer);
+      }
+    }
+    for (auto it = stmt_stack_.rbegin(); it != stmt_stack_.rend(); ++it) {
+      if (!(*it)->IsInstance<BlockNode>())
+        continue;
+      const auto *block = static_cast<const BlockNode *>(*it);
+      auto map_it = block_alloc_buffers_.find(block);
+      if (map_it == block_alloc_buffers_.end())
+        continue;
+      for (const Buffer &buffer : map_it->second) {
+        auto scope = buffer.scope();
+        if (!(scope == "shared" || scope == "shared.dyn"))
+          continue;
+        if (seen.insert(buffer.get()).second) {
+          scoped_buffers.push_back(buffer);
+        }
+      }
    }


⚠️ Potential issue | 🔴 Critical

Fix LCA scope tracking for Let/If wrappers.

We only push For/Block nodes into stmt_stack_, so when DetectBufferAccessLCA says a buffer’s LCA is a LetStmt or IfThenElse (exactly the wrappers we now keep instead of inlining), the ancestor scan never sees it. in_scope stays false, the buffer drops out of scoped_buffers, and we silently skip versioning—regressing double buffering for the new “non-inlined let” pipelines. Please push those wrapper nodes before visiting their bodies.

+ Stmt VisitStmt_(const LetStmtNode *op) final { + stmt_stack_.push_back(op); + Stmt stmt = StmtExprMutator::VisitStmt_(op); + stmt_stack_.pop_back(); + return stmt; + } + + Stmt VisitStmt_(const IfThenElseNode *op) final { + stmt_stack_.push_back(op); + Stmt stmt = StmtExprMutator::VisitStmt_(op); + stmt_stack_.pop_back(); + return stmt; + }

🤖 Prompt for AI Agents

In src/transform/multi_version_buffer_rewriter.cc around lines 301 to 333, the ancestor scan for LCAs always checks stmt_stack_ which only contains For/Block nodes, so LCAs that are LetStmt or IfThenElse wrappers are missed causing buffers to be excluded from scoped_buffers; update the traversal to push LetStmt and IfThenElse nodes onto stmt_stack_ before visiting their bodies (and pop them after visiting) so DetectBufferAccessLCA finds wrapper LCAs and buffers get properly considered for double-buffering.

LeiWang1999 · 2025-10-10T02:33:47Z

PR Summary via GPT-5 Pro – LetStmt-Aware Pipeline Fixes (Issue #814)

Overview

Addressed correctness regressions that appear when LetInline is disabled for CUDA pipelines.
Hardened multiple lowering passes so they can work with LetStmt, AttrStmt, and single-branch IfThenElse wrappers that now remain in the IR.
Added documentation and regression tests to capture the behaviour.

Compiler Pass Updates

Remove unconditional LetInline (tilelang/engine/phase.py:88): the pass was masking issues in downstream transforms; we now rely on updated passes that explicitly tolerate LetStmt wrappers.
Pipeline planning (src/transform/pipeline_planning.cc:410):
- Walks through LetStmt/single-branch IfThenElse wrappers to recover the underlying SeqStmt.
- Records shared buffers defined in ancestor blocks and handles address_of calls whose operand is a Var, ensuring dependency analysis still sees shared memory accesses.
Software-pipeline injection (src/transform/inject_pipeline.cc:846):
- Uses the same unwrapping logic and re-applies the stripped wrappers after the pipeline is rewritten so attributes, lets, and guards survive the transform intact.
Multi-version buffer rewriter (src/transform/multi_version_buffer_rewriter.cc:141):
- Tracks ancestor statements and block-local allocations to find shared buffers even when their LCA is outside the loop body.
- Treats Role::kBoth stages as both producers and consumers and falls back to “first write / last read” indices to decide when double-buffering is required.
- Restricts multi-versioning to shared / shared.dyn scopes so locals are not duplicated accidentally and handles let/if wrappers around the pipeline body.
Shared-memory merger (src/transform/merge_shared_memory_allocations.cc:194):
- Allows reads in the same scope level where the buffer is allocated, preventing false negatives in the presence of let-wrapped expressions.
Tile op lowering (src/transform/lower_tile_op.cc:330):
- Resolves let-bound BufferLoad expressions before remapping, so address_of / tvm_access_ptr seen after LetInline removal still convert to the permuted layouts required by Hopper intrinsics.
- Caches let bindings during visitation to avoid repeated re-resolution.
Docs (docs/compiler_internals/letstmt_inline.md, docs/index.md:35):
- Document how and why LetStmt inlining works, providing context for the removal in the lowering pipeline.

Examples & Tests

FusedMoE example (examples/fusedmoe/example_fusedmoe_tilelang.py:521):
- Parameterized the demo so CI can run a reduced problem size, keeping coverage while respecting resource limits.
- Updated test harness (examples/fusedmoe/test_example_fusedmoe.py:3) to pass the smaller configuration.
Issue regression test (testing/python/issue/test_tilelang_issue_814.py):
- Captures the scenario where a let-bound temporary feeds both load and store paths, ensuring future changes do not reintroduce silent inlining bugs.

Verification

python examples/deepseek_deepgemm/example_deepgemm_fp8_2xAcc.py
python -m pytest testing/python/issue/test_tilelang_issue_814.py
python -m pytest examples/fusedmoe/test_example_fusedmoe.py

LeiWang1999 · 2025-10-10T03:41:18Z

have to modify some codes for aws related pass, @chengyupku should take a look.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

src/transform/multi_version_buffer_rewriter.cc (1)

286-291: Fix LCA scope tracking: push Let/If/Attr onto stmt_stack_.

When LCA is a LetStmt or IfThenElse (now preserved), in-scope detection fails because stmt_stack_ only tracks For/Block. Buffers with wrapper LCAs get skipped for versioning.

Add visitors that push/pop these wrappers, mirroring Block/For.

   Stmt VisitStmt_(const BlockNode *op) final {
     stmt_stack_.push_back(op);
     Stmt stmt = StmtExprMutator::VisitStmt_(op);
     stmt_stack_.pop_back();
     return stmt;
   }

   Stmt VisitStmt_(const ForNode *op) final {
     stmt_stack_.push_back(op);
     ...
     stmt_stack_.pop_back();
     return for_node;
   }
+
+  Stmt VisitStmt_(const LetStmtNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
+
+  Stmt VisitStmt_(const IfThenElseNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }
+
+  Stmt VisitStmt_(const AttrStmtNode *op) final {
+    stmt_stack_.push_back(op);
+    Stmt stmt = StmtExprMutator::VisitStmt_(op);
+    stmt_stack_.pop_back();
+    return stmt;
+  }

Also applies to: 293-406, 474-485

🧹 Nitpick comments (5)

examples/blocksparse_attention/test_example_blocksparse_attention.py (1)

10-11: Consider parameterizing the remaining tests for consistency.

Tests test_block_sparse_attn_triton, test_example_tilelang_block_sparse_attn, test_example_triton_sparse_gqa_decode_varlen_indice, and test_example_triton_sparse_gqa_decode_varlen_mask still call main() without explicit parameters. If their default configurations also result in long CI runtimes, consider applying similar parameterization for consistency and efficiency.

Also applies to: 14-15, 26-31
testing/python/transform/test_tilelang_transform_multi_version_buffer.py (2)
108-135: Good regression test for let-bound producer/consumer.

Covers non-inlined let across pipeline stages and matches expected double buffering (k % 2). Consider adding complementary cases:

Single-branch IfThenElse wrapper around the pipeline body.

AttrStmt wrapper (e.g., pragma) above SeqStmt.

This will guard the additional wrapper paths tightened in the pass.

114-121: Optional: zero‑init accum to avoid undefined reads if executed.

If this TIR is ever run, accum[i] += shared[...] reads undefined initial values. You can add a small init loop before k.
         accum = T.alloc_buffer((8,), "float32", scope="local")
+        for i in T.serial(8):
+            accum[i] = T.float32(0)
         for k in T.serial(4, annotations={"num_stages": T.int32(2)}):
Also applies to: 126-134
src/transform/multi_version_buffer_rewriter.cc (2)
142-173: Consider unwrapping single-branch IfThenElse in collect_stmts.

collect_stmts flattens Let/Attr/Block wrappers but leaves IfThenElse. For trivial guards inserted around stages, flattening the then_case (when else is absent) can stabilize stage detection and access analysis.

Optional improvement: unwrap IfThenElse without else same as Let/Attr.
     if (const auto *let = stmt.as<LetStmtNode>()) {
       collect_stmts(let->body);
       return;
     }
     if (const auto *attr = stmt.as<AttrStmtNode>()) {
       collect_stmts(attr->body);
       return;
     }
+    if (const auto *ite = stmt.as<IfThenElseNode>()) {
+      if (!ite->else_case.defined()) {
+        collect_stmts(ite->then_case);
+        return;
+      }
+    }
434-472: Also rewrite address_of to include version offset (optional).

Only tvm_access_ptr is adjusted. Calls like tir.address_of(buffer[...]) or builtin::tvm_address_of may appear after letting/intrinsic lowering and need the version stride added.

Add a branch similar to tvm_access_ptr to rewrite address_of indices.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c637767 and 22ae8c5.

📒 Files selected for processing (4)

examples/blocksparse_attention/test_example_blocksparse_attention.py (1 hunks)
src/transform/multi_version_buffer_rewriter.cc (6 hunks)
src/transform/simplify.cc (2 hunks)
testing/python/transform/test_tilelang_transform_multi_version_buffer.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
PR: tile-ai/tilelang#794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/multi_version_buffer_rewriter.cc

🧬 Code graph analysis (4)

src/transform/simplify.cc (2)

src/transform/lower_tile_op.cc (2)

expr (428-440)

expr (428-428)

src/transform/layout_inference.cc (2)

expr (389-407)

expr (389-389)

testing/python/transform/test_tilelang_transform_multi_version_buffer.py (2)

tilelang/language/ast/ir.py (1)

alloc_buffer (441-508)

src/transform/storage_rewrite.cc (1)

i (705-841)

examples/blocksparse_attention/test_example_blocksparse_attention.py (2)

examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_indice.py (2)

main (173-186)

main (381-468)

examples/blocksparse_attention/example_tilelang_sparse_gqa_decode_varlen_mask.py (2)

main (162-173)

main (371-449)

src/transform/multi_version_buffer_rewriter.cc (1)

src/transform/warp_specialized_rewriter.cc (17)

stmt (51-62)

stmt (51-51)

stmt (121-124)

stmt (121-121)

stmt (126-130)

stmt (126-126)

stmt (132-132)

stmt (132-132)

stmt (230-230)

stmt (230-230)

stmt (271-271)

stmt (271-271)

seq_stmt (917-990)

seq_stmt (918-919)

seq_stmt (1033-1090)

seq_stmt (1033-1033)

marker (1181-1181)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-metal

🔇 Additional comments (5)

examples/blocksparse_attention/test_example_blocksparse_attention.py (2)

18-19: LGTM! Explicit parameterization reduces CI runtime.

The explicit parameters (batch=1, max_cache_seqlen=2048) reduce the test workload from the defaults (batch=8, max_cache_seqlen=8192), aligning with the PR's goal of parameterizing examples for faster CI runs.

22-23: LGTM! Consistent parameterization with the other test.

The parameters match line 19, ensuring consistency across the modified tests.
src/transform/simplify.cc (3)
8-8: LGTM!

The inclusion of tvm/tir/analysis.h supports the new buffer alias analysis functionality.

15-15: LGTM!

The <optional> header is appropriate for C++17 optional types used in this file.

356-370: Verify the non-unit dimension counting aligns with intended removal criteria.

The current logic counts non-unit dimensions to decide on buffer alias removal. Please confirm:

Should a single-dimension vectorized access (e.g., buffer[ramp(0, 1, 4)] representing a 4-lane vector) trigger removal? The comment mentions "or carries vector lanes" as a separate condition, but the current non_unit_dims > 1 check would NOT remove such cases.

Is the threshold non_unit_dims > 1 (2+ dims) intentional, or should it be >= 2 or some other value?

If single-dimension vectorized accesses should also be removed, consider detecting Ramp nodes explicitly:
     Array<Range> ranges = get_ranges(value);
     if (!ranges.empty()) {
       int non_unit_dims = 0;
+      bool has_ramp = false;
       for (const Range &range : ranges) {
         PrimExpr extent = analyzer_->Simplify(range->extent);
         if (is_const_int(extent, 1) || analyzer_->CanProveEqual(extent, 1)) {
           continue;
         }
+        // Check if this came from a Ramp (vectorized access)
+        if (value.as<BufferLoadNode>()) {
+          auto load = value.as<BufferLoadNode>();
+          for (const PrimExpr &index : load->indices) {
+            if (index.as<RampNode>()) {
+              has_ramp = true;
+            }
+          }
+        }
         ++non_unit_dims;
         if (non_unit_dims > 1) {
           remove_buffer_alias = true;
           break;
         }
       }
+      if (has_ramp) {
+        remove_buffer_alias = true;
+      }
     }

coderabbitai · 2025-10-10T06:41:16Z

src/transform/multi_version_buffer_rewriter.cc

+    const SeqStmtNode *pipeline_body_seq = nullptr;
+    {
+      // Traverse trivial wrappers (let/if) to find the actual SeqStmt body.
+      Stmt current = pipeline_body_root;
+      while (true) {
+        if (const auto *seq_stmt = current.as<SeqStmtNode>()) {
+          pipeline_body_seq = seq_stmt;
+          break;
+        }
+        if (const auto *if_then_else = current.as<IfThenElseNode>()) {
+          ICHECK(!if_then_else->else_case.defined())
+              << "MultiVersionBuffer: Can't handle the body of the loop "
+                 "because the IfThenElse node has an else branch";
+          current = if_then_else->then_case;
+          continue;
+        }
+        if (const auto *let_stmt = current.as<LetStmtNode>()) {
+          current = let_stmt->body;
+          continue;
+        }
+        LOG(FATAL)
+            << "MultiVersionBuffer: Can't handle the body of the loop because "
+            << "it is not a SeqStmt, IfThenElse without else, "
+            << "or LetStmt wrapping them, but got " << current->GetTypeKey();
+      }
+    }


⚠️ Potential issue | 🔴 Critical

Unwrap AttrStmt while locating pipeline body.

The finder unwraps IfThenElse and LetStmt but not AttrStmt. If For->body is an AttrStmt wrapping a SeqStmt (common after LetInline removal), this path LOG(FATAL)s.

Add AttrStmt unwrapping alongside Let/If.

while (true) { if (const auto *seq_stmt = current.as<SeqStmtNode>()) { pipeline_body_seq = seq_stmt; break; } if (const auto *if_then_else = current.as<IfThenElseNode>()) { ICHECK(!if_then_else->else_case.defined()) << "MultiVersionBuffer: Can't handle the body of the loop " "because the IfThenElse node has an else branch"; current = if_then_else->then_case; continue; } + if (const auto *attr_stmt = current.as<AttrStmtNode>()) { + current = attr_stmt->body; + continue; + } if (const auto *let_stmt = current.as<LetStmtNode>()) { current = let_stmt->body; continue; } LOG(FATAL) << "MultiVersionBuffer: Can't handle the body of the loop because " << "it is not a SeqStmt, IfThenElse without else, " - << "or LetStmt wrapping them, but got " << current->GetTypeKey(); + << "or LetStmt/AttrStmt wrapping them, but got " + << current->GetTypeKey(); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const SeqStmtNode *pipeline_body_seq = nullptr;

{

// Traverse trivial wrappers (let/if) to find the actual SeqStmt body.

Stmt current = pipeline_body_root;

while (true) {

if (const auto *seq_stmt = current.as<SeqStmtNode>()) {

pipeline_body_seq = seq_stmt;

break;

}

if (const auto *if_then_else = current.as<IfThenElseNode>()) {

ICHECK(!if_then_else->else_case.defined())

<< "MultiVersionBuffer: Can't handle the body of the loop "

"because the IfThenElse node has an else branch";

current = if_then_else->then_case;

continue;

}

if (const auto *let_stmt = current.as<LetStmtNode>()) {

current = let_stmt->body;

continue;

}

LOG(FATAL)

<< "MultiVersionBuffer: Can't handle the body of the loop because "

<< "it is not a SeqStmt, IfThenElse without else, "

<< "or LetStmt wrapping them, but got " << current->GetTypeKey();

}

}

const SeqStmtNode *pipeline_body_seq = nullptr;

{

// Traverse trivial wrappers (let/if) to find the actual SeqStmt body.

Stmt current = pipeline_body_root;

while (true) {

if (const auto *seq_stmt = current.as<SeqStmtNode>()) {

pipeline_body_seq = seq_stmt;

break;

}

if (const auto *if_then_else = current.as<IfThenElseNode>()) {

ICHECK(!if_then_else->else_case.defined())

<< "MultiVersionBuffer: Can't handle the body of the loop "

"because the IfThenElse node has an else branch";

current = if_then_else->then_case;

continue;

}

if (const auto *attr_stmt = current.as<AttrStmtNode>()) {

current = attr_stmt->body;

continue;

}

if (const auto *let_stmt = current.as<LetStmtNode>()) {

current = let_stmt->body;

continue;

}

LOG(FATAL)

<< "MultiVersionBuffer: Can't handle the body of the loop because "

<< "it is not a SeqStmt, IfThenElse without else, "

<< "or LetStmt/AttrStmt wrapping them, but got "

<< current->GetTypeKey();

}

}

🤖 Prompt for AI Agents

In src/transform/multi_version_buffer_rewriter.cc around lines 319 to 344, the traversal that unwraps trivial wrappers when locating the SeqStmt body handles IfThenElse and LetStmt but omits AttrStmt, causing a fatal error when the loop body is an AttrStmt-wrapped SeqStmt; add a branch that recognizes AttrStmtNode and sets current = attr_stmt->body (continuing the loop) alongside the existing Let/If handling so AttrStmt wrappers are unwrapped before the SeqStmt check.

coderabbitai · 2025-10-10T06:41:17Z

src/transform/simplify.cc

+    // TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate
+    // fragment types. TVM currently reinterprets vectorized/shared accesses as
+    // Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later
+    // passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes
+    // that our layout can't handle. Force-inline (by dropping the let) whenever
+    // the alias spans more than 2 dims or carries vector lanes.


⚠️ Potential issue | 🟠 Major

Clarify the comment to match implementation logic.

The comment states removal triggers when the alias "spans more than 2 dims or carries vector lanes", but the implementation at line 365 checks non_unit_dims > 1, which means 2+ non-unit dimensions. This creates two mismatches:

"More than 2 dims" typically means 3+, but the code triggers on 2+.

The "or carries vector lanes" suggests a vectorized single-dimension access should also trigger removal, but it won't because a single Ramp index contributes only 1 non-unit dimension.

Apply this diff to align the comment with the actual logic:

- // TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate - // fragment types. TVM currently reinterprets vectorized/shared accesses as - // Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later - // passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes - // that our layout can't handle. Force-inline (by dropping the let) whenever - // the alias spans more than 2 dims or carries vector lanes. + // TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate + // fragment types. TVM currently reinterprets vectorized/shared accesses as + // Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later + // passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes + // that our layout can't handle. Force-inline (by dropping the let) whenever + // the alias has 2 or more non-unit dimensions (counting vector lanes).

Alternatively, if the intent is to also catch single-dimension vectorized accesses, update the logic:

if (!ranges.empty()) { int non_unit_dims = 0; + bool has_vector_lanes = false; for (const Range &range : ranges) { PrimExpr extent = analyzer_->Simplify(range->extent); if (is_const_int(extent, 1) || analyzer_->CanProveEqual(extent, 1)) { continue; } + if (is_const_int(extent) && *as_const_int(extent) > 1) { + has_vector_lanes = true; + } ++non_unit_dims; if (non_unit_dims > 1) { remove_buffer_alias = true; break; } } + if (has_vector_lanes) { + remove_buffer_alias = true; + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate

// fragment types. TVM currently reinterprets vectorized/shared accesses as

// Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later

// passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes

// that our layout can't handle. Force-inline (by dropping the let) whenever

// the alias spans more than 2 dims or carries vector lanes.

// TileLang emits aliases like `X_shared = buffer[0:128, 0:32]` to annotate

// fragment types. TVM currently reinterprets vectorized/shared accesses as

// Let-bound BufferLoad/BufferRegion nodes. If these bindings survive, later

// passes (Layout rewrite, FlattenBuffer) substitute them with vector lanes

// that our layout can't handle. Force-inline (by dropping the let) whenever

// the alias has 2 or more non-unit dimensions (counting vector lanes).

coderabbitai · 2025-10-10T06:41:17Z

src/transform/simplify.cc

+    if (remove_buffer_alias) {
+      Stmt body = this->VisitStmt(op->body);
+      bool used = UsesVar(
+          body, [&](const VarNode *var) { return var == op->var.get(); });
+      ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "
+                       "before removal "
+                    << op->var << " : " << op->value << " .";
+      return body;
+    }


⚠️ Potential issue | 🔴 Critical

ICHECK could crash on valid code; consider graceful fallback.

The ICHECK at line 375 will crash the compiler if the let-bound variable is used in the body. While the comment states "expected to be unused", this expectation may not always hold—especially during incremental refactoring or in edge cases not covered by tests. A compiler crash is a poor user experience.

Consider one of these alternatives:

Option 1 (preferred): Fall back to normal inlining if the variable is used:

if (remove_buffer_alias) { Stmt body = this->VisitStmt(op->body); bool used = UsesVar( body, [&](const VarNode *var) { return var == op->var.get(); }); - ICHECK(!used) << "Let binding of BufferLoad is expected to be unused " - "before removal " - << op->var << " : " << op->value << " ."; - return body; + if (!used) { + return body; + } + // Variable is used; fall back to normal inlining logic below + remove_buffer_alias = false; }

Option 2: Perform actual inlining (substitute the variable) instead of just removal:

if (remove_buffer_alias) { - Stmt body = this->VisitStmt(op->body); - bool used = UsesVar( - body, [&](const VarNode *var) { return var == op->var.get(); }); - ICHECK(!used) << "Let binding of BufferLoad is expected to be unused " - "before removal " - << op->var << " : " << op->value << " ."; - return body; + // Inline the buffer alias by substituting it in the body + analyzer_->Bind(op->var, value); + Stmt body = this->VisitStmt(op->body); + return body; }

Option 3: If the ICHECK is truly an invariant, add a comment explaining why the variable is guaranteed to be unused (e.g., due to upstream pass behavior).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (remove_buffer_alias) {

Stmt body = this->VisitStmt(op->body);

bool used = UsesVar(

body, [&](const VarNode *var) { return var == op->var.get(); });

ICHECK(!used) << "Let binding of BufferLoad is expected to be unused "

"before removal "

<< op->var << " : " << op->value << " .";

return body;

}

if (remove_buffer_alias) {

Stmt body = this->VisitStmt(op->body);

bool used = UsesVar(

body, [&](const VarNode *var) { return var == op->var.get(); });

if (!used) {

return body;

}

// Variable is used; fall back to normal inlining logic below

remove_buffer_alias = false;

}

🤖 Prompt for AI Agents

In src/transform/simplify.cc around lines 371-379, replace the ICHECK that aborts when the let-bound variable is actually used with a graceful fallback: detect when used==true and in that case do not crash but return a preserved/updated Let binding (e.g., reconstruct the original Let with the visited value and visited body or simply return a let node built from op->var, the visited value and the visited body) so the pass continues; only remove the let when used==false. Ensure no crash occurs and the original behavior (removal when unused) is kept.

LeiWang1999 · 2025-10-10T07:21:19Z

ci has failed due to limited storage, but tests and examples have been passed in my local environment.

- register the new config in builtin headers/registration - add helper to pipeline enabling LetInline based on pass context - document LetStmt inlining controls and usage

* remove debug print * Remove inline let expressions from the LowerAndLegalize function in phase.py * add test * Update sparse MLA examples to support SKV adjustment and correctness checks - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests. - Added check_correctness parameter to test functions for validation of outputs. - Updated test cases to reflect new SKV values and correctness checks. * reduce test shape * Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py - Added a new section for compiler internals in the documentation. - Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability. * Update buffer access checks in merge_shared_memory_allocations.cc - Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level. - Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels. * lint fix * Support pipeline with LetStmt * lint fix * • Fix LowerTileOp let handling to avoid LetInline dependency - inline let-bound BufferLoad nodes via resolver helpers and structured return - remap layouts/buffers using original data vars and only rewrite when needed - update pipeline planner to understand let-bound address_of buffers - document the new inline behaviour in docs/let_inline_fix.md * fix for wgmma pipeline with let binding * lint fix * test fix * reduce smem usage. * let binding enhancement * fix for dpgm * fix simplify * lint fix * use tilelang.Simplify instead of tir.Simplify * • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage - register the new config in builtin headers/registration - add helper to pipeline enabling LetInline based on pass context - document LetStmt inlining controls and usage

LeiWang1999 added 3 commits October 7, 2025 15:17

remove debug print

5d6e821

Remove inline let expressions from the LowerAndLegalize function in p…

81f1cae

…hase.py

add test

abc2f8c

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

LeiWang1999 added 3 commits October 7, 2025 16:53

reduce test shape

6efbef8

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

lint fix

ede05d3

LeiWang1999 mentioned this pull request Oct 9, 2025

[Bug] Incorrect Inlining of Local Variables in TileLang #814

Closed

LeiWang1999 added 3 commits October 9, 2025 13:46

Support pipeline with LetStmt

cc3138a

lint fix

597d8b1

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

LeiWang1999 added 3 commits October 10, 2025 09:25

Merge branch 'main' of https://github.com/tile-ai/tilelang into issue…

0da83a8

…_814

fix for wgmma pipeline with let binding

d0648e5

lint fix

6f115d3

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

test fix

04d66d6

reduce smem usage.

afc668d

LeiWang1999 requested a review from chengyupku October 10, 2025 03:41

LeiWang1999 added 2 commits October 10, 2025 12:34

let binding enhancement

c637767

fix for dpgm

531a3ae

LeiWang1999 added 2 commits October 10, 2025 14:34

fix simplify

4fa28c9

lint fix

22ae8c5

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

use tilelang.Simplify instead of tir.Simplify

9ee3540

• Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage

2311fc7

- register the new config in builtin headers/registration - add helper to pipeline enabling LetInline based on pass context - document LetStmt inlining controls and usage

LeiWang1999 merged commit f8ae600 into tile-ai:main Oct 10, 2025
4 of 7 checks passed

coderabbitai bot mentioned this pull request Oct 11, 2025

[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend #977

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Merged

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

		If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.

	If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.
	Be aware that inlining duplicates loads. If a write to the same location occurs between two uses, re-evaluation can observe updated memory and change behavior. In this example, both reads occur before the write, so semantics are unchanged, though duplicated loads may affect performance.

[Bugfix] Do not force inline let stmt #947

[Bugfix] Do not force inline let stmt #947

Uh oh!

Conversation

LeiWang1999 commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Oct 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Oct 10, 2025

PR Summary via GPT-5 Pro – LetStmt-Aware Pipeline Fixes (Issue #814)

Overview

Compiler Pass Updates

Examples & Tests

Verification

Uh oh!

LeiWang1999 commented Oct 10, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

LeiWang1999 commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading