[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend #977

LeiWang1999 · 2025-10-11T06:04:03Z

This pull request introduces a new TIR pass tl.InjectFenceProxy for NVIDIA Hopper (SM90+) GPUs and adds support for new warpgroup synchronization intrinsics. The main goal is to ensure correct ordering between generic and asynchronous memory operations by automatically injecting fence.proxy.async instructions and handling TMA store synchronization. The pass is now documented, and the codebase has been refactored for clarity and extensibility. Additionally, new intrinsics for warpgroup synchronization are registered and lowered to CUDA code.

TIR Fence Proxy Pass and Synchronization Improvements

Added a new TIR pass tl.InjectFenceProxy to automatically insert fence.proxy.async instructions when control flow switches from generic memory operations to asynchronous proxy operations, preventing race conditions and undefined behavior on NVIDIA Hopper (SM90+). The pass is conservative and treats unknown extern calls as async.
Refactored the implementation of the fence proxy injection pass for clarity and extensibility: replaced the old ProxyMarker and InjectFenceProxy classes with new ProxyKind tracking and a more robust analysis of IR nodes, supporting structured control flow and easy extension for new intrinsics.
Added automatic lowering of tma_store intrinsics into the required arrive/wait handshake to ensure proper synchronization for TMA stores.
Documented the new pass in docs/compiler_internals/inject_fence_proxy.md, explaining why fences are needed, what the pass does, coverage of intrinsics, usage, and extension instructions. [1] [2]

Warpgroup Synchronization Intrinsics

Registered new TileLang built-in ops: warpgroup_arrive, warpgroup_commit_batch, and warpgroup_wait in src/op/builtin.cc and documented them in src/op/builtin.h. These ops enable explicit warpgroup synchronization for WGMMA sequences. [1] [2]
Lowered the new warpgroup intrinsics to CUDA code using cute library calls in src/tl_templates/cuda/intrin.h and added lowering logic to the CUDA codegen in src/target/codegen_cuda.cc. [1] [2]

Miscellaneous

Minor refactoring in pipeline injection logic for clarity and correctness in src/transform/inject_pipeline.cc. [1] [2]

Summary by CodeRabbit

New Features
- Added warpgroup synchronization intrinsics (arrive, commit batch, wait) for WGMMA workflows.
- Automatic fence injection to ensure GPU proxy ordering across generic↔async transitions; CUDA emission updated to emit new warpgroup ops.
Refactor
- Simplified pipeline block handling for more predictable injection behavior.
- WGMMA macro updated to include warpgroup synchronization steps.
Documentation
- New detailed guide on the fence proxy pass and linked in docs.
Tests
- Added tests covering fence injection, TMA sync, WGMMA sequencing; updated example test parameters.

- annotate proxy fence injector with context comments for async/generic detection - add compiler internals doc covering the pass mechanics and link it in docs index - repair fence proxy test by fixing descriptor init usage and fence counter logic

github-actions · 2025-10-11T06:04:13Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-11T06:04:14Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds an InjectFenceProxy transform (including TMA store arrive/wait lowering and proxy-kind tracking), introduces warpgroup synchronization intrinsics and codegen, updates WGMMA macro to use them, alters pipeline child handling, adds docs/tests, and adds debug logging around the pass.

Changes

Cohort / File(s)	Summary
Docs: InjectFenceProxy `docs/compiler_internals/inject_fence_proxy.md`, `docs/index.md`	New detailed doc for the InjectFenceProxy pass and added link in docs index; fixed toctree trailing marker.
Warpgroup intrinsics API + Codegen `src/op/builtin.h`, `src/op/builtin.cc`, `src/tl_templates/cuda/intrin.h`, `src/target/codegen_cuda.cc`, `tilelang/language/builtin.py`	Add `tl.warpgroup_arrive`, `tl.warpgroup_commit_batch`, `tl.warpgroup_wait` (API, TL device wrappers, and CUDA codegen emission); Python wrappers added.
Fence/proxy transform `src/transform/inject_fence_proxy.cc`, `testing/python/transform/test_tilelang_transform_inject_fence_proxy.py`	Replace ProxyMarker with ProxyKind tracking; inject `fence.proxy.async` on generic→async transitions; add TMAStoreSyncInjector to split `tma_store` into arrive/wait; convert passes to Apply API; add tests for fence behavior and TMA/WGMMA ordering.
Pipeline injector tweak `src/transform/inject_pipeline.cc`	For-nodes no longer unwrap nested SeqStmt children; always add original child Stmt to pipeline.
Engine debug logging `tilelang/engine/phase.py`	Add prints logging IRModule before and after InjectFenceProxy in OptimizeForTarget.
WGMMA macro updates `tilelang/intrinsics/wgmma_macro_generator.py`	Insert `warpgroup_arrive()` after descriptor init and `warpgroup_commit_batch()` + `warpgroup_wait(0)` after inner loops to bracket MMA compute.
Examples/tests `examples/deepseek_v32/test_tilelang_example_deepseek_v32.py`	Tweak test parameter shapes to smaller constants.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant TL as TileLang OptimizeForTarget
  participant P1 as TMAStoreSyncInjector
  participant P2 as ProxyFenceInjector
  participant IR as IRModule
  participant CG as CUDA CodeGen
  participant CU as CUDA Device (cute)

  Dev->>TL: Lower module
  TL->>IR: Input IR
  TL->>P1: Apply TMA store sync rewrite
  P1-->>TL: IR' (tma_store_arrive/wait)
  TL->>P2: Apply InjectFenceProxy
  P2-->>IR: IR'' (fence.proxy.async on generic→async)
  TL-->>Dev: Lowered IR''
  Dev->>CG: Generate CUDA
  CG->>CU: Emit tl::warpgroup_arrive / commit_batch / wait<N>()
  CG-->>Dev: CUDA kernel

sequenceDiagram
  autonumber
  participant K as Kernel
  participant TL as tl::intrinsics
  participant CU as cute::arch (SM90)

  K->>TL: warpgroup_arrive()
  TL->>CU: cute::warpgroup_arrive()

  rect rgba(230,240,255,0.5)
    K->>K: MMA compute region (wgmma)
  end

  K->>TL: warpgroup_commit_batch()
  TL->>CU: cute::warpgroup_commit_batch()
  K->>TL: warpgroup_wait<N>()
  TL->>CU: cute::warpgroup_wait<N>()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend #977: Appears to implement the same inject_fence_proxy refactor, warpgroup intrinsics, tests, and codegen changes — likely directly related or overlapping.
[Refactor] Refactor barrier management #744: Also modifies inject_fence_proxy internals (proxy/fence detection); overlaps with pass rewrite and fence detection logic.
[Feature] Add 1D TMA support #761: Modifies TMA load/store handling and interfaces; related to the TMAStoreSyncInjector changes here.

Suggested reviewers

tzj-fxz

Poem

I nudge the fences, light and spry,
Async winds leap lightning-high—
Arrive, commit, then patiently wait,
Warpgroup pulses keep timing straight.
A rabbit hops, guards order fine,
Tiles aligned, no missing sign. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights both the refactoring of the InjectFenceProxy pass and the exposure of new warp group primitives, matching the two primary objectives of the PR without extraneous details.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 671b226 and 0f1060c.

📒 Files selected for processing (2)

docs/compiler_internals/inject_fence_proxy.md (1 hunks)
examples/deepseek_v32/test_tilelang_example_deepseek_v32.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/compiler_internals/inject_fence_proxy.md (1)
19-25: Specify a code-fence language for the timeline block.

Markdown lint (MD040) flags this fence because it lacks a language tag. Please annotate it (e.g., ```text) so the docs build stays clean.
-```
+```text
 generic initialize_descriptor → generic shared-store → async wgmma
              │                           │                   │
              └─ generic proxy            ┴─ generic proxy    ┴─ async proxy
                          │        fence inserted here   ↑
                          └──────────────────────────────┘

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae183d and 1fb79d6.

📒 Files selected for processing (12)

docs/compiler_internals/inject_fence_proxy.md (1 hunks)
docs/index.md (2 hunks)
src/op/builtin.cc (1 hunks)
src/op/builtin.h (1 hunks)
src/target/codegen_cuda.cc (1 hunks)
src/tl_templates/cuda/intrin.h (2 hunks)
src/transform/inject_fence_proxy.cc (1 hunks)
src/transform/inject_pipeline.cc (2 hunks)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (1 hunks)
tilelang/engine/phase.py (1 hunks)
tilelang/intrinsics/wgmma_macro_generator.py (2 hunks)
tilelang/language/builtin.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (9)

tilelang/engine/phase.py (2)

src/transform/inject_fence_proxy.cc (2)

InjectFenceProxy (305-313)

InjectFenceProxy (305-305)

tilelang/transform/__init__.py (1)

InjectFenceProxy (230-238)

tilelang/language/builtin.py (2)

tilelang/language/tir/op.py (1)

call_intrin (119-144)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/target/codegen_cuda.cc (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/op/builtin.h (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (3)

tilelang/language/tir/op.py (3)

ptx_cp_async (1244-1273)

call_intrin (119-144)

ptx_wgmma_ss (1064-1103)

tilelang/language/builtin.py (3)

fence_proxy_async (82-91)

warpgroup_arrive (252-258)

initialize_descriptor (391-422)

src/transform/inject_fence_proxy.cc (24)

InjectFenceProxy (305-313)

InjectFenceProxy (305-305)

stmt (138-138)

stmt (138-138)

stmt (279-281)

stmt (279-279)

stmt (283-292)

stmt (283-283)

op (140-154)

op (140-140)

op (171-196)

op (171-171)

op (198-218)

op (198-198)

op (220-224)

op (220-220)

op (226-235)

op (226-226)

op (237-243)

op (237-237)

op (245-250)

op (245-245)

op (252-262)

op (252-252)

tilelang/intrinsics/wgmma_macro_generator.py (3)

tilelang/language/allocate.py (1)

alloc_descriptor (158-164)

tilelang/language/builtin.py (4)

initialize_descriptor (391-422)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/transform/inject_fence_proxy.cc (2)

tilelang/language/builtin.py (5)

fence_proxy_async (82-91)

tma_load (70-79)

tma_store_arrive (94-103)

tma_store_wait (106-115)

initialize_descriptor (391-422)

tilelang/language/tir/op.py (6)

ptx_wgmma_ss (1064-1103)

ptx_wgmma_rs (1106-1143)

ptx_cp_async (1244-1273)

ptx_cp_async_barrier (1592-1606)

ptx_cp_async_bulk (1276-1310)

ptx_ldmatrix (1205-1241)

src/op/builtin.cc (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/tl_templates/cuda/intrin.h (3)

src/op/builtin.h (1)

tl (22-473)

src/tl_templates/cuda/gemm_sm90.h (1)

tl (231-385)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

🪛 markdownlint-cli2 (0.18.1)

docs/compiler_internals/inject_fence_proxy.md

19-19: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (5)

tilelang/language/builtin.py (1)

252-280: Warpgroup intrinsics wired cleanly.

The Python wrappers mirror the new TL ops and keep the handle signatures consistent, so the frontend API stays aligned with the backend changes.

src/target/codegen_cuda.cc (1)

1377-1385: CUDA emission matches the new intrinsics.

The dedicated branches map straight to the cute warpgroup helpers and preserve the compile-time MMA count, keeping codegen in lockstep with the intrinsic surface.

src/op/builtin.h (1)

337-359: Header exports look complete.

Good to see the warpgroup arrive/commit/wait ops surfaced alongside the existing intrinsics so downstream components can rely on the shared registry.

src/tl_templates/cuda/intrin.h (1)

10-15: Device wrappers line up with cute helpers.

Including the SM90 GMMA header and delegating through cute keeps the TL device API tidy while exposing the new warpgroup primitives.

testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (1)

56-224: Great coverage for the new fence logic.

The new cases validate no double-fencing, proxy hints, TMA store sync insertion, and the warpgroup sequencing windows—exactly the behavior we need confidence in.

coderabbitai · 2025-10-11T06:11:55Z

src/transform/inject_fence_proxy.cc

+  Stmt VisitStmt_(const ForNode *op) final { return VisitSingleBody(op); }
+  Stmt VisitStmt_(const LetStmtNode *op) final { return VisitSingleBody(op); }
+  Stmt VisitStmt_(const AssertStmtNode *op) final {
+    return VisitSingleBody(op);
+  }
+  Stmt VisitStmt_(const WhileNode *op) final { return VisitSingleBody(op); }
+
+  template <typename NodeType> Stmt VisitSingleBody(const NodeType *op) {
+    Stmt stmt = StmtMutator::VisitStmt_(op);
+    const auto *node = stmt.as<NodeType>();
+    ProxyKind body_kind = GetProxyKind(node->body);
+    SetProxyKind(stmt, body_kind);
+    return stmt;
+  }


⚠️ Potential issue | 🔴 Critical

Propagate proxy kind through Allocate (and similar) nodes.

ProxyFenceInjector never overrides VisitStmt_(const AllocateNode*), so an Allocate winds up with ProxyKind::kUnknown. When a generic op precedes an Allocate whose body starts with an async op, NeedsFence() sees Unknown → Async and skips the fence. That recreates the very race this pass is supposed to eliminate—most Hopper kernels wrap async copies inside shared-memory alloc scopes, so the fence silently goes missing.

Propagate the body’s proxy kind through Allocate (and related single-body nodes) the same way you already do for For, Let, etc.

Stmt VisitStmt_(const ForNode *op) final { return VisitSingleBody(op); } Stmt VisitStmt_(const LetStmtNode *op) final { return VisitSingleBody(op); } Stmt VisitStmt_(const AssertStmtNode *op) final { return VisitSingleBody(op); } Stmt VisitStmt_(const WhileNode *op) final { return VisitSingleBody(op); } + Stmt VisitStmt_(const AllocateNode *op) final { return VisitSingleBody(op); } + Stmt VisitStmt_(const AllocateConstNode *op) final { + return VisitSingleBody(op); + } + Stmt VisitStmt_(const DeclBufferNode *op) final { + return VisitSingleBody(op); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Stmt VisitStmt_(const ForNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const LetStmtNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const AssertStmtNode *op) final {

return VisitSingleBody(op);

}

Stmt VisitStmt_(const WhileNode *op) final { return VisitSingleBody(op); }

template <typename NodeType> Stmt VisitSingleBody(const NodeType *op) {

Stmt stmt = StmtMutator::VisitStmt_(op);

const auto *node = stmt.as<NodeType>();

ProxyKind body_kind = GetProxyKind(node->body);

SetProxyKind(stmt, body_kind);

return stmt;

}

Stmt VisitStmt_(const ForNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const LetStmtNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const AssertStmtNode *op) final {

return VisitSingleBody(op);

}

Stmt VisitStmt_(const WhileNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const AllocateNode *op) final { return VisitSingleBody(op); }

Stmt VisitStmt_(const AllocateConstNode *op) final {

return VisitSingleBody(op);

}

Stmt VisitStmt_(const DeclBufferNode *op) final {

return VisitSingleBody(op);

}

template <typename NodeType> Stmt VisitSingleBody(const NodeType *op) {

Stmt stmt = StmtMutator::VisitStmt_(op);

const auto *node = stmt.as<NodeType>();

ProxyKind body_kind = GetProxyKind(node->body);

SetProxyKind(stmt, body_kind);

return stmt;

}

🤖 Prompt for AI Agents

In src/transform/inject_fence_proxy.cc around lines 264-277, the pass doesn't override VisitStmt_(const AllocateNode*), so Allocate (and similar single-body nodes) retain ProxyKind::kUnknown and can hide needed fences; add a VisitStmt_ override for const AllocateNode* that returns VisitSingleBody(op) (and add similar overrides for any other single-body statement node types not yet covered) so the body's ProxyKind is propagated into the Allocate node before returning.

coderabbitai · 2025-10-11T06:11:55Z

tilelang/engine/phase.py

+            print("Before injectFenceProxy")
+            print(mod)
            mod = tilelang.transform.InjectFenceProxy()(mod)
+            print("After InjectFenceProxy")
+            print(mod)
+


⚠️ Potential issue | 🟠 Major

Remove unconditional debug prints

Dumping the entire IR module every compile will spam stdout/stderr and severely degrade usability. Please drop these prints or gate them behind an explicit debug flag/logger before merging.

- print("Before injectFenceProxy") - print(mod) mod = tilelang.transform.InjectFenceProxy()(mod) - print("After InjectFenceProxy") - print(mod)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("Before injectFenceProxy")

print(mod)

mod = tilelang.transform.InjectFenceProxy()(mod)

print("After InjectFenceProxy")

print(mod)

mod = tilelang.transform.InjectFenceProxy()(mod)

🤖 Prompt for AI Agents

In tilelang/engine/phase.py around lines 159 to 164 there are unconditional print statements that dump the IR module before and after InjectFenceProxy; remove these prints or replace them with a proper logger call guarded by a debug flag. Specifically, delete the two print(...) lines or change them to something like logger.debug(...) and ensure the logger's level is controlled by a global debug setting or environment/config flag so the IR is only emitted when debugging is explicitly enabled. Ensure no stdout/stderr spamming remains in the hot compile path.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/transform/inject_fence_proxy.cc (1)
76-107: Consider explicitly classifying the new warpgroup operations.

The IsAsyncIntrinsic function correctly identifies WGMMA operations (ptx_wgmma_ss, ptx_wgmma_rs) as async. However, the newly added warpgroup synchronization operations (warpgroup_arrive, warpgroup_commit_batch, warpgroup_wait) are not explicitly classified in either IsAsyncIntrinsic or IsKnownGeneric.

Currently, they will fall through to the conservative default at line 212 (treating unknown externs as async), which is safe but not explicit. For better maintainability and clarity, consider adding explicit classification for these operations, likely as neutral barriers since they are synchronization primitives:
 bool IsAsyncIntrinsic(const CallNode *call) {
   if (call == nullptr) {
     return false;
   }
 
   // TileLang async intrinsics
   if (call->op.same_as(tma_load()) || call->op.same_as(tma_load_im2col()) ||
       call->op.same_as(tma_store()) || call->op.same_as(tma_store_arrive()) ||
       call->op.same_as(tma_store_wait()) ||
       call->op.same_as(ptx_cp_async_barrier_noinc()) ||
       call->op.same_as(ptx_wgmma_ss()) || call->op.same_as(ptx_wgmma_rs())) {
     return true;
   }
 
   // PTX async copy intrinsics
   if (call->op.same_as(builtin::ptx_cp_async()) ||
       call->op.same_as(builtin::ptx_cp_async_barrier()) ||
       call->op.same_as(builtin::ptx_cp_async_bulk())) {
     return true;
   }
 
   return false;
 }
And add a helper for neutral operations:
+// Operations that act as synchronization barriers and reset proxy state
+bool IsNeutralIntrinsic(const CallNode *call) {
+  if (call == nullptr) {
+    return false;
+  }
+  return call->op.same_as(warpgroup_arrive()) || 
+         call->op.same_as(warpgroup_commit_batch()) ||
+         call->op.same_as(warpgroup_wait());
+}
Then update the VisitStmt_(const EvaluateNode *op) to check this:
   if (const auto *call = evaluate->value.as<CallNode>()) {
     if (IsFenceCall(call)) {
       kind = ProxyKind::kNeutral;
+    } else if (IsNeutralIntrinsic(call)) {
+      kind = ProxyKind::kNeutral;
     } else if (IsAsyncIntrinsic(call)) {
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (2)
56-88: LGTM! Test correctly validates no redundant fence insertion.

The test verifies that the pass doesn't add a second fence when one is already present between async and generic operations, which is the expected behavior.

Consider extracting the fence counting logic into a helper function to reduce duplication across tests:
def _count_intrinsic_calls(stmt, intrinsic_name):
    count = 0
    def visit(node):
        nonlocal count
        if isinstance(node, tir.Evaluate) and isinstance(node.value, tir.Call):
            name = getattr(node.value.op, "name", None)
            if name == intrinsic_name:
                count += 1
    tir.stmt_functor.post_order_visit(stmt, visit)
    return count
Then simplify line 87 to:
assert _count_intrinsic_calls(mod["main"].body, "tl.fence_proxy_async") == 1
90-121: LGTM! Test correctly validates proxy hint override behavior.

The test confirms that proxy_hint="neutral" scopes prevent automatic fence insertion, giving users manual control over fence placement when needed.

The _has_fence visitor follows the same pattern as _count_fences in the previous test. Consider using the helper function suggested above with a zero-count check:
assert _count_intrinsic_calls(mod["main"].body, "tl.fence_proxy_async") == 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae183d and 671b226.

📒 Files selected for processing (12)

docs/compiler_internals/inject_fence_proxy.md (1 hunks)
docs/index.md (2 hunks)
src/op/builtin.cc (1 hunks)
src/op/builtin.h (1 hunks)
src/target/codegen_cuda.cc (1 hunks)
src/tl_templates/cuda/intrin.h (2 hunks)
src/transform/inject_fence_proxy.cc (1 hunks)
src/transform/inject_pipeline.cc (2 hunks)
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (1 hunks)
tilelang/engine/phase.py (1 hunks)
tilelang/intrinsics/wgmma_macro_generator.py (2 hunks)
tilelang/language/builtin.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (9)

tilelang/intrinsics/wgmma_macro_generator.py (3)

tilelang/language/allocate.py (1)

alloc_descriptor (158-164)

tilelang/language/builtin.py (4)

initialize_descriptor (391-422)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

tilelang/language/builtin.py (2)

tilelang/language/tir/op.py (1)

call_intrin (119-144)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

tilelang/engine/phase.py (2)

src/transform/inject_fence_proxy.cc (2)

InjectFenceProxy (305-313)

InjectFenceProxy (305-305)

tilelang/transform/__init__.py (1)

InjectFenceProxy (230-238)

src/target/codegen_cuda.cc (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/op/builtin.h (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/op/builtin.cc (2)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

src/tl_templates/cuda/intrin.h (1)

warpgroup_wait (13-15)

src/tl_templates/cuda/intrin.h (3)

src/op/builtin.h (1)

tl (22-473)

src/tl_templates/cuda/gemm_sm90.h (1)

tl (231-385)

tilelang/language/builtin.py (3)

warpgroup_arrive (252-258)

warpgroup_commit_batch (261-267)

warpgroup_wait (270-280)

testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (3)

tilelang/language/tir/op.py (4)

ptx_cp_async (1244-1273)

call_extern (172-194)

call_intrin (119-144)

ptx_wgmma_ss (1064-1103)

tilelang/language/builtin.py (3)

fence_proxy_async (82-91)

warpgroup_arrive (252-258)

initialize_descriptor (391-422)

src/transform/inject_fence_proxy.cc (24)

InjectFenceProxy (305-313)

InjectFenceProxy (305-305)

stmt (138-138)

stmt (138-138)

stmt (279-281)

stmt (279-279)

stmt (283-292)

stmt (283-283)

op (140-154)

op (140-140)

op (171-196)

op (171-171)

op (198-218)

op (198-198)

op (220-224)

op (220-220)

op (226-235)

op (226-226)

op (237-243)

op (237-237)

op (245-250)

op (245-245)

op (252-262)

op (252-252)

src/transform/inject_fence_proxy.cc (2)

tilelang/language/builtin.py (5)

fence_proxy_async (82-91)

tma_load (70-79)

tma_store_arrive (94-103)

tma_store_wait (106-115)

initialize_descriptor (391-422)

tilelang/language/tir/op.py (6)

ptx_wgmma_ss (1064-1103)

ptx_wgmma_rs (1106-1143)

ptx_cp_async (1244-1273)

ptx_cp_async_barrier (1592-1606)

ptx_cp_async_bulk (1276-1310)

ptx_ldmatrix (1205-1241)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-metal

🔇 Additional comments (12)

src/tl_templates/cuda/intrin.h (1)

5-16: LGTM!

The new warpgroup synchronization wrappers are correctly implemented:

The include for mma_sm90_gmma.hpp is necessary for the cute library functions

All three wrappers (warpgroup_arrive, warpgroup_commit_batch, warpgroup_wait) correctly delegate to their cute counterparts

The templated warpgroup_wait<NumMma>() properly forwards the template parameter

All functions are appropriately marked TL_DEVICE and guarded by the SM90+ architecture check

src/target/codegen_cuda.cc (1)

1377-1385: LGTM!

The codegen for the three warpgroup operations is correctly implemented:

warpgroup_arrive and warpgroup_commit_batch emit simple external calls with no arguments

warpgroup_wait correctly extracts the num_mma argument from op->args[0] and emits it as a template parameter

The implementation follows the established pattern for similar intrinsics in this file

src/op/builtin.h (1)

337-359: LGTM!

The three new Op declarations are properly implemented:

All follow the established pattern for Op declarations in this file

Doxygen documentation is clear and appropriate

The TVM_DLL export macro is correctly applied

Logical placement between no_set_max_nreg and wait_wgmma maintains file organization

docs/compiler_internals/inject_fence_proxy.md (1)

1-36: LGTM!

This is excellent documentation for the InjectFenceProxy pass:

Clear explanation of the problem (Hopper generic vs. async proxy paths)

Comprehensive description of the pass behavior (state tracking, TMA store lowering, fence injection)

Appropriate technical detail on intrinsic coverage

Helpful usage example with code

Useful extension guidance for future maintainers

tilelang/language/builtin.py (1)

252-280: LGTM!

The three new Python wrappers are correctly implemented:

All follow the established pattern of using tir.call_intrin("handle", tir.op.Op.get(...))

The warpgroup_wait function correctly accepts and forwards the num_mma: int parameter

Docstrings are clear and appropriate

Type hints are provided where applicable

src/transform/inject_fence_proxy.cc (3)

126-155: LGTM!

The TMAStoreSyncInjector correctly rewrites TMA store sequences:

Properly identifies tma_store intrinsics in evaluate nodes

Constructs the required synchronization sequence (store → arrive → wait)

Uses SeqStmt to maintain the ordering

The Apply method follows the established pattern for TIR passes

159-301: LGTM!

The ProxyFenceInjector is well-implemented with comprehensive statement coverage:

The SeqStmtNode visitor correctly tracks proxy state across statements and injects fences at generic-to-async transitions

Statement classification is appropriate:

EvaluateNode: classifies calls based on intrinsic type, with conservative async default for unknowns

BufferStoreNode: correctly marked as generic

IfThenElseNode: properly combines proxy kinds from branches

Other control flow nodes appropriately propagate body kinds

The proxy state management through proxy_map_ is clean and efficient

The MakeFenceStmt helper correctly creates neutral fence statements

305-313: LGTM!

The pass composition is clean and correct:

Properly composes TMAStoreSyncInjector and ProxyFenceInjector in the right order (TMA synchronization rewriting must happen before fence injection)

Uses the standard CreatePrimFuncPass API appropriately

Pass metadata (priority 0, name "tl.InjectFenceProxy") is correctly specified
testing/python/transform/test_tilelang_transform_inject_fence_proxy.py (4)
123-152: LGTM! Test correctly validates TMA store synchronization injection.

The test confirms that tl.tma_store is automatically rewritten to include tma_store_arrive and tma_store_wait handshakes, matching the behavior described in the PR objectives and implemented in TMAStoreSyncInjector::Apply (src/transform/inject_fence_proxy.cc).

154-186: LGTM! Test correctly validates wgmma async marking and fence ordering.

The test confirms that ptx_wgmma_ss operations are treated as async and that a fence_proxy_async is inserted before them when transitioning from generic operations. The ordering verification using order.index() is appropriate and will fail clearly if the expected intrinsics are not present.

227-227: LGTM! Standard test runner pattern.

The change to use tilelang.testing.main() enables standard test discovery and execution, allowing individual tests to be run via command line arguments or IDE test runners.

188-224: Verify and tighten the fence_count assertion

The test uses assert fence_count >= 1, which may mask redundant fence insertions. Determine the exact number of tl.fence_proxy_async calls in this scenario (e.g., 1) and update the assertion to:
assert fence_count == <expected_count>
Run the test with the correct PYTHONPATH or via pytest to load the tilelang package before adjusting.

coderabbitai · 2025-10-11T06:12:03Z

tilelang/engine/phase.py

+            print("Before injectFenceProxy")
+            print(mod)
            mod = tilelang.transform.InjectFenceProxy()(mod)
+            print("After InjectFenceProxy")
+            print(mod)
+


⚠️ Potential issue | 🟠 Major

Remove debug prints from the lowering pipeline.
These unconditional print calls dump entire IRModules during every lowering run, which will flood stdout and noticeably slow compilation for any Hopper target. Please drop them or gate behind an explicit debug flag.

- print("Before injectFenceProxy") - print(mod) mod = tilelang.transform.InjectFenceProxy()(mod) - print("After InjectFenceProxy") - print(mod)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("Before injectFenceProxy")

print(mod)

mod = tilelang.transform.InjectFenceProxy()(mod)

print("After InjectFenceProxy")

print(mod)

mod = tilelang.transform.InjectFenceProxy()(mod)

🤖 Prompt for AI Agents

In tilelang/engine/phase.py around lines 159 to 164, remove the unconditional print statements that dump the IRModule before and after InjectFenceProxy; either delete these prints or replace them with a gated debug mechanism (e.g., use an existing logger.debug call or check a module-level/dev flag or environment variable before printing) so the IR is not always written to stdout and compilation is not slowed or flooded.

LeiWang1999 · 2025-10-11T08:46:44Z

local test passed :)

LeiWang1999 added 2 commits October 11, 2025 13:59

• InjectFenceProxy docs and tests

14d3ac3

- annotate proxy fence injector with context comments for async/generic detection - add compiler internals doc covering the pass mechanics and link it in docs index - repair fence proxy test by fixing descriptor init usage and fence counter logic

do not consider call_extern as async.

671b226

doc update.

1fb79d6

coderabbitai bot reviewed Oct 11, 2025

View reviewed changes

reduce test size for sparse mla

0f1060c

LeiWang1999 merged commit ddfaac3 into tile-ai:main Oct 11, 2025
3 of 6 checks passed

This was referenced Oct 12, 2025

[Language] Expose T.warpgroup_fence_operand for nvcc code motion #986

Open

[Language] Expose T.get_warp_idx_sync and T.shuffle_elect for efficient thread election #989

Merged

This was referenced Oct 20, 2025

[Enhancement] Update async intrinsic handling in inject_fence_proxy #1068

Merged

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi #1108

Open

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Open

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend #977

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend #977

Uh oh!

LeiWang1999 commented Oct 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025 •

edited

Loading

Review failed

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 11, 2025

Uh oh!

coderabbitai bot Oct 11, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 11, 2025

Uh oh!

LeiWang1999 commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend #977

[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend #977

Uh oh!

Conversation

LeiWang1999 commented Oct 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TIR Fence Proxy Pass and Synchronization Improvements

Warpgroup Synchronization Intrinsics

Miscellaneous

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

coderabbitai bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend #977

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend #977

LeiWang1999 commented Oct 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 11, 2025 •

edited

Loading