tile-ai
diff --git a/‎docs/compiler_internals/inject_fence_proxy.md‎
Lines changed: 113 additions & 0 deletions b/‎docs/compiler_internals/inject_fence_proxy.md‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/index.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/deepseek_v32/test_tilelang_example_deepseek_v32.py‎
Lines changed: 3 additions & 3 deletions b/‎examples/deepseek_v32/test_tilelang_example_deepseek_v32.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎src/op/builtin.cc‎
Lines changed: 15 additions & 0 deletions b/‎src/op/builtin.cc‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎src/op/builtin.h‎
Lines changed: 24 additions & 0 deletions b/‎src/op/builtin.h‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎src/target/codegen_cuda.cc‎
Lines changed: 9 additions & 0 deletions b/‎src/target/codegen_cuda.cc‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎src/tl_templates/cuda/intrin.h‎
Lines changed: 10 additions & 1 deletion b/‎src/tl_templates/cuda/intrin.h‎
Lines changed: 10 additions & 1 deletion
@@ -0,0 +1,113 @@
+# InjectFenceProxy Pass
+
+`tl.InjectFenceProxy` is a TIR-level transform that keeps the GPU proxy state consistent on NVIDIA Hopper (SM90+) by inserting `fence.proxy.async` instructions when control flow switches from generic memory operations to asynchronous proxy operations.
+
+## Why Fences Are Needed
+
+Hopper separates memory instructions into generic and asynchronous proxy paths. When an asynchronous instruction (for example, `cp.async` or `tma.load`) issues after generic traffic (like `ldmatrix` or plain buffer stores), the hardware requires a `fence.proxy.async` to guarantee ordering. Missing fences can lead to race conditions or undefined behaviour.
+
+## What the Pass Does
+
+- Walks every statement in the `PrimFunc`, tracking whether it behaves as a **generic**, **async**, or **neutral** proxy (neutral statements reset the state, such as an explicit fence).
+- Automatically lowers `tma_store` intrinsics into the required `arrive`/`wait` handshake so that TMA stores participate correctly in synchronization.
+- Injects an explicit `fence.proxy.async` whenever a generic statement is followed by an async statement without an intervening neutral barrier.
+
+The pass is conservative: unknown extern calls are treated as async so that the fence is inserted rather than accidentally omitted.
+
+### Timeline View
+
+```
+generic initialize_descriptor → generic shared-store → async wgmma
+             │                           │                   │
+             └─ generic proxy            ┴─ generic proxy    ┴─ async proxy
+                         │        fence inserted here   ↑
+                         └──────────────────────────────┘
+```
+
+The proxy tracker scans the sequence from left to right. The moment it detects a transition from generic to async (between the store and `cp.async` above), it synthesizes a `fence.proxy.async` to reset the hardware proxy state before the async path runs.
+
+## Coverage of Intrinsics
+
+The tracker understands the TileLang intrinsics for TMA load/store, shared-memory MMA (`wgmma`), and TVM/PTX async copy intrinsics (`cp.async` variants). Generic operations currently include `ldmatrix`, `stmatrix`, and descriptor initialization. Other IR nodes (loops, blocks, attributes) receive a proxy kind derived from their bodies so that the analysis survives structured control flow.
+
+## Usage
+
+The pass is part of the default TileLang lowering pipeline. To apply it manually:
+
+```python
+from tilelang import tl
+from tvm import IRModule
+
+mod = IRModule({"main": prim_func})
+with tvm.transform.PassContext():
+    mod = tl.transform.InjectFenceProxy()(mod)
+```
+
+## End-to-End Example
+
+Before the pass:
+
+```python
+@T.prim_func
+def kernel():
+    with T.Kernel(1):
+        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
+        smem = T.decl_buffer((128,), "float16", scope="shared")
+        T.initialize_descriptor(desc, T.uint64(0), 2, 1, 32)
+        smem[0] = T.float16(0)
+        T.ptx_wgmma_ss(
+            "float16",
+            "m64n64k16",
+            T.bool(True),
+            T.bool(True),
+            "fp16",
+            "fp16",
+            "fp16",
+            desc.data,
+            T.int32(0),
+            desc.data,
+            T.int32(0),
+            smem.data,
+            T.int32(0),
+            T.bool(True),
+            1,
+            1,
+        )
+```
+
+After `tl.transform.InjectFenceProxy`:
+
+```python
+@T.prim_func
+def kernel():
+    with T.Kernel(1):
+        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
+        smem = T.decl_buffer((128,), "float16", scope="shared")
+        T.initialize_descriptor(desc, T.uint64(0), 2, 1, 32)
+        smem[0] = T.float16(0)
+        T.fence_proxy_async()
+        T.ptx_wgmma_ss(
+            "float16",
+            "m64n64k16",
+            T.bool(True),
+            T.bool(True),
+            "fp16",
+            "fp16",
+            "fp16",
+            desc.data,
+            T.int32(0),
+            desc.data,
+            T.int32(0),
+            smem.data,
+            T.int32(0),
+            T.bool(True),
+            1,
+            1,
+        )
+```
+
+The only change is the `fence_proxy_async` between the generic descriptor setup / shared-memory write and the async `wgmma`. In larger kernels the pass performs the same operation across nested blocks, loops, and conditional branches.
+
+## Extending the Pass
+
+If you introduce a new intrinsic that behaves like an async proxy, add it to `IsAsyncIntrinsic` in `src/transform/inject_fence_proxy.cc`. Likewise, extend `IsKnownGeneric` for additional generic operations. When adding new neutral barriers, make sure they set the proxy kind to `kNeutral` so the state resets correctly.
@@ -40,6 +40,7 @@ deeplearning_operators/deepseek_mla
 :caption: COMPILER INTERNALS
 
 compiler_internals/letstmt_inline
+compiler_internals/inject_fence_proxy
 :::
 
 :::{toctree}
@@ -54,4 +55,4 @@ autoapi/tilelang/index
 :caption: Privacy
 
 privacy
-:::
+:::
@@ -21,22 +21,22 @@ def test_example_fp8_lighting_indexer():
 def test_example_sparse_mla_fwd():
     # small shapes for testing
     test_sparse_mla_fwd(
-        S=1024, SKV=2048, H=128, HKV=1, DQK=576, DV=512, topk=256, check_correctness=False)
+        S=256, SKV=1024, H=64, HKV=1, DQK=576, DV=512, topk=256, check_correctness=False)
 
 
 @tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version_ge(9, 0)
 def test_example_sparse_mla_fwd_pipelined():
     # small shapes for testing
     test_sparse_mla_fwd_pipelined(
-        S=1024, SKV=2048, H=128, HKV=1, DQK=576, DV=512, topk=256, check_correctness=False)
+        S=256, SKV=1024, H=64, HKV=1, DQK=576, DV=512, topk=256, check_correctness=False)
 
 
 @tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version_ge(9, 0)
 def test_example_sparse_mla_bwd():
     test_sparse_mla_bwd(
-        S=1024, SKV=2048, H=64, HKV=1, DQKV=576, DV=512, topk=256, check_correctness=False)
+        S=256, SKV=1024, H=64, HKV=1, DQKV=576, DV=512, topk=256, check_correctness=False)
 
 
 if __name__ == "__main__":
 
@@ -203,6 +203,21 @@ TIR_DEFINE_TL_BUILTIN(no_set_max_nreg)
     .set_attr<TCallEffectKind>("TCallEffectKind",
                                Integer(CallEffectKind::kOpaque));
 
+TIR_DEFINE_TL_BUILTIN(warpgroup_arrive)
+    .set_num_inputs(0)
+    .set_attr<TCallEffectKind>("TCallEffectKind",
+                               Integer(CallEffectKind::kOpaque));
+
+TIR_DEFINE_TL_BUILTIN(warpgroup_commit_batch)
+    .set_num_inputs(0)
+    .set_attr<TCallEffectKind>("TCallEffectKind",
+                               Integer(CallEffectKind::kOpaque));
+
+TIR_DEFINE_TL_BUILTIN(warpgroup_wait)
+    .set_num_inputs(1)
+    .set_attr<TCallEffectKind>("TCallEffectKind",
+                               Integer(CallEffectKind::kOpaque));
+
 TIR_DEFINE_TL_BUILTIN(wait_wgmma)
     .set_num_inputs(1)
     .set_attr<TCallEffectKind>("TCallEffectKind",
 
@@ -334,6 +334,30 @@ TVM_DLL const Op &set_max_nreg();
  */
 TVM_DLL const Op &no_set_max_nreg();
 
+/*!
+ * \brief Arrive at a warpgroup fence for WGMMA sequences
+ *
+ * warpgroup_arrive()
+ *
+ */
+TVM_DLL const Op &warpgroup_arrive();
+
+/*!
+ * \brief Commit the current warpgroup batch for WGMMA sequences
+ *
+ * warpgroup_commit_batch()
+ *
+ */
+TVM_DLL const Op &warpgroup_commit_batch();
+
+/*!
+ * \brief Wait for the warpgroup batch identified by num_mma
+ *
+ * warpgroup_wait(num_mma)
+ *
+ */
+TVM_DLL const Op &warpgroup_wait();
+
 /*!
  * \brief Wait the previous wgmma to finish
  *
 
@@ -1374,6 +1374,15 @@ void CodeGenTileLangCUDA::VisitExpr_(const CallNode *op, std::ostream &os) {
     print_extern_call_stmt("tl::tma_store_arrive");
   } else if (op->op.same_as(tl::tma_store_wait())) {
     print_extern_call_stmt("tl::tma_store_wait<0>");
+  } else if (op->op.same_as(tl::warpgroup_arrive())) {
+    print_extern_call_stmt("tl::warpgroup_arrive");
+  } else if (op->op.same_as(tl::warpgroup_commit_batch())) {
+    print_extern_call_stmt("tl::warpgroup_commit_batch");
+  } else if (op->op.same_as(tl::warpgroup_wait())) {
+    this->PrintIndent();
+    int num_mma = Downcast<IntImm>(op->args[0])->value;
+    this->stream << "tl::warpgroup_wait<" << std::to_string(num_mma)
+                 << ">();\n";
   } else if (op->op.same_as(tl::set_max_nreg())) {
     this->PrintIndent();
     int nreg = Downcast<IntImm>(op->args[0])->value;
 
@@ -2,9 +2,18 @@
 
 #if __CUDA_ARCH_LIST__ >= 900
 #include "cute/arch/cluster_sm90.hpp"
+#include "cute/arch/mma_sm90_gmma.hpp"
 #include "cutlass/cutlass.h"
 
 namespace tl {
+
+TL_DEVICE void warpgroup_arrive() { cute::warpgroup_arrive(); }
+TL_DEVICE void warpgroup_commit_batch() { cute::warpgroup_commit_batch(); }
+
+template <int NumMma> TL_DEVICE void warpgroup_wait() {
+  cute::warpgroup_wait<NumMma>();
+}
+
 // Template parameter:
 //   thread_extent: the logical size (in number of threads) of each "group"
 //                  within which we want to elect exactly ONE representative
@@ -53,4 +62,4 @@ template <uint32_t RegCount> TL_DEVICE void warpgroup_reg_dealloc() {
   asm volatile("setmaxnreg.dec.sync.aligned.u32 %0;\n" : : "n"(RegCount));
 }
 } // namespace tl
-#endif
+#endif