[Backend] Optimize membar insertion on hopper #8374

lijinpei · 2025-10-05T17:22:04Z

New contributor declaration

[x ] I am not making a trivial change, such as fixing a typo in a comment.
[ x] I have written a PR description following these
rules.
[ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- [x ] I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

lijinpei · 2025-10-05T17:28:43Z

Per my benchmarking, redundant bar has around 10% performance impact on the example included:
before:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            515.24    575.15
 1024            565.71    637.72
 2048            590.59    621.93
 4096            597.59    636.09
 8192            606.03    665.42
16384            629.67    643.29

after:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            533.60    572.95
 1024            607.96    635.05
 2048            644.24    657.84
 4096            653.42    664.32
 8192            672.24    679.02
16384            680.23    671.07

lijinpei · 2025-10-05T17:29:59Z

Some functionality of this pr may be a duplicate of https://github.com/triton-lang/triton/pull/7846/files .
~~But some others parts, like "mbarrier.try_wait should function as a synchronization" is not.~~ This has potential problem, other warps are not guaranteed to have reached mbarrier.try_wait.

lijinpei · 2025-10-05T17:52:36Z

I have tested #7846 locally, the performance is not as good:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            521.02    580.96
 1024            566.28    655.77
 2048            602.45    621.87
 4096            601.08    629.14
 8192            613.34    639.98
16384            623.64    690.13

Adding some printf in Membar.cpp shows some bar not eliminated, between mbarrier.try_wait and mbarrier.arrive.

ThomasRaoux

let's move the tutorial change to a separate PR

ThomasRaoux · 2025-10-05T18:04:25Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.cpp

+  // FIXME: work-around
+  return;


definitely not correct

Sorry, committed by mistake.

ThomasRaoux · 2025-10-05T18:19:52Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp

          ptxBuilderTMA.newOperand(barrierMemObj.getBase(), "r"));
-      tmaInst += "}], [$" + std::to_string(operandIdx++) + "];";
+      tmaInst +=
+          "}], [$" + std::to_string(operandIdx++) + "], 1152921504606846976;";


why do we need this: 1152921504606846976?

I got it from cutlass and tile-lang. Seems it's not necessary. Removed.

ThomasRaoux · 2025-10-05T18:21:06Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp

+    // pred = b.and_(pred, LLVM::NVIDIA::createElectPredicate(loc, rewriter));
+    // FIXME: is elect.sync slower, and hard to optimize?
+    auto [laneId, warpId] = getLaneAndWarpId(rewriter, loc);
+    pred = b.and_(pred, b.icmp_eq(b.i32_val(0), laneId));


the problem with not using elect is that ptxas will not be able to figure out that the op is uniform an in general I noticed it will insert a loop around the tma instruction.
Could you check if it happening?

Yes, elect is better than lane0.
Some TMA SASS instructions are issued on a per-warp basis, ptx has to impelment its spec with some loop around underlying SASS instruction. When guarded by a elect, the loop is eliminated, however, lane == 0 won't eliminate the loop.
https://godbolt.org/z/TTxaosecn
https://godbolt.org/z/zxvrz6YK8
Unluckily, godbolt has no support for cutlass yet.

There is a special register %laneid in ptx, which when compared with zero, generates no loop and no elect.sync instruction:
https://godbolt.org/z/9M5jxGoxh
https://godbolt.org/z/TqGYr9src

interesting, I don't see if being used in cutlass much. Do you have the SASS generated in each case?

ThomasRaoux · 2025-10-05T18:23:51Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TritonGPUToLLVM.cpp

+    if (auto mbarArriveOp =
+            dyn_cast<triton::nvidia_gpu::ArriveBarrierOp>(afterOp)) {
+      auto numWarps = triton::gpu::lookupNumWarps(afterOp);
+      auto numArrive = mbarArriveOp.getCount();
+      return numArrive >= numWarps;
+    }


how is it possible that we don't need a bar.sync when we have storeOp followed by arrive? We would want to make sure the store is done before the mbarrier is signaled

Suppose we give the ArriveBarrierOp release semantics, and the ArriveBarrierOp is executed by all warps in this partition, then by the time mbarrier.arrive finishes, we know that all warps in this partition reached this point, and previous smem store takes effect(A fence.proxy.async.shared::cta is needed in-between, but that is not the responsibility of this pass to insert it.)
Suppose we want to give ArriveBarrierOp as strong semantics as bar.sync, I think some extra bar.sync are not avoidable.

For ArriveBarrierOp around WarpGroupDotOp(you will have both before and after inside a loop), I think the real problem could be ArriveBarrierOp is modeled as read/write to everything:

// If this op is may be signalling other threads asynchronously, make sure // all shared memory transactions are complete beforehand. if (isa<triton::nvidia_gpu::ArriveBarrierOp>(op)) { Interval<size_t> allIntervals(0, std::numeric_limits<size_t>::max()); curBlockInfo.syncWriteIntervals[allIntervals].insert(op); curBlockInfo.syncReadIntervals[allIntervals].insert(op); }

Maybe just read is enough for a release fence to guarantee previous store finished.

ThomasRaoux · 2025-10-05T18:24:07Z

python/tutorials/gluon/08-warp-specialization.py

+
+
+# Helper class for passing arguments around partitions.
+@aggregate


let's move that in a separate PR

ThomasRaoux · 2025-10-05T18:26:17Z

lib/Analysis/Membar.cpp

                            OpBuilder *builder) {
-  if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp>(op)) {
+  if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp,
+          triton::nvidia_gpu::WaitBarrierOp>(op)) {


I'm not sure WaitBarrierOp always syncronize the different threads, it is possible that the barrier is passed by some threads but not others so I think it is wrong to assume it is counts as a sync

You are correct that WaitBarrierOp doesn't always sync threads.
There are 2 aspect to bar.sync:

memory fence

once a thread pass a bar, it's guaranteed all threads have reached the bar, in the sense that previous load/store have taken effects.
I think WaitBarrierOp is more like a acquire fence:

It synchronize the tma data transaction.

If the mbarrier is arrived by all warps in another warp-specialization-partition, it guarantees that all warps in the other partition have reached the point.
I will remove this change and think of another way to eliminate my redundant local_sync.

ThomasRaoux · 2025-10-05T18:45:27Z

include/triton/Analysis/Membar.h

+    // * Atomic, Write, Read
+    // Atomic F, T, T
+    // Write  T, T, T
+    // Rread  T, T, F


this makes sense, better than what I had done in my previous PR

Jokeren · 2025-10-05T20:49:56Z

include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h

 #include "triton/Dialect/TritonGPU/IR/AttrInterfaces.h.inc"
 // clang-format on

+namespace mlir::MemoryEffects {


Can you add more comments to describe why they are "atomic"?

Also, is atomic the correct terminology to describe the effect?

I understand that these operations are internally "synced" as only a single thread will perform them, but this isn't the same as "atomic".

The name MBarAtomic may be changed to something else. It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.
BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis. See the goldbot link I post above.

It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.

It's due to the fact that these instructions are executed per warp group basis. In that sense it's kind of "atomic" to each warp group but I'm still not sure it describes the behavior correctly. Can you add some more comments in the code in either here or Membar.h.

BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis.

Yes

- `mbarrier.try_wait` has same effects has bar. - Don't insert bar between mbarrier arrive/expect-tx/etc. - Distributed `mbarrier.arrive`'s arrive-count to as much warps as possible. - When all warps participates in `mbarrier.arrive`, don't insert a bar between it and previous `wgmma.mma_async` or `stmatrix`.

lijinpei · 2025-10-06T08:54:39Z

Also I have found that, the mbarrier.init outside the ttg.warp-specialization-op has reached warp specialization partitions, and causes extra local_barrier.
As there are implicit __syncthreads() before entering and after leaving ttg.warp-specialization-op, this propagation should be stopped.

ThomasRaoux · 2025-10-06T16:35:18Z

Also I have found that, the mbarrier.init outside the ttg.warp-specialization-op has reached warp specialization partitions, and causes extra local_barrier. As there are implicit __syncthreads() before entering and after leaving ttg.warp-specialization-op, this propagation should be stopped.

yeah warp_specialization op can count as a bar sync

[Example][Gluon] Add a hopper warp-sepc wgmma example

134515d

lijinpei requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners October 5, 2025 17:22

lijinpei closed this Oct 5, 2025

lijinpei reopened this Oct 5, 2025

lijinpei mentioned this pull request Oct 5, 2025

[BACKEND] Applying correct filtering in membar for nvidia #7846

Draft

ThomasRaoux reviewed Oct 5, 2025

View reviewed changes

Jokeren reviewed Oct 5, 2025

View reviewed changes

lijinpei force-pushed the demo-hoppper-gluon-gemm branch from 24d7486 to 6ca49d1 Compare October 6, 2025 02:43



		# Helper class for passing arguments around partitions.
		@aggregate

[Backend] Optimize membar insertion on hopper #8374

Are you sure you want to change the base?

[Backend] Optimize membar insertion on hopper #8374

Uh oh!

Conversation

lijinpei commented Oct 5, 2025

New contributor declaration

Uh oh!

lijinpei commented Oct 5, 2025

Uh oh!

lijinpei commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lijinpei commented Oct 5, 2025

Uh oh!

ThomasRaoux left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lijinpei Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lijinpei commented Oct 6, 2025

Uh oh!

ThomasRaoux commented Oct 6, 2025

Uh oh!

Uh oh!

lijinpei commented Oct 5, 2025 •

edited

Loading

ThomasRaoux left a comment •

edited

Loading

lijinpei Oct 6, 2025 •

edited

Loading