Skip to content

Conversation

lijinpei
Copy link
Contributor

@lijinpei lijinpei commented Oct 5, 2025

New contributor declaration

  • [x ] I am not making a trivial change, such as fixing a typo in a comment.

  • [ x] I have written a PR description following these
    rules.

  • [ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • [x ] I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@lijinpei
Copy link
Contributor Author

lijinpei commented Oct 5, 2025

Per my benchmarking, redundant bar has around 10% performance impact on the example included:
before:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            515.24    575.15
 1024            565.71    637.72
 2048            590.59    621.93
 4096            597.59    636.09
 8192            606.03    665.42
16384            629.67    643.29

after:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            533.60    572.95
 1024            607.96    635.05
 2048            644.24    657.84
 4096            653.42    664.32
 8192            672.24    679.02
16384            680.23    671.07

@lijinpei
Copy link
Contributor Author

lijinpei commented Oct 5, 2025

Some functionality of this pr may be a duplicate of https://github.com/triton-lang/triton/pull/7846/files .
But some others parts, like "mbarrier.try_wait should function as a synchronization" is not. This has potential problem, other warps are not guaranteed to have reached mbarrier.try_wait.

@lijinpei lijinpei closed this Oct 5, 2025
@lijinpei
Copy link
Contributor Author

lijinpei commented Oct 5, 2025

I have tested #7846 locally, the performance is not as good:

Benchmarking matmul_warp_specialized on hopper
====================================
    K  warp-specialized    cublas
  512            521.02    580.96
 1024            566.28    655.77
 2048            602.45    621.87
 4096            601.08    629.14
 8192            613.34    639.98
16384            623.64    690.13

Adding some printf in Membar.cpp shows some bar not eliminated, between mbarrier.try_wait and mbarrier.arrive.

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move the tutorial change to a separate PR

Comment on lines 130 to 131
// FIXME: work-around
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely not correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, committed by mistake.

ptxBuilderTMA.newOperand(barrierMemObj.getBase(), "r"));
tmaInst += "}], [$" + std::to_string(operandIdx++) + "];";
tmaInst +=
"}], [$" + std::to_string(operandIdx++) + "], 1152921504606846976;";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this: 1152921504606846976?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from cutlass and tile-lang. Seems it's not necessary. Removed.

Comment on lines 1345 to 1348
// pred = b.and_(pred, LLVM::NVIDIA::createElectPredicate(loc, rewriter));
// FIXME: is elect.sync slower, and hard to optimize?
auto [laneId, warpId] = getLaneAndWarpId(rewriter, loc);
pred = b.and_(pred, b.icmp_eq(b.i32_val(0), laneId));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with not using elect is that ptxas will not be able to figure out that the op is uniform an in general I noticed it will insert a loop around the tma instruction.
Could you check if it happening?

Copy link
Contributor Author

@lijinpei lijinpei Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, elect is better than lane0.
Some TMA SASS instructions are issued on a per-warp basis, ptx has to impelment its spec with some loop around underlying SASS instruction. When guarded by a elect, the loop is eliminated, however, lane == 0 won't eliminate the loop.
https://godbolt.org/z/TTxaosecn
https://godbolt.org/z/zxvrz6YK8
Unluckily, godbolt has no support for cutlass yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a special register %laneid in ptx, which when compared with zero, generates no loop and no elect.sync instruction:
https://godbolt.org/z/9M5jxGoxh
https://godbolt.org/z/TqGYr9src

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I don't see if being used in cutlass much. Do you have the SASS generated in each case?

Comment on lines +74 to +79
if (auto mbarArriveOp =
dyn_cast<triton::nvidia_gpu::ArriveBarrierOp>(afterOp)) {
auto numWarps = triton::gpu::lookupNumWarps(afterOp);
auto numArrive = mbarArriveOp.getCount();
return numArrive >= numWarps;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it possible that we don't need a bar.sync when we have storeOp followed by arrive? We would want to make sure the store is done before the mbarrier is signaled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we give the ArriveBarrierOp release semantics, and the ArriveBarrierOp is executed by all warps in this partition, then by the time mbarrier.arrive finishes, we know that all warps in this partition reached this point, and previous smem store takes effect(A fence.proxy.async.shared::cta is needed in-between, but that is not the responsibility of this pass to insert it.)
Suppose we want to give ArriveBarrierOp as strong semantics as bar.sync, I think some extra bar.sync are not avoidable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ArriveBarrierOp around WarpGroupDotOp(you will have both before and after inside a loop), I think the real problem could be ArriveBarrierOp is modeled as read/write to everything:

    // If this op is may be signalling other threads asynchronously, make sure
    // all shared memory transactions are complete beforehand.
    if (isa<triton::nvidia_gpu::ArriveBarrierOp>(op)) {
      Interval<size_t> allIntervals(0, std::numeric_limits<size_t>::max());
      curBlockInfo.syncWriteIntervals[allIntervals].insert(op);
      curBlockInfo.syncReadIntervals[allIntervals].insert(op);
    }

Maybe just read is enough for a release fence to guarantee previous store finished.



# Helper class for passing arguments around partitions.
@aggregate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move that in a separate PR

OpBuilder *builder) {
if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp>(op)) {
if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp,
triton::nvidia_gpu::WaitBarrierOp>(op)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure WaitBarrierOp always syncronize the different threads, it is possible that the barrier is passed by some threads but not others so I think it is wrong to assume it is counts as a sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that WaitBarrierOp doesn't always sync threads.
There are 2 aspect to bar.sync:

  • memory fence
  • once a thread pass a bar, it's guaranteed all threads have reached the bar, in the sense that previous load/store have taken effects.
    I think WaitBarrierOp is more like a acquire fence:
  • It synchronize the tma data transaction.
  • If the mbarrier is arrived by all warps in another warp-specialization-partition, it guarantees that all warps in the other partition have reached the point.
    I will remove this change and think of another way to eliminate my redundant local_sync.

Comment on lines +74 to +77
// * Atomic, Write, Read
// Atomic F, T, T
// Write T, T, T
// Rread T, T, F
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense, better than what I had done in my previous PR

#include "triton/Dialect/TritonGPU/IR/AttrInterfaces.h.inc"
// clang-format on

namespace mlir::MemoryEffects {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more comments to describe why they are "atomic"?

Also, is atomic the correct terminology to describe the effect?

I understand that these operations are internally "synced" as only a single thread will perform them, but this isn't the same as "atomic".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name MBarAtomic may be changed to something else. It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.
BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis. See the goldbot link I post above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.

It's due to the fact that these instructions are executed per warp group basis. In that sense it's kind of "atomic" to each warp group but I'm still not sure it describes the behavior correctly. Can you add some more comments in the code in either here or Membar.h.

BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis.

Yes

- `mbarrier.try_wait` has same effects has bar.
- Don't insert bar between mbarrier arrive/expect-tx/etc.
- Distributed `mbarrier.arrive`'s arrive-count to as much warps as
  possible.
- When all warps participates in `mbarrier.arrive`, don't insert a bar
  between it and previous `wgmma.mma_async` or `stmatrix`.
@lijinpei lijinpei force-pushed the demo-hoppper-gluon-gemm branch from 24d7486 to 6ca49d1 Compare October 6, 2025 02:43
@lijinpei
Copy link
Contributor Author

lijinpei commented Oct 6, 2025

Also I have found that, the mbarrier.init outside the ttg.warp-specialization-op has reached warp specialization partitions, and causes extra local_barrier.
As there are implicit __syncthreads() before entering and after leaving ttg.warp-specialization-op, this propagation should be stopped.

@ThomasRaoux
Copy link
Collaborator

Also I have found that, the mbarrier.init outside the ttg.warp-specialization-op has reached warp specialization partitions, and causes extra local_barrier. As there are implicit __syncthreads() before entering and after leaving ttg.warp-specialization-op, this propagation should be stopped.

yeah warp_specialization op can count as a bar sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants