-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[Backend] Optimize membar insertion on hopper #8374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Per my benchmarking, redundant bar has around 10% performance impact on the example included:
after:
|
Some functionality of this pr may be a duplicate of https://github.com/triton-lang/triton/pull/7846/files . |
I have tested #7846 locally, the performance is not as good:
Adding some printf in Membar.cpp shows some bar not eliminated, between mbarrier.try_wait and mbarrier.arrive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move the tutorial change to a separate PR
// FIXME: work-around | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, committed by mistake.
ptxBuilderTMA.newOperand(barrierMemObj.getBase(), "r")); | ||
tmaInst += "}], [$" + std::to_string(operandIdx++) + "];"; | ||
tmaInst += | ||
"}], [$" + std::to_string(operandIdx++) + "], 1152921504606846976;"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this: 1152921504606846976?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it from cutlass and tile-lang. Seems it's not necessary. Removed.
// pred = b.and_(pred, LLVM::NVIDIA::createElectPredicate(loc, rewriter)); | ||
// FIXME: is elect.sync slower, and hard to optimize? | ||
auto [laneId, warpId] = getLaneAndWarpId(rewriter, loc); | ||
pred = b.and_(pred, b.icmp_eq(b.i32_val(0), laneId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with not using elect is that ptxas will not be able to figure out that the op is uniform an in general I noticed it will insert a loop around the tma instruction.
Could you check if it happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, elect is better than lane0.
Some TMA SASS instructions are issued on a per-warp basis, ptx has to impelment its spec with some loop around underlying SASS instruction. When guarded by a elect, the loop is eliminated, however, lane == 0
won't eliminate the loop.
https://godbolt.org/z/TTxaosecn
https://godbolt.org/z/zxvrz6YK8
Unluckily, godbolt has no support for cutlass yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a special register %laneid
in ptx, which when compared with zero, generates no loop and no elect.sync
instruction:
https://godbolt.org/z/9M5jxGoxh
https://godbolt.org/z/TqGYr9src
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, I don't see if being used in cutlass much. Do you have the SASS generated in each case?
if (auto mbarArriveOp = | ||
dyn_cast<triton::nvidia_gpu::ArriveBarrierOp>(afterOp)) { | ||
auto numWarps = triton::gpu::lookupNumWarps(afterOp); | ||
auto numArrive = mbarArriveOp.getCount(); | ||
return numArrive >= numWarps; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it possible that we don't need a bar.sync when we have storeOp followed by arrive? We would want to make sure the store is done before the mbarrier is signaled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we give the ArriveBarrierOp
release semantics, and the ArriveBarrierOp
is executed by all warps in this partition, then by the time mbarrier.arrive
finishes, we know that all warps in this partition reached this point, and previous smem store takes effect(A fence.proxy.async.shared::cta
is needed in-between, but that is not the responsibility of this pass to insert it.)
Suppose we want to give ArriveBarrierOp
as strong semantics as bar.sync, I think some extra bar.sync are not avoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ArriveBarrierOp
around WarpGroupDotOp
(you will have both before and after inside a loop), I think the real problem could be ArriveBarrierOp
is modeled as read/write to everything:
// If this op is may be signalling other threads asynchronously, make sure
// all shared memory transactions are complete beforehand.
if (isa<triton::nvidia_gpu::ArriveBarrierOp>(op)) {
Interval<size_t> allIntervals(0, std::numeric_limits<size_t>::max());
curBlockInfo.syncWriteIntervals[allIntervals].insert(op);
curBlockInfo.syncReadIntervals[allIntervals].insert(op);
}
Maybe just read is enough for a release fence to guarantee previous store finished.
|
||
|
||
# Helper class for passing arguments around partitions. | ||
@aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move that in a separate PR
OpBuilder *builder) { | ||
if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp>(op)) { | ||
if (isa<gpu::BarrierOp, triton::gpu::LocalBarrierOp, | ||
triton::nvidia_gpu::WaitBarrierOp>(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure WaitBarrierOp
always syncronize the different threads, it is possible that the barrier is passed by some threads but not others so I think it is wrong to assume it is counts as a sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that WaitBarrierOp
doesn't always sync threads.
There are 2 aspect to bar.sync
:
- memory fence
- once a thread pass a bar, it's guaranteed all threads have reached the bar, in the sense that previous load/store have taken effects.
I thinkWaitBarrierOp
is more like a acquire fence: - It synchronize the tma data transaction.
- If the mbarrier is arrived by all warps in another warp-specialization-partition, it guarantees that all warps in the other partition have reached the point.
I will remove this change and think of another way to eliminate my redundant local_sync.
// * Atomic, Write, Read | ||
// Atomic F, T, T | ||
// Write T, T, T | ||
// Rread T, T, F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense, better than what I had done in my previous PR
#include "triton/Dialect/TritonGPU/IR/AttrInterfaces.h.inc" | ||
// clang-format on | ||
|
||
namespace mlir::MemoryEffects { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more comments to describe why they are "atomic"?
Also, is atomic
the correct terminology to describe the effect?
I understand that these operations are internally "synced" as only a single thread will perform them, but this isn't the same as "atomic".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name MBarAtomic
may be changed to something else. It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.
BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis. See the goldbot link I post above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It simply means, when multiple threads/warps do mbarrier.arrive/expect_tx/complet_tx without other ordering/"at the same time", there will be no hazard.
It's due to the fact that these instructions are executed per warp group basis. In that sense it's kind of "atomic" to each warp group but I'm still not sure it describes the behavior correctly. Can you add some more comments in the code in either here or Membar.h
.
BTW per my experiment, mbarrier.arrive/expect_tx are not issued on a per-warp basis, only tmp load/store are issued on a per-warp basis.
Yes
- `mbarrier.try_wait` has same effects has bar. - Don't insert bar between mbarrier arrive/expect-tx/etc. - Distributed `mbarrier.arrive`'s arrive-count to as much warps as possible. - When all warps participates in `mbarrier.arrive`, don't insert a bar between it and previous `wgmma.mma_async` or `stmatrix`.
24d7486
to
6ca49d1
Compare
Also I have found that, the mbarrier.init outside the ttg.warp-specialization-op has reached warp specialization partitions, and causes extra local_barrier. |
yeah warp_specialization op can count as a bar sync |
New contributor declaration
[x ] I am not making a trivial change, such as fixing a typo in a comment.
[ x] I have written a PR description following these
rules.
[ x] I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsFILL THIS IN
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)