-
Notifications
You must be signed in to change notification settings - Fork 2k
[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708
Conversation
Added implementation for scalar atomics: - atomic_add - uint, int and float types - atomic_min/max - uint and int atomic_min/max for float is not supported in LLVM 14. We can wait to moving in upstream to latest LLVM or have to implement these atomics with inline asm. Tensor variants and atomic_cas, as well as rest of scalar atomics will be introduced with next PR(s). (cherry picked from commit 463591a)
This patch allows for the optional use of the ROCm implementation of atomicrmw matchAndRewrite. It's initial purpose is for supporting atom.global.add.bf16 fallback on Ampere (which ROCm's generic LLVM IR atomic + NVPTX codegen supports well).
This patch adds support for bf16 global atomic adds on Hopper. If the check for Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is the case with the cuda C library's atomicAdd(). For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used to produce the necessary pattern (which is actually handled by the NVPTX LLVM backend for LLVM IR `atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val`). The ROCm matchAndRewrite method works in this case because it is generating a generic LLVM IR instruction. The patterns produced for each Target are as follows: Hopper: `@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;` Ampere: ``` $BB: shr.u32 %r1, %r2, %r3; cvt.u16.u32 %rs1, %r1; add.bf16 %rs2, %rs1, %rs1; cvt.u32.u16 %r4, %rs2; shl.b32 %r5, %r4, %r3; and.b32 %r6, %r2, %r9; or.b32 %r7, %r6, %r5; atom.global.cas.b32 %r8, [%rd1], %r2, %r7; setp.ne.s32 %p1, %r8, %r2; mov.u32 %r2, %r8; @%p1 bra $BB; ```
Still experimenting with some things here, will add test cases soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your patch!
Can you clarify, is this ready to be reviewed (other than tests), or are you still working on it?
In any case, I left a few preliminary comments.
default: | ||
return std::nullopt; | ||
} | ||
llvm_unreachable("Invalid RMWOp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary, since you have a default
case anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is picked from https://github.com/ROCmSoftwarePlatform/triton, it also would be interesting to try lowering more atomics cases using generic llvm ir in the future.
#define STORE_BF16_AS_BF16 1 | ||
#if STORE_BF16_AS_BF16 | ||
return FloatType::getBF16(type.getContext()); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer we deleted this if it doesn't work with LLVM as-is today.
if (valElements.size() && !valElements[0].getType().isBF16()) { | ||
assert(false && "atom.add.bf16 fallback requires bf16 stored elements"); | ||
} else { | ||
return matchAndRewriteROCm(op, adaptor, rewriter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called rewriteUsingCAS or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well so, based on my experimentation it only does an atomic cas because the nvptx backend seems to support only that pattern and not the more compact atom.global.add ptx (this goes for f16 and f32 as well). At the moment on ampere for bf16 there doesn't seem to be anything better I know of that we can do so the nvptx backends lowering is good enough as a fall back I think.
LogicalResult | ||
matchAndRewrite(triton::AtomicRMWOp op, OpAdaptor adaptor, | ||
ConversionPatternRewriter &rewriter) const override { | ||
#ifdef USE_ROCM | ||
return matchAndRewriteROCm(op, adaptor, rewriter); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is not the final state of the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I just need to make sure I'm not breaking other use cases or tests by doing this.
@@ -1157,7 +1315,7 @@ struct AtomicRMWOpConversion | |||
case RMWOp::FADD: | |||
rmwOp = "add"; | |||
rmwOp += (valueElemNBits == 16 ? ".noftz" : ""); | |||
sTy = "f" + sBits; | |||
sTy = ((isHopper && elementTy.isBF16()) ? "bf" : "f") + sBits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the isHopper
check here. If it's not Hopper, shouldn't it be an error (maybe caught above) to try to use bf16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I was being a little redundant here.
rewriter.setInsertionPointToEnd(atomicBlock); | ||
auto maybeKind = matchAtomicOp(atomicRmwAttr); | ||
// TODO: use rocdl.raw.buffer.atomic from ROCDL dialect to use efficient | ||
// atomics for MI-* series of AMD GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code copy-pasted from somewhere? (I didn't review this code carefully yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cherry picked from the rocm fork mentioned in other comments.
Thanks for the fast comments! It is ready for review but please note that the fallback path is based on code cherry-picked from https://github.com/ROCmSoftwarePlatform/triton. I'd like to leave that bit of the code alone so that there's as little difference as possible, and I also would like to experiment with possibly moving atomics off of inline ptx assembly in the future too (this would require better much better lowering on the llvm backend side, I suspect in isel). |
Got it, thanks for the info! I don't think I'm comfortable copy-pasting code from the ROCm fork without cleaning it up to "make sense" in the context of mainline. It's no different than copying code from a random other project. |
sorry but we can't accept that. The exact same feature was proposed a few months ago and rejected: #1689. TLDR; I don't see the point of doing this:
|
Does also this mean that it generally it isn't a good idea to replace inline ptx with llvm irgen or that in this case it just doesn't make sense? |
I think in this case it is too specific. I think we could potentially allow |
Actually we had to work it around in Pytorch (https://github.com/pytorch/pytorch/pull/113204/files#r1387160822) and it'd be nice if it can be supported in Triton. |
The main place where TorchInductor will generate atomics is the backwards of an indirect load (e.g. embeddings and graph neural networks). Embeddings seem pretty common, so it would be nice fp16/bf16 embeddings to be supported. |
Agree, it's not only graph neural networks, some kernels like SplitK GEMM also require atomic accumulation, it's generally good to see the BF16 atomic add support for wider user case |
…d on Hopper and Ampere Summary: This is a cherry pick of D62682155 to Triton OSMETA Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Manual rebase conflict history: https://phabricator.intern.facebook.com/D64616323 https://phabricator.intern.facebook.com/D64840725 https://phabricator.intern.facebook.com/D64973444 https://phabricator.intern.facebook.com/D66253680 https://phabricator.intern.facebook.com/D67126772 Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, plotfi, amainz, aakhundov, #triton_dev Subscribers: jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D69543248 (cherry picked from commit 0db557bf3a1cb4af5bb8c61f424fd9db5a7025c8)
Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
Summary: Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev Reviewed By: bertrand Subscribers: binbao, ivankobzarev, jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D62682155 HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1 (cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
Summary: Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev Reviewed By: bertrand Subscribers: binbao, ivankobzarev, jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D62682155 HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1 (cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
Summary: Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev Reviewed By: bertrand Subscribers: binbao, ivankobzarev, jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D62682155 HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1 (cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG (cherry picked from commit ac6076c)
Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG (cherry picked from commit ac6076c)
Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG (cherry picked from commit ac6076c)
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG
This patch adds support for bf16 global atomic adds on Hopper. If the check for Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is the case with the cuda C library's atomicAdd().
For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used to produce the necessary pattern (which is actually handled by the NVPTX LLVM backend for LLVM IR
atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val
). The ROCm matchAndRewrite method works in this case because it is generating a generic LLVM IR instruction.The patterns produced for each Target are as follows:
Hopper:
@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;
Ampere: