[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

plotfi · 2023-11-28T04:32:45Z

This patch adds support for bf16 global atomic adds on Hopper. If the check for Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is the case with the cuda C library's atomicAdd().

For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used to produce the necessary pattern (which is actually handled by the NVPTX LLVM backend for LLVM IR atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val). The ROCm matchAndRewrite method works in this case because it is generating a generic LLVM IR instruction.

The patterns produced for each Target are as follows:

Hopper:

@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;

Ampere:

  $BB:
  shr.u32               %r1, %r2, %r3;
  cvt.u16.u32           %rs1, %r1;
  add.bf16              %rs2, %rs1, %rs1;
  cvt.u32.u16           %r4, %rs2;
  shl.b32               %r5, %r4, %r3;
  and.b32               %r6, %r2, %r9;
  or.b32                %r7, %r6, %r5;
  atom.global.cas.b32   %r8, [%rd1], %r2, %r7;
  setp.ne.s32           %p1, %r8, %r2;
  mov.u32               %r2, %r8;
  @%p1 bra              $BB;

Added implementation for scalar atomics: - atomic_add - uint, int and float types - atomic_min/max - uint and int atomic_min/max for float is not supported in LLVM 14. We can wait to moving in upstream to latest LLVM or have to implement these atomics with inline asm. Tensor variants and atomic_cas, as well as rest of scalar atomics will be introduced with next PR(s). (cherry picked from commit 463591a)

Based on changes from: ROCm@841784d1e308ea0aebbde2 ROCm@13facab95f06c7cc9d7700

This patch allows for the optional use of the ROCm implementation of atomicrmw matchAndRewrite. It's initial purpose is for supporting atom.global.add.bf16 fallback on Ampere (which ROCm's generic LLVM IR atomic + NVPTX codegen supports well).

This patch adds support for bf16 global atomic adds on Hopper. If the check for Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is the case with the cuda C library's atomicAdd(). For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used to produce the necessary pattern (which is actually handled by the NVPTX LLVM backend for LLVM IR `atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val`). The ROCm matchAndRewrite method works in this case because it is generating a generic LLVM IR instruction. The patterns produced for each Target are as follows: Hopper: `@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;` Ampere: ``` $BB: shr.u32 %r1, %r2, %r3; cvt.u16.u32 %rs1, %r1; add.bf16 %rs2, %rs1, %rs1; cvt.u32.u16 %r4, %rs2; shl.b32 %r5, %r4, %r3; and.b32 %r6, %r2, %r9; or.b32 %r7, %r6, %r5; atom.global.cas.b32 %r8, [%rd1], %r2, %r7; setp.ne.s32 %p1, %r8, %r2; mov.u32 %r2, %r8; @%p1 bra $BB; ```

plotfi · 2023-11-28T04:34:47Z

Still experimenting with some things here, will add test cases soon.

@htyu

jlebar

Hi, thank you for your patch!

Can you clarify, is this ready to be reviewed (other than tests), or are you still working on it?

In any case, I left a few preliminary comments.

jlebar · 2023-11-28T04:58:44Z

lib/Conversion/TritonGPUToLLVM/LoadStoreOpToLLVM.cpp

+    default:
+      return std::nullopt;
+    }
+    llvm_unreachable("Invalid RMWOp");


Is this necessary, since you have a default case anyway?

This is picked from https://github.com/ROCmSoftwarePlatform/triton, it also would be interesting to try lowering more atomics cases using generic llvm ir in the future.

jlebar · 2023-11-28T04:59:36Z

lib/Conversion/TritonGPUToLLVM/TypeConverter.cpp

+    #define STORE_BF16_AS_BF16 1
+    #if STORE_BF16_AS_BF16
+    return FloatType::getBF16(type.getContext());
+    #else


Would prefer we deleted this if it doesn't work with LLVM as-is today.

jlebar · 2023-11-28T05:04:14Z

lib/Conversion/TritonGPUToLLVM/LoadStoreOpToLLVM.cpp

+      if (valElements.size() && !valElements[0].getType().isBF16()) {
+        assert(false && "atom.add.bf16 fallback requires bf16 stored elements");
+      } else {
+        return matchAndRewriteROCm(op, adaptor, rewriter);


Should this be called rewriteUsingCAS or something?

Well so, based on my experimentation it only does an atomic cas because the nvptx backend seems to support only that pattern and not the more compact atom.global.add ptx (this goes for f16 and f32 as well). At the moment on ampere for bf16 there doesn't seem to be anything better I know of that we can do so the nvptx backends lowering is good enough as a fall back I think.

jlebar · 2023-11-28T05:04:26Z

lib/Conversion/TritonGPUToLLVM/LoadStoreOpToLLVM.cpp

  LogicalResult
  matchAndRewrite(triton::AtomicRMWOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
+    #ifdef USE_ROCM
+    return matchAndRewriteROCm(op, adaptor, rewriter);
+    #endif


I assume this is not the final state of the patch?

Yep. I just need to make sure I'm not breaking other use cases or tests by doing this.

jlebar · 2023-11-28T05:05:34Z

lib/Conversion/TritonGPUToLLVM/LoadStoreOpToLLVM.cpp

@@ -1157,7 +1315,7 @@ struct AtomicRMWOpConversion
      case RMWOp::FADD:
        rmwOp = "add";
        rmwOp += (valueElemNBits == 16 ? ".noftz" : "");
-        sTy = "f" + sBits;
+        sTy = ((isHopper && elementTy.isBF16()) ? "bf" : "f") + sBits;


I don't understand the isHopper check here. If it's not Hopper, shouldn't it be an error (maybe caught above) to try to use bf16?

You're right. I was being a little redundant here.

jlebar · 2023-11-28T05:06:11Z

lib/Conversion/TritonGPUToLLVM/LoadStoreOpToLLVM.cpp

+      rewriter.setInsertionPointToEnd(atomicBlock);
+      auto maybeKind = matchAtomicOp(atomicRmwAttr);
+      // TODO: use rocdl.raw.buffer.atomic from ROCDL dialect to use efficient
+      // atomics for MI-* series of AMD GPU.


Was this code copy-pasted from somewhere? (I didn't review this code carefully yet.)

Cherry picked from the rocm fork mentioned in other comments.

plotfi · 2023-11-28T05:41:07Z

Hi, thank you for your patch!

Can you clarify, is this ready to be reviewed (other than tests), or are you still working on it?

In any case, I left a few preliminary comments.

Thanks for the fast comments!

It is ready for review but please note that the fallback path is based on code cherry-picked from https://github.com/ROCmSoftwarePlatform/triton. I'd like to leave that bit of the code alone so that there's as little difference as possible, and I also would like to experiment with possibly moving atomics off of inline ptx assembly in the future too (this would require better much better lowering on the llvm backend side, I suspect in isel).

jlebar · 2023-11-28T06:14:07Z

Got it, thanks for the info!

I don't think I'm comfortable copy-pasting code from the ROCm fork without cleaning it up to "make sense" in the context of mainline. It's no different than copying code from a random other project.

ptillet · 2023-11-28T06:18:44Z

sorry but we can't accept that. The exact same feature was proposed a few months ago and rejected: #1689.

TLDR; I don't see the point of doing this:

fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits
for H100, people can use elementwise ptx if they are accumulating a small enough number of elements for bf16 to make sense
for hardware that doesn't support it natively, the codegen seems unnecessarily specific. Maybe we could talk about atomics that take arbitrary binary functions (although I'd need to see a compelling use case), but baking that in the compiler for standard binary operators seems like a bad idea.

plotfi · 2023-11-28T06:43:34Z

for hardware that doesn't support it natively, the codegen seems unnecessarily specific. Maybe we could talk about atomics that take arbitrary binary functions (although I'd need to see a compelling use case), but baking that in the compiler for standard binary operators seems like a bad idea.

Does also this mean that it generally it isn't a good idea to replace inline ptx with llvm irgen or that in this case it just doesn't make sense?

ptillet · 2023-11-28T06:52:16Z

I think in this case it is too specific. I think we could potentially allow AtomicRMW to take in an arbitrary associative operator instead of a hardcoded op flag, although I'd have to see a better use case to justify the added complexity

htyu · 2023-11-28T17:20:54Z

I think in this case it is too specific. I think we could potentially allow AtomicRMW to take in an arbitrary associative operator instead of a hardcoded op flag, although I'd have to see a better use case to justify the added complexity

Actually we had to work it around in Pytorch (https://github.com/pytorch/pytorch/pull/113204/files#r1387160822) and it'd be nice if it can be supported in Triton.

jansel · 2023-12-04T22:59:04Z

fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits

The main place where TorchInductor will generate atomics is the backwards of an indirect load (e.g. embeddings and graph neural networks). Embeddings seem pretty common, so it would be nice fp16/bf16 embeddings to be supported.

harveyp123 · 2024-07-01T17:16:57Z

fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits

The main place where TorchInductor will generate atomics is the backwards of an indirect load (e.g. embeddings and graph neural networks). Embeddings seem pretty common, so it would be nice fp16/bf16 embeddings to be supported.

Agree, it's not only graph neural networks, some kernels like SplitK GEMM also require atomic accumulation, it's generally good to see the BF16 atomic add support for wider user case

…d on Hopper and Ampere Summary: This is a cherry pick of D62682155 to Triton OSMETA Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Manual rebase conflict history: https://phabricator.intern.facebook.com/D64616323 https://phabricator.intern.facebook.com/D64840725 https://phabricator.intern.facebook.com/D64973444 https://phabricator.intern.facebook.com/D66253680 https://phabricator.intern.facebook.com/D67126772 Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, plotfi, amainz, aakhundov, #triton_dev Subscribers: jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D69543248 (cherry picked from commit 0db557bf3a1cb4af5bb8c61f424fd9db5a7025c8)

Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG

Summary: Based on OSS patch triton-lang#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG Test Plan: lit + pytest + test in scripts dir Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev Reviewed By: bertrand Subscribers: binbao, ivankobzarev, jianyuhuang Differential Revision: https://phabricator.intern.facebook.com/D62682155 HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1 (cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)

This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for buffer ops on AMDGPU. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG

This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG

Based on OSS patch triton-lang/triton#2708 This adds the ability to do atomics using BF16 types which are less precise but cheaper. These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate. I also have a comparison of BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG (cherry picked from commit ac6076c)

This revives triton-lang#2708 to add support for atomics using BF16 types which are less precise but cheaper. BF16 accumulators have proven to be useful in the context of Split-K's where it is necessary to have cheaper atomic accumulation across two SMs. BF16 atomics are also needed for some of the AMD buffer atomics work (ie BufferAtomicRMWOp) as well a the need for a path to add unit tests for AMD's backend. BF16 atomics across A100, H100 and MI300 at: https://godbolt.org/z/jW3EMbxrG

dfukalov and others added 4 commits November 27, 2023 17:53

[BACKEND] Adding changes AtomicRMW updates from ROCm fork

870e4b6

Based on changes from: ROCm@841784d1e308ea0aebbde2 ROCm@13facab95f06c7cc9d7700

plotfi requested a review from ptillet as a code owner November 28, 2023 04:32

jlebar reviewed Nov 28, 2023

View reviewed changes

ptillet closed this Nov 28, 2023

This was referenced Apr 2, 2024

[Inductor] atomic_add does not support bf16 pytorch/pytorch#97016

Open

Bf16 with tl.dot and tl.atomic_add #2834

Open

daviswer mentioned this pull request Jul 10, 2024

Telescoping cache precision/throughput issues foundation-model-stack/avengers#1

Open

plotfi mentioned this pull request Oct 25, 2024

[AMD] Emit vectorized 16-bit float LLVM atomic ops #4925

Merged

This was referenced Dec 3, 2024

[kernel] introduce fused_moe_triton_splitk to sglang sgl-project/sglang#2333

Closed

[kernel] introduce fused_moe_triton_splitk optimization sgl-project/sglang#2334

Closed

plotfi mentioned this pull request Apr 22, 2025

Support bf16 global atomic add on Hopper (only) facebookexperimental/triton#99

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

plotfi commented Nov 28, 2023 •

edited

Loading

plotfi commented Nov 28, 2023

jlebar left a comment

jlebar Nov 28, 2023

plotfi Nov 28, 2023

jlebar Nov 28, 2023

jlebar Nov 28, 2023

plotfi Nov 28, 2023

jlebar Nov 28, 2023

plotfi Nov 28, 2023

jlebar Nov 28, 2023

plotfi Nov 28, 2023

jlebar Nov 28, 2023

plotfi Nov 28, 2023

plotfi commented Nov 28, 2023

jlebar commented Nov 28, 2023

ptillet commented Nov 28, 2023 •

edited

Loading

plotfi commented Nov 28, 2023

ptillet commented Nov 28, 2023

htyu commented Nov 28, 2023

jansel commented Dec 4, 2023

harveyp123 commented Jul 1, 2024

[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

Conversation

plotfi commented Nov 28, 2023 • edited Loading

plotfi commented Nov 28, 2023

jlebar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plotfi commented Nov 28, 2023

jlebar commented Nov 28, 2023

ptillet commented Nov 28, 2023 • edited Loading

plotfi commented Nov 28, 2023

ptillet commented Nov 28, 2023

htyu commented Nov 28, 2023

jansel commented Dec 4, 2023

harveyp123 commented Jul 1, 2024

plotfi commented Nov 28, 2023 •

edited

Loading

ptillet commented Nov 28, 2023 •

edited

Loading