Skip to content

[BACKEND] Support bf16 global atomic add on Hopper and Ampere #2708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

plotfi
Copy link
Contributor

@plotfi plotfi commented Nov 28, 2023

This patch adds support for bf16 global atomic adds on Hopper. If the check for Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is the case with the cuda C library's atomicAdd().

For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used to produce the necessary pattern (which is actually handled by the NVPTX LLVM backend for LLVM IR atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val). The ROCm matchAndRewrite method works in this case because it is generating a generic LLVM IR instruction.

The patterns produced for each Target are as follows:

Hopper:

@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;

Ampere:

  $BB:
  shr.u32               %r1, %r2, %r3;
  cvt.u16.u32           %rs1, %r1;
  add.bf16              %rs2, %rs1, %rs1;
  cvt.u32.u16           %r4, %rs2;
  shl.b32               %r5, %r4, %r3;
  and.b32               %r6, %r2, %r9;
  or.b32                %r7, %r6, %r5;
  atom.global.cas.b32   %r8, [%rd1], %r2, %r7;
  setp.ne.s32           %p1, %r8, %r2;
  mov.u32               %r2, %r8;
  @%p1 bra              $BB;

dfukalov and others added 4 commits November 27, 2023 17:53
Added implementation for scalar atomics:
- atomic_add - uint, int and float types
- atomic_min/max - uint and int

atomic_min/max for float is not supported in LLVM 14. We can wait to
moving in upstream to latest LLVM or have to implement these
atomics with inline asm.

Tensor variants and atomic_cas, as well as rest of scalar atomics
will be introduced with next PR(s).

(cherry picked from commit 463591a)
This patch allows for the optional use of the ROCm implementation of atomicrmw
matchAndRewrite. It's initial purpose is for supporting atom.global.add.bf16
fallback on Ampere (which ROCm's generic LLVM IR atomic + NVPTX codegen
supports well).
This patch adds support for bf16 global atomic adds on Hopper. If the check for
Hopper (sm_90) fails, then a atomic compare and swap pattern is produced (as is
the case with the cuda C library's atomicAdd().

For the <sm_90 case, lowering code obtained from the ROCm fork of Triton is used
to produce the necessary pattern (which is actually handled by the NVPTX LLVM
backend for LLVM IR `atomicrmw fadd ptr addrspace(1) %ptr, bfloat %val`). The
ROCm matchAndRewrite method works in this case because it is generating a
generic LLVM IR instruction.

The patterns produced for each Target are as follows:

Hopper:

  `@%p1 atom.global.gpu.acq_rel.add.noftz.bf16 %rs1, [ %rd1 + 0 ], %rs2;`

Ampere:

```
  $BB:
  shr.u32               %r1, %r2, %r3;
  cvt.u16.u32           %rs1, %r1;
  add.bf16              %rs2, %rs1, %rs1;
  cvt.u32.u16           %r4, %rs2;
  shl.b32               %r5, %r4, %r3;
  and.b32               %r6, %r2, %r9;
  or.b32                %r7, %r6, %r5;
  atom.global.cas.b32   %r8, [%rd1], %r2, %r7;
  setp.ne.s32           %p1, %r8, %r2;
  mov.u32               %r2, %r8;
  @%p1 bra              $BB;
```
@plotfi plotfi requested a review from ptillet as a code owner November 28, 2023 04:32
@plotfi
Copy link
Contributor Author

plotfi commented Nov 28, 2023

Still experimenting with some things here, will add test cases soon.

@htyu

Copy link
Collaborator

@jlebar jlebar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your patch!

Can you clarify, is this ready to be reviewed (other than tests), or are you still working on it?

In any case, I left a few preliminary comments.

default:
return std::nullopt;
}
llvm_unreachable("Invalid RMWOp");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary, since you have a default case anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is picked from https://github.com/ROCmSoftwarePlatform/triton, it also would be interesting to try lowering more atomics cases using generic llvm ir in the future.

#define STORE_BF16_AS_BF16 1
#if STORE_BF16_AS_BF16
return FloatType::getBF16(type.getContext());
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer we deleted this if it doesn't work with LLVM as-is today.

if (valElements.size() && !valElements[0].getType().isBF16()) {
assert(false && "atom.add.bf16 fallback requires bf16 stored elements");
} else {
return matchAndRewriteROCm(op, adaptor, rewriter);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called rewriteUsingCAS or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well so, based on my experimentation it only does an atomic cas because the nvptx backend seems to support only that pattern and not the more compact atom.global.add ptx (this goes for f16 and f32 as well). At the moment on ampere for bf16 there doesn't seem to be anything better I know of that we can do so the nvptx backends lowering is good enough as a fall back I think.

LogicalResult
matchAndRewrite(triton::AtomicRMWOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
#ifdef USE_ROCM
return matchAndRewriteROCm(op, adaptor, rewriter);
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is not the final state of the patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I just need to make sure I'm not breaking other use cases or tests by doing this.

@@ -1157,7 +1315,7 @@ struct AtomicRMWOpConversion
case RMWOp::FADD:
rmwOp = "add";
rmwOp += (valueElemNBits == 16 ? ".noftz" : "");
sTy = "f" + sBits;
sTy = ((isHopper && elementTy.isBF16()) ? "bf" : "f") + sBits;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the isHopper check here. If it's not Hopper, shouldn't it be an error (maybe caught above) to try to use bf16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I was being a little redundant here.

rewriter.setInsertionPointToEnd(atomicBlock);
auto maybeKind = matchAtomicOp(atomicRmwAttr);
// TODO: use rocdl.raw.buffer.atomic from ROCDL dialect to use efficient
// atomics for MI-* series of AMD GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this code copy-pasted from somewhere? (I didn't review this code carefully yet.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cherry picked from the rocm fork mentioned in other comments.

@plotfi
Copy link
Contributor Author

plotfi commented Nov 28, 2023

Hi, thank you for your patch!

Can you clarify, is this ready to be reviewed (other than tests), or are you still working on it?

In any case, I left a few preliminary comments.

Thanks for the fast comments!

It is ready for review but please note that the fallback path is based on code cherry-picked from https://github.com/ROCmSoftwarePlatform/triton. I'd like to leave that bit of the code alone so that there's as little difference as possible, and I also would like to experiment with possibly moving atomics off of inline ptx assembly in the future too (this would require better much better lowering on the llvm backend side, I suspect in isel).

@jlebar
Copy link
Collaborator

jlebar commented Nov 28, 2023

Got it, thanks for the info!

I don't think I'm comfortable copy-pasting code from the ROCm fork without cleaning it up to "make sense" in the context of mainline. It's no different than copying code from a random other project.

@ptillet
Copy link
Collaborator

ptillet commented Nov 28, 2023

sorry but we can't accept that. The exact same feature was proposed a few months ago and rejected: #1689.

TLDR; I don't see the point of doing this:

  • fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits
  • for H100, people can use elementwise ptx if they are accumulating a small enough number of elements for bf16 to make sense
  • for hardware that doesn't support it natively, the codegen seems unnecessarily specific. Maybe we could talk about atomics that take arbitrary binary functions (although I'd need to see a compelling use case), but baking that in the compiler for standard binary operators seems like a bad idea.

@ptillet ptillet closed this Nov 28, 2023
@plotfi
Copy link
Contributor Author

plotfi commented Nov 28, 2023

  • for hardware that doesn't support it natively, the codegen seems unnecessarily specific. Maybe we could talk about atomics that take arbitrary binary functions (although I'd need to see a compelling use case), but baking that in the compiler for standard binary operators seems like a bad idea.

Does also this mean that it generally it isn't a good idea to replace inline ptx with llvm irgen or that in this case it just doesn't make sense?

@ptillet
Copy link
Collaborator

ptillet commented Nov 28, 2023

I think in this case it is too specific. I think we could potentially allow AtomicRMW to take in an arbitrary associative operator instead of a hardcoded op flag, although I'd have to see a better use case to justify the added complexity

@htyu
Copy link
Collaborator

htyu commented Nov 28, 2023

I think in this case it is too specific. I think we could potentially allow AtomicRMW to take in an arbitrary associative operator instead of a hardcoded op flag, although I'd have to see a better use case to justify the added complexity

Actually we had to work it around in Pytorch (https://github.com/pytorch/pytorch/pull/113204/files#r1387160822) and it'd be nice if it can be supported in Triton.

@jansel
Copy link
Contributor

jansel commented Dec 4, 2023

fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits

The main place where TorchInductor will generate atomics is the backwards of an indirect load (e.g. embeddings and graph neural networks). Embeddings seem pretty common, so it would be nice fp16/bf16 embeddings to be supported.

@harveyp123
Copy link

fp16 atomics are barely useful already, and bf16 has even fewer mantissa bits

The main place where TorchInductor will generate atomics is the backwards of an indirect load (e.g. embeddings and graph neural networks). Embeddings seem pretty common, so it would be nice fp16/bf16 embeddings to be supported.

Agree, it's not only graph neural networks, some kernels like SplitK GEMM also require atomic accumulation, it's generally good to see the BF16 atomic add support for wider user case

plotfi pushed a commit to plotfi/triton that referenced this pull request Apr 3, 2025
…d on Hopper and Ampere

Summary:
This is a cherry pick of D62682155 to Triton OSMETA

Based on OSS patch triton-lang#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.

I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG

Manual rebase conflict history:
https://phabricator.intern.facebook.com/D64616323
https://phabricator.intern.facebook.com/D64840725
https://phabricator.intern.facebook.com/D64973444
https://phabricator.intern.facebook.com/D66253680
https://phabricator.intern.facebook.com/D67126772

Test Plan: lit + pytest + test in scripts dir

Reviewers: bertrand, kmanivannan, mren, hoy, plotfi, amainz, aakhundov, #triton_dev

Subscribers: jianyuhuang

Differential Revision: https://phabricator.intern.facebook.com/D69543248

(cherry picked from commit 0db557bf3a1cb4af5bb8c61f424fd9db5a7025c8)
plotfi pushed a commit to facebookexperimental/triton that referenced this pull request Apr 4, 2025
Based on OSS patch triton-lang/triton#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.
These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to facebookexperimental/triton that referenced this pull request Apr 4, 2025
Based on OSS patch triton-lang/triton#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.
These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 10, 2025
Summary:
Based on OSS patch triton-lang#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.

I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG

Test Plan: lit + pytest + test in scripts dir

Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev

Reviewed By: bertrand

Subscribers: binbao, ivankobzarev, jianyuhuang

Differential Revision: https://phabricator.intern.facebook.com/D62682155

HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1

(cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
plotfi added a commit to plotfi/triton that referenced this pull request Apr 11, 2025
Summary:
Based on OSS patch triton-lang#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.

I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG

Test Plan: lit + pytest + test in scripts dir

Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev

Reviewed By: bertrand

Subscribers: binbao, ivankobzarev, jianyuhuang

Differential Revision: https://phabricator.intern.facebook.com/D62682155

HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1

(cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
plotfi added a commit to plotfi/triton that referenced this pull request Apr 15, 2025
Summary:
Based on OSS patch triton-lang#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.

I suspect these are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG

Test Plan: lit + pytest + test in scripts dir

Reviewers: bertrand, kmanivannan, mren, hoy, #triton_dev

Reviewed By: bertrand

Subscribers: binbao, ivankobzarev, jianyuhuang

Differential Revision: https://phabricator.intern.facebook.com/D62682155

HG Revision: aa663178f9c1e769ad2ed25ec131187335c448d1

(cherry picked from commit 88ed6c26cfb0d7d9a41afbf1c690319021255f66)
plotfi added a commit to plotfi/triton that referenced this pull request Apr 16, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 16, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 16, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 16, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 17, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 17, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for buffer ops on AMDGPU.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 17, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton-facebookexperimental that referenced this pull request Apr 22, 2025
Based on OSS patch triton-lang/triton#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.
These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
(cherry picked from commit ac6076c)
plotfi added a commit to plotfi/triton-facebookexperimental that referenced this pull request Apr 22, 2025
Based on OSS patch triton-lang/triton#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.
These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
(cherry picked from commit ac6076c)
plotfi added a commit to plotfi/triton-facebookexperimental that referenced this pull request Apr 25, 2025
Based on OSS patch triton-lang/triton#2708

This adds the ability to do atomics using BF16 types which are less precise but cheaper.
These are useful to have as accumulators because they make Split-K's cheaper to atomically accumulate.

I also have a comparison of BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
(cherry picked from commit ac6076c)
plotfi added a commit to plotfi/triton that referenced this pull request Apr 25, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 25, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request Apr 26, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 1, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 2, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
plotfi added a commit to plotfi/triton that referenced this pull request May 2, 2025
This revives triton-lang#2708 to add
support for atomics using BF16 types which are less precise but cheaper.

BF16 accumulators have proven to be useful in the context of Split-K's
where it is necessary to have cheaper atomic accumulation across two SMs.

BF16 atomics are also needed for some of the AMD buffer atomics work
(ie BufferAtomicRMWOp) as well a the need for a path to add unit tests
for AMD's backend.

BF16 atomics across A100, H100 and MI300 at:

https://godbolt.org/z/jW3EMbxrG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants