Skip to content

[AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction #142859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

shiltian
Copy link
Contributor

@shiltian shiltian commented Jun 4, 2025

If a target supports f64 global atomic add instruction, we can also use full flat emulation.

If a target supports f64 atomic add instruction, we can also use full flat
emulation.
Copy link
Contributor Author

shiltian commented Jun 4, 2025

@shiltian shiltian requested review from arsenm, Pierre-vh and rampitec June 4, 2025 21:32
@llvmbot
Copy link
Member

llvmbot commented Jun 4, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

If a target supports f64 atomic add instruction, we can also use full flat
emulation.


Patch is 109.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142859.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+5-3)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+105-47)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+348-108)
  • (modified) llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll (+31-12)
  • (modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+29-10)
  • (modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll (+606-97)
  • (modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomicrmw-flat-noalias-addrspace.ll (+22-6)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 1957e442dbabb..7575510dd7f98 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -17541,9 +17541,11 @@ void SITargetLowering::emitExpandAtomicAddrSpacePredicate(
   // where we only insert a check for private and still use the flat instruction
   // for global and shared.
 
-  bool FullFlatEmulation = RMW && RMW->getOperation() == AtomicRMWInst::FAdd &&
-                           Subtarget->hasAtomicFaddInsts() &&
-                           RMW->getType()->isFloatTy();
+  bool FullFlatEmulation =
+      RMW && RMW->getOperation() == AtomicRMWInst::FAdd &&
+      ((Subtarget->hasAtomicFaddInsts() && RMW->getType()->isFloatTy()) ||
+       (Subtarget->hasFlatBufferGlobalAtomicFaddF64Inst() &&
+        RMW->getType()->isDoubleTy()));
 
   // If the return value isn't used, do not introduce a false use in the phi.
   bool ReturnValueIsUsed = !AI->use_empty();
diff --git a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
index 1669909e96eb1..231f53d7f3710 100644
--- a/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
@@ -799,6 +799,31 @@ define double @optnone_atomicrmw_fadd_f64_expand(double %val) #1 {
 ; GFX90A-LABEL: optnone_atomicrmw_fadd_f64_expand:
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_mov_b64 s[4:5], src_shared_base
+; GFX90A-NEXT:    s_mov_b32 s6, 32
+; GFX90A-NEXT:    s_lshr_b64 s[4:5], s[4:5], s6
+; GFX90A-NEXT:    s_getpc_b64 s[6:7]
+; GFX90A-NEXT:    s_add_u32 s6, s6, global@rel32@lo+4
+; GFX90A-NEXT:    s_addc_u32 s7, s7, global@rel32@hi+12
+; GFX90A-NEXT:    s_cmp_eq_u32 s7, s4
+; GFX90A-NEXT:    s_cselect_b64 s[4:5], -1, 0
+; GFX90A-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s[4:5]
+; GFX90A-NEXT:    s_mov_b64 s[4:5], -1
+; GFX90A-NEXT:    s_mov_b32 s6, 1
+; GFX90A-NEXT:    v_cmp_ne_u32_e64 s[6:7], v2, s6
+; GFX90A-NEXT:    s_and_b64 vcc, exec, s[6:7]
+; GFX90A-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_3
+; GFX90A-NEXT:  .LBB5_1: ; %Flow4
+; GFX90A-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s[4:5]
+; GFX90A-NEXT:    s_mov_b32 s4, 1
+; GFX90A-NEXT:    v_cmp_ne_u32_e64 s[4:5], v4, s4
+; GFX90A-NEXT:    s_and_b64 vcc, exec, s[4:5]
+; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_10
+; GFX90A-NEXT:  ; %bb.2: ; %atomicrmw.shared
+; GFX90A-NEXT:    ds_add_rtn_f64 v[2:3], v0, v[0:1]
+; GFX90A-NEXT:    s_branch .LBB5_10
+; GFX90A-NEXT:  .LBB5_3: ; %atomicrmw.check.private
 ; GFX90A-NEXT:    s_mov_b64 s[4:5], src_private_base
 ; GFX90A-NEXT:    s_mov_b32 s6, 32
 ; GFX90A-NEXT:    s_lshr_b64 s[4:5], s[4:5], s6
@@ -813,50 +838,54 @@ define double @optnone_atomicrmw_fadd_f64_expand(double %val) #1 {
 ; GFX90A-NEXT:    v_cmp_ne_u32_e64 s[6:7], v2, s6
 ; GFX90A-NEXT:    s_and_b64 vcc, exec, s[6:7]
 ; GFX90A-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_2
-; GFX90A-NEXT:    s_branch .LBB5_3
-; GFX90A-NEXT:  .LBB5_1: ; %atomicrmw.private
+; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_5
+; GFX90A-NEXT:    s_branch .LBB5_6
+; GFX90A-NEXT:  .LBB5_4: ; %atomicrmw.private
 ; GFX90A-NEXT:    buffer_load_dword v2, v0, s[0:3], 0 offen
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_mov_b32_e32 v3, v2
-; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], v[0:1]
-; GFX90A-NEXT:    buffer_store_dword v1, v0, s[0:3], 0 offen
-; GFX90A-NEXT:    buffer_store_dword v0, v0, s[0:3], 0 offen
-; GFX90A-NEXT:    s_branch .LBB5_6
-; GFX90A-NEXT:  .LBB5_2: ; %atomicrmw.global
+; GFX90A-NEXT:    v_add_f64 v[4:5], v[2:3], v[0:1]
+; GFX90A-NEXT:    buffer_store_dword v5, v0, s[0:3], 0 offen
+; GFX90A-NEXT:    buffer_store_dword v4, v0, s[0:3], 0 offen
+; GFX90A-NEXT:    s_branch .LBB5_9
+; GFX90A-NEXT:  .LBB5_5: ; %atomicrmw.global
+; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    s_getpc_b64 s[4:5]
 ; GFX90A-NEXT:    s_add_u32 s4, s4, global@rel32@lo+4
 ; GFX90A-NEXT:    s_addc_u32 s5, s5, global@rel32@hi+12
-; GFX90A-NEXT:    v_pk_mov_b32 v[2:3], s[4:5], s[4:5] op_sel:[0,1]
-; GFX90A-NEXT:    flat_load_dwordx2 v[2:3], v[2:3]
+; GFX90A-NEXT:    global_load_dwordx2 v[2:3], v2, s[4:5]
 ; GFX90A-NEXT:    s_mov_b64 s[4:5], 0
-; GFX90A-NEXT:    s_branch .LBB5_4
-; GFX90A-NEXT:  .LBB5_3: ; %Flow
+; GFX90A-NEXT:    s_branch .LBB5_7
+; GFX90A-NEXT:  .LBB5_6: ; %Flow
 ; GFX90A-NEXT:    s_and_b64 vcc, exec, s[4:5]
-; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_1
-; GFX90A-NEXT:    s_branch .LBB5_6
-; GFX90A-NEXT:  .LBB5_4: ; %atomicrmw.start
+; GFX90A-NEXT:    s_cbranch_vccnz .LBB5_4
+; GFX90A-NEXT:    s_branch .LBB5_9
+; GFX90A-NEXT:  .LBB5_7: ; %atomicrmw.start
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], v[0:1]
 ; GFX90A-NEXT:    s_getpc_b64 s[6:7]
 ; GFX90A-NEXT:    s_add_u32 s6, s6, global@rel32@lo+4
 ; GFX90A-NEXT:    s_addc_u32 s7, s7, global@rel32@hi+12
-; GFX90A-NEXT:    v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1]
-; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[6:7], v[2:5] glc
-; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    v_mov_b32_e32 v6, 0
+; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[2:3], v6, v[2:5], s[6:7] glc
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e64 s[6:7], v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], s[6:7], s[4:5]
 ; GFX90A-NEXT:    s_andn2_b64 exec, exec, s[4:5]
-; GFX90A-NEXT:    s_cbranch_execnz .LBB5_4
-; GFX90A-NEXT:  ; %bb.5: ; %atomicrmw.end1
+; GFX90A-NEXT:    s_cbranch_execnz .LBB5_7
+; GFX90A-NEXT:  ; %bb.8: ; %atomicrmw.end1
 ; GFX90A-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GFX90A-NEXT:    s_mov_b64 s[4:5], 0
-; GFX90A-NEXT:    s_branch .LBB5_3
-; GFX90A-NEXT:  .LBB5_6: ; %atomicrmw.phi
-; GFX90A-NEXT:  ; %bb.7: ; %atomicrmw.end
+; GFX90A-NEXT:    s_branch .LBB5_6
+; GFX90A-NEXT:  .LBB5_9: ; %Flow3
+; GFX90A-NEXT:    s_mov_b64 s[4:5], 0
+; GFX90A-NEXT:    s_branch .LBB5_1
+; GFX90A-NEXT:  .LBB5_10: ; %atomicrmw.phi
+; GFX90A-NEXT:  ; %bb.11: ; %atomicrmw.end
 ; GFX90A-NEXT:    s_mov_b32 s4, 32
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_lshrrev_b64 v[4:5], s4, v[2:3]
 ; GFX90A-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX90A-NEXT:    v_mov_b32_e32 v1, v4
@@ -866,6 +895,31 @@ define double @optnone_atomicrmw_fadd_f64_expand(double %val) #1 {
 ; GFX942-LABEL: optnone_atomicrmw_fadd_f64_expand:
 ; GFX942:       ; %bb.0:
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_mov_b64 s[0:1], src_shared_base
+; GFX942-NEXT:    s_mov_b32 s2, 32
+; GFX942-NEXT:    s_lshr_b64 s[0:1], s[0:1], s2
+; GFX942-NEXT:    s_getpc_b64 s[2:3]
+; GFX942-NEXT:    s_add_u32 s2, s2, global@rel32@lo+4
+; GFX942-NEXT:    s_addc_u32 s3, s3, global@rel32@hi+12
+; GFX942-NEXT:    s_cmp_eq_u32 s3, s0
+; GFX942-NEXT:    s_cselect_b64 s[0:1], -1, 0
+; GFX942-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s[0:1]
+; GFX942-NEXT:    s_mov_b64 s[0:1], -1
+; GFX942-NEXT:    s_mov_b32 s2, 1
+; GFX942-NEXT:    v_cmp_ne_u32_e64 s[2:3], v2, s2
+; GFX942-NEXT:    s_and_b64 vcc, exec, s[2:3]
+; GFX942-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX942-NEXT:    s_cbranch_vccnz .LBB5_3
+; GFX942-NEXT:  .LBB5_1: ; %Flow4
+; GFX942-NEXT:    v_cndmask_b32_e64 v4, 0, 1, s[0:1]
+; GFX942-NEXT:    s_mov_b32 s0, 1
+; GFX942-NEXT:    v_cmp_ne_u32_e64 s[0:1], v4, s0
+; GFX942-NEXT:    s_and_b64 vcc, exec, s[0:1]
+; GFX942-NEXT:    s_cbranch_vccnz .LBB5_10
+; GFX942-NEXT:  ; %bb.2: ; %atomicrmw.shared
+; GFX942-NEXT:    ds_add_rtn_f64 v[2:3], v0, v[0:1]
+; GFX942-NEXT:    s_branch .LBB5_10
+; GFX942-NEXT:  .LBB5_3: ; %atomicrmw.check.private
 ; GFX942-NEXT:    s_mov_b64 s[0:1], src_private_base
 ; GFX942-NEXT:    s_mov_b32 s2, 32
 ; GFX942-NEXT:    s_lshr_b64 s[0:1], s[0:1], s2
@@ -880,48 +934,52 @@ define double @optnone_atomicrmw_fadd_f64_expand(double %val) #1 {
 ; GFX942-NEXT:    v_cmp_ne_u32_e64 s[2:3], v2, s2
 ; GFX942-NEXT:    s_and_b64 vcc, exec, s[2:3]
 ; GFX942-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX942-NEXT:    s_cbranch_vccnz .LBB5_2
-; GFX942-NEXT:    s_branch .LBB5_3
-; GFX942-NEXT:  .LBB5_1: ; %atomicrmw.private
+; GFX942-NEXT:    s_cbranch_vccnz .LBB5_5
+; GFX942-NEXT:    s_branch .LBB5_6
+; GFX942-NEXT:  .LBB5_4: ; %atomicrmw.private
 ; GFX942-NEXT:    scratch_load_dwordx2 v[2:3], off, s0
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
-; GFX942-NEXT:    v_add_f64 v[0:1], v[2:3], v[0:1]
-; GFX942-NEXT:    scratch_store_dwordx2 off, v[0:1], s0
-; GFX942-NEXT:    s_branch .LBB5_6
-; GFX942-NEXT:  .LBB5_2: ; %atomicrmw.global
+; GFX942-NEXT:    v_add_f64 v[4:5], v[2:3], v[0:1]
+; GFX942-NEXT:    scratch_store_dwordx2 off, v[4:5], s0
+; GFX942-NEXT:    s_branch .LBB5_9
+; GFX942-NEXT:  .LBB5_5: ; %atomicrmw.global
+; GFX942-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX942-NEXT:    s_getpc_b64 s[0:1]
 ; GFX942-NEXT:    s_add_u32 s0, s0, global@rel32@lo+4
 ; GFX942-NEXT:    s_addc_u32 s1, s1, global@rel32@hi+12
-; GFX942-NEXT:    v_mov_b64_e32 v[2:3], s[0:1]
-; GFX942-NEXT:    flat_load_dwordx2 v[2:3], v[2:3]
+; GFX942-NEXT:    global_load_dwordx2 v[2:3], v2, s[0:1]
 ; GFX942-NEXT:    s_mov_b64 s[0:1], 0
-; GFX942-NEXT:    s_branch .LBB5_4
-; GFX942-NEXT:  .LBB5_3: ; %Flow
+; GFX942-NEXT:    s_branch .LBB5_7
+; GFX942-NEXT:  .LBB5_6: ; %Flow
 ; GFX942-NEXT:    s_and_b64 vcc, exec, s[0:1]
-; GFX942-NEXT:    s_cbranch_vccnz .LBB5_1
-; GFX942-NEXT:    s_branch .LBB5_6
-; GFX942-NEXT:  .LBB5_4: ; %atomicrmw.start
+; GFX942-NEXT:    s_cbranch_vccnz .LBB5_4
+; GFX942-NEXT:    s_branch .LBB5_9
+; GFX942-NEXT:  .LBB5_7: ; %atomicrmw.start
 ; GFX942-NEXT:    ; =>This Inner Loop Header: Depth=1
-; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    v_mov_b64_e32 v[4:5], v[2:3]
 ; GFX942-NEXT:    v_add_f64 v[2:3], v[4:5], v[0:1]
 ; GFX942-NEXT:    s_getpc_b64 s[2:3]
 ; GFX942-NEXT:    s_add_u32 s2, s2, global@rel32@lo+4
 ; GFX942-NEXT:    s_addc_u32 s3, s3, global@rel32@hi+12
-; GFX942-NEXT:    v_mov_b64_e32 v[6:7], s[2:3]
-; GFX942-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[6:7], v[2:5] sc0 sc1
-; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    v_mov_b32_e32 v6, 0
+; GFX942-NEXT:    global_atomic_cmpswap_x2 v[2:3], v6, v[2:5], s[2:3] sc0 sc1
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    v_cmp_eq_u64_e64 s[2:3], v[2:3], v[4:5]
 ; GFX942-NEXT:    s_or_b64 s[0:1], s[2:3], s[0:1]
 ; GFX942-NEXT:    s_andn2_b64 exec, exec, s[0:1]
-; GFX942-NEXT:    s_cbranch_execnz .LBB5_4
-; GFX942-NEXT:  ; %bb.5: ; %atomicrmw.end1
+; GFX942-NEXT:    s_cbranch_execnz .LBB5_7
+; GFX942-NEXT:  ; %bb.8: ; %atomicrmw.end1
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[0:1]
 ; GFX942-NEXT:    s_mov_b64 s[0:1], 0
-; GFX942-NEXT:    s_branch .LBB5_3
-; GFX942-NEXT:  .LBB5_6: ; %atomicrmw.phi
-; GFX942-NEXT:  ; %bb.7: ; %atomicrmw.end
+; GFX942-NEXT:    s_branch .LBB5_6
+; GFX942-NEXT:  .LBB5_9: ; %Flow3
+; GFX942-NEXT:    s_mov_b64 s[0:1], 0
+; GFX942-NEXT:    s_branch .LBB5_1
+; GFX942-NEXT:  .LBB5_10: ; %atomicrmw.phi
+; GFX942-NEXT:  ; %bb.11: ; %atomicrmw.end
 ; GFX942-NEXT:    s_mov_b32 s0, 32
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-NEXT:    v_lshrrev_b64 v[4:5], s0, v[2:3]
 ; GFX942-NEXT:    v_mov_b32_e32 v0, v2
 ; GFX942-NEXT:    v_mov_b32_e32 v1, v4
diff --git a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
index e13c895a1cc85..cfe4d24d427e7 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll
@@ -5758,29 +5758,38 @@ define double @flat_agent_atomic_fadd_ret_f64__amdgpu_no_fine_grained_memory(ptr
 ; GFX942:       ; %bb.0:
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_mov_b32_e32 v5, v1
-; GFX942-NEXT:    s_mov_b64 s[0:1], src_private_base
+; GFX942-NEXT:    s_mov_b64 s[0:1], src_shared_base
 ; GFX942-NEXT:    v_mov_b32_e32 v4, v0
 ; GFX942-NEXT:    v_cmp_ne_u32_e32 vcc, s1, v5
 ; GFX942-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; GFX942-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX942-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
 ; GFX942-NEXT:    s_cbranch_execnz .LBB30_3
-; GFX942-NEXT:  ; %bb.1: ; %Flow
+; GFX942-NEXT:  ; %bb.1: ; %Flow2
 ; GFX942-NEXT:    s_andn2_saveexec_b64 s[0:1], s[0:1]
-; GFX942-NEXT:    s_cbranch_execnz .LBB30_4
+; GFX942-NEXT:    s_cbranch_execnz .LBB30_8
 ; GFX942-NEXT:  .LBB30_2: ; %atomicrmw.phi
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
-; GFX942-NEXT:  .LBB30_3: ; %atomicrmw.global
+; GFX942-NEXT:  .LBB30_3: ; %atomicrmw.check.private
+; GFX942-NEXT:    s_mov_b64 s[2:3], src_private_base
+; GFX942-NEXT:    v_cmp_ne_u32_e32 vcc, s3, v5
+; GFX942-NEXT:    ; implicit-def: $vgpr0_vgpr1
+; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX942-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX942-NEXT:    s_cbranch_execz .LBB30_5
+; GFX942-NEXT:  ; %bb.4: ; %atomicrmw.global
 ; GFX942-NEXT:    buffer_wbl2 sc1
-; GFX942-NEXT:    flat_atomic_add_f64 v[0:1], v[4:5], v[2:3] sc0
-; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off sc0
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    buffer_inv sc1
 ; GFX942-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; GFX942-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX942-NEXT:    s_andn2_saveexec_b64 s[0:1], s[0:1]
-; GFX942-NEXT:    s_cbranch_execz .LBB30_2
-; GFX942-NEXT:  .LBB30_4: ; %atomicrmw.private
+; GFX942-NEXT:  .LBB30_5: ; %Flow
+; GFX942-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX942-NEXT:    s_cbranch_execz .LBB30_7
+; GFX942-NEXT:  ; %bb.6: ; %atomicrmw.private
 ; GFX942-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v4, -1, v4, vcc
@@ -5788,6 +5797,18 @@ define double @flat_agent_atomic_fadd_ret_f64__amdgpu_no_fine_grained_memory(ptr
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    v_add_f64 v[2:3], v[0:1], v[2:3]
 ; GFX942-NEXT:    scratch_store_dwordx2 v4, v[2:3], off
+; GFX942-NEXT:  .LBB30_7: ; %Flow1
+; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX942-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; GFX942-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX942-NEXT:    s_andn2_saveexec_b64 s[0:1], s[0:1]
+; GFX942-NEXT:    s_cbranch_execz .LBB30_2
+; GFX942-NEXT:  .LBB30_8: ; %atomicrmw.shared
+; GFX942-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX942-NEXT:    s_nop 1
+; GFX942-NEXT:    v_cndmask_b32_e32 v0, -1, v4, vcc
+; GFX942-NEXT:    ds_add_rtn_f64 v[0:1], v0, v[2:3]
+; GFX942-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[0:1]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
@@ -5894,28 +5915,37 @@ define double @flat_agent_atomic_fadd_ret_f64__amdgpu_no_fine_grained_memory(ptr
 ; GFX90A:       ; %bb.0:
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_mov_b32_e32 v5, v1
-; GFX90A-NEXT:    s_mov_b64 s[4:5], src_private_base
+; GFX90A-NEXT:    s_mov_b64 s[4:5], src_shared_base
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, v0
 ; GFX90A-NEXT:    v_cmp_ne_u32_e32 vcc, s5, v5
 ; GFX90A-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; GFX90A-NEXT:    s_and_saveexec_b64 s[4:5], vcc
 ; GFX90A-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
 ; GFX90A-NEXT:    s_cbranch_execnz .LBB30_3
-; GFX90A-NEXT:  ; %bb.1: ; %Flow
+; GFX90A-NEXT:  ; %bb.1: ; %Flow2
 ; GFX90A-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX90A-NEXT:    s_cbranch_execnz .LBB30_4
+; GFX90A-NEXT:    s_cbranch_execnz .LBB30_8
 ; GFX90A-NEXT:  .LBB30_2: ; %atomicrmw.phi
 ; GFX90A-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
-; GFX90A-NEXT:  .LBB30_3: ; %atomicrmw.global
-; GFX90A-NEXT:    flat_atomic_add_f64 v[0:1], v[4:5], v[2:3] glc
-; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:  .LBB30_3: ; %atomicrmw.check.private
+; GFX90A-NEXT:    s_mov_b64 s[6:7], src_private_base
+; GFX90A-NEXT:    v_cmp_ne_u32_e32 vcc, s7, v5
+; GFX90A-NEXT:    ; implicit-def: $vgpr0_vgpr1
+; GFX90A-NEXT:    s_and_saveexec_b64 s[6:7], vcc
+; GFX90A-NEXT:    s_xor_b64 s[6:7], exec, s[6:7]
+; GFX90A-NEXT:    s_cbranch_execz .LBB30_5
+; GFX90A-NEXT:  ; %bb.4: ; %atomicrmw.global
+; GFX90A-NEXT:    global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off glc
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_wbinvl1
 ; GFX90A-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; GFX90A-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX90A-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX90A-NEXT:    s_cbranch_execz .LBB30_2
-; GFX90A-NEXT:  .LBB30_4: ; %atomicrmw.private
+; GFX90A-NEXT:  .LBB30_5: ; %Flow
+; GFX90A-NEXT:    s_andn2_saveexec_b64 s[6:7], s[6:7]
+; GFX90A-NEXT:    s_cbranch_execz .LBB30_7
+; GFX90A-NEXT:  ; %bb.6: ; %atomicrmw.private
 ; GFX90A-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
 ; GFX90A-NEXT:    v_cndmask_b32_e32 v4, -1, v4, vcc
 ; GFX90A-NEXT:    buffer_load_dword v0, v4, s[0:3], 0 offen
@@ -5924,6 +5954,17 @@ define double @flat_agent_atomic_fadd_ret_f64__amdgpu_no_fine_grained_memory(ptr
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[0:1], v[2:3]
 ; GFX90A-NEXT:    buffer_store_dword v2, v4, s[0:3], 0 offen
 ; GFX90A-NEXT:    buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
+; GFX90A-NEXT:  .LBB30_7: ; %Flow1
+; GFX90A-NEXT:    s_or_b64 exec, exec, s[6:7]
+; GFX90A-NEXT:    ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX90A-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX90A-NEXT:    s_cbranch_execz .LBB30_2
+; GFX90A-NEXT:  .LBB30_8: ; %atomicrmw.shared
+; GFX90A-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT:    v_cndmask_b32_e32 v0, -1, v4, vcc
+; GFX90A-NEXT:    ds_add_rtn_f64 v[0:1], v0, v[2:3]
+; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    s_or_b64 exec, exec, s[4:5]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    s_setpc_b64 s[30:31]
@@ -6160,28 +6201,37 @@ define double @flat_agent_atomic_fadd_ret_f64__offset12b_pos__amdgpu_no_fine_gra
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    s_mov_b64 s[0:1], 0x7f8
 ; GFX942-NEXT:    v_lshl_add_u64 v[4:5], v[0:1], 0, s[0:1]
-; GFX942-NEXT:    s_mov_b64 s[0:1], src_private_base
+; GFX942-NEXT:    s_mov_b64 s[0:1], src_shared_base
 ; GFX942-NEXT:    v_cmp_ne_u32_e32 vcc, s1, v5
 ; GFX942-NEXT:    ; implicit-def: $vgpr0_vgpr1
 ; GFX942-NEXT:    s_and_saveexec_b64 s[0:1], vcc
 ; GFX942-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
 ; GFX942-NEXT:    s_cbranch_execnz .LBB31_3
-; GFX942-NEXT:  ; %bb.1: ; %Flow
+; GFX942-NEXT:  ; %bb.1: ; %Flow2
 ; GFX942-NEXT:    s_andn2_saveexec_b64 s[0:1], s[0:1]
-; GFX942-NEXT:    s_cbranch_execnz .LBB31_4
+; GFX942-NEXT:    s_cbranch_execnz .LBB31_8
 ; GFX942-NEXT:  .LBB31_2: ; %atomicrmw.phi
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
-; GFX942-NEXT:  .LBB31_3: ; %atomicrmw.global
+; GFX942-NEXT:  .LBB31_3: ; %atomicrmw.check.private
+; GFX942-NEXT:    s_mov_b64 s[2:3], src_private_base
+; GFX942-NEXT:    v_cmp_ne_u32_e32 vcc, s3, v5
+; GFX942-NEXT:    ; implicit-def: $vgpr0_vgpr1
+; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX942-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX942-NEXT:    s_cbranch_execz .LBB31_5
+; GFX942-NEXT:  ; %bb.4: ; %atomicrmw.global
 ; GFX942-NEXT:    buffer_wbl2 sc1
-; GFX942-NEXT:    flat_atomic_add_f64 v[0:1], v[4:5], v[2:3] sc0
-; GFX942-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off sc0
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    buffer_inv sc1
 ; GFX942-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; GFX942-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX942-NEXT:    s_andn2_saveexec_b64 s[0:1], s[0:1]
-; GFX942-NEXT:    s_cbranch_execz .LBB31_2
-; GFX942-NEXT:  .LBB31_4: ; %atomicrmw.private
+; GFX942-NEXT:  .LBB31_5: ; %Flow
+; GFX942-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX942-NEXT:    s_cbranch_execz .LBB31_7
+; GFX942-NEXT:  ; %bb.6: ; %atomicrmw.private
 ; GFX942-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[4:5]
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32...
[truncated]

@shiltian shiltian requested a review from jayfoad June 4, 2025 21:32
Comment on lines +17545 to +17548
RMW && RMW->getOperation() == AtomicRMWInst::FAdd &&
((Subtarget->hasAtomicFaddInsts() && RMW->getType()->isFloatTy()) ||
(Subtarget->hasFlatBufferGlobalAtomicFaddF64Inst() &&
RMW->getType()->isDoubleTy()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the description? This is only about adding support for the f64 case where we have global but no shared operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought that was fine. updated.

@shiltian shiltian changed the title [AMDGPU][AtomicExpand] Use full flat emulation if feasible [AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction Jun 5, 2025
@shiltian shiltian merged commit 8cd5604 into main Jun 5, 2025
14 checks passed
@shiltian shiltian deleted the users/shiltian/use-full-flat-emulation-for-double-if-feasible branch June 5, 2025 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants