-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU] Add DAG mutation to improve scheduling before barriers #142716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses. This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations.
@llvm/pr-subscribers-backend-amdgpu Author: Carl Ritson (perlfu) ChangesAdd scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses. This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations. Patch is 368.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142716.diff 14 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp
new file mode 100644
index 0000000000000..b633cbf91b7ff
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp
@@ -0,0 +1,100 @@
+//===--- AMDGPUBarrierLatency.cpp - AMDGPU Barrier Latency ----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file This file contains a DAG scheduling mutation to add data dependency
+/// edges between ATOMIC_FENCE instructions and preceeding memory
+/// accesses they might be effected by the fence.
+/// This is beneficial when a fence would cause wait count insertion,
+/// as more instructions will be scheduled before the fence hiding
+/// memory latency.
+/// It also reduces the risk of a fence causing a premature wait
+/// on all active memory operations.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUBarrierLatency.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "SIInstrInfo.h"
+#include "llvm/CodeGen/ScheduleDAGInstrs.h"
+
+using namespace llvm;
+
+namespace {
+
+class BarrierLatency : public ScheduleDAGMutation {
+public:
+ BarrierLatency() = default;
+ void apply(ScheduleDAGInstrs *DAG) override;
+};
+
+static bool isMemRead(const MachineInstr *MI) {
+ return (SIInstrInfo::isDS(*MI) || SIInstrInfo::isVMEM(*MI) ||
+ SIInstrInfo::isSMRD(*MI)) &&
+ MI->mayLoad();
+}
+
+static const MachineInstr *getReadInstr(const MachineInstr *MI) {
+ if (MI->isBundle()) {
+ auto I = std::next(MI->getIterator());
+ if (I != MI->getParent()->instr_end() && I->isInsideBundle() &&
+ isMemRead(&*I))
+ return &*I;
+ } else if (isMemRead(MI)) {
+ return MI;
+ }
+
+ return nullptr;
+}
+
+void BarrierLatency::apply(ScheduleDAGInstrs *DAG) {
+ const unsigned SyntheticLatency = 2000;
+ const unsigned MaxTracked = 32;
+ SmallVector<std::pair<SUnit *, const MachineInstr *>, MaxTracked> ReadOps;
+ unsigned NextIdx = 0;
+
+ for (SUnit &SU : DAG->SUnits) {
+ auto *MI = SU.getInstr();
+ auto *ReadMI = getReadInstr(MI);
+
+ // Record read operations.
+ // If SU represents a bundle, then ReadMI is the first instruction in the
+ // bundle.
+ if (ReadMI) {
+ if (ReadOps.size() < MaxTracked) {
+ ReadOps.emplace_back(&SU, ReadMI);
+ } else {
+ ReadOps[NextIdx] = std::pair(&SU, ReadMI);
+ NextIdx = (NextIdx + 1) % MaxTracked;
+ }
+ continue;
+ }
+
+ // Create new edges on ATOMIC_FENCE for recorded reads.
+ // We don't consider the scope of the fence so it is possible there will
+ // be no impact of this fence on the recorded operations.
+ if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) {
+ for (auto &DSOp : ReadOps) {
+ Register DstReg = DSOp.second->getOperand(0).getReg();
+ SDep Edge = SDep(DSOp.first, SDep::Data, DstReg);
+ Edge.setLatency(SyntheticLatency);
+ DAG->addEdge(&SU, Edge);
+ }
+ // Clear tracked operations
+ ReadOps.clear();
+ NextIdx = 0;
+ continue;
+ }
+ }
+}
+
+} // end namespace
+
+std::unique_ptr<ScheduleDAGMutation>
+llvm::createAMDGPUBarrierLatencyDAGMutation() {
+ return std::make_unique<BarrierLatency>();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h
new file mode 100644
index 0000000000000..c23f0b99fe822
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h
@@ -0,0 +1,21 @@
+//===- AMDGPUBarrierLatency.h - AMDGPU Export Clustering --------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
+
+#include "llvm/CodeGen/ScheduleDAGMutation.h"
+#include <memory>
+
+namespace llvm {
+
+std::unique_ptr<ScheduleDAGMutation> createAMDGPUBarrierLatencyDAGMutation();
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index f0a0c2113bf81..fcf4111ea16de 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -17,6 +17,7 @@
#include "AMDGPUTargetMachine.h"
#include "AMDGPU.h"
#include "AMDGPUAliasAnalysis.h"
+#include "AMDGPUBarrierLatency.h"
#include "AMDGPUCtorDtorLowering.h"
#include "AMDGPUExportClustering.h"
#include "AMDGPUExportKernelRuntimeHandles.h"
@@ -588,6 +589,7 @@ createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {
DAG->addMutation(createIGroupLPDAGMutation(AMDGPU::SchedulingPhase::Initial));
DAG->addMutation(createAMDGPUMacroFusionDAGMutation());
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
@@ -608,6 +610,7 @@ createGCNMaxMemoryClauseMachineScheduler(MachineSchedContext *C) {
if (ST.shouldClusterStores())
DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
@@ -1156,6 +1159,7 @@ GCNTargetMachine::createPostMachineScheduler(MachineSchedContext *C) const {
EnableVOPD)
DAG->addMutation(createVOPDPairingMutation());
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c6d70ee39202e..d09de7f91d8c5 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUAsmPrinter.cpp
AMDGPUAtomicOptimizer.cpp
AMDGPUAttributor.cpp
+ AMDGPUBarrierLatency.cpp
AMDGPUCallLowering.cpp
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..75039722b141b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB12_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB12_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: v_mov_b32_e32 v1, v5
; GFX908-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB12_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: v_mov_b32_e32 v1, v5
; GFX8-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB12_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX942-NEXT: v_mov_b32_e32 v1, v4
+; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB13_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX90A-NEXT: v_mov_b32_e32 v1, v4
+; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB13_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1756,10 +1756,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v4, v0
; GFX908-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX908-NEXT: v_mov_b32_e32 v1, v4
+; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB13_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1782,10 +1782,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v4, v0
; GFX8-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX8-NEXT: v_mov_b32_e32 v1, v4
+; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB13_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1830,9 +1830,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10
; GFX12-NEXT: buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-NEXT: s_wait_loadcnt 0x0
-; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10]
; GFX12-NEXT: s_or_b32 s4, vcc_lo, s4
+; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: s_wait_alu 0xfffe
; GFX12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX12-NEXT: s_cbranch_execnz .LBB14_1
@@ -1872,11 +1872,10 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10
; GFX11-NEXT: buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], 0 offen glc
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: buffer_gl1_inv
-; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10]
; GFX11-NEXT: s_or_b32 s4, vcc_lo, s4
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: buffer_gl1_inv
+; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX11-NEXT: s_cbranch_execnz .LBB14_1
; GFX11-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1925,9 +1924,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v3, v10
; GFX908-NEXT: buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10]
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB14_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1956,9 +1955,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v3, v10
; GFX8-NEXT: buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10]
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB14_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2000,10 +1999,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0
; GFX12-NEXT: buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-NEXT: s_wait_loadcnt 0x0
-; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3]
; GFX12-NEXT: v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8
; GFX12-NEXT: s_or_b32 s4, vcc_lo, s4
+; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: s_wait_alu 0xfffe
; GFX12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX12-NEXT: s_cbranch_execnz .LBB15_1
@@ -2040,12 +2039,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0
; GFX11-NEXT: buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], 0 offen glc
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: buffer_gl1_inv
-; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3]
; GFX11-NEXT: v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8
; GFX11-NEXT: s_or_b32 s4, vcc_lo, s4
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: buffer_gl1_inv
+; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX11-NEXT: s_cbranch_execnz .LBB15_1
; GFX11-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2090,11 +2088,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v7, v0
; GFX908-NEXT: buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3]
; GFX908-NEXT: v_mov_b32_e32 v2, v7
-; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX908-NEXT: v_mov_b32_e32 v3, v8
+; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB15_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2119,11 +2117,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v7, v0
; GFX8-NEXT: buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3]
; GFX8-NEXT: v_mov_b32_e32 v2, v7
-; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX8-NEXT: v_mov_b32_e32 v3, v8
+; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB15_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..8988b2fd5d01b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB12_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB12_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: v_mov_b32_e32 v1, v5
; GFX908-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB12_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: v_mov_b32_e32 v1, v5
; GFX8-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB12_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX942-NEXT: v_mov_b32_e32 v1, v4
+; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB13_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt v...
[truncated]
|
@llvm/pr-subscribers-llvm-globalisel Author: Carl Ritson (perlfu) ChangesAdd scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses. This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations. Patch is 368.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142716.diff 14 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp
new file mode 100644
index 0000000000000..b633cbf91b7ff
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.cpp
@@ -0,0 +1,100 @@
+//===--- AMDGPUBarrierLatency.cpp - AMDGPU Barrier Latency ----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file This file contains a DAG scheduling mutation to add data dependency
+/// edges between ATOMIC_FENCE instructions and preceeding memory
+/// accesses they might be effected by the fence.
+/// This is beneficial when a fence would cause wait count insertion,
+/// as more instructions will be scheduled before the fence hiding
+/// memory latency.
+/// It also reduces the risk of a fence causing a premature wait
+/// on all active memory operations.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUBarrierLatency.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "SIInstrInfo.h"
+#include "llvm/CodeGen/ScheduleDAGInstrs.h"
+
+using namespace llvm;
+
+namespace {
+
+class BarrierLatency : public ScheduleDAGMutation {
+public:
+ BarrierLatency() = default;
+ void apply(ScheduleDAGInstrs *DAG) override;
+};
+
+static bool isMemRead(const MachineInstr *MI) {
+ return (SIInstrInfo::isDS(*MI) || SIInstrInfo::isVMEM(*MI) ||
+ SIInstrInfo::isSMRD(*MI)) &&
+ MI->mayLoad();
+}
+
+static const MachineInstr *getReadInstr(const MachineInstr *MI) {
+ if (MI->isBundle()) {
+ auto I = std::next(MI->getIterator());
+ if (I != MI->getParent()->instr_end() && I->isInsideBundle() &&
+ isMemRead(&*I))
+ return &*I;
+ } else if (isMemRead(MI)) {
+ return MI;
+ }
+
+ return nullptr;
+}
+
+void BarrierLatency::apply(ScheduleDAGInstrs *DAG) {
+ const unsigned SyntheticLatency = 2000;
+ const unsigned MaxTracked = 32;
+ SmallVector<std::pair<SUnit *, const MachineInstr *>, MaxTracked> ReadOps;
+ unsigned NextIdx = 0;
+
+ for (SUnit &SU : DAG->SUnits) {
+ auto *MI = SU.getInstr();
+ auto *ReadMI = getReadInstr(MI);
+
+ // Record read operations.
+ // If SU represents a bundle, then ReadMI is the first instruction in the
+ // bundle.
+ if (ReadMI) {
+ if (ReadOps.size() < MaxTracked) {
+ ReadOps.emplace_back(&SU, ReadMI);
+ } else {
+ ReadOps[NextIdx] = std::pair(&SU, ReadMI);
+ NextIdx = (NextIdx + 1) % MaxTracked;
+ }
+ continue;
+ }
+
+ // Create new edges on ATOMIC_FENCE for recorded reads.
+ // We don't consider the scope of the fence so it is possible there will
+ // be no impact of this fence on the recorded operations.
+ if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) {
+ for (auto &DSOp : ReadOps) {
+ Register DstReg = DSOp.second->getOperand(0).getReg();
+ SDep Edge = SDep(DSOp.first, SDep::Data, DstReg);
+ Edge.setLatency(SyntheticLatency);
+ DAG->addEdge(&SU, Edge);
+ }
+ // Clear tracked operations
+ ReadOps.clear();
+ NextIdx = 0;
+ continue;
+ }
+ }
+}
+
+} // end namespace
+
+std::unique_ptr<ScheduleDAGMutation>
+llvm::createAMDGPUBarrierLatencyDAGMutation() {
+ return std::make_unique<BarrierLatency>();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h
new file mode 100644
index 0000000000000..c23f0b99fe822
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUBarrierLatency.h
@@ -0,0 +1,21 @@
+//===- AMDGPUBarrierLatency.h - AMDGPU Export Clustering --------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
+
+#include "llvm/CodeGen/ScheduleDAGMutation.h"
+#include <memory>
+
+namespace llvm {
+
+std::unique_ptr<ScheduleDAGMutation> createAMDGPUBarrierLatencyDAGMutation();
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUBARRIERLATENCY_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index f0a0c2113bf81..fcf4111ea16de 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -17,6 +17,7 @@
#include "AMDGPUTargetMachine.h"
#include "AMDGPU.h"
#include "AMDGPUAliasAnalysis.h"
+#include "AMDGPUBarrierLatency.h"
#include "AMDGPUCtorDtorLowering.h"
#include "AMDGPUExportClustering.h"
#include "AMDGPUExportKernelRuntimeHandles.h"
@@ -588,6 +589,7 @@ createGCNMaxOccupancyMachineScheduler(MachineSchedContext *C) {
DAG->addMutation(createIGroupLPDAGMutation(AMDGPU::SchedulingPhase::Initial));
DAG->addMutation(createAMDGPUMacroFusionDAGMutation());
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
@@ -608,6 +610,7 @@ createGCNMaxMemoryClauseMachineScheduler(MachineSchedContext *C) {
if (ST.shouldClusterStores())
DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
@@ -1156,6 +1159,7 @@ GCNTargetMachine::createPostMachineScheduler(MachineSchedContext *C) const {
EnableVOPD)
DAG->addMutation(createVOPDPairingMutation());
DAG->addMutation(createAMDGPUExportClusteringDAGMutation());
+ DAG->addMutation(createAMDGPUBarrierLatencyDAGMutation());
return DAG;
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c6d70ee39202e..d09de7f91d8c5 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -50,6 +50,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUAsmPrinter.cpp
AMDGPUAtomicOptimizer.cpp
AMDGPUAttributor.cpp
+ AMDGPUBarrierLatency.cpp
AMDGPUCallLowering.cpp
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
index 666523c88860c..75039722b141b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll
@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB12_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB12_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: v_mov_b32_e32 v1, v5
; GFX908-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB12_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmax_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: v_mov_b32_e32 v1, v5
; GFX8-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB12_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX942-NEXT: v_mov_b32_e32 v1, v4
+; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB13_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX90A-NEXT: v_mov_b32_e32 v1, v4
+; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB13_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1756,10 +1756,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v4, v0
; GFX908-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX908-NEXT: v_mov_b32_e32 v1, v4
+; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB13_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1782,10 +1782,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f32__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v4, v0
; GFX8-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX8-NEXT: v_mov_b32_e32 v1, v4
+; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB13_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1830,9 +1830,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10
; GFX12-NEXT: buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-NEXT: s_wait_loadcnt 0x0
-; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10]
; GFX12-NEXT: s_or_b32 s4, vcc_lo, s4
+; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: s_wait_alu 0xfffe
; GFX12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX12-NEXT: s_cbranch_execnz .LBB14_1
@@ -1872,11 +1872,10 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: v_dual_mov_b32 v2, v9 :: v_dual_mov_b32 v3, v10
; GFX11-NEXT: buffer_atomic_cmpswap_b64 v[0:3], v6, s[0:3], 0 offen glc
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: buffer_gl1_inv
-; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[0:1], v[9:10]
; GFX11-NEXT: s_or_b32 s4, vcc_lo, s4
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: buffer_gl1_inv
+; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX11-NEXT: s_cbranch_execnz .LBB14_1
; GFX11-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1925,9 +1924,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v3, v10
; GFX908-NEXT: buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10]
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB14_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1956,9 +1955,9 @@ define double @buffer_fat_ptr_agent_atomic_fmax_ret_f64__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v3, v10
; GFX8-NEXT: buffer_atomic_cmpswap_x2 v[0:3], v6, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[9:10]
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB14_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2000,10 +1999,10 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX12-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0
; GFX12-NEXT: buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], null offen th:TH_ATOMIC_RETURN
; GFX12-NEXT: s_wait_loadcnt 0x0
-; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3]
; GFX12-NEXT: v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8
; GFX12-NEXT: s_or_b32 s4, vcc_lo, s4
+; GFX12-NEXT: global_inv scope:SCOPE_DEV
; GFX12-NEXT: s_wait_alu 0xfffe
; GFX12-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX12-NEXT: s_cbranch_execnz .LBB15_1
@@ -2040,12 +2039,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX11-NEXT: v_dual_mov_b32 v8, v1 :: v_dual_mov_b32 v7, v0
; GFX11-NEXT: buffer_atomic_cmpswap_b64 v[7:10], v6, s[0:3], 0 offen glc
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: buffer_gl1_inv
-; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: v_cmp_eq_u64_e32 vcc_lo, v[7:8], v[2:3]
; GFX11-NEXT: v_dual_mov_b32 v2, v7 :: v_dual_mov_b32 v3, v8
; GFX11-NEXT: s_or_b32 s4, vcc_lo, s4
-; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT: buffer_gl1_inv
+; GFX11-NEXT: buffer_gl0_inv
; GFX11-NEXT: s_and_not1_b32 exec_lo, exec_lo, s4
; GFX11-NEXT: s_cbranch_execnz .LBB15_1
; GFX11-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2090,11 +2088,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX908-NEXT: v_mov_b32_e32 v7, v0
; GFX908-NEXT: buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3]
; GFX908-NEXT: v_mov_b32_e32 v2, v7
-; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX908-NEXT: v_mov_b32_e32 v3, v8
+; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB15_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -2119,11 +2117,11 @@ define void @buffer_fat_ptr_agent_atomic_fmax_noret_f64__amdgpu_no_fine_grained_
; GFX8-NEXT: v_mov_b32_e32 v7, v0
; GFX8-NEXT: buffer_atomic_cmpswap_x2 v[7:10], v6, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u64_e32 vcc, v[7:8], v[2:3]
; GFX8-NEXT: v_mov_b32_e32 v2, v7
-; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX8-NEXT: v_mov_b32_e32 v3, v8
+; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB15_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
index 351502816ae6e..8988b2fd5d01b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll
@@ -1528,9 +1528,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB12_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1576,9 +1576,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX90A-NEXT: v_pk_mov_b32 v[0:1], v[4:5], v[4:5] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX90A-NEXT: buffer_wbinvl1
; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execnz .LBB12_1
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1603,9 +1603,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX908-NEXT: v_mov_b32_e32 v1, v5
; GFX908-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX908-NEXT: s_waitcnt vmcnt(0)
-; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX908-NEXT: buffer_wbinvl1
; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX908-NEXT: s_cbranch_execnz .LBB12_1
; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1630,9 +1630,9 @@ define float @buffer_fat_ptr_agent_atomic_fmin_ret_f32__amdgpu_no_fine_grained_m
; GFX8-NEXT: v_mov_b32_e32 v1, v5
; GFX8-NEXT: buffer_atomic_cmpswap v[0:1], v2, s[16:19], 0 offen glc
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, v0, v5
; GFX8-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX8-NEXT: buffer_wbinvl1
; GFX8-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX8-NEXT: s_cbranch_execnz .LBB12_1
; GFX8-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1683,10 +1683,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_
; GFX942-NEXT: buffer_wbl2 sc1
; GFX942-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[0:3], 0 offen sc0
; GFX942-NEXT: s_waitcnt vmcnt(0)
-; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: v_cmp_eq_u32_e32 vcc, v4, v1
-; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
; GFX942-NEXT: v_mov_b32_e32 v1, v4
+; GFX942-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
+; GFX942-NEXT: buffer_inv sc1
; GFX942-NEXT: s_andn2_b64 exec, exec, s[4:5]
; GFX942-NEXT: s_cbranch_execnz .LBB13_1
; GFX942-NEXT: ; %bb.2: ; %atomicrmw.end
@@ -1730,10 +1730,10 @@ define void @buffer_fat_ptr_agent_atomic_fmin_noret_f32__amdgpu_no_fine_grained_
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[0:1], v[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_atomic_cmpswap v[4:5], v2, s[16:19], 0 offen glc
; GFX90A-NEXT: s_waitcnt v...
[truncated]
|
I'm trying to understand this. In some cases the load is required to complete before the fence for correctness, so there must already be an edge in DAG representing that, right? So why do you need to add extra edges? Are you handling cases where the ordering was not required for correctness? Or do you just want to set a different latency on the edge? If so, can't you use |
// Create new edges on ATOMIC_FENCE for recorded reads. | ||
// We don't consider the scope of the fence so it is possible there will | ||
// be no impact of this fence on the recorded operations. | ||
if (MI->getOpcode() == AMDGPU::ATOMIC_FENCE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing that the desciption talks about "barriers", but the code handles "fence".
✅ With the latest revision this PR passed the C/C++ code formatter. |
In an earlier version of this I tried to add latency by introducing artificial edges with the new latency; however, this had no impact on scheduling. I was unaware of |
Add scheduler DAG mutation to add data dependencies between atomic fences and preceding memory reads. This allows some modelling of the impact an atomic fence can have on outstanding memory accesses.
This is beneficial when a fence would cause wait count insertion, as more instructions will be scheduled before the fence hiding memory latency. It also reduces the risk of a fence causing a premature wait on all active memory operations.