-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AMDGPU][MISched] Allow memory ops of different base pointers to be clustered #140674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…lustered This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: choikwa (choikwa) Changes…lustered This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions. Patch is 2.67 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140674.diff 107 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 85276bd24bcf4..8b19ab35bc822 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -47,6 +47,12 @@ namespace llvm::AMDGPU {
#include "AMDGPUGenSearchableTables.inc"
} // namespace llvm::AMDGPU
+static cl::opt<bool> DisableDiffBasePtrMemClustering(
+ "amdgpu-disable-diff-baseptr-mem-clustering",
+ cl::desc("Disable clustering memory ops with different base pointers"),
+ cl::init(false),
+ cl::Hidden);
+
// Must be at least 4 to be able to branch over minimum unconditional branch
// code. This is only for making it possible to write reasonably small tests for
// long branches.
@@ -522,6 +528,22 @@ bool SIInstrInfo::getMemOperandsWithOffsetWidth(
return false;
}
+static bool memOpsHaveSameAddrspace(const MachineInstr &MI1,
+ ArrayRef<const MachineOperand *> BaseOps1,
+ const MachineInstr &MI2,
+ ArrayRef<const MachineOperand *> BaseOps2) {
+ // If base is identical, assume identical addrspace
+ if (BaseOps1.front()->isIdenticalTo(*BaseOps2.front()))
+ return true;
+
+ if (!MI1.hasOneMemOperand() || !MI2.hasOneMemOperand())
+ return false;
+
+ auto *MO1 = *MI1.memoperands_begin();
+ auto *MO2 = *MI2.memoperands_begin();
+ return MO1->getAddrSpace() == MO2->getAddrSpace();
+}
+
static bool memOpsHaveSameBasePtr(const MachineInstr &MI1,
ArrayRef<const MachineOperand *> BaseOps1,
const MachineInstr &MI2,
@@ -559,14 +581,21 @@ bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
int64_t Offset2, bool OffsetIsScalable2,
unsigned ClusterSize,
unsigned NumBytes) const {
- // If the mem ops (to be clustered) do not have the same base ptr, then they
- // should not be clustered
unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
if (!BaseOps1.empty() && !BaseOps2.empty()) {
const MachineInstr &FirstLdSt = *BaseOps1.front()->getParent();
const MachineInstr &SecondLdSt = *BaseOps2.front()->getParent();
- if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
- return false;
+
+ if (!DisableDiffBasePtrMemClustering) {
+ // Only consider memory ops from same addrspace for clustering
+ if (!memOpsHaveSameAddrspace(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+ return false;
+ } else {
+ // If the mem ops (to be clustered) do not have the same base ptr, then they
+ // should not be clustered
+ if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+ return false;
+ }
const SIMachineFunctionInfo *MFI =
FirstLdSt.getMF()->getInfo<SIMachineFunctionInfo>();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index 27b93872b9f1d..f562d958529d1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -8,31 +8,31 @@ define void @add_v3i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: v_add_u32_e32 v6, vcc, 2, v0
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v8, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 4, v0
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v9, v[6:7]
-; GFX8-NEXT: flat_load_ushort v10, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v2
+; GFX8-NEXT: v_add_u32_e32 v8, vcc, 4, v0
+; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT: flat_load_ushort v10, v[8:9]
+; GFX8-NEXT: v_add_u32_e32 v8, vcc, 2, v2
+; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v3, vcc
+; GFX8-NEXT: flat_load_ushort v11, v[0:1]
+; GFX8-NEXT: flat_load_ushort v12, v[2:3]
+; GFX8-NEXT: flat_load_ushort v8, v[8:9]
+; GFX8-NEXT: v_add_u32_e32 v0, vcc, 4, v2
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v6, vcc, 4, v2
-; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT: flat_load_ushort v11, v[2:3]
-; GFX8-NEXT: flat_load_ushort v12, v[0:1]
; GFX8-NEXT: flat_load_ushort v6, v[6:7]
+; GFX8-NEXT: flat_load_ushort v7, v[0:1]
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v4
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v5, vcc
; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v4
; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v5, vcc
-; GFX8-NEXT: s_waitcnt vmcnt(2)
-; GFX8-NEXT: v_add_u16_e32 v7, v8, v11
+; GFX8-NEXT: s_waitcnt vmcnt(3)
+; GFX8-NEXT: v_add_u16_e32 v9, v11, v12
; GFX8-NEXT: s_waitcnt vmcnt(1)
-; GFX8-NEXT: v_add_u16_e32 v8, v9, v12
+; GFX8-NEXT: v_add_u16_e32 v6, v6, v8
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: v_add_u16_e32 v6, v10, v6
-; GFX8-NEXT: flat_store_short v[4:5], v7
-; GFX8-NEXT: flat_store_short v[0:1], v8
-; GFX8-NEXT: flat_store_short v[2:3], v6
+; GFX8-NEXT: v_add_u16_e32 v7, v10, v7
+; GFX8-NEXT: flat_store_short v[4:5], v9
+; GFX8-NEXT: flat_store_short v[0:1], v6
+; GFX8-NEXT: flat_store_short v[2:3], v7
; GFX8-NEXT: s_waitcnt vmcnt(0)
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
@@ -153,28 +153,28 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc
; GFX8-NEXT: v_add_u32_e32 v8, vcc, 4, v0
; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc
-; GFX8-NEXT: v_add_u32_e32 v10, vcc, 6, v0
-; GFX8-NEXT: v_addc_u32_e32 v11, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v12, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 8, v0
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v13, v[6:7]
-; GFX8-NEXT: flat_load_ushort v14, v[8:9]
-; GFX8-NEXT: flat_load_ushort v15, v[10:11]
-; GFX8-NEXT: flat_load_ushort v16, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v2
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v6, vcc, 4, v2
+; GFX8-NEXT: flat_load_ushort v12, v[6:7]
+; GFX8-NEXT: flat_load_ushort v13, v[8:9]
+; GFX8-NEXT: v_add_u32_e32 v6, vcc, 6, v0
+; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT: v_add_u32_e32 v8, vcc, 8, v0
+; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT: flat_load_ushort v14, v[6:7]
+; GFX8-NEXT: flat_load_ushort v15, v[8:9]
+; GFX8-NEXT: v_add_u32_e32 v6, vcc, 2, v2
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v8, vcc, 6, v2
+; GFX8-NEXT: v_add_u32_e32 v8, vcc, 4, v2
; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v10, vcc, 8, v2
+; GFX8-NEXT: v_add_u32_e32 v10, vcc, 6, v2
; GFX8-NEXT: v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT: flat_load_ushort v16, v[0:1]
; GFX8-NEXT: flat_load_ushort v17, v[2:3]
-; GFX8-NEXT: flat_load_ushort v18, v[0:1]
-; GFX8-NEXT: flat_load_ushort v19, v[6:7]
-; GFX8-NEXT: flat_load_ushort v20, v[8:9]
+; GFX8-NEXT: flat_load_ushort v18, v[6:7]
+; GFX8-NEXT: flat_load_ushort v19, v[8:9]
; GFX8-NEXT: flat_load_ushort v10, v[10:11]
+; GFX8-NEXT: v_add_u32_e32 v0, vcc, 8, v2
+; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT: flat_load_ushort v11, v[0:1]
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 2, v4
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v5, vcc
; GFX8-NEXT: v_add_u32_e32 v2, vcc, 4, v4
@@ -184,20 +184,20 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
; GFX8-NEXT: v_add_u32_e32 v8, vcc, 8, v4
; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v5, vcc
; GFX8-NEXT: s_waitcnt vmcnt(4)
-; GFX8-NEXT: v_add_u16_e32 v11, v12, v17
+; GFX8-NEXT: v_add_u16_e32 v16, v16, v17
; GFX8-NEXT: s_waitcnt vmcnt(3)
-; GFX8-NEXT: v_add_u16_e32 v12, v13, v18
+; GFX8-NEXT: v_add_u16_e32 v12, v12, v18
; GFX8-NEXT: s_waitcnt vmcnt(2)
-; GFX8-NEXT: v_add_u16_e32 v13, v14, v19
+; GFX8-NEXT: v_add_u16_e32 v13, v13, v19
; GFX8-NEXT: s_waitcnt vmcnt(1)
-; GFX8-NEXT: v_add_u16_e32 v14, v15, v20
+; GFX8-NEXT: v_add_u16_e32 v10, v14, v10
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: v_add_u16_e32 v10, v16, v10
-; GFX8-NEXT: flat_store_short v[4:5], v11
+; GFX8-NEXT: v_add_u16_e32 v11, v15, v11
+; GFX8-NEXT: flat_store_short v[4:5], v16
; GFX8-NEXT: flat_store_short v[0:1], v12
; GFX8-NEXT: flat_store_short v[2:3], v13
-; GFX8-NEXT: flat_store_short v[6:7], v14
-; GFX8-NEXT: flat_store_short v[8:9], v10
+; GFX8-NEXT: flat_store_short v[6:7], v10
+; GFX8-NEXT: flat_store_short v[8:9], v11
; GFX8-NEXT: s_waitcnt vmcnt(0)
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
@@ -513,25 +513,25 @@ define void @add_v9i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
; GFX8-NEXT: flat_load_dwordx4 v[10:13], v[2:3]
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v0
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v14, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
; GFX8-NEXT: flat_load_ushort v0, v[0:1]
+; GFX8-NEXT: flat_load_ushort v1, v[2:3]
; GFX8-NEXT: s_waitcnt vmcnt(2)
-; GFX8-NEXT: v_add_u16_e32 v1, v6, v10
-; GFX8-NEXT: v_add_u16_sdwa v2, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u16_e32 v3, v7, v11
-; GFX8-NEXT: v_add_u16_sdwa v10, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u16_e32 v11, v8, v12
+; GFX8-NEXT: v_add_u16_e32 v2, v6, v10
+; GFX8-NEXT: v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT: v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v14, v8, v12
; GFX8-NEXT: v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_add_u16_e32 v12, v9, v13
; GFX8-NEXT: v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_add_u32_e32 v6, vcc, 16, v4
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: v_add_u16_e32 v13, v14, v0
-; GFX8-NEXT: v_or_b32_e32 v0, v1, v2
-; GFX8-NEXT: v_or_b32_e32 v1, v3, v10
-; GFX8-NEXT: v_or_b32_e32 v2, v11, v8
+; GFX8-NEXT: v_add_u16_e32 v13, v0, v1
+; GFX8-NEXT: v_or_b32_e32 v0, v2, v3
+; GFX8-NEXT: v_or_b32_e32 v1, v10, v11
+; GFX8-NEXT: v_or_b32_e32 v2, v14, v8
; GFX8-NEXT: v_or_b32_e32 v3, v12, v9
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v5, vcc
; GFX8-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
@@ -604,10 +604,10 @@ define void @add_v10i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
; GFX8-NEXT: flat_load_dwordx4 v[10:13], v[2:3]
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v0
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
+; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
; GFX8-NEXT: flat_load_dword v14, v[0:1]
-; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT: flat_load_dword v15, v[0:1]
+; GFX8-NEXT: flat_load_dword v15, v[2:3]
; GFX8-NEXT: s_waitcnt vmcnt(2)
; GFX8-NEXT: v_add_u16_e32 v0, v6, v10
; GFX8-NEXT: v_add_u16_sdwa v1, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
@@ -663,53 +663,53 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: flat_load_dwordx4 v[6:9], v[0:1]
; GFX8-NEXT: flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT: v_add_u32_e32 v14, vcc, 16, v2
-; GFX8-NEXT: v_addc_u32_e32 v15, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v16, vcc, 18, v2
-; GFX8-NEXT: v_addc_u32_e32 v17, vcc, 0, v3, vcc
-; GFX8-NEXT: v_add_u32_e32 v2, vcc, 20, v2
-; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
-; GFX8-NEXT: flat_load_ushort v14, v[14:15]
-; GFX8-NEXT: flat_load_ushort v15, v[16:17]
-; GFX8-NEXT: flat_load_ushort v16, v[2:3]
-; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v0
-; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
-; GFX8-NEXT: s_waitcnt vmcnt(3)
-; GFX8-NEXT: v_add_u16_e32 v17, v6, v10
-; GFX8-NEXT: v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u32_e32 v6, vcc, 18, v0
-; GFX8-NEXT: v_add_u16_e32 v18, v7, v11
-; GFX8-NEXT: v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: s_waitcnt vmcnt(0)
+; GFX8-NEXT: v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT: v_add_u16_sdwa v15, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u32_e32 v6, vcc, 16, v0
+; GFX8-NEXT: v_add_u16_e32 v16, v7, v11
+; GFX8-NEXT: v_add_u16_sdwa v17, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT: v_add_u16_e32 v18, v8, v12
+; GFX8-NEXT: v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u32_e32 v8, vcc, 18, v0
+; GFX8-NEXT: v_add_u16_e32 v19, v9, v13
+; GFX8-NEXT: v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT: v_add_u32_e32 v10, vcc, 16, v2
+; GFX8-NEXT: v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT: flat_load_ushort v20, v[6:7]
+; GFX8-NEXT: flat_load_ushort v21, v[8:9]
+; GFX8-NEXT: v_add_u32_e32 v6, vcc, 18, v2
+; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v3, vcc
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 20, v0
-; GFX8-NEXT: flat_load_ushort v2, v[2:3]
-; GFX8-NEXT: flat_load_ushort v3, v[6:7]
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_ushort v21, v[0:1]
+; GFX8-NEXT: v_add_u32_e32 v2, vcc, 20, v2
+; GFX8-NEXT: flat_load_ushort v10, v[10:11]
+; GFX8-NEXT: flat_load_ushort v11, v[6:7]
+; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT: flat_load_ushort v22, v[0:1]
+; GFX8-NEXT: flat_load_ushort v2, v[2:3]
; GFX8-NEXT: v_add_u32_e32 v6, vcc, 16, v4
; GFX8-NEXT: v_addc_u32_e32 v7, vcc, 0, v5, vcc
-; GFX8-NEXT: v_add_u16_e32 v19, v8, v12
-; GFX8-NEXT: v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_add_u32_e32 v8, vcc, 18, v4
-; GFX8-NEXT: v_add_u16_e32 v20, v9, v13
-; GFX8-NEXT: v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_addc_u32_e32 v9, vcc, 0, v5, vcc
-; GFX8-NEXT: v_or_b32_e32 v0, v17, v10
-; GFX8-NEXT: v_or_b32_e32 v1, v18, v11
+; GFX8-NEXT: v_or_b32_e32 v0, v14, v15
+; GFX8-NEXT: v_or_b32_e32 v1, v16, v17
+; GFX8-NEXT: v_or_b32_e32 v3, v19, v13
+; GFX8-NEXT: s_waitcnt vmcnt(3)
+; GFX8-NEXT: v_add_u16_e32 v20, v20, v10
; GFX8-NEXT: v_add_u32_e32 v10, vcc, 20, v4
-; GFX8-NEXT: v_addc_u32_e32 v11, vcc, 0, v5, vcc
; GFX8-NEXT: s_waitcnt vmcnt(2)
-; GFX8-NEXT: v_add_u16_e32 v14, v2, v14
-; GFX8-NEXT: s_waitcnt vmcnt(1)
-; GFX8-NEXT: v_add_u16_e32 v15, v3, v15
-; GFX8-NEXT: v_or_b32_e32 v2, v19, v12
-; GFX8-NEXT: v_or_b32_e32 v3, v20, v13
+; GFX8-NEXT: v_add_u16_e32 v21, v21, v11
; GFX8-NEXT: s_waitcnt vmcnt(0)
-; GFX8-NEXT: v_add_u16_e32 v16, v21, v16
+; GFX8-NEXT: v_add_u16_e32 v14, v22, v2
+; GFX8-NEXT: v_or_b32_e32 v2, v18, v12
+; GFX8-NEXT: v_addc_u32_e32 v11, vcc, 0, v5, vcc
; GFX8-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
-; GFX8-NEXT: flat_store_short v[6:7], v14
-; GFX8-NEXT: flat_store_short v[8:9], v15
-; GFX8-NEXT: flat_store_short v[10:11], v16
+; GFX8-NEXT: flat_store_short v[6:7], v20
+; GFX8-NEXT: flat_store_short v[8:9], v21
+; GFX8-NEXT: flat_store_short v[10:11], v14
; GFX8-NEXT: s_waitcnt vmcnt(0)
; GFX8-NEXT: s_setpc_b64 s[30:31]
;
@@ -794,34 +794,34 @@ define void @add_v12i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: flat_load_dwordx4 v[6:9], v[0:1]
; GFX8-NEXT: flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v2
-; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v0
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT: flat_load_dwordx2 v[14:15], v[2:3]
-; GFX8-NEXT: s_waitcnt vmcnt(1)
-; GFX8-NEXT: v_add_u16_e32 v2, v6, v10
-; GFX8-NEXT: v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT: v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT: s_waitcnt vmcnt(0)
+; GFX8-NEXT: v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT: v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v15, v7, v11
; GFX8-NEXT: v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: flat_load_dwordx2 v[6:7], v[0:1]
; GFX8-NEXT: v_add_u16_e32 v16, v8, v12
-; GFX8-NEXT: v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u16_e32 v12, v9, v13
-; GFX8-NEXT: v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_or_b32_e32 v0, v2, v3
-; GFX8-NEXT: v_or_b32_e32 v1, v10, v11
-; GFX8-NEXT: v_or_b32_e32 v2, v16, v8
-; GFX8-NEXT: v_or_b32_e32 v3, v12, v9
+; GFX8-NEXT: v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v17, v9, v13
+; GFX8-NEXT: v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: flat_load_dwordx2 v[6:7], v[0:1]
+; GFX8-NEXT: flat_load_dwordx2 v[8:9], v[2:3]
+; GFX8-NEXT: v_or_b32_e32 v0, v14, v10
+; GFX8-NEXT: v_or_b32_e32 v1, v15, v11
+; GFX8-NEXT: v_or_b32_e32 v2, v16, v12
+; GFX8-NEXT: v_or_b32_e32 v3, v17, v13
; GFX8-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
; GFX8-NEXT: s_waitcnt vmcnt(1)
-; GFX8-NEXT: v_add_u16_e32 v8, v6, v14
-; GFX8-NEXT: v_add_u16_sdwa v6, v6, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT: v_add_u16_e32 v9, v7, v15
-; GFX8-NEXT: v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v10, v6, v8
+; GFX8-NEXT: v_add_u16_sdwa v6, v6, v8 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT: v_add_u16_e32 v8, v7, v9
+; GFX8-NEXT: v_add_u16_sdwa v7, v7, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; GFX8-NEXT: v_add_u32_e32 v0, vcc, 16, v4
-; GFX8-NEXT: v_or_b32_e32 v6, v8, v6
-; GFX8-NEXT: v_or_b32_e32 v7, v9, v7
+; GFX8-NEXT: v_or_b32_e32 v6, v10, v6
+; GFX8-NEXT: v_or_b32_e32 v7, v8, v7
; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v5, vcc
; GFX8-NEXT: flat_store_dwordx2 v[0:1], v[6:7]
; GFX8-NEXT: s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
index 86766e2904619..89f896a2b1656 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
@@ -288,16 +288,16 @@ define amdgpu_kernel void...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
The main motivation for doing this came from looking at MISched logs and observing degradation when two loads from different arrays of dot_product weren't put adjacent. I saw that shouldClusterMemOp was the main determinant for rejecting clustering two loads if base pointers were different and otherwise it was relying only on tie-breaking heuristics to decide if loads should be put together, which isn't deterministic. Shader Programming Guide section 3.1.8 on "Soft" Memory Clause also notes that back-to-back requests are much more efficient for the cache. |
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/CodeGen/AMDGPU/test-enable-diffbase-clustering-flag.ll llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/add.v2i16.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll llvm/test/CodeGen/AMDGPU/clamp-modifier.ll llvm/test/CodeGen/AMDGPU/clamp.ll llvm/test/CodeGen/AMDGPU/cluster_stores.ll llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll llvm/test/CodeGen/AMDGPU/copy-to-reg-scc-clobber.ll llvm/test/CodeGen/AMDGPU/ctpop16.ll llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/fcmp.f16.ll llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll llvm/test/CodeGen/AMDGPU/fma-combine.ll llvm/test/CodeGen/AMDGPU/fmed3.ll llvm/test/CodeGen/AMDGPU/fmul.f16.ll llvm/test/CodeGen/AMDGPU/frem.ll llvm/test/CodeGen/AMDGPU/fsub.f16.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-args.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/group-image-instructions.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/idot2.ll llvm/test/CodeGen/AMDGPU/idot4s.ll llvm/test/CodeGen/AMDGPU/idot4u.ll llvm/test/CodeGen/AMDGPU/idot8s.ll llvm/test/CodeGen/AMDGPU/idot8u.ll llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll llvm/test/CodeGen/AMDGPU/kernel-args.ll llvm/test/CodeGen/AMDGPU/lds-frame-extern.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bvh8_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dead.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dual_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fmad.ftz.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.lds.kernel.id.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.waitcnt.out.order.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/llvm.fma.f16.ll llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll llvm/test/CodeGen/AMDGPU/load-select-ptr.ll llvm/test/CodeGen/AMDGPU/max.i16.ll llvm/test/CodeGen/AMDGPU/min.ll llvm/test/CodeGen/AMDGPU/mixed-vmem-types.ll llvm/test/CodeGen/AMDGPU/mul.ll llvm/test/CodeGen/AMDGPU/or.ll llvm/test/CodeGen/AMDGPU/permute_i8.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll llvm/test/CodeGen/AMDGPU/rotl.ll llvm/test/CodeGen/AMDGPU/rotr.ll llvm/test/CodeGen/AMDGPU/sdwa-commute.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/sitofp.f16.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/CodeGen/AMDGPU/sub.ll llvm/test/CodeGen/AMDGPU/sub.v2i16.ll llvm/test/CodeGen/AMDGPU/uitofp.f16.ll llvm/test/CodeGen/AMDGPU/v_madak_f16.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fadd.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fmul.ll llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll llvm/test/CodeGen/AMDGPU/vselect.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll llvm/test/CodeGen/AMDGPU/wqm.ll llvm/test/CodeGen/AMDGPU/xor.ll The following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
} Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
} Please refer to the Undefined Behavior Manual for more information. |
Have you done any other benchmarking on this patch? It seems like it could have a big effect on performance, both good and bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIR test exercising the flag would be good
llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
Outdated
Show resolved
Hide resolved
I ran the ROCmValidation suite but didn't observe significant perf delta. |
Ping |
…lustered
This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.
RFC