Skip to content

[AMDGPU][MISched] Allow memory ops of different base pointers to be clustered #140674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

choikwa
Copy link
Contributor

@choikwa choikwa commented May 20, 2025

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.

RFC

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace.
In testing, it has been observed that clustering memory ops with different base pointers can improve performance.
In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers.
Internal CQE testing did not show significant regressions.
@llvmbot
Copy link
Member

llvmbot commented May 20, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: choikwa (choikwa)

Changes

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.

RFC


Patch is 2.67 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140674.diff

107 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+33-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+114-114)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll (+10-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/add.v2i16.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+8927-9038)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+204-210)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+1293-1303)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+58-54)
  • (modified) llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+86-78)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+43-44)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+8-3)
  • (modified) llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cluster_stores.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll (+841-136)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-scc-clobber.ll (+14-13)
  • (modified) llvm/test/CodeGen/AMDGPU/ctpop16.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fcmp.f16.ll (+56)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+30-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fma-combine.ll (+10-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul.f16.ll (+47-45)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fsub.f16.ll (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+76-82)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+60-56)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+114-98)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+60-42)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+60-42)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+114-98)
  • (modified) llvm/test/CodeGen/AMDGPU/group-image-instructions.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/idot2.ll (+348-350)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+346-351)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+610-618)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8s.ll (+117-121)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8u.ll (+98-105)
  • (modified) llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+7-6)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-frame-extern.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bvh8_intersect_ray.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dead.ll (+10-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dual_intersect_ray.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll (+8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fmad.ftz.ll (+1-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.lds.kernel.id.ll (+8-7)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll (+17-17)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll (+20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll (+20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.waitcnt.out.order.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fma.f16.ll (+44-44)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll (+31-21)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll (+80-86)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll (+80-86)
  • (modified) llvm/test/CodeGen/AMDGPU/load-select-ptr.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/mixed-vmem-types.ll (+6-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+68-62)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+101-51)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+70-53)
  • (modified) llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll (+59-61)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+8-5)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+8-5)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-commute.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+155-155)
  • (modified) llvm/test/CodeGen/AMDGPU/sitofp.f16.ll (+32-30)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/uitofp.f16.ll (+32-30)
  • (modified) llvm/test/CodeGen/AMDGPU/v_madak_f16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fadd.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmul.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+191-16)
  • (modified) llvm/test/CodeGen/AMDGPU/vselect.ll (+25-29)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll (+10-16)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll (+22-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+42-53)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+32-32)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 85276bd24bcf4..8b19ab35bc822 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -47,6 +47,12 @@ namespace llvm::AMDGPU {
 #include "AMDGPUGenSearchableTables.inc"
 } // namespace llvm::AMDGPU
 
+static cl::opt<bool> DisableDiffBasePtrMemClustering(
+  "amdgpu-disable-diff-baseptr-mem-clustering",
+  cl::desc("Disable clustering memory ops with different base pointers"),
+  cl::init(false),
+  cl::Hidden);
+
 // Must be at least 4 to be able to branch over minimum unconditional branch
 // code. This is only for making it possible to write reasonably small tests for
 // long branches.
@@ -522,6 +528,22 @@ bool SIInstrInfo::getMemOperandsWithOffsetWidth(
   return false;
 }
 
+static bool memOpsHaveSameAddrspace(const MachineInstr &MI1,
+                                  ArrayRef<const MachineOperand *> BaseOps1,
+                                  const MachineInstr &MI2,
+                                  ArrayRef<const MachineOperand *> BaseOps2) {
+  // If base is identical, assume identical addrspace
+  if (BaseOps1.front()->isIdenticalTo(*BaseOps2.front()))
+    return true;
+
+  if (!MI1.hasOneMemOperand() || !MI2.hasOneMemOperand())
+    return false;
+
+  auto *MO1 = *MI1.memoperands_begin();
+  auto *MO2 = *MI2.memoperands_begin();
+  return MO1->getAddrSpace() == MO2->getAddrSpace();
+}
+
 static bool memOpsHaveSameBasePtr(const MachineInstr &MI1,
                                   ArrayRef<const MachineOperand *> BaseOps1,
                                   const MachineInstr &MI2,
@@ -559,14 +581,21 @@ bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
                                       int64_t Offset2, bool OffsetIsScalable2,
                                       unsigned ClusterSize,
                                       unsigned NumBytes) const {
-  // If the mem ops (to be clustered) do not have the same base ptr, then they
-  // should not be clustered
   unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
   if (!BaseOps1.empty() && !BaseOps2.empty()) {
     const MachineInstr &FirstLdSt = *BaseOps1.front()->getParent();
     const MachineInstr &SecondLdSt = *BaseOps2.front()->getParent();
-    if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
-      return false;
+    
+    if (!DisableDiffBasePtrMemClustering) {
+      // Only consider memory ops from same addrspace for clustering
+      if (!memOpsHaveSameAddrspace(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+        return false;
+    } else {
+      // If the mem ops (to be clustered) do not have the same base ptr, then they
+      // should not be clustered
+      if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+        return false;
+    }
 
     const SIMachineFunctionInfo *MFI =
         FirstLdSt.getMF()->getInfo<SIMachineFunctionInfo>();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index 27b93872b9f1d..f562d958529d1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -8,31 +8,31 @@ define void @add_v3i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 2, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v8, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 4, v0
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v9, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v10, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v2
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v0
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    flat_load_ushort v10, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 2, v2
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v11, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v12, v[2:3]
+; GFX8-NEXT:    flat_load_ushort v8, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 4, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 4, v2
-; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_ushort v11, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v12, v[0:1]
 ; GFX8-NEXT:    flat_load_ushort v6, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v7, v[0:1]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 4, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v5, vcc
-; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v7, v8, v11
+; GFX8-NEXT:    s_waitcnt vmcnt(3)
+; GFX8-NEXT:    v_add_u16_e32 v9, v11, v12
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v8, v9, v12
+; GFX8-NEXT:    v_add_u16_e32 v6, v6, v8
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v6, v10, v6
-; GFX8-NEXT:    flat_store_short v[4:5], v7
-; GFX8-NEXT:    flat_store_short v[0:1], v8
-; GFX8-NEXT:    flat_store_short v[2:3], v6
+; GFX8-NEXT:    v_add_u16_e32 v7, v10, v7
+; GFX8-NEXT:    flat_store_short v[4:5], v9
+; GFX8-NEXT:    flat_store_short v[0:1], v6
+; GFX8-NEXT:    flat_store_short v[2:3], v7
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -153,28 +153,28 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
-; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 6, v0
-; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v12, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 8, v0
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v13, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v14, v[8:9]
-; GFX8-NEXT:    flat_load_ushort v15, v[10:11]
-; GFX8-NEXT:    flat_load_ushort v16, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 4, v2
+; GFX8-NEXT:    flat_load_ushort v12, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v13, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 6, v0
+; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 8, v0
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    flat_load_ushort v14, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v15, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 2, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 6, v2
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 8, v2
+; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 6, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v16, v[0:1]
 ; GFX8-NEXT:    flat_load_ushort v17, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v18, v[0:1]
-; GFX8-NEXT:    flat_load_ushort v19, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v20, v[8:9]
+; GFX8-NEXT:    flat_load_ushort v18, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v19, v[8:9]
 ; GFX8-NEXT:    flat_load_ushort v10, v[10:11]
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 8, v2
+; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v11, v[0:1]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 4, v4
@@ -184,20 +184,20 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 8, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v5, vcc
 ; GFX8-NEXT:    s_waitcnt vmcnt(4)
-; GFX8-NEXT:    v_add_u16_e32 v11, v12, v17
+; GFX8-NEXT:    v_add_u16_e32 v16, v16, v17
 ; GFX8-NEXT:    s_waitcnt vmcnt(3)
-; GFX8-NEXT:    v_add_u16_e32 v12, v13, v18
+; GFX8-NEXT:    v_add_u16_e32 v12, v12, v18
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v13, v14, v19
+; GFX8-NEXT:    v_add_u16_e32 v13, v13, v19
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v14, v15, v20
+; GFX8-NEXT:    v_add_u16_e32 v10, v14, v10
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v10, v16, v10
-; GFX8-NEXT:    flat_store_short v[4:5], v11
+; GFX8-NEXT:    v_add_u16_e32 v11, v15, v11
+; GFX8-NEXT:    flat_store_short v[4:5], v16
 ; GFX8-NEXT:    flat_store_short v[0:1], v12
 ; GFX8-NEXT:    flat_store_short v[2:3], v13
-; GFX8-NEXT:    flat_store_short v[6:7], v14
-; GFX8-NEXT:    flat_store_short v[8:9], v10
+; GFX8-NEXT:    flat_store_short v[6:7], v10
+; GFX8-NEXT:    flat_store_short v[8:9], v11
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -513,25 +513,25 @@ define void @add_v9i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v14, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    flat_load_ushort v0, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v1, v[2:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v1, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v2, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v3, v7, v11
-; GFX8-NEXT:    v_add_u16_sdwa v10, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v11, v8, v12
+; GFX8-NEXT:    v_add_u16_e32 v2, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v14, v8, v12
 ; GFX8-NEXT:    v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u16_e32 v12, v9, v13
 ; GFX8-NEXT:    v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v4
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v13, v14, v0
-; GFX8-NEXT:    v_or_b32_e32 v0, v1, v2
-; GFX8-NEXT:    v_or_b32_e32 v1, v3, v10
-; GFX8-NEXT:    v_or_b32_e32 v2, v11, v8
+; GFX8-NEXT:    v_add_u16_e32 v13, v0, v1
+; GFX8-NEXT:    v_or_b32_e32 v0, v2, v3
+; GFX8-NEXT:    v_or_b32_e32 v1, v10, v11
+; GFX8-NEXT:    v_or_b32_e32 v2, v14, v8
 ; GFX8-NEXT:    v_or_b32_e32 v3, v12, v9
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
@@ -604,10 +604,10 @@ define void @add_v10i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    flat_load_dword v14, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_dword v15, v[0:1]
+; GFX8-NEXT:    flat_load_dword v15, v[2:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
 ; GFX8-NEXT:    v_add_u16_e32 v0, v6, v10
 ; GFX8-NEXT:    v_add_u16_sdwa v1, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
@@ -663,53 +663,53 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    flat_load_dwordx4 v[6:9], v[0:1]
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v14, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v15, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v16, vcc, 18, v2
-; GFX8-NEXT:    v_addc_u32_e32 v17, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 20, v2
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_ushort v14, v[14:15]
-; GFX8-NEXT:    flat_load_ushort v15, v[16:17]
-; GFX8-NEXT:    flat_load_ushort v16, v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v0
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v1, vcc
-; GFX8-NEXT:    s_waitcnt vmcnt(3)
-; GFX8-NEXT:    v_add_u16_e32 v17, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 18, v0
-; GFX8-NEXT:    v_add_u16_e32 v18, v7, v11
-; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v15, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v0
+; GFX8-NEXT:    v_add_u16_e32 v16, v7, v11
+; GFX8-NEXT:    v_add_u16_sdwa v17, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u16_e32 v18, v8, v12
+; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 18, v0
+; GFX8-NEXT:    v_add_u16_e32 v19, v9, v13
+; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v20, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v21, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 18, v2
+; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 20, v0
-; GFX8-NEXT:    flat_load_ushort v2, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v3, v[6:7]
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v21, v[0:1]
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 20, v2
+; GFX8-NEXT:    flat_load_ushort v10, v[10:11]
+; GFX8-NEXT:    flat_load_ushort v11, v[6:7]
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v22, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v2, v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v5, vcc
-; GFX8-NEXT:    v_add_u16_e32 v19, v8, v12
-; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 18, v4
-; GFX8-NEXT:    v_add_u16_e32 v20, v9, v13
-; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v5, vcc
-; GFX8-NEXT:    v_or_b32_e32 v0, v17, v10
-; GFX8-NEXT:    v_or_b32_e32 v1, v18, v11
+; GFX8-NEXT:    v_or_b32_e32 v0, v14, v15
+; GFX8-NEXT:    v_or_b32_e32 v1, v16, v17
+; GFX8-NEXT:    v_or_b32_e32 v3, v19, v13
+; GFX8-NEXT:    s_waitcnt vmcnt(3)
+; GFX8-NEXT:    v_add_u16_e32 v20, v20, v10
 ; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 20, v4
-; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v5, vcc
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v14, v2, v14
-; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v15, v3, v15
-; GFX8-NEXT:    v_or_b32_e32 v2, v19, v12
-; GFX8-NEXT:    v_or_b32_e32 v3, v20, v13
+; GFX8-NEXT:    v_add_u16_e32 v21, v21, v11
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v16, v21, v16
+; GFX8-NEXT:    v_add_u16_e32 v14, v22, v2
+; GFX8-NEXT:    v_or_b32_e32 v2, v18, v12
+; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
-; GFX8-NEXT:    flat_store_short v[6:7], v14
-; GFX8-NEXT:    flat_store_short v[8:9], v15
-; GFX8-NEXT:    flat_store_short v[10:11], v16
+; GFX8-NEXT:    flat_store_short v[6:7], v20
+; GFX8-NEXT:    flat_store_short v[8:9], v21
+; GFX8-NEXT:    flat_store_short v[10:11], v14
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -794,34 +794,34 @@ define void @add_v12i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    flat_load_dwordx4 v[6:9], v[0:1]
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_dwordx2 v[14:15], v[2:3]
-; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v2, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v15, v7, v11
 ; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    flat_load_dwordx2 v[6:7], v[0:1]
 ; GFX8-NEXT:    v_add_u16_e32 v16, v8, v12
-; GFX8-NEXT:    v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v12, v9, v13
-; GFX8-NEXT:    v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_or_b32_e32 v0, v2, v3
-; GFX8-NEXT:    v_or_b32_e32 v1, v10, v11
-; GFX8-NEXT:    v_or_b32_e32 v2, v16, v8
-; GFX8-NEXT:    v_or_b32_e32 v3, v12, v9
+; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v17, v9, v13
+; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    flat_load_dwordx2 v[6:7], v[0:1]
+; GFX8-NEXT:    flat_load_dwordx2 v[8:9], v[2:3]
+; GFX8-NEXT:    v_or_b32_e32 v0, v14, v10
+; GFX8-NEXT:    v_or_b32_e32 v1, v15, v11
+; GFX8-NEXT:    v_or_b32_e32 v2, v16, v12
+; GFX8-NEXT:    v_or_b32_e32 v3, v17, v13
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v8, v6, v14
-; GFX8-NEXT:    v_add_u16_sdwa v6, v6, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v9, v7, v15
-; GFX8-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v10, v6, v8
+; GFX8-NEXT:    v_add_u16_sdwa v6, v6, v8 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v8, v7, v9
+; GFX8-NEXT:    v_add_u16_sdwa v7, v7, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v4
-; GFX8-NEXT:    v_or_b32_e32 v6, v8, v6
-; GFX8-NEXT:    v_or_b32_e32 v7, v9, v7
+; GFX8-NEXT:    v_or_b32_e32 v6, v10, v6
+; GFX8-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx2 v[0:1], v[6:7]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
index 86766e2904619..89f896a2b1656 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
@@ -288,16 +288,16 @@ define amdgpu_kernel void...
[truncated]

Copy link

github-actions bot commented May 20, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@choikwa
Copy link
Contributor Author

choikwa commented May 20, 2025

The main motivation for doing this came from looking at MISched logs and observing degradation when two loads from different arrays of dot_product weren't put adjacent. I saw that shouldClusterMemOp was the main determinant for rejecting clustering two loads if base pointers were different and otherwise it was relying only on tie-breaking heuristics to decide if loads should be put together, which isn't deterministic.

Shader Programming Guide section 3.1.8 on "Soft" Memory Clause also notes that back-to-back requests are much more efficient for the cache.

@shiltian shiltian changed the title [AMDGPU][MISched] Allow memory ops of different base pointers to be c… [AMDGPU][MISched] Allow memory ops of different base pointers to be clustered May 20, 2025
Copy link

github-actions bot commented May 20, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/CodeGen/AMDGPU/test-enable-diffbase-clustering-flag.ll llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/add.v2i16.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll llvm/test/CodeGen/AMDGPU/clamp-modifier.ll llvm/test/CodeGen/AMDGPU/clamp.ll llvm/test/CodeGen/AMDGPU/cluster_stores.ll llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll llvm/test/CodeGen/AMDGPU/copy-to-reg-scc-clobber.ll llvm/test/CodeGen/AMDGPU/ctpop16.ll llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/fcmp.f16.ll llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll llvm/test/CodeGen/AMDGPU/fma-combine.ll llvm/test/CodeGen/AMDGPU/fmed3.ll llvm/test/CodeGen/AMDGPU/fmul.f16.ll llvm/test/CodeGen/AMDGPU/frem.ll llvm/test/CodeGen/AMDGPU/fsub.f16.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-args.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/group-image-instructions.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/idot2.ll llvm/test/CodeGen/AMDGPU/idot4s.ll llvm/test/CodeGen/AMDGPU/idot4u.ll llvm/test/CodeGen/AMDGPU/idot8s.ll llvm/test/CodeGen/AMDGPU/idot8u.ll llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll llvm/test/CodeGen/AMDGPU/kernel-args.ll llvm/test/CodeGen/AMDGPU/lds-frame-extern.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bvh8_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dead.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dual_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fmad.ftz.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.lds.kernel.id.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.waitcnt.out.order.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/llvm.fma.f16.ll llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll llvm/test/CodeGen/AMDGPU/load-select-ptr.ll llvm/test/CodeGen/AMDGPU/max.i16.ll llvm/test/CodeGen/AMDGPU/min.ll llvm/test/CodeGen/AMDGPU/mixed-vmem-types.ll llvm/test/CodeGen/AMDGPU/mul.ll llvm/test/CodeGen/AMDGPU/or.ll llvm/test/CodeGen/AMDGPU/permute_i8.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll llvm/test/CodeGen/AMDGPU/rotl.ll llvm/test/CodeGen/AMDGPU/rotr.ll llvm/test/CodeGen/AMDGPU/sdwa-commute.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/sitofp.f16.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/CodeGen/AMDGPU/sub.ll llvm/test/CodeGen/AMDGPU/sub.v2i16.ll llvm/test/CodeGen/AMDGPU/uitofp.f16.ll llvm/test/CodeGen/AMDGPU/v_madak_f16.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fadd.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fmul.ll llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll llvm/test/CodeGen/AMDGPU/vselect.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll llvm/test/CodeGen/AMDGPU/wqm.ll llvm/test/CodeGen/AMDGPU/xor.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@choikwa choikwa requested a review from kerbowa May 20, 2025 20:02
@jayfoad
Copy link
Contributor

jayfoad commented May 21, 2025

Have you done any other benchmarking on this patch? It seems like it could have a big effect on performance, both good and bad.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIR test exercising the flag would be good

@choikwa
Copy link
Contributor Author

choikwa commented May 21, 2025

Have you done any other benchmarking on this patch? It seems like it could have a big effect on performance, both good and bad.

I ran the ROCmValidation suite but didn't observe significant perf delta.

@choikwa
Copy link
Contributor Author

choikwa commented Jun 2, 2025

Ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants