Skip to content

[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 29, 2025

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Mar 4, 2025

Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself).

This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.


Patch is 482.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129695.diff

16 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+32-11)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+5-2)
  • (modified) llvm/test/CodeGen/X86/oddsubvector.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/setcc-lowering.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+10-7)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll (+64-64)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-3.ll (+28-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-5.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-6.ll (+1139-1144)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-7.ll (+426-416)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-5.ll (+18-18)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll (+28-28)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-7.ll (+70-78)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-v192.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vselect-pcmp.ll (+10-8)
  • (modified) llvm/test/CodeGen/X86/widen_load-3.ll (+9-9)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bbab43d4e92af..9e51bef650562 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3260,6 +3260,12 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
                                               EVT NewVT) const {
   assert(cast<LoadSDNode>(Load)->isSimple() && "illegal to narrow");
 
+  auto PeekThroughOneUserBitcasts = [](const SDNode *N) {
+    while (N->getOpcode() == ISD::BITCAST && N->hasOneUse())
+      N = *N->user_begin();
+    return N;
+  };
+
   // "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF
   // relocation target a movq or addq instruction: don't let the load shrink.
   SDValue BasePtr = cast<LoadSDNode>(Load)->getBasePtr();
@@ -3267,9 +3273,10 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
     if (const auto *GA = dyn_cast<GlobalAddressSDNode>(BasePtr.getOperand(0)))
       return GA->getTargetFlags() != X86II::MO_GOTTPOFF;
 
-  // If this is an (1) AVX vector load with (2) multiple uses and (3) all of
+  // If this is an (1) AVX vector load with (2) multiple uses and (3) any of
   // those uses are extracted directly into a store, then the extract + store
-  // can be store-folded. Therefore, it's probably not worth splitting the load.
+  // can be store-folded, or (4) any use will be used by legal full width
+  // instruction. Then, it's probably not worth splitting the load.
   EVT VT = Load->getValueType(0);
   if ((VT.is256BitVector() || VT.is512BitVector()) &&
       !SDValue(Load, 0).hasOneUse()) {
@@ -3278,15 +3285,23 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
       if (Use.getResNo() != 0)
         continue;
 
-      SDNode *User = Use.getUser();
+      const SDNode *User = PeekThroughOneUserBitcasts(Use.getUser());
 
-      // If this use is not an extract + store, it's probably worth splitting.
-      if (User->getOpcode() != ISD::EXTRACT_SUBVECTOR || !User->hasOneUse() ||
-          User->user_begin()->getOpcode() != ISD::STORE)
-        return true;
+      // If any use is an extract + store, it's probably not worth splitting.
+      if (User->getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+          all_of(User->uses(), [&](const SDUse &U) {
+            const SDNode *Inner = PeekThroughOneUserBitcasts(U.getUser());
+            return Inner->getOpcode() == ISD::STORE;
+          }))
+        return false;
+
+      // If any use is a full width legal/target bin op, then assume its legal
+      // and shouldn't split.
+      if (isBinOp(User->getOpcode()) &&
+          (isOperationLegal(User->getOpcode(), VT) ||
+           User->getOpcode() > ISD::BUILTIN_OP_END))
+        return false;
     }
-    // All non-chain uses are extract + store.
-    return false;
   }
 
   return true;
@@ -4001,8 +4016,14 @@ static SDValue getConstVector(ArrayRef<APInt> Bits, const APInt &Undefs,
     const APInt &V = Bits[i];
     assert(V.getBitWidth() == VT.getScalarSizeInBits() && "Unexpected sizes");
     if (Split) {
-      Ops.push_back(DAG.getConstant(V.trunc(32), dl, EltVT));
-      Ops.push_back(DAG.getConstant(V.lshr(32).trunc(32), dl, EltVT));
+      Ops.push_back(DAG.getConstant(V.extractBits(32, 0), dl, EltVT));
+      Ops.push_back(DAG.getConstant(V.extractBits(32, 32), dl, EltVT));
+    } else if (EltVT == MVT::bf16) {
+      APFloat FV(APFloat::BFloat(), V);
+      Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
+    } else if (EltVT == MVT::f16) {
+      APFloat FV(APFloat::IEEEhalf(), V);
+      Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
     } else if (EltVT == MVT::f32) {
       APFloat FV(APFloat::IEEEsingle(), V);
       Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 4a2b35e9efe7c..0332325b145eb 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1337,8 +1337,8 @@ namespace llvm {
                                    unsigned Depth) const override;
 
     bool isTargetCanonicalConstantNode(SDValue Op) const override {
-      // Peek through bitcasts/extracts/inserts to see if we have a broadcast
-      // vector from memory.
+      // Peek through bitcasts/extracts/inserts to see if we have a vector
+      // load/broadcast from memory.
       while (Op.getOpcode() == ISD::BITCAST ||
              Op.getOpcode() == ISD::EXTRACT_SUBVECTOR ||
              (Op.getOpcode() == ISD::INSERT_SUBVECTOR &&
@@ -1346,6 +1346,9 @@ namespace llvm {
         Op = Op.getOperand(Op.getOpcode() == ISD::INSERT_SUBVECTOR ? 1 : 0);
 
       return Op.getOpcode() == X86ISD::VBROADCAST_LOAD ||
+             Op.getOpcode() == X86ISD::SUBV_BROADCAST_LOAD ||
+             (Op.getOpcode() == ISD::LOAD &&
+              getTargetConstantFromLoad(cast<LoadSDNode>(Op))) ||
              TargetLowering::isTargetCanonicalConstantNode(Op);
     }
 
diff --git a/llvm/test/CodeGen/X86/oddsubvector.ll b/llvm/test/CodeGen/X86/oddsubvector.ll
index 2f557679a1558..a1da40e7e7655 100644
--- a/llvm/test/CodeGen/X86/oddsubvector.ll
+++ b/llvm/test/CodeGen/X86/oddsubvector.ll
@@ -261,9 +261,9 @@ define void @PR42833() {
 ;
 ; AVX2-LABEL: PR42833:
 ; AVX2:       # %bb.0:
-; AVX2-NEXT:    movl b(%rip), %eax
 ; AVX2-NEXT:    vmovdqu c+128(%rip), %ymm0
-; AVX2-NEXT:    addl c+128(%rip), %eax
+; AVX2-NEXT:    vmovd %xmm0, %eax
+; AVX2-NEXT:    addl b(%rip), %eax
 ; AVX2-NEXT:    vmovd %eax, %xmm1
 ; AVX2-NEXT:    vpaddd %ymm1, %ymm0, %ymm2
 ; AVX2-NEXT:    vpaddd %ymm0, %ymm0, %ymm3
@@ -284,10 +284,10 @@ define void @PR42833() {
 ;
 ; AVX512-LABEL: PR42833:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    movl b(%rip), %eax
 ; AVX512-NEXT:    vmovdqu c+128(%rip), %ymm0
 ; AVX512-NEXT:    vmovdqu64 c+128(%rip), %zmm1
-; AVX512-NEXT:    addl c+128(%rip), %eax
+; AVX512-NEXT:    vmovd %xmm0, %eax
+; AVX512-NEXT:    addl b(%rip), %eax
 ; AVX512-NEXT:    vmovd %eax, %xmm2
 ; AVX512-NEXT:    vpaddd %ymm2, %ymm0, %ymm2
 ; AVX512-NEXT:    vpaddd %ymm0, %ymm0, %ymm0
diff --git a/llvm/test/CodeGen/X86/setcc-lowering.ll b/llvm/test/CodeGen/X86/setcc-lowering.ll
index cdf9e180345ed..5a72bf3e18bd8 100644
--- a/llvm/test/CodeGen/X86/setcc-lowering.ll
+++ b/llvm/test/CodeGen/X86/setcc-lowering.ll
@@ -11,7 +11,7 @@ define <8 x i16> @pr25080(<8 x i32> %a) nounwind {
 ; AVX1-LABEL: pr25080:
 ; AVX1:       # %bb.0: # %entry
 ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}+16(%rip), %xmm0, %xmm0
 ; AVX1-NEXT:    vpxor %xmm1, %xmm1, %xmm1
 ; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
 ; AVX1-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
diff --git a/llvm/test/CodeGen/X86/vec_int_to_fp.ll b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
index af841cf38b24a..c3b67bc5a8c8c 100644
--- a/llvm/test/CodeGen/X86/vec_int_to_fp.ll
+++ b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
@@ -4228,7 +4228,7 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
 ; AVX1:       # %bb.0:
 ; AVX1-NEXT:    vmovdqa (%rdi), %ymm0
 ; AVX1-NEXT:    vpsrlq $1, %xmm0, %xmm1
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
 ; AVX1-NEXT:    vpsrlq $1, %xmm2, %xmm3
 ; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm1, %ymm1
 ; AVX1-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm3
@@ -4273,7 +4273,8 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm1
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm2[0,1,2],xmm1[0]
 ; AVX2-NEXT:    vaddps %xmm1, %xmm1, %xmm2
-; AVX2-NEXT:    vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
+; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
 ; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
 ; AVX2-NEXT:    vzeroupper
 ; AVX2-NEXT:    retq
@@ -4658,7 +4659,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX1-NEXT:    vbroadcastsd {{.*#+}} ymm2 = [1,1,1,1]
 ; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm3
 ; AVX1-NEXT:    vpsrlq $1, %xmm1, %xmm4
-; AVX1-NEXT:    vmovdqa 48(%rdi), %xmm5
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm5
 ; AVX1-NEXT:    vpsrlq $1, %xmm5, %xmm6
 ; AVX1-NEXT:    vinsertf128 $1, %xmm6, %ymm4, %ymm4
 ; AVX1-NEXT:    vorps %ymm3, %ymm4, %ymm3
@@ -4680,7 +4681,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX1-NEXT:    vblendvps %xmm1, %xmm4, %xmm3, %xmm1
 ; AVX1-NEXT:    vandps %ymm2, %ymm0, %ymm2
 ; AVX1-NEXT:    vpsrlq $1, %xmm0, %xmm3
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm4
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm4
 ; AVX1-NEXT:    vpsrlq $1, %xmm4, %xmm5
 ; AVX1-NEXT:    vinsertf128 $1, %xmm5, %ymm3, %ymm3
 ; AVX1-NEXT:    vorps %ymm2, %ymm3, %ymm2
@@ -4725,7 +4726,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm6, %xmm3
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm3 = xmm4[0,1,2],xmm3[0]
 ; AVX2-NEXT:    vaddps %xmm3, %xmm3, %xmm4
-; AVX2-NEXT:    vpackssdw 48(%rdi), %xmm1, %xmm1
+; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm5
+; AVX2-NEXT:    vpackssdw %xmm5, %xmm1, %xmm1
 ; AVX2-NEXT:    vblendvps %xmm1, %xmm4, %xmm3, %xmm1
 ; AVX2-NEXT:    vandps %ymm2, %ymm0, %ymm2
 ; AVX2-NEXT:    vpsrlq $1, %ymm0, %ymm3
@@ -4744,7 +4746,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm6, %xmm2
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm2 = xmm3[0,1,2],xmm2[0]
 ; AVX2-NEXT:    vaddps %xmm2, %xmm2, %xmm3
-; AVX2-NEXT:    vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm4
+; AVX2-NEXT:    vpackssdw %xmm4, %xmm0, %xmm0
 ; AVX2-NEXT:    vblendvps %xmm0, %xmm3, %xmm2, %xmm0
 ; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
 ; AVX2-NEXT:    retq
@@ -4849,7 +4852,7 @@ define <8 x float> @uitofp_load_8i32_to_8f32(ptr%a) {
 ; AVX1:       # %bb.0:
 ; AVX1-NEXT:    vmovdqa (%rdi), %ymm0
 ; AVX1-NEXT:    vpsrld $16, %xmm0, %xmm1
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
 ; AVX1-NEXT:    vpsrld $16, %xmm2, %xmm2
 ; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm1, %ymm1
 ; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
diff --git a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
index a39bc6b668669..73a1684c09c70 100644
--- a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
+++ b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
@@ -977,11 +977,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512-LABEL: load_i16_stride3_vf16:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -993,19 +993,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-NEXT:    vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-NEXT:    vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-NEXT:    vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-NEXT:    vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-NEXT:    vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-NEXT:    vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-NEXT:    vmovdqa 16(%rdi), %xmm1
+; AVX512-NEXT:    vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-NEXT:    vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-NEXT:    vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-NEXT:    vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-NEXT:    vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-NEXT:    vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-NEXT:    vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-NEXT:    vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm1
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
 ; AVX512-NEXT:    vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
 ; AVX512-NEXT:    vpshufb %ymm1, %ymm0, %ymm0
@@ -1021,11 +1021,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512-FCP-LABEL: load_i16_stride3_vf16:
 ; AVX512-FCP:       # %bb.0:
-; AVX512-FCP-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512-FCP-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512-FCP-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512-FCP-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512-FCP-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1037,19 +1037,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-FCP-NEXT:    vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-FCP-NEXT:    vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT:    vmovdqa 16(%rdi), %xmm1
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-FCP-NEXT:    vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-FCP-NEXT:    vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vextracti128 $1, %ymm1, %xmm1
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
 ; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
 ; AVX512-FCP-NEXT:    vpshufb %ymm1, %ymm0, %ymm0
@@ -1065,11 +1065,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512DQ-LABEL: load_i16_stride3_vf16:
 ; AVX512DQ:       # %bb.0:
-; AVX512DQ-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512DQ-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512DQ-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512DQ-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512DQ-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512DQ-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512DQ-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512DQ-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1081,19 +1081,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512DQ-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512DQ-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512DQ-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512DQ-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512DQ-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512DQ-NEXT:    v...
[truncated]

@RKSimon RKSimon force-pushed the x86-reduce-load-legalops branch from 678ca30 to 8489c8e Compare March 4, 2025 13:30
@RKSimon RKSimon marked this pull request as draft March 6, 2025 09:01
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Apr 22, 2025
…ment

Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Apr 22, 2025
…ment

Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Apr 23, 2025
RKSimon added a commit that referenced this pull request Apr 23, 2025
…ment (#136723)

Based off feedback for #129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
@RKSimon RKSimon force-pushed the x86-reduce-load-legalops branch from 8489c8e to 3b3e83c Compare April 23, 2025 12:57
Copy link

github-actions bot commented Apr 23, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@RKSimon RKSimon marked this pull request as ready for review April 23, 2025 13:15
@RKSimon RKSimon force-pushed the x86-reduce-load-legalops branch from 3b3e83c to bbf31a1 Compare April 23, 2025 13:16
@RKSimon RKSimon changed the title [X86] shouldReduceLoadWidth - don't split loads if ANY uses are a extract+store or a full width legal binop [X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop Apr 23, 2025
…ract+store or a full width legal binop

Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself).

This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
@RKSimon RKSimon force-pushed the x86-reduce-load-legalops branch from bbf31a1 to 878a957 Compare April 24, 2025 09:42
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Apr 25, 2025
…128/I128 into a regular load

This is mainly to help remove subvector extractions from spilled YMM registers.

We can extend this for all the AVX512 variants, I've tried to make the implementation ready for this (a quick test indicated its mainly we're missing test coverage for AVX512).

What I'm not sure on is how best we can then fold this new smaller load into another instruction (you can see some examples of this in vector-interleaved-load-i32-stride-8.ll)?

The comment still saying "32-byte Reload" is annoying, but we already have this for many other element/subvector load folds.

Noticed while looking at next steps after llvm#129695
@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 28, 2025

ping?

; AVX512-NEXT: vpermq {{.*#+}} zmm2 = zmm2[2,2,2,3,6,6,6,7]
; AVX512-NEXT: vpternlogq {{.*#+}} zmm2 = zmm2 ^ (mem & (zmm2 ^ zmm8))
; AVX512-NEXT: vpshufb %ymm7, %ymm1, %ymm3
; AVX512-NEXT: vbroadcasti64x4 {{.*#+}} zmm3 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we have one more vbroadcasti64x4 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its been hoisted out of the folded vpternlogq load below - we've managed to simplify the constant and allowed a zmm full width load to be replaced with a broadcast.

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@RKSimon RKSimon merged commit 8961f3e into llvm:main Apr 29, 2025
9 of 11 checks passed
@RKSimon RKSimon deleted the x86-reduce-load-legalops branch April 29, 2025 09:35
gizmondo pushed a commit to gizmondo/llvm-project that referenced this pull request Apr 29, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…ment (llvm#136723)

Based off feedback for llvm#129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…ment (llvm#136723)

Based off feedback for llvm#129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…ment (llvm#136723)

Based off feedback for llvm#129695 - we need to be able to determine the
load offset of smaller loads when trying to determine whether a multiple
use load should be split (in particular for AVX subvector extractions).

This patch adds a std::optional<unsigned> ByteOffset argument to
shouldReduceLoadWidth calls for where we know the constant offset to
allow targets to make use of it in future patches.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request May 9, 2025
…e full width legal binop (llvm#129695)

Currently shouldReduceLoadWidth is very relaxed about when loads can be
split to avoid extractions from the original full width load - resulting
in many cases where the number of memory operations notably increases,
replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to
detect cases where ANY use of the full width load is be used directly -
in which case we will now reuse that load for smaller types, unless we'd
need to extract an upper subvector / integer element - i.e. we now
correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are
extract+stores but we improve this by peeking through bitcasts while
looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to
peek through bitcasts UP the use-chain to find final users (limited to
hasOneUse cases to reduce complexity). It also exposed an issue in
isTargetCanonicalConstantNode which assumed that a load of vector
constant data would always extract, which is no longer the case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants