[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

RKSimon · 2025-03-04T12:31:55Z

Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free.

We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains.

This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

llvmbot · 2025-03-04T12:32:28Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.

This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself).

This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

Patch is 482.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129695.diff

16 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+32-11)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+5-2)
(modified) llvm/test/CodeGen/X86/oddsubvector.ll (+4-4)
(modified) llvm/test/CodeGen/X86/setcc-lowering.ll (+1-1)
(modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+10-7)
(modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll (+64-64)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-3.ll (+28-24)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-5.ll (+32-32)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-6.ll (+1139-1144)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-7.ll (+426-416)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-5.ll (+18-18)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll (+28-28)
(modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-7.ll (+70-78)
(modified) llvm/test/CodeGen/X86/vector-shuffle-v192.ll (+2-2)
(modified) llvm/test/CodeGen/X86/vselect-pcmp.ll (+10-8)
(modified) llvm/test/CodeGen/X86/widen_load-3.ll (+9-9)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bbab43d4e92af..9e51bef650562 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3260,6 +3260,12 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
                                               EVT NewVT) const {
   assert(cast<LoadSDNode>(Load)->isSimple() && "illegal to narrow");
 
+  auto PeekThroughOneUserBitcasts = [](const SDNode *N) {
+    while (N->getOpcode() == ISD::BITCAST && N->hasOneUse())
+      N = *N->user_begin();
+    return N;
+  };
+
   // "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF
   // relocation target a movq or addq instruction: don't let the load shrink.
   SDValue BasePtr = cast<LoadSDNode>(Load)->getBasePtr();
@@ -3267,9 +3273,10 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
     if (const auto *GA = dyn_cast<GlobalAddressSDNode>(BasePtr.getOperand(0)))
       return GA->getTargetFlags() != X86II::MO_GOTTPOFF;
 
-  // If this is an (1) AVX vector load with (2) multiple uses and (3) all of
+  // If this is an (1) AVX vector load with (2) multiple uses and (3) any of
   // those uses are extracted directly into a store, then the extract + store
-  // can be store-folded. Therefore, it's probably not worth splitting the load.
+  // can be store-folded, or (4) any use will be used by legal full width
+  // instruction. Then, it's probably not worth splitting the load.
   EVT VT = Load->getValueType(0);
   if ((VT.is256BitVector() || VT.is512BitVector()) &&
       !SDValue(Load, 0).hasOneUse()) {
@@ -3278,15 +3285,23 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
       if (Use.getResNo() != 0)
         continue;
 
-      SDNode *User = Use.getUser();
+      const SDNode *User = PeekThroughOneUserBitcasts(Use.getUser());
 
-      // If this use is not an extract + store, it's probably worth splitting.
-      if (User->getOpcode() != ISD::EXTRACT_SUBVECTOR || !User->hasOneUse() ||
-          User->user_begin()->getOpcode() != ISD::STORE)
-        return true;
+      // If any use is an extract + store, it's probably not worth splitting.
+      if (User->getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+          all_of(User->uses(), [&](const SDUse &U) {
+            const SDNode *Inner = PeekThroughOneUserBitcasts(U.getUser());
+            return Inner->getOpcode() == ISD::STORE;
+          }))
+        return false;
+
+      // If any use is a full width legal/target bin op, then assume its legal
+      // and shouldn't split.
+      if (isBinOp(User->getOpcode()) &&
+          (isOperationLegal(User->getOpcode(), VT) ||
+           User->getOpcode() > ISD::BUILTIN_OP_END))
+        return false;
     }
-    // All non-chain uses are extract + store.
-    return false;
   }
 
   return true;
@@ -4001,8 +4016,14 @@ static SDValue getConstVector(ArrayRef<APInt> Bits, const APInt &Undefs,
     const APInt &V = Bits[i];
     assert(V.getBitWidth() == VT.getScalarSizeInBits() && "Unexpected sizes");
     if (Split) {
-      Ops.push_back(DAG.getConstant(V.trunc(32), dl, EltVT));
-      Ops.push_back(DAG.getConstant(V.lshr(32).trunc(32), dl, EltVT));
+      Ops.push_back(DAG.getConstant(V.extractBits(32, 0), dl, EltVT));
+      Ops.push_back(DAG.getConstant(V.extractBits(32, 32), dl, EltVT));
+    } else if (EltVT == MVT::bf16) {
+      APFloat FV(APFloat::BFloat(), V);
+      Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
+    } else if (EltVT == MVT::f16) {
+      APFloat FV(APFloat::IEEEhalf(), V);
+      Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
     } else if (EltVT == MVT::f32) {
       APFloat FV(APFloat::IEEEsingle(), V);
       Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 4a2b35e9efe7c..0332325b145eb 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1337,8 +1337,8 @@ namespace llvm {
                                    unsigned Depth) const override;
 
     bool isTargetCanonicalConstantNode(SDValue Op) const override {
-      // Peek through bitcasts/extracts/inserts to see if we have a broadcast
-      // vector from memory.
+      // Peek through bitcasts/extracts/inserts to see if we have a vector
+      // load/broadcast from memory.
       while (Op.getOpcode() == ISD::BITCAST ||
              Op.getOpcode() == ISD::EXTRACT_SUBVECTOR ||
              (Op.getOpcode() == ISD::INSERT_SUBVECTOR &&
@@ -1346,6 +1346,9 @@ namespace llvm {
         Op = Op.getOperand(Op.getOpcode() == ISD::INSERT_SUBVECTOR ? 1 : 0);
 
       return Op.getOpcode() == X86ISD::VBROADCAST_LOAD ||
+             Op.getOpcode() == X86ISD::SUBV_BROADCAST_LOAD ||
+             (Op.getOpcode() == ISD::LOAD &&
+              getTargetConstantFromLoad(cast<LoadSDNode>(Op))) ||
              TargetLowering::isTargetCanonicalConstantNode(Op);
     }
 
diff --git a/llvm/test/CodeGen/X86/oddsubvector.ll b/llvm/test/CodeGen/X86/oddsubvector.ll
index 2f557679a1558..a1da40e7e7655 100644
--- a/llvm/test/CodeGen/X86/oddsubvector.ll
+++ b/llvm/test/CodeGen/X86/oddsubvector.ll
@@ -261,9 +261,9 @@ define void @PR42833() {
 ;
 ; AVX2-LABEL: PR42833:
 ; AVX2:       # %bb.0:
-; AVX2-NEXT:    movl b(%rip), %eax
 ; AVX2-NEXT:    vmovdqu c+128(%rip), %ymm0
-; AVX2-NEXT:    addl c+128(%rip), %eax
+; AVX2-NEXT:    vmovd %xmm0, %eax
+; AVX2-NEXT:    addl b(%rip), %eax
 ; AVX2-NEXT:    vmovd %eax, %xmm1
 ; AVX2-NEXT:    vpaddd %ymm1, %ymm0, %ymm2
 ; AVX2-NEXT:    vpaddd %ymm0, %ymm0, %ymm3
@@ -284,10 +284,10 @@ define void @PR42833() {
 ;
 ; AVX512-LABEL: PR42833:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    movl b(%rip), %eax
 ; AVX512-NEXT:    vmovdqu c+128(%rip), %ymm0
 ; AVX512-NEXT:    vmovdqu64 c+128(%rip), %zmm1
-; AVX512-NEXT:    addl c+128(%rip), %eax
+; AVX512-NEXT:    vmovd %xmm0, %eax
+; AVX512-NEXT:    addl b(%rip), %eax
 ; AVX512-NEXT:    vmovd %eax, %xmm2
 ; AVX512-NEXT:    vpaddd %ymm2, %ymm0, %ymm2
 ; AVX512-NEXT:    vpaddd %ymm0, %ymm0, %ymm0
diff --git a/llvm/test/CodeGen/X86/setcc-lowering.ll b/llvm/test/CodeGen/X86/setcc-lowering.ll
index cdf9e180345ed..5a72bf3e18bd8 100644
--- a/llvm/test/CodeGen/X86/setcc-lowering.ll
+++ b/llvm/test/CodeGen/X86/setcc-lowering.ll
@@ -11,7 +11,7 @@ define <8 x i16> @pr25080(<8 x i32> %a) nounwind {
 ; AVX1-LABEL: pr25080:
 ; AVX1:       # %bb.0: # %entry
 ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}+16(%rip), %xmm0, %xmm0
 ; AVX1-NEXT:    vpxor %xmm1, %xmm1, %xmm1
 ; AVX1-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
 ; AVX1-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
diff --git a/llvm/test/CodeGen/X86/vec_int_to_fp.ll b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
index af841cf38b24a..c3b67bc5a8c8c 100644
--- a/llvm/test/CodeGen/X86/vec_int_to_fp.ll
+++ b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
@@ -4228,7 +4228,7 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
 ; AVX1:       # %bb.0:
 ; AVX1-NEXT:    vmovdqa (%rdi), %ymm0
 ; AVX1-NEXT:    vpsrlq $1, %xmm0, %xmm1
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
 ; AVX1-NEXT:    vpsrlq $1, %xmm2, %xmm3
 ; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm1, %ymm1
 ; AVX1-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm3
@@ -4273,7 +4273,8 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm1
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm2[0,1,2],xmm1[0]
 ; AVX2-NEXT:    vaddps %xmm1, %xmm1, %xmm2
-; AVX2-NEXT:    vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
+; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
 ; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
 ; AVX2-NEXT:    vzeroupper
 ; AVX2-NEXT:    retq
@@ -4658,7 +4659,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX1-NEXT:    vbroadcastsd {{.*#+}} ymm2 = [1,1,1,1]
 ; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm3
 ; AVX1-NEXT:    vpsrlq $1, %xmm1, %xmm4
-; AVX1-NEXT:    vmovdqa 48(%rdi), %xmm5
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm5
 ; AVX1-NEXT:    vpsrlq $1, %xmm5, %xmm6
 ; AVX1-NEXT:    vinsertf128 $1, %xmm6, %ymm4, %ymm4
 ; AVX1-NEXT:    vorps %ymm3, %ymm4, %ymm3
@@ -4680,7 +4681,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX1-NEXT:    vblendvps %xmm1, %xmm4, %xmm3, %xmm1
 ; AVX1-NEXT:    vandps %ymm2, %ymm0, %ymm2
 ; AVX1-NEXT:    vpsrlq $1, %xmm0, %xmm3
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm4
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm4
 ; AVX1-NEXT:    vpsrlq $1, %xmm4, %xmm5
 ; AVX1-NEXT:    vinsertf128 $1, %xmm5, %ymm3, %ymm3
 ; AVX1-NEXT:    vorps %ymm2, %ymm3, %ymm2
@@ -4725,7 +4726,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm6, %xmm3
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm3 = xmm4[0,1,2],xmm3[0]
 ; AVX2-NEXT:    vaddps %xmm3, %xmm3, %xmm4
-; AVX2-NEXT:    vpackssdw 48(%rdi), %xmm1, %xmm1
+; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm5
+; AVX2-NEXT:    vpackssdw %xmm5, %xmm1, %xmm1
 ; AVX2-NEXT:    vblendvps %xmm1, %xmm4, %xmm3, %xmm1
 ; AVX2-NEXT:    vandps %ymm2, %ymm0, %ymm2
 ; AVX2-NEXT:    vpsrlq $1, %ymm0, %ymm3
@@ -4744,7 +4746,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
 ; AVX2-NEXT:    vcvtsi2ss %rax, %xmm6, %xmm2
 ; AVX2-NEXT:    vinsertps {{.*#+}} xmm2 = xmm3[0,1,2],xmm2[0]
 ; AVX2-NEXT:    vaddps %xmm2, %xmm2, %xmm3
-; AVX2-NEXT:    vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm4
+; AVX2-NEXT:    vpackssdw %xmm4, %xmm0, %xmm0
 ; AVX2-NEXT:    vblendvps %xmm0, %xmm3, %xmm2, %xmm0
 ; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
 ; AVX2-NEXT:    retq
@@ -4849,7 +4852,7 @@ define <8 x float> @uitofp_load_8i32_to_8f32(ptr%a) {
 ; AVX1:       # %bb.0:
 ; AVX1-NEXT:    vmovdqa (%rdi), %ymm0
 ; AVX1-NEXT:    vpsrld $16, %xmm0, %xmm1
-; AVX1-NEXT:    vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
 ; AVX1-NEXT:    vpsrld $16, %xmm2, %xmm2
 ; AVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm1, %ymm1
 ; AVX1-NEXT:    vcvtdq2ps %ymm1, %ymm1
diff --git a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
index a39bc6b668669..73a1684c09c70 100644
--- a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
+++ b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
@@ -977,11 +977,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512-LABEL: load_i16_stride3_vf16:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -993,19 +993,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-NEXT:    vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-NEXT:    vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-NEXT:    vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-NEXT:    vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-NEXT:    vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-NEXT:    vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-NEXT:    vmovdqa 16(%rdi), %xmm1
+; AVX512-NEXT:    vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-NEXT:    vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-NEXT:    vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-NEXT:    vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-NEXT:    vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-NEXT:    vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-NEXT:    vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-NEXT:    vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-NEXT:    vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm1
 ; AVX512-NEXT:    vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
 ; AVX512-NEXT:    vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
 ; AVX512-NEXT:    vpshufb %ymm1, %ymm0, %ymm0
@@ -1021,11 +1021,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512-FCP-LABEL: load_i16_stride3_vf16:
 ; AVX512-FCP:       # %bb.0:
-; AVX512-FCP-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512-FCP-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512-FCP-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512-FCP-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512-FCP-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1037,19 +1037,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-FCP-NEXT:    vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-FCP-NEXT:    vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT:    vmovdqa 16(%rdi), %xmm1
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-FCP-NEXT:    vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-FCP-NEXT:    vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-FCP-NEXT:    vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-FCP-NEXT:    vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-FCP-NEXT:    vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-FCP-NEXT:    vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT:    vextracti128 $1, %ymm1, %xmm1
 ; AVX512-FCP-NEXT:    vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
 ; AVX512-FCP-NEXT:    vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
 ; AVX512-FCP-NEXT:    vpshufb %ymm1, %ymm0, %ymm0
@@ -1065,11 +1065,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ;
 ; AVX512DQ-LABEL: load_i16_stride3_vf16:
 ; AVX512DQ:       # %bb.0:
-; AVX512DQ-NEXT:    vmovdqa 32(%rdi), %ymm1
-; AVX512DQ-NEXT:    vmovdqa (%rdi), %ymm2
+; AVX512DQ-NEXT:    vmovdqa (%rdi), %ymm1
+; AVX512DQ-NEXT:    vmovdqa 32(%rdi), %ymm2
 ; AVX512DQ-NEXT:    vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
 ; AVX512DQ-NEXT:    vmovdqa %ymm0, %ymm3
-; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
 ; AVX512DQ-NEXT:    vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
 ; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
 ; AVX512DQ-NEXT:    vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1081,19 +1081,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
 ; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
 ; AVX512DQ-NEXT:    vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
 ; AVX512DQ-NEXT:    vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512DQ-NEXT:    vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512DQ-NEXT:    vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512DQ-NEXT:    vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512DQ-NEXT:    vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512DQ-NEXT:    v...
[truncated]

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/setcc-lowering.ll

llvm/test/CodeGen/X86/vec_int_to_fp.ll

…ment Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.

…llvm#129695

…ment (#136723) Based off feedback for #129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.

github-actions · 2025-04-23T13:07:09Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…ract+store or a full width legal binop Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself). This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

…128/I128 into a regular load This is mainly to help remove subvector extractions from spilled YMM registers. We can extend this for all the AVX512 variants, I've tried to make the implementation ready for this (a quick test indicated its mainly we're missing test coverage for AVX512). What I'm not sure on is how best we can then fold this new smaller load into another instruction (you can see some examples of this in vector-interleaved-load-i32-stride-8.ll)? The comment still saying "32-byte Reload" is annoying, but we already have this for many other element/subvector load folds. Noticed while looking at next steps after llvm#129695

RKSimon · 2025-04-28T12:05:34Z

ping?

phoebewang · 2025-04-28T14:02:56Z

llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll

 ; AVX512-NEXT:    vpermq {{.*#+}} zmm2 = zmm2[2,2,2,3,6,6,6,7]
-; AVX512-NEXT:    vpternlogq {{.*#+}} zmm2 = zmm2 ^ (mem & (zmm2 ^ zmm8))
-; AVX512-NEXT:    vpshufb %ymm7, %ymm1, %ymm3
+; AVX512-NEXT:    vbroadcasti64x4 {{.*#+}} zmm3 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]


Seems we have one more vbroadcasti64x4 here?

Its been hoisted out of the folded vpternlogq load below - we've managed to simplify the constant and allowed a zmm full width load to be replaced with a broadcast.

phoebewang

LGTM.

…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.

…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.

…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.

…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.

RKSimon requested review from phoebewang, KanRobert and topperc March 4, 2025 12:31

llvmbot added the backend:X86 label Mar 4, 2025

RKSimon force-pushed the x86-reduce-load-legalops branch from 678ca30 to 8489c8e Compare March 4, 2025 13:30

phoebewang reviewed Mar 5, 2025

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved

phoebewang reviewed Mar 5, 2025

View reviewed changes

llvm/test/CodeGen/X86/setcc-lowering.ll Show resolved Hide resolved

phoebewang reviewed Mar 5, 2025

View reviewed changes

llvm/test/CodeGen/X86/vec_int_to_fp.ll Outdated Show resolved Hide resolved

RKSimon marked this pull request as draft March 6, 2025 09:01

RKSimon mentioned this pull request Apr 22, 2025

[DAG] shouldReduceLoadWidth - add optional<unsigned> byte offset argument #136723

Merged

RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Apr 23, 2025

Remove isTargetCanonicalConstantNode change mistakenly brought in from …

b814b2a

…llvm#129695

RKSimon force-pushed the x86-reduce-load-legalops branch from 8489c8e to 3b3e83c Compare April 23, 2025 12:57

RKSimon marked this pull request as ready for review April 23, 2025 13:15

RKSimon force-pushed the x86-reduce-load-legalops branch from 3b3e83c to bbf31a1 Compare April 23, 2025 13:16

RKSimon changed the title ~~[X86] shouldReduceLoadWidth - don't split loads if ANY uses are a extract+store or a full width legal binop~~ [X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop Apr 23, 2025

RKSimon force-pushed the x86-reduce-load-legalops branch from bbf31a1 to 878a957 Compare April 24, 2025 09:42

Merge branch 'main' into x86-reduce-load-legalops

a240158

RKSimon mentioned this pull request Apr 25, 2025

[X86] foldMemoryOperandCustom - attempt to load-fold with a VEXTRACTF128/I128 into a regular load #137316

Draft

Merge branch 'main' into x86-reduce-load-legalops

07a3771

phoebewang reviewed Apr 28, 2025

View reviewed changes

phoebewang approved these changes Apr 29, 2025

View reviewed changes

Merge branch 'main' into x86-reduce-load-legalops

e497855

RKSimon merged commit 8961f3e into llvm:main Apr 29, 2025
9 of 11 checks passed

RKSimon deleted the x86-reduce-load-legalops branch April 29, 2025 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

Uh oh!

RKSimon commented Mar 4, 2025 •

edited

Loading

Uh oh!

llvmbot commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 23, 2025 •

edited

Loading

Uh oh!

RKSimon commented Apr 28, 2025

Uh oh!

phoebewang Apr 28, 2025

Uh oh!

RKSimon Apr 28, 2025

Uh oh!

phoebewang left a comment

Uh oh!

Uh oh!

Uh oh!

[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695

Uh oh!

Conversation

RKSimon commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon commented Apr 28, 2025

Uh oh!

phoebewang Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Mar 4, 2025 •

edited

Loading

github-actions bot commented Apr 23, 2025 •

edited

Loading