-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[X86] shouldReduceLoadWidth - don't split loads if we can freely reuse full width legal binop #129695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon) ChangesCurrently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself). This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case. Patch is 482.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129695.diff 16 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index bbab43d4e92af..9e51bef650562 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3260,6 +3260,12 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
EVT NewVT) const {
assert(cast<LoadSDNode>(Load)->isSimple() && "illegal to narrow");
+ auto PeekThroughOneUserBitcasts = [](const SDNode *N) {
+ while (N->getOpcode() == ISD::BITCAST && N->hasOneUse())
+ N = *N->user_begin();
+ return N;
+ };
+
// "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF
// relocation target a movq or addq instruction: don't let the load shrink.
SDValue BasePtr = cast<LoadSDNode>(Load)->getBasePtr();
@@ -3267,9 +3273,10 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
if (const auto *GA = dyn_cast<GlobalAddressSDNode>(BasePtr.getOperand(0)))
return GA->getTargetFlags() != X86II::MO_GOTTPOFF;
- // If this is an (1) AVX vector load with (2) multiple uses and (3) all of
+ // If this is an (1) AVX vector load with (2) multiple uses and (3) any of
// those uses are extracted directly into a store, then the extract + store
- // can be store-folded. Therefore, it's probably not worth splitting the load.
+ // can be store-folded, or (4) any use will be used by legal full width
+ // instruction. Then, it's probably not worth splitting the load.
EVT VT = Load->getValueType(0);
if ((VT.is256BitVector() || VT.is512BitVector()) &&
!SDValue(Load, 0).hasOneUse()) {
@@ -3278,15 +3285,23 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
if (Use.getResNo() != 0)
continue;
- SDNode *User = Use.getUser();
+ const SDNode *User = PeekThroughOneUserBitcasts(Use.getUser());
- // If this use is not an extract + store, it's probably worth splitting.
- if (User->getOpcode() != ISD::EXTRACT_SUBVECTOR || !User->hasOneUse() ||
- User->user_begin()->getOpcode() != ISD::STORE)
- return true;
+ // If any use is an extract + store, it's probably not worth splitting.
+ if (User->getOpcode() == ISD::EXTRACT_SUBVECTOR &&
+ all_of(User->uses(), [&](const SDUse &U) {
+ const SDNode *Inner = PeekThroughOneUserBitcasts(U.getUser());
+ return Inner->getOpcode() == ISD::STORE;
+ }))
+ return false;
+
+ // If any use is a full width legal/target bin op, then assume its legal
+ // and shouldn't split.
+ if (isBinOp(User->getOpcode()) &&
+ (isOperationLegal(User->getOpcode(), VT) ||
+ User->getOpcode() > ISD::BUILTIN_OP_END))
+ return false;
}
- // All non-chain uses are extract + store.
- return false;
}
return true;
@@ -4001,8 +4016,14 @@ static SDValue getConstVector(ArrayRef<APInt> Bits, const APInt &Undefs,
const APInt &V = Bits[i];
assert(V.getBitWidth() == VT.getScalarSizeInBits() && "Unexpected sizes");
if (Split) {
- Ops.push_back(DAG.getConstant(V.trunc(32), dl, EltVT));
- Ops.push_back(DAG.getConstant(V.lshr(32).trunc(32), dl, EltVT));
+ Ops.push_back(DAG.getConstant(V.extractBits(32, 0), dl, EltVT));
+ Ops.push_back(DAG.getConstant(V.extractBits(32, 32), dl, EltVT));
+ } else if (EltVT == MVT::bf16) {
+ APFloat FV(APFloat::BFloat(), V);
+ Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
+ } else if (EltVT == MVT::f16) {
+ APFloat FV(APFloat::IEEEhalf(), V);
+ Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
} else if (EltVT == MVT::f32) {
APFloat FV(APFloat::IEEEsingle(), V);
Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 4a2b35e9efe7c..0332325b145eb 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1337,8 +1337,8 @@ namespace llvm {
unsigned Depth) const override;
bool isTargetCanonicalConstantNode(SDValue Op) const override {
- // Peek through bitcasts/extracts/inserts to see if we have a broadcast
- // vector from memory.
+ // Peek through bitcasts/extracts/inserts to see if we have a vector
+ // load/broadcast from memory.
while (Op.getOpcode() == ISD::BITCAST ||
Op.getOpcode() == ISD::EXTRACT_SUBVECTOR ||
(Op.getOpcode() == ISD::INSERT_SUBVECTOR &&
@@ -1346,6 +1346,9 @@ namespace llvm {
Op = Op.getOperand(Op.getOpcode() == ISD::INSERT_SUBVECTOR ? 1 : 0);
return Op.getOpcode() == X86ISD::VBROADCAST_LOAD ||
+ Op.getOpcode() == X86ISD::SUBV_BROADCAST_LOAD ||
+ (Op.getOpcode() == ISD::LOAD &&
+ getTargetConstantFromLoad(cast<LoadSDNode>(Op))) ||
TargetLowering::isTargetCanonicalConstantNode(Op);
}
diff --git a/llvm/test/CodeGen/X86/oddsubvector.ll b/llvm/test/CodeGen/X86/oddsubvector.ll
index 2f557679a1558..a1da40e7e7655 100644
--- a/llvm/test/CodeGen/X86/oddsubvector.ll
+++ b/llvm/test/CodeGen/X86/oddsubvector.ll
@@ -261,9 +261,9 @@ define void @PR42833() {
;
; AVX2-LABEL: PR42833:
; AVX2: # %bb.0:
-; AVX2-NEXT: movl b(%rip), %eax
; AVX2-NEXT: vmovdqu c+128(%rip), %ymm0
-; AVX2-NEXT: addl c+128(%rip), %eax
+; AVX2-NEXT: vmovd %xmm0, %eax
+; AVX2-NEXT: addl b(%rip), %eax
; AVX2-NEXT: vmovd %eax, %xmm1
; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm2
; AVX2-NEXT: vpaddd %ymm0, %ymm0, %ymm3
@@ -284,10 +284,10 @@ define void @PR42833() {
;
; AVX512-LABEL: PR42833:
; AVX512: # %bb.0:
-; AVX512-NEXT: movl b(%rip), %eax
; AVX512-NEXT: vmovdqu c+128(%rip), %ymm0
; AVX512-NEXT: vmovdqu64 c+128(%rip), %zmm1
-; AVX512-NEXT: addl c+128(%rip), %eax
+; AVX512-NEXT: vmovd %xmm0, %eax
+; AVX512-NEXT: addl b(%rip), %eax
; AVX512-NEXT: vmovd %eax, %xmm2
; AVX512-NEXT: vpaddd %ymm2, %ymm0, %ymm2
; AVX512-NEXT: vpaddd %ymm0, %ymm0, %ymm0
diff --git a/llvm/test/CodeGen/X86/setcc-lowering.ll b/llvm/test/CodeGen/X86/setcc-lowering.ll
index cdf9e180345ed..5a72bf3e18bd8 100644
--- a/llvm/test/CodeGen/X86/setcc-lowering.ll
+++ b/llvm/test/CodeGen/X86/setcc-lowering.ll
@@ -11,7 +11,7 @@ define <8 x i16> @pr25080(<8 x i32> %a) nounwind {
; AVX1-LABEL: pr25080:
; AVX1: # %bb.0: # %entry
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
-; AVX1-NEXT: vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX1-NEXT: vpand {{\.?LCPI[0-9]+_[0-9]+}}+16(%rip), %xmm0, %xmm0
; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
; AVX1-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
; AVX1-NEXT: vpackssdw %xmm0, %xmm0, %xmm0
diff --git a/llvm/test/CodeGen/X86/vec_int_to_fp.ll b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
index af841cf38b24a..c3b67bc5a8c8c 100644
--- a/llvm/test/CodeGen/X86/vec_int_to_fp.ll
+++ b/llvm/test/CodeGen/X86/vec_int_to_fp.ll
@@ -4228,7 +4228,7 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
; AVX1: # %bb.0:
; AVX1-NEXT: vmovdqa (%rdi), %ymm0
; AVX1-NEXT: vpsrlq $1, %xmm0, %xmm1
-; AVX1-NEXT: vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
; AVX1-NEXT: vpsrlq $1, %xmm2, %xmm3
; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm1
; AVX1-NEXT: vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm3
@@ -4273,7 +4273,8 @@ define <4 x float> @uitofp_load_4i64_to_4f32(ptr%a) {
; AVX2-NEXT: vcvtsi2ss %rax, %xmm4, %xmm1
; AVX2-NEXT: vinsertps {{.*#+}} xmm1 = xmm2[0,1,2],xmm1[0]
; AVX2-NEXT: vaddps %xmm1, %xmm1, %xmm2
-; AVX2-NEXT: vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm3
+; AVX2-NEXT: vpackssdw %xmm3, %xmm0, %xmm0
; AVX2-NEXT: vblendvps %xmm0, %xmm2, %xmm1, %xmm0
; AVX2-NEXT: vzeroupper
; AVX2-NEXT: retq
@@ -4658,7 +4659,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
; AVX1-NEXT: vbroadcastsd {{.*#+}} ymm2 = [1,1,1,1]
; AVX1-NEXT: vandps %ymm2, %ymm1, %ymm3
; AVX1-NEXT: vpsrlq $1, %xmm1, %xmm4
-; AVX1-NEXT: vmovdqa 48(%rdi), %xmm5
+; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm5
; AVX1-NEXT: vpsrlq $1, %xmm5, %xmm6
; AVX1-NEXT: vinsertf128 $1, %xmm6, %ymm4, %ymm4
; AVX1-NEXT: vorps %ymm3, %ymm4, %ymm3
@@ -4680,7 +4681,7 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
; AVX1-NEXT: vblendvps %xmm1, %xmm4, %xmm3, %xmm1
; AVX1-NEXT: vandps %ymm2, %ymm0, %ymm2
; AVX1-NEXT: vpsrlq $1, %xmm0, %xmm3
-; AVX1-NEXT: vmovdqa 16(%rdi), %xmm4
+; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4
; AVX1-NEXT: vpsrlq $1, %xmm4, %xmm5
; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm3, %ymm3
; AVX1-NEXT: vorps %ymm2, %ymm3, %ymm2
@@ -4725,7 +4726,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
; AVX2-NEXT: vcvtsi2ss %rax, %xmm6, %xmm3
; AVX2-NEXT: vinsertps {{.*#+}} xmm3 = xmm4[0,1,2],xmm3[0]
; AVX2-NEXT: vaddps %xmm3, %xmm3, %xmm4
-; AVX2-NEXT: vpackssdw 48(%rdi), %xmm1, %xmm1
+; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm5
+; AVX2-NEXT: vpackssdw %xmm5, %xmm1, %xmm1
; AVX2-NEXT: vblendvps %xmm1, %xmm4, %xmm3, %xmm1
; AVX2-NEXT: vandps %ymm2, %ymm0, %ymm2
; AVX2-NEXT: vpsrlq $1, %ymm0, %ymm3
@@ -4744,7 +4746,8 @@ define <8 x float> @uitofp_load_8i64_to_8f32(ptr%a) {
; AVX2-NEXT: vcvtsi2ss %rax, %xmm6, %xmm2
; AVX2-NEXT: vinsertps {{.*#+}} xmm2 = xmm3[0,1,2],xmm2[0]
; AVX2-NEXT: vaddps %xmm2, %xmm2, %xmm3
-; AVX2-NEXT: vpackssdw 16(%rdi), %xmm0, %xmm0
+; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm4
+; AVX2-NEXT: vpackssdw %xmm4, %xmm0, %xmm0
; AVX2-NEXT: vblendvps %xmm0, %xmm3, %xmm2, %xmm0
; AVX2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
; AVX2-NEXT: retq
@@ -4849,7 +4852,7 @@ define <8 x float> @uitofp_load_8i32_to_8f32(ptr%a) {
; AVX1: # %bb.0:
; AVX1-NEXT: vmovdqa (%rdi), %ymm0
; AVX1-NEXT: vpsrld $16, %xmm0, %xmm1
-; AVX1-NEXT: vmovdqa 16(%rdi), %xmm2
+; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
; AVX1-NEXT: vpsrld $16, %xmm2, %xmm2
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm1, %ymm1
; AVX1-NEXT: vcvtdq2ps %ymm1, %ymm1
diff --git a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
index a39bc6b668669..73a1684c09c70 100644
--- a/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
+++ b/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll
@@ -977,11 +977,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
;
; AVX512-LABEL: load_i16_stride3_vf16:
; AVX512: # %bb.0:
-; AVX512-NEXT: vmovdqa 32(%rdi), %ymm1
-; AVX512-NEXT: vmovdqa (%rdi), %ymm2
+; AVX512-NEXT: vmovdqa (%rdi), %ymm1
+; AVX512-NEXT: vmovdqa 32(%rdi), %ymm2
; AVX512-NEXT: vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
; AVX512-NEXT: vmovdqa %ymm0, %ymm3
-; AVX512-NEXT: vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-NEXT: vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
; AVX512-NEXT: vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
; AVX512-NEXT: vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
; AVX512-NEXT: vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -993,19 +993,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
; AVX512-NEXT: vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
; AVX512-NEXT: vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
; AVX512-NEXT: vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-NEXT: vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-NEXT: vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-NEXT: vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-NEXT: vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-NEXT: vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-NEXT: vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-NEXT: vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-NEXT: vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-NEXT: vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-NEXT: vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-NEXT: vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-NEXT: vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-NEXT: vmovdqa 16(%rdi), %xmm1
+; AVX512-NEXT: vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-NEXT: vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-NEXT: vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-NEXT: vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-NEXT: vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-NEXT: vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-NEXT: vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-NEXT: vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-NEXT: vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-NEXT: vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-NEXT: vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-NEXT: vextracti128 $1, %ymm1, %xmm1
; AVX512-NEXT: vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
; AVX512-NEXT: vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
; AVX512-NEXT: vpshufb %ymm1, %ymm0, %ymm0
@@ -1021,11 +1021,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
;
; AVX512-FCP-LABEL: load_i16_stride3_vf16:
; AVX512-FCP: # %bb.0:
-; AVX512-FCP-NEXT: vmovdqa 32(%rdi), %ymm1
-; AVX512-FCP-NEXT: vmovdqa (%rdi), %ymm2
+; AVX512-FCP-NEXT: vmovdqa (%rdi), %ymm1
+; AVX512-FCP-NEXT: vmovdqa 32(%rdi), %ymm2
; AVX512-FCP-NEXT: vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
; AVX512-FCP-NEXT: vmovdqa %ymm0, %ymm3
-; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
; AVX512-FCP-NEXT: vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
; AVX512-FCP-NEXT: vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1037,19 +1037,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
; AVX512-FCP-NEXT: vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
; AVX512-FCP-NEXT: vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512-FCP-NEXT: vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT: vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512-FCP-NEXT: vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512-FCP-NEXT: vpblendw {{.*#+}} xmm7 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
-; AVX512-FCP-NEXT: vpshufb {{.*#+}} xmm7 = xmm7[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
-; AVX512-FCP-NEXT: vinserti128 $1, %xmm7, %ymm0, %ymm7
-; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm7 = ymm6[0,1,2],ymm7[3,4,5,6,7],ymm6[8,9,10],ymm7[11,12,13,14,15]
-; AVX512-FCP-NEXT: vpshufhw {{.*#+}} xmm6 = xmm6[0,1,2,3,5,6,7,4]
-; AVX512-FCP-NEXT: vpblendd {{.*#+}} ymm6 = ymm6[0,1,2,3],ymm7[4,5,6,7]
-; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm0 = ymm2 ^ (ymm0 & (ymm1 ^ ymm2))
-; AVX512-FCP-NEXT: vmovdqa 16(%rdi), %xmm1
+; AVX512-FCP-NEXT: vpblendw {{.*#+}} xmm6 = xmm4[0,1],xmm5[2],xmm4[3,4],xmm5[5],xmm4[6,7]
+; AVX512-FCP-NEXT: vpshufb {{.*#+}} xmm6 = xmm6[u,u,u,u,u,u,4,5,10,11,0,1,6,7,12,13]
+; AVX512-FCP-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm6
+; AVX512-FCP-NEXT: vmovdqa {{.*#+}} ymm7 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
+; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm7 = ymm1 ^ (ymm7 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT: vpermq {{.*#+}} ymm8 = ymm7[2,3,0,1]
+; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm7 = ymm7[0,1],ymm8[2],ymm7[3,4],ymm8[5],ymm7[6,7,8,9],ymm8[10],ymm7[11,12],ymm8[13],ymm7[14,15]
+; AVX512-FCP-NEXT: vpshufb {{.*#+}} ymm7 = ymm7[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
+; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm6 = ymm7[0,1,2],ymm6[3,4,5,6,7],ymm7[8,9,10],ymm6[11,12,13,14,15]
+; AVX512-FCP-NEXT: vpshufhw {{.*#+}} xmm7 = xmm7[0,1,2,3,5,6,7,4]
+; AVX512-FCP-NEXT: vpblendd {{.*#+}} ymm6 = ymm7[0,1,2,3],ymm6[4,5,6,7]
+; AVX512-FCP-NEXT: vpternlogq {{.*#+}} ymm0 = ymm1 ^ (ymm0 & (ymm2 ^ ymm1))
+; AVX512-FCP-NEXT: vextracti128 $1, %ymm1, %xmm1
; AVX512-FCP-NEXT: vpblendw {{.*#+}} ymm0 = ymm1[0],ymm0[1,2],ymm1[3],ymm0[4,5],ymm1[6],ymm0[7],ymm1[8],ymm0[9,10],ymm1[11],ymm0[12,13],ymm1[14],ymm0[15]
; AVX512-FCP-NEXT: vmovdqa {{.*#+}} ymm1 = [4,5,10,11,0,1,6,7,12,13,2,3,8,9,14,15,20,21,26,27,u,u,u,u,u,u,u,u,u,u,u,u]
; AVX512-FCP-NEXT: vpshufb %ymm1, %ymm0, %ymm0
@@ -1065,11 +1065,11 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
;
; AVX512DQ-LABEL: load_i16_stride3_vf16:
; AVX512DQ: # %bb.0:
-; AVX512DQ-NEXT: vmovdqa 32(%rdi), %ymm1
-; AVX512DQ-NEXT: vmovdqa (%rdi), %ymm2
+; AVX512DQ-NEXT: vmovdqa (%rdi), %ymm1
+; AVX512DQ-NEXT: vmovdqa 32(%rdi), %ymm2
; AVX512DQ-NEXT: vmovdqa {{.*#+}} ymm0 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535]
; AVX512DQ-NEXT: vmovdqa %ymm0, %ymm3
-; AVX512DQ-NEXT: vpternlogq {{.*#+}} ymm3 = ymm1 ^ (ymm3 & (ymm2 ^ ymm1))
+; AVX512DQ-NEXT: vpternlogq {{.*#+}} ymm3 = ymm2 ^ (ymm3 & (ymm1 ^ ymm2))
; AVX512DQ-NEXT: vpermq {{.*#+}} ymm4 = ymm3[2,3,0,1]
; AVX512DQ-NEXT: vpblendw {{.*#+}} ymm3 = ymm3[0],ymm4[1],ymm3[2,3],ymm4[4],ymm3[5,6],ymm4[7],ymm3[8],ymm4[9],ymm3[10,11],ymm4[12],ymm3[13,14],ymm4[15]
; AVX512DQ-NEXT: vpshufb {{.*#+}} ymm3 = ymm3[0,1,6,7,12,13,2,3,4,5,14,15,8,9,10,11,16,17,22,23,28,29,18,19,20,21,30,31,24,25,26,27]
@@ -1081,19 +1081,19 @@ define void @load_i16_stride3_vf16(ptr %in.vec, ptr %out.vec0, ptr %out.vec1, pt
; AVX512DQ-NEXT: vpblendw {{.*#+}} ymm6 = ymm3[0,1,2],ymm6[3,4,5,6,7],ymm3[8,9,10],ymm6[11,12,13,14,15]
; AVX512DQ-NEXT: vpshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,4,7]
; AVX512DQ-NEXT: vpblendd {{.*#+}} ymm3 = ymm3[0,1,2,3],ymm6[4,5,6,7]
-; AVX512DQ-NEXT: vmovdqa {{.*#+}} ymm6 = [65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535]
-; AVX512DQ-NEXT: vpternlogq {{.*#+}} ymm6 = ymm2 ^ (ymm6 & (ymm1 ^ ymm2))
-; AVX512DQ-NEXT: vpermq {{.*#+}} ymm7 = ymm6[2,3,0,1]
-; AVX512DQ-NEXT: vpblendw {{.*#+}} ymm6 = ymm6[0,1],ymm7[2],ymm6[3,4],ymm7[5],ymm6[6,7,8,9],ymm7[10],ymm6[11,12],ymm7[13],ymm6[14,15]
-; AVX512DQ-NEXT: vpshufb {{.*#+}} ymm6 = ymm6[2,3,8,9,14,15,4,5,12,13,10,11,0,1,6,7,18,19,24,25,30,31,20,21,28,29,26,27,16,17,22,23]
-; AVX512DQ-NEXT: v...
[truncated]
|
678ca30
to
8489c8e
Compare
…ment Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
…ment Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
…ment (#136723) Based off feedback for #129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
8489c8e
to
3b3e83c
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
3b3e83c
to
bbf31a1
Compare
…ract+store or a full width legal binop Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to not split if ANY use of the full width load can be used directly - either in an extract+store (previously ALL uses had to be extract+store to prevent splits) or is used by a legal binop (so unlikely to be split itself). This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
bbf31a1
to
878a957
Compare
…128/I128 into a regular load This is mainly to help remove subvector extractions from spilled YMM registers. We can extend this for all the AVX512 variants, I've tried to make the implementation ready for this (a quick test indicated its mainly we're missing test coverage for AVX512). What I'm not sure on is how best we can then fold this new smaller load into another instruction (you can see some examples of this in vector-interleaved-load-i32-stride-8.ll)? The comment still saying "32-byte Reload" is annoying, but we already have this for many other element/subvector load folds. Noticed while looking at next steps after llvm#129695
ping? |
; AVX512-NEXT: vpermq {{.*#+}} zmm2 = zmm2[2,2,2,3,6,6,6,7] | ||
; AVX512-NEXT: vpternlogq {{.*#+}} zmm2 = zmm2 ^ (mem & (zmm2 ^ zmm8)) | ||
; AVX512-NEXT: vpshufb %ymm7, %ymm1, %ymm3 | ||
; AVX512-NEXT: vbroadcasti64x4 {{.*#+}} zmm3 = [65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535,65535,0,65535] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we have one more vbroadcasti64x4
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its been hoisted out of the folded vpternlogq load below - we've managed to simplify the constant and allowed a zmm full width load to be replaced with a broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
…ment (llvm#136723) Based off feedback for llvm#129695 - we need to be able to determine the load offset of smaller loads when trying to determine whether a multiple use load should be split (in particular for AVX subvector extractions). This patch adds a std::optional<unsigned> ByteOffset argument to shouldReduceLoadWidth calls for where we know the constant offset to allow targets to make use of it in future patches.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
…e full width legal binop (llvm#129695) Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads. This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free. We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains. This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.
Currently shouldReduceLoadWidth is very relaxed about when loads can be split to avoid extractions from the original full width load - resulting in many cases where the number of memory operations notably increases, replacing the cost of a extract_subvector for additional loads.
This patch adjusts the 256/512-bit vector load splitting metric to detect cases where ANY use of the full width load is be used directly - in which case we will now reuse that load for smaller types, unless we'd need to extract an upper subvector / integer element - i.e. we now correctly treat (extract_subvector cst, 0) as free.
We retain the existing logic of never splitting loads if all uses are extract+stores but we improve this by peeking through bitcasts while looking for extract_subvector/store chains.
This required a number of fixes - shouldReduceLoadWidth now needs to peek through bitcasts UP the use-chain to find final users (limited to hasOneUse cases to reduce complexity). It also exposed an issue in isTargetCanonicalConstantNode which assumed that a load of vector constant data would always extract, which is no longer the case.