-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Add DAG combine for forming VAADDU_VL from VP intrinsics. #124848
Conversation
This adds a VP version of an existing DAG combine. I've put it in RISCV since we would need to add a ISD::VP_AVGCEIL opcode otherwise. This pattern appears in 525.264_r.
@llvm/pr-subscribers-backend-risc-v Author: Craig Topper (topperc) ChangesThis adds a VP version of an existing DAG combine. I've put it in RISCVISelLowering since we would need to add a ISD::VP_AVGCEIL opcode otherwise. This pattern appears in 525.264_r. I extracted this code from our downstream compiler. Hopefully it works with the current EVL vectorizer in upstream. @lukel97 can you double check that this pattern triggers on 525.x264_r with the EVL vectorizer. Patch is 20.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124848.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 8d09e534b1858b..f004a00b19d205 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -1526,18 +1526,16 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
setTargetDAGCombine({ISD::ZERO_EXTEND, ISD::FP_TO_SINT, ISD::FP_TO_UINT,
ISD::FP_TO_SINT_SAT, ISD::FP_TO_UINT_SAT});
if (Subtarget.hasVInstructions())
- setTargetDAGCombine({ISD::FCOPYSIGN, ISD::MGATHER,
- ISD::MSCATTER, ISD::VP_GATHER,
- ISD::VP_SCATTER, ISD::SRA,
- ISD::SRL, ISD::SHL,
- ISD::STORE, ISD::SPLAT_VECTOR,
- ISD::BUILD_VECTOR, ISD::CONCAT_VECTORS,
- ISD::VP_STORE, ISD::EXPERIMENTAL_VP_REVERSE,
- ISD::MUL, ISD::SDIV,
- ISD::UDIV, ISD::SREM,
- ISD::UREM, ISD::INSERT_VECTOR_ELT,
- ISD::ABS, ISD::CTPOP,
- ISD::VECTOR_SHUFFLE, ISD::VSELECT});
+ setTargetDAGCombine(
+ {ISD::FCOPYSIGN, ISD::MGATHER, ISD::MSCATTER,
+ ISD::VP_GATHER, ISD::VP_SCATTER, ISD::SRA,
+ ISD::SRL, ISD::SHL, ISD::STORE,
+ ISD::SPLAT_VECTOR, ISD::BUILD_VECTOR, ISD::CONCAT_VECTORS,
+ ISD::VP_STORE, ISD::VP_TRUNCATE, ISD::EXPERIMENTAL_VP_REVERSE,
+ ISD::MUL, ISD::SDIV, ISD::UDIV,
+ ISD::SREM, ISD::UREM, ISD::INSERT_VECTOR_ELT,
+ ISD::ABS, ISD::CTPOP, ISD::VECTOR_SHUFFLE,
+ ISD::VSELECT});
if (Subtarget.hasVendorXTHeadMemPair())
setTargetDAGCombine({ISD::LOAD, ISD::STORE});
@@ -16373,6 +16371,101 @@ static SDValue performVP_STORECombine(SDNode *N, SelectionDAG &DAG,
VPStore->isTruncatingStore(), VPStore->isCompressingStore());
}
+// Peephole avgceil pattern.
+// %1 = zext <N x i8> %a to <N x i32>
+// %2 = zext <N x i8> %b to <N x i32>
+// %3 = add nuw nsw <N x i32> %1, splat (i32 1)
+// %4 = add nuw nsw <N x i32> %3, %2
+// %5 = lshr <N x i32> %N, <i32 1 x N>
+// %6 = trunc <N x i32> %5 to <N x i8>
+static SDValue performVP_TRUNCATECombine(SDNode *N, SelectionDAG &DAG,
+ const RISCVSubtarget &Subtarget) {
+ EVT VT = N->getValueType(0);
+
+ // Ignore fixed vectors.
+ const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+ if (!VT.isScalableVector() || !TLI.isTypeLegal(VT))
+ return SDValue();
+
+ SDValue In = N->getOperand(0);
+ SDValue Mask = N->getOperand(1);
+ SDValue VL = N->getOperand(2);
+
+ // Input should be a vp_srl with same mask and VL.
+ if (In.getOpcode() != ISD::VP_SRL || In.getOperand(2) != Mask ||
+ In.getOperand(3) != VL)
+ return SDValue();
+
+ // Shift amount should be 1.
+ if (!isOneOrOneSplat(In.getOperand(1)))
+ return SDValue();
+
+ // Shifted value should be a vp_add with same mask and VL.
+ SDValue LHS = In.getOperand(0);
+ if (LHS.getOpcode() != ISD::VP_ADD || LHS.getOperand(2) != Mask ||
+ LHS.getOperand(3) != VL)
+ return SDValue();
+
+ SDValue Operands[3];
+ Operands[0] = LHS.getOperand(0);
+ Operands[1] = LHS.getOperand(1);
+
+ // Matches another VP_ADD with same VL and Mask.
+ auto FindAdd = [&](SDValue V, SDValue &Op0, SDValue &Op1) {
+ if (V.getOpcode() != ISD::VP_ADD || V.getOperand(2) != Mask ||
+ V.getOperand(3) != VL)
+ return false;
+
+ Op0 = V.getOperand(0);
+ Op1 = V.getOperand(1);
+ return true;
+ };
+
+ // We need to find another VP_ADD in one of the operands.
+ SDValue Op0, Op1;
+ if (FindAdd(Operands[0], Op0, Op1))
+ Operands[0] = Operands[1];
+ else if (!FindAdd(Operands[1], Op0, Op1))
+ return SDValue();
+ Operands[2] = Op0;
+ Operands[1] = Op1;
+
+ // Now we have three operands of two additions. Check that one of them is a
+ // constant vector with ones.
+ auto I = llvm::find_if(Operands,
+ [](const SDValue &Op) { return isOneOrOneSplat(Op); });
+ if (I == std::end(Operands))
+ return SDValue();
+ // We found a vector with ones, move if it to the end of the Operands array.
+ std::swap(Operands[I - std::begin(Operands)], Operands[2]);
+
+ // Make sure the other 2 operands can be promoted from the result type.
+ for (int i = 0; i < 2; ++i) {
+ if (Operands[i].getOpcode() != ISD::VP_ZERO_EXTEND ||
+ Operands[i].getOperand(1) != Mask || Operands[i].getOperand(2) != VL)
+ return SDValue();
+ // Input must be smaller than our result.
+ if (Operands[i].getOperand(0).getScalarValueSizeInBits() >
+ VT.getScalarSizeInBits())
+ return SDValue();
+ }
+
+ // Pattern is detected.
+ Op0 = Operands[0].getOperand(0);
+ Op1 = Operands[1].getOperand(0);
+ // Rebuild the zero extends if the inputs are smaller than our result.
+ if (Op0.getValueType() != VT)
+ Op0 =
+ DAG.getNode(ISD::VP_ZERO_EXTEND, SDLoc(Operands[0]), VT, Op0, Mask, VL);
+ if (Op1.getValueType() != VT)
+ Op1 =
+ DAG.getNode(ISD::VP_ZERO_EXTEND, SDLoc(Operands[1]), VT, Op1, Mask, VL);
+ // Build a VAADDU with RNU rounding mode.
+ SDLoc DL(N);
+ return DAG.getNode(RISCVISD::AVGCEILU_VL, DL, VT,
+ {Op0, Op1, DAG.getUNDEF(VT), Mask, VL});
+}
+
// Convert from one FMA opcode to another based on whether we are negating the
// multiply result and/or the accumulator.
// NOTE: Only supports RVV operations with VL.
@@ -17930,6 +18023,8 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
if (SDValue V = combineTruncOfSraSext(N, DAG))
return V;
return combineTruncToVnclip(N, DAG, Subtarget);
+ case ISD::VP_TRUNCATE:
+ return performVP_TRUNCATECombine(N, DAG, Subtarget);
case ISD::TRUNCATE:
return performTRUNCATECombine(N, DAG, Subtarget);
case ISD::SELECT:
diff --git a/llvm/test/CodeGen/RISCV/rvv/vp-vaaddu.ll b/llvm/test/CodeGen/RISCV/rvv/vp-vaaddu.ll
new file mode 100644
index 00000000000000..42f8e38d5ac163
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/vp-vaaddu.ll
@@ -0,0 +1,169 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=riscv64 -mattr=+v | FileCheck %s
+
+declare <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i8(<vscale x 2 x i8>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i16(<vscale x 2 x i16>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i16> @llvm.vp.trunc.nxv2i16.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16>, <vscale x 2 x i16>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16>, <vscale x 2 x i16>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i32>, <vscale x 2 x i1>, i32)
+declare <vscale x 2 x i32> @llvm.vp.lshr.nxv2i32(<vscale x 2 x i32>, <vscale x 2 x i32>, <vscale x 2 x i1>, i32)
+
+define <vscale x 2 x i8> @vaaddu_1(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_1:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v8, v9, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %xz, <vscale x 2 x i16> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %a, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+define <vscale x 2 x i8> @vaaddu_2(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_2:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v8, v9, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %xz, <vscale x 2 x i16> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i16> %a, <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+define <vscale x 2 x i8> @vaaddu_3(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_3:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v9, v8, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %xz, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %a, <vscale x 2 x i16> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+define <vscale x 2 x i8> @vaaddu_4(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_4:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v9, v8, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %xz, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %yz, <vscale x 2 x i16> %a, <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+define <vscale x 2 x i8> @vaaddu_5(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_5:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v9, v8, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i16> %xz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %a, <vscale x 2 x i16> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+define <vscale x 2 x i8> @vaaddu_6(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_6:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v9, v8, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i16> @llvm.vp.zext.nxv2i16.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i16> %xz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i16> @llvm.vp.add.nxv2i16(<vscale x 2 x i16> %yz, <vscale x 2 x i16> %a, <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i16> @llvm.vp.lshr.nxv2i16(<vscale x 2 x i16> %b, <vscale x 2 x i16> shufflevector (<vscale x 2 x i16> insertelement (<vscale x 2 x i16> poison, i16 1, i32 0), <vscale x 2 x i16> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i16(<vscale x 2 x i16> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+; Test where the size is reduced by 4x instead of 2x.
+define <vscale x 2 x i8> @vaaddu_7(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_7:
+; CHECK: # %bb.0:
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
+; CHECK-NEXT: vaaddu.vv v8, v8, v9, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %xz, <vscale x 2 x i32> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %a, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i32> @llvm.vp.lshr.nxv2i32(<vscale x 2 x i32> %b, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i8> @llvm.vp.trunc.nxv2i8.nxv2i32(<vscale x 2 x i32> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i8> %d
+}
+
+; Test where the zext can't be completely removed.
+define <vscale x 2 x i16> @vaaddu_8(<vscale x 2 x i8> %x, <vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_8:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT: vzext.vf2 v10, v8, v0.t
+; CHECK-NEXT: csrwi vxrm, 0
+; CHECK-NEXT: vzext.vf2 v8, v9, v0.t
+; CHECK-NEXT: vaaddu.vv v8, v10, v8, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i8(<vscale x 2 x i8> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i8(<vscale x 2 x i8> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %xz, <vscale x 2 x i32> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %a, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i32> @llvm.vp.lshr.nxv2i32(<vscale x 2 x i32> %b, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %d = call <vscale x 2 x i16> @llvm.vp.trunc.nxv2i16.nxv2i32(<vscale x 2 x i32> %c, <vscale x 2 x i1> %m, i32 %vl)
+ ret <vscale x 2 x i16> %d
+}
+
+; Negative test. The truncate has a smaller type than the zero extend.
+; TODO: Could still handle this by truncating after an i16 vaaddu.
+define <vscale x 2 x i8> @vaaddu_9(<vscale x 2 x i16> %x, <vscale x 2 x i16> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
+; CHECK-LABEL: vaaddu_9:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli zero, a0, e16, mf2, ta, ma
+; CHECK-NEXT: vwaddu.vv v10, v8, v9, v0.t
+; CHECK-NEXT: vsetvli zero, zero, e32, m1, ta, ma
+; CHECK-NEXT: vadd.vi v8, v10, 1, v0.t
+; CHECK-NEXT: vsrl.vi v8, v8, 1, v0.t
+; CHECK-NEXT: vsetvli zero, zero, e16, mf2, ta, ma
+; CHECK-NEXT: vnsrl.wi v8, v8, 0, v0.t
+; CHECK-NEXT: vsetvli zero, zero, e8, mf4, ta, ma
+; CHECK-NEXT: vnsrl.wi v8, v8, 0, v0.t
+; CHECK-NEXT: ret
+ %xz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i16(<vscale x 2 x i16> %x, <vscale x 2 x i1> %m, i32 %vl)
+ %yz = call <vscale x 2 x i32> @llvm.vp.zext.nxv2i32.nxv2i16(<vscale x 2 x i16> %y, <vscale x 2 x i1> %m, i32 %vl)
+ %a = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %xz, <vscale x 2 x i32> %yz, <vscale x 2 x i1> %m, i32 %vl)
+ %b = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %a, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i1> %m, i32 %vl)
+ %c = call <vscale x 2 x i32> @llvm.vp.lshr.nxv2i32(<vscale x 2 x i32> %b, <vscale x 2 x i32> shufflevector (<vscale x 2 x i32> insertelement (<vscale...
[truncated]
|
Just checked, this patch triggers on 525.x264_r with EVL tail folding |
// %2 = zext <N x i8> %b to <N x i32> | ||
// %3 = add nuw nsw <N x i32> %1, splat (i32 1) | ||
// %4 = add nuw nsw <N x i32> %3, %2 | ||
// %5 = lshr <N x i32> %N, <i32 1 x N> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think %N should be %4 here? And the second operand looks invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
// We need to find another VP_ADD in one of the operands. | ||
SDValue Op0, Op1; | ||
if (FindAdd(Operands[0], Op0, Op1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple uses of Operands[0] seems a tad confusing to me. The logic appears correct, but might it be simpler to simply pass LHS.getOperand(X) to the helper, and update Operands[1], and Operands[2] by reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was lifted from some X86 code that was removed here ea2ee5d
I'll look into improving it
Operands[i].getOperand(1) != Mask || Operands[i].getOperand(2) != VL) | ||
return SDValue(); | ||
// Input must be smaller than our result. | ||
if (Operands[i].getOperand(0).getScalarValueSizeInBits() > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero extend should always be less than equal right? If so, is this check needed? And don't you need to check for the degenerate equal sized type case? Or do we simplify those at construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checking the destination type of the VP_TRUNCATE against the input type of VP_ZERO_EXTEND.
std::swap(Operands[I - std::begin(Operands)], Operands[2]); | ||
|
||
// Make sure the other 2 operands can be promoted from the result type. | ||
for (int i = 0; i < 2; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to read as for (auto Op : Operands.take_front(2)). -- Actually, ignore this, we had take_front on ArrayRef, but not SmallVector. Might be worth adding, but definitely non-blocking for this review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is no SmallVector in this code. It's a plain C array. We can use drop_end
on it to skip the last element.
} | ||
|
||
// Pattern is detected. | ||
Op0 = Operands[0].getOperand(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reuse of Op0, Op1 here is a bit odd. Maybe reuse Operands instead, and drop the splat of 1 above rather than moving it to the end? That would simplify the loop just above too. Though I guess that complicates the debug loc management just below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operands is a plain C array so we can't really "drop" the splat of 1.
if (Op0.getValueType() != VT) | ||
Op0 = | ||
DAG.getNode(ISD::VP_ZERO_EXTEND, SDLoc(Operands[0]), VT, Op0, Mask, VL); | ||
if (Op1.getValueType() != VT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely, we short circuit these on construction and can just unconditionally call the getNode API?
DAG.getNode(ISD::VP_ZERO_EXTEND, SDLoc(Operands[1]), VT, Op1, Mask, VL); | ||
// Build a VAADDU with RNU rounding mode. | ||
SDLoc DL(N); | ||
return DAG.getNode(RISCVISD::AVGCEILU_VL, DL, VT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVGCEILU_VL appears to be previously unused code. Do we have any such other nodes which need deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not unused it used via macro in getRISCVVLOp
ISD::UREM, ISD::INSERT_VECTOR_ELT, | ||
ISD::ABS, ISD::CTPOP, | ||
ISD::VECTOR_SHUFFLE, ISD::VSELECT}); | ||
setTargetDAGCombine( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray format change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added VP_TRUNCATE to it. It added it to the middle because that's where it was in my downstream.
Not opposed to doing this in RISCV DAG, but an alternative in case you haven't considered it would be to emit a AVGCEIL node for the full type width (no predication), and then a VP.VSELECT just after it. I believe the VLOptimizer should fold this back into the predicated form without issue. This works for all instructions which are non-faulting (i.e. we can speculatively execute the additional lanes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/minor comment.
if (Op.getOpcode() != ISD::VP_ZERO_EXTEND || Op.getOperand(1) != Mask || | ||
Op.getOperand(2) != VL) | ||
return SDValue(); | ||
// Input must be smaller than our result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smaller but the code checks less than equal. I think the comment just needs updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Could we also do this after legalization and try and match it on TRUNCATE_VECTOR_VL
? That way we would also match fixed length vectors.
auto I = llvm::find_if(Operands, | ||
[](const SDValue &Op) { return isOneOrOneSplat(Op); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pass the predicate directly?
auto I = llvm::find_if(Operands, | |
[](const SDValue &Op) { return isOneOrOneSplat(Op); }); | |
auto I = llvm::find_if(Operands, isOneOrOneSplat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't compile. Possibly because isOneOrOneSplat
has a second defaulted argument
Don't we have a generic DAGCombine for fixed vectors using non-VP operations? |
Yeah but this would be for fixed length VP operations. Just putting the idea out there |
I think moving it after legalization is a little more complex. We'll have to handle VP_TRUNCATE i32->i8 being split into 2 TRUNCATE_VECTOR_VLs and we'll have to look for combinations of VWADDU_VL and VWADDU_W_VL. |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/14941 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12743 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/13807 Here is the relevant piece of the build log for the reference
|
This adds a VP version of an existing DAG combine. I've put it in RISCVISelLowering since we would need to add a ISD::VP_AVGCEIL opcode otherwise.
This pattern appears in 525.264_r.
I extracted this code from our downstream compiler. Hopefully it works with the current EVL vectorizer in upstream.
@lukel97 can you double check that this pattern triggers on 525.x264_r with the EVL vectorizer.