-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[AArch64] Improve urem by constant costs #122236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: David Green (davemgreen) ChangesA urem by a constant, much like a udiv by a constant, can be expanded into a series of mul/add/shift instructions. The exact sequence of instructions depends on the constants and the types. If the constant is a power-2 then a shift / and will be used, so the cost will be 1. This canonicalization happens relatively early so this likely has very little effect in practice (it does help the cost of funnel shifts). For a non-power 2 the code for div will expand to a series of UMULH + Add + Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so involves a few extra instructions. The UMULH is not always available, i32 will use umull+shift, and vector types will use umull+shift or umull+umull2+uzp depending on the vector size. v2i64 will be scalarized because there is no mul available. SVE does have a UMULH instruction. The end result is that the costs should be closer to reality, with scalable types a little lower cost than the fixed-width versions. (In the future we might be able to use umulh for fixed-width when the SVE instruction is available, but for the moment this should favour scalable vectorization a little). I've tried to make this patch only apply to constant UREM/UDIV instructions. SDIV and SREM are left until a later patch to prevent this becoming too complex. The funnel shift costs are changing as it believes it will need a urem to clamp the shift amount, which should be a power-2 value for most common types. Patch is 182.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122236.diff 10 Files Affected:
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..13a56709ed10f5 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -893,7 +893,8 @@ TargetTransformInfo::getOperandInfo(const Value *V) {
// Check for a splat of a constant or for a non uniform vector of constants
// and check if the constant(s) are all powers of two.
- if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V)) {
+ if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V) ||
+ isa<ConstantExpr>(V)) {
OpInfo = OK_NonUniformConstantValue;
if (Splat) {
OpInfo = OK_UniformConstantValue;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 25b6731cb313a1..2b51d2de060b6c 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3510,21 +3510,61 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
return Cost;
}
[[fallthrough]];
- case ISD::UDIV: {
+ case ISD::UDIV:
+ case ISD::UREM: {
auto VT = TLI->getValueType(DL, Ty);
- if (Op2Info.isConstant() && Op2Info.isUniform()) {
- if (TLI->isOperationLegalOrCustom(ISD::MULHU, VT)) {
+ if (Op2Info.isConstant()) {
+ // If the operand is a power of 2 we can use the shift or and cost.
+ if (ISD == ISD::UDIV && Op2Info.isPowerOf2())
+ return getArithmeticInstrCost(Instruction::LShr, Ty, CostKind,
+ Op1Info.getNoProps(),
+ Op2Info.getNoProps());
+ if (ISD == ISD::UREM && Op2Info.isPowerOf2())
+ return getArithmeticInstrCost(Instruction::And, Ty, CostKind,
+ Op1Info.getNoProps(),
+ Op2Info.getNoProps());
+
+ if (ISD == ISD::UDIV || ISD == ISD::UREM) {
+ // Divides by a constant are expanded to MULHU + SUB + SRL + ADD + SRL.
+ // The MULHU will be expanded to UMULL for the types not listed below,
+ // and will become a pair of UMULL+MULL2 for 128bit vectors.
+ bool HasMULH = VT == MVT::i64 || LT.second == MVT::nxv2i64 ||
+ LT.second == MVT::nxv4i32 || LT.second == MVT::nxv8i16 ||
+ LT.second == MVT::nxv16i8;
+ bool Is128bit = LT.second.is128BitVector();
+
+ InstructionCost MulCost =
+ getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost AddCost =
+ getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost ShrCost =
+ getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost DivCost = MulCost * (Is128bit ? 2 : 1) + // UMULL/UMULH
+ (HasMULH ? 0 : ShrCost) + // UMULL shift
+ AddCost * 2 + ShrCost;
+ return DivCost + (ISD == ISD::UREM ? MulCost + AddCost : 0);
+ }
+
+ // TODOD: Fix SDIV and SREM costs, similar to the above.
+ if (TLI->isOperationLegalOrCustom(ISD::MULHU, VT) &&
+ Op2Info.isUniform()) {
// Vector signed division by constant are expanded to the
- // sequence MULHS + ADD/SUB + SRA + SRL + ADD, and unsigned division
- // to MULHS + SUB + SRL + ADD + SRL.
- InstructionCost MulCost = getArithmeticInstrCost(
- Instruction::Mul, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
- InstructionCost AddCost = getArithmeticInstrCost(
- Instruction::Add, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
- InstructionCost ShrCost = getArithmeticInstrCost(
- Instruction::AShr, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
+ // sequence MULHS + ADD/SUB + SRA + SRL + ADD.
+ InstructionCost MulCost =
+ getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost AddCost =
+ getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
+ InstructionCost ShrCost =
+ getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+ Op1Info.getNoProps(), Op2Info.getNoProps());
return MulCost * 2 + AddCost * 2 + ShrCost * 2 + 1;
}
+
}
// div i128's are lowered as libcalls. Pass nullptr as (u)divti3 calls are
@@ -3535,7 +3575,7 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
InstructionCost Cost = BaseT::getArithmeticInstrCost(
Opcode, Ty, CostKind, Op1Info, Op2Info);
- if (Ty->isVectorTy()) {
+ if (Ty->isVectorTy() && (ISD == ISD::SDIV || ISD == ISD::UDIV)) {
if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
// SDIV/UDIV operations are lowered using SVE, then we can have less
// costs.
diff --git a/llvm/test/Analysis/CostModel/AArch64/div.ll b/llvm/test/Analysis/CostModel/AArch64/div.ll
index ef52d0db01eefd..fcf567f37b665a 100644
--- a/llvm/test/Analysis/CostModel/AArch64/div.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/div.ll
@@ -180,28 +180,28 @@ define i32 @sdiv_const() {
define i32 @udiv_const() {
; CHECK-LABEL: 'udiv_const'
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 6, i64 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I32 = udiv i32 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = udiv <2 x i32> undef, <i32 4, i32 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = udiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = udiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = udiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I16 = udiv i16 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = udiv <2 x i16> undef, <i16 4, i16 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = udiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = udiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = udiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = udiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I8 = udiv i8 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i8 = udiv <2 x i8> undef, <i8 4, i8 5>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 48 for instruction: %V4i8 = udiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 88 for instruction: %V8i8 = udiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 168 for instruction: %V16i8 = udiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 336 for instruction: %V32i8 = udiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 672 for instruction: %V64i8 = udiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I64 = udiv i64 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 6, i64 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I32 = udiv i32 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = udiv <2 x i32> undef, <i32 4, i32 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4i32 = udiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V8i32 = udiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V16i32 = udiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I16 = udiv i16 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = udiv <2 x i16> undef, <i16 4, i16 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = udiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8i16 = udiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V16i16 = udiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V32i16 = udiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I8 = udiv i8 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = udiv <2 x i8> undef, <i8 4, i8 5>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = udiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = udiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16i8 = udiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V32i8 = udiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V64i8 = udiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
@@ -297,28 +297,28 @@ define i32 @sdiv_uniformconst() {
define i32 @udiv_uniformconst() {
; CHECK-LABEL: 'udiv_uniformconst'
; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i64 = udiv <2 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i64 = udiv <4 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i64 = udiv <8 x i64> undef, splat (i64 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I32 = udiv i32 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i32 = udiv <2 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V4i32 = udiv <4 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i32 = udiv <8 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16i32 = udiv <16 x i32> undef, splat (i32 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I16 = udiv i16 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i16 = udiv <2 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i16 = udiv <4 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V8i16 = udiv <8 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 80 for instruction: %V16i16 = udiv <16 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 160 for instruction: %V32i16 = udiv <32 x i16> undef, splat (i16 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I8 = udiv i8 undef, 7
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %V2i8 = udiv <2 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %V4i8 = udiv <4 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 40 for instruction: %V8i8 = udiv <8 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %V16i8 = udiv <16 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 160 for instruction: %V32i8 = udiv <32 x i8> undef, splat (i8 7)
-; CHECK-NEXT: Cost Model: Found an estimated cost of 320 for instruction: %V64i8 = udiv <64 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %I64 = udiv i64 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %V2i64 = udiv <2 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %V4i64 = udiv <4 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %V8i64 = udiv <8 x i64> undef, splat (i64 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I32 = udiv i32 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = udiv <2 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V4i32 = udiv <4 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V8i32 = udiv <8 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V16i32 = udiv <16 x i32> undef, splat (i32 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I16 = udiv i16 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = udiv <2 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = udiv <4 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V8i16 = udiv <8 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V16i16 = udiv <16 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V32i16 = udiv <32 x i16> undef, splat (i16 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %I8 = udiv i8 undef, 7
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = udiv <2 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = udiv <4 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = udiv <8 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %V16i8 = udiv <16 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %V32i8 = udiv <32 x i8> undef, splat (i8 7)
+; CHECK-NEXT: Cost Model: Found an estimated cost of 24 for instruction: %V64i8 = udiv <64 x i8> undef, splat (i8 7)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
;
%I128 = udiv i128 undef, 7
@@ -412,29 +412,29 @@ define i32 @sdiv_constpow2() {
define i32 @udiv_constpow2() {
; CHECK-LABEL: 'udiv_constpow2'
-; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 16
-; CHECK-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 8, i64 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
-; CHECK-NEXT: Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 2, i64 4, i64 8, i64 16, i64 32, i64 64, i64 128, i64 256>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %I128 = udiv i128 undef, 16
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %I64 = udiv i64 undef, 16
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 8, i64 16>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
+; CHECK-NEXT: Cost Model: Found ...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
13f83b0
to
1c59289
Compare
@@ -893,7 +893,8 @@ TargetTransformInfo::getOperandInfo(const Value *V) { | |||
|
|||
// Check for a splat of a constant or for a non uniform vector of constants | |||
// and check if the constant(s) are all powers of two. | |||
if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V)) { | |||
if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V) || | |||
isa<ConstantExpr>(V)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth committing this separately, or is it only exposed with your code changes? I know with the work @paulwalker-arm has been doing on migrating LLVM to use splat(i32 ...)
for constant splats we should be seeing ConstantExpr
used a lot more in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was a relatively late addition whilst writing SVE tests. #122469 has it separated out.
Pulled out of llvm#122236, this allows Splats contants to be recognized in by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-with vs scalable for the same fiv can become more accurate, especially with patches like llvm#122236.
This patch draws its inspiration from the udiv/urem patch llvm#122236 For sdiv, typical sequence of instructions as per the type and divisor property is as follows: Scalar power-of-2: cmp + csel + asr Neon power-of-2: usra + sshr Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add Vector non-power-2: a) <2 x i64>: 2 * (smulh + asr + add) . This yeilds scalarized form. b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions. For srem, typical sequence of instructions as per the type and divisor property is as follows: Scalar version: <set of sdiv instructions> + msub Vector version: <set of sdiv instructions> + 2-msub/1-mls
1c59289
to
b0c15cd
Compare
Pulled out of llvm#122236, this allows Splats contants to be recognized in by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-with vs scalable for the same fiv can become more accurate, especially with patches like llvm#122236.
Pulled out of llvm#122236, this allows Splats contants to be recognized in by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-with vs scalable for the same fiv can become more accurate, especially with patches like llvm#122236.
@davemgreen
So, essentially, its like having 8 possibilities for the constant with these 3 variables. Now, things change pretty quickly if the constant is negative(I think I said that things dont change much if the constant is negative on the other patch but I was wrong).
The non-uniform case also brings in more complexity but not that much. So, maybe, can be break this patch so that we conquer just few possibilities(from 8 possibilities) in every patch at a time ? This will ensure precision in cost estimation and easy verification I believe. |
Pulled out of llvm#122236, this allows Splats contants to be recognized in by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-with vs scalable for the same fiv can become more accurate, especially with patches like llvm#122236.
Pulled out of #122236, this allows Splats constants to be recognized by getOperandInfo, allowing "better" costs for instructions like divides by constants to be produced (which are expanded into mul+add+shift). Some of the costs are not very accurate yet, but the comparison of scalar vs fixed-width vs scalable for the same div can become more accurate, especially with patches like #122236.
I believe for udiv/urem there are only really 2 cases - positive power-2 and everything else. The power-2 cases almost never come up as they are canonicalised early and are mostly added for completeness. Like you say I don't think there is a lot of difference between uniform and non-uniforms, it just takes the worst-case. The exact cost will depend on the constant though, and we are currently keeping things simple and assuming no simplification happens. Some positive constants will simplify, some negative ones will, it is hard to tell without doing a lot of maths to see what the exact instructions will be. We could try to use DivisionByConstantInfo to get a more precise expansion and figure out what will simplify, but that feels a bit more complex than is helpful considering the other costs that could be improved. Maybe one for the future, if you don't want to look into it yourself. |
b0c15cd
to
b0bac75
Compare
Op1Info.getNoProps(), Op2Info.getNoProps()); | ||
InstructionCost ShrCost = | ||
getArithmeticInstrCost(Instruction::AShr, Ty, CostKind, | ||
Op1Info.getNoProps(), Op2Info.getNoProps()); | ||
return MulCost * 2 + AddCost * 2 + ShrCost * 2 + 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this part can be removed. Division by non-constant, uniform divisor is giving scalar code. Check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - This is inside the top-level if (Op2Info.isConstant()) {
, so I don't think that should be an issue. We are only trying to update constant costs in this patch, to keep it simpler. I've added some uniform tests in f08824b to check.
I think this bit of code can probably be removed when sdiv gets added. For now I will re-add the isScalableVector check to make sure the sdiv scores don't change yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Thanks. Will remove this if(){...} when I update the sdiv/srem patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
A urem by a constant, much like a udiv by a constant, can be expanded into a series of mul/add/shift instructions. The exact sequence of instructions depends on the constants and the types. If the constant is a power-2 then a shift / and will be used, so the cost will be 1. This canonicalization happens relatively early so this likely has very little effect in practice (it does help the cost of funnel shifts). For a non-power 2 the code for div will expand to a series of UMULH + Add + Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so involves a few extra instructions. The UMULH is not always available, i32 will use umull+shift, and vector types will use umull+shift or umull+umull2+uzp depending on the vector size. v2i64 will be scalarized because there is no mul available. SVE does have a UMULH instruction. The end result is that the costs should be closer to reality, with scalable types a little lower cost than the fixed-width versions. (In the future we might be able to use umulh for fixed-width when the SVE instruction is available, but for the moment this should favour scalable vectorization a little). I've tried to make this patch only apply to constant UREM/UDIV instructions. SDIV and SREM are left until a later patch to prevent this becoming too complex. The funnel shift costs are changing as it believes it will need a urem to clamp the shift amount, which should be a power-2 value for most common types.
b0bac75
to
fe93eb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This patch draws its inspiration from the udiv/urem patch llvm#122236 For sdiv, typical sequence of instructions as per the type and divisor property is as follows: Scalar power-of-2: cmp + csel + asr Neon power-of-2: usra + sshr Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add Vector non-power-2: a) <2 x i64>: 2 * (smulh + asr + add) . This yeilds scalarized form. b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions. For srem, typical sequence of instructions as per the type and divisor property is as follows: Scalar version: <set of sdiv instructions> + msub Vector version: <set of sdiv instructions> + 2-msub/1-mls
…nt (#123552) This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236 The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
A urem by a constant, much like a udiv by a constant, can be expanded into a series of mul/add/shift instructions. The exact sequence of instructions depends on the constants and the types.
If the constant is a power-2 then a shift / and will be used, so the cost will be 1. This canonicalization happens relatively early so this likely has very little effect in practice (it does help the cost of funnel shifts).
For a non-power 2 the code for div will expand to a series of UMULH + Add + Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so involves a few extra instructions. The UMULH is not always available, i32 will use umull+shift, and vector types will use umull+shift or umull+umull2+uzp depending on the vector size. v2i64 will be scalarized because there is no mul available. SVE does have a UMULH instruction.
The end result is that the costs should be closer to reality, with scalable types a little lower cost than the fixed-width versions. (In the future we might be able to use umulh for fixed-width when the SVE instruction is available, but for the moment this should favour scalable vectorization a little).
I've tried to make this patch only apply to constant UREM/UDIV instructions. SDIV and SREM are left until a later patch to prevent this becoming too complex. The funnel shift costs are changing as it believes it will need a urem to clamp the shift amount, which should be a power-2 value for most common types.