Skip to content

[AArch64] Improve urem by constant costs #122236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 26, 2025

Conversation

davemgreen
Copy link
Collaborator

A urem by a constant, much like a udiv by a constant, can be expanded into a series of mul/add/shift instructions. The exact sequence of instructions depends on the constants and the types.

If the constant is a power-2 then a shift / and will be used, so the cost will be 1. This canonicalization happens relatively early so this likely has very little effect in practice (it does help the cost of funnel shifts).

For a non-power 2 the code for div will expand to a series of UMULH + Add + Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so involves a few extra instructions. The UMULH is not always available, i32 will use umull+shift, and vector types will use umull+shift or umull+umull2+uzp depending on the vector size. v2i64 will be scalarized because there is no mul available. SVE does have a UMULH instruction.

The end result is that the costs should be closer to reality, with scalable types a little lower cost than the fixed-width versions. (In the future we might be able to use umulh for fixed-width when the SVE instruction is available, but for the moment this should favour scalable vectorization a little).

I've tried to make this patch only apply to constant UREM/UDIV instructions. SDIV and SREM are left until a later patch to prevent this becoming too complex. The funnel shift costs are changing as it believes it will need a urem to clamp the shift amount, which should be a power-2 value for most common types.

@llvmbot
Copy link
Member

llvmbot commented Jan 9, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

A urem by a constant, much like a udiv by a constant, can be expanded into a series of mul/add/shift instructions. The exact sequence of instructions depends on the constants and the types.

If the constant is a power-2 then a shift / and will be used, so the cost will be 1. This canonicalization happens relatively early so this likely has very little effect in practice (it does help the cost of funnel shifts).

For a non-power 2 the code for div will expand to a series of UMULH + Add + Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so involves a few extra instructions. The UMULH is not always available, i32 will use umull+shift, and vector types will use umull+shift or umull+umull2+uzp depending on the vector size. v2i64 will be scalarized because there is no mul available. SVE does have a UMULH instruction.

The end result is that the costs should be closer to reality, with scalable types a little lower cost than the fixed-width versions. (In the future we might be able to use umulh for fixed-width when the SVE instruction is available, but for the moment this should favour scalable vectorization a little).

I've tried to make this patch only apply to constant UREM/UDIV instructions. SDIV and SREM are left until a later patch to prevent this becoming too complex. The funnel shift costs are changing as it believes it will need a urem to clamp the shift amount, which should be a power-2 value for most common types.


Patch is 182.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/122236.diff

10 Files Affected:

  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+2-1)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+52-12)
  • (modified) llvm/test/Analysis/CostModel/AArch64/div.ll (+128-128)
  • (modified) llvm/test/Analysis/CostModel/AArch64/div_cte.ll (+3-3)
  • (modified) llvm/test/Analysis/CostModel/AArch64/fshl.ll (+8-8)
  • (modified) llvm/test/Analysis/CostModel/AArch64/fshr.ll (+8-8)
  • (modified) llvm/test/Analysis/CostModel/AArch64/rem.ll (+135-135)
  • (modified) llvm/test/Analysis/CostModel/AArch64/sve-div.ll (+127-127)
  • (modified) llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/AArch64/sve-rem.ll (+128-128)
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..13a56709ed10f5 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -893,7 +893,8 @@ TargetTransformInfo::getOperandInfo(const Value *V) {
 
   // Check for a splat of a constant or for a non uniform vector of constants
   // and check if the constant(s) are all powers of two.
-  if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V)) {
+  if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V) ||
+      isa<ConstantExpr>(V)) {
     OpInfo = OK_NonUniformConstantValue;
     if (Splat) {
       OpInfo = OK_UniformConstantValue;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 25b6731cb313a1..2b51d2de060b6c 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -3510,21 +3510,61 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
       return Cost;
     }
     [[fallthrough]];
-  case ISD::UDIV: {
+  case ISD::UDIV:
+  case ISD::UREM: {
     auto VT = TLI->getValueType(DL, Ty);
-    if (Op2Info.isConstant() && Op2Info.isUniform()) {
-      if (TLI->isOperationLegalOrCustom(ISD::MULHU, VT)) {
+    if (Op2Info.isConstant()) {
+      // If the operand is a power of 2 we can use the shift or and cost.
+      if (ISD == ISD::UDIV && Op2Info.isPowerOf2())
+        return getArithmeticInstrCost(Instruction::LShr, Ty, CostKind,
+                                      Op1Info.getNoProps(),
+                                      Op2Info.getNoProps());
+      if (ISD == ISD::UREM && Op2Info.isPowerOf2())
+        return getArithmeticInstrCost(Instruction::And, Ty, CostKind,
+                                      Op1Info.getNoProps(),
+                                      Op2Info.getNoProps());
+
+      if (ISD == ISD::UDIV || ISD == ISD::UREM) {
+        // Divides by a constant are expanded to MULHU + SUB + SRL + ADD + SRL.
+        // The MULHU will be expanded to UMULL for the types not listed below,
+        // and will become a pair of UMULL+MULL2 for 128bit vectors.
+        bool HasMULH = VT == MVT::i64 || LT.second == MVT::nxv2i64 ||
+                       LT.second == MVT::nxv4i32 || LT.second == MVT::nxv8i16 ||
+                       LT.second == MVT::nxv16i8;
+        bool Is128bit = LT.second.is128BitVector();
+
+        InstructionCost MulCost =
+            getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
+        InstructionCost AddCost =
+            getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
+        InstructionCost ShrCost =
+            getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
+        InstructionCost DivCost = MulCost * (Is128bit ? 2 : 1) + // UMULL/UMULH
+                                  (HasMULH ? 0 : ShrCost) +      // UMULL shift
+                                  AddCost * 2 + ShrCost;
+        return DivCost + (ISD == ISD::UREM ? MulCost + AddCost : 0);
+      }
+
+      // TODOD: Fix SDIV and SREM costs, similar to the above.
+      if (TLI->isOperationLegalOrCustom(ISD::MULHU, VT) &&
+          Op2Info.isUniform()) {
         // Vector signed division by constant are expanded to the
-        // sequence MULHS + ADD/SUB + SRA + SRL + ADD, and unsigned division
-        // to MULHS + SUB + SRL + ADD + SRL.
-        InstructionCost MulCost = getArithmeticInstrCost(
-            Instruction::Mul, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
-        InstructionCost AddCost = getArithmeticInstrCost(
-            Instruction::Add, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
-        InstructionCost ShrCost = getArithmeticInstrCost(
-            Instruction::AShr, Ty, CostKind, Op1Info.getNoProps(), Op2Info.getNoProps());
+        // sequence MULHS + ADD/SUB + SRA + SRL + ADD.
+        InstructionCost MulCost =
+            getArithmeticInstrCost(Instruction::Mul, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
+        InstructionCost AddCost =
+            getArithmeticInstrCost(Instruction::Add, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
+        InstructionCost ShrCost =
+            getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
+                                   Op1Info.getNoProps(), Op2Info.getNoProps());
         return MulCost * 2 + AddCost * 2 + ShrCost * 2 + 1;
       }
+
     }
 
     // div i128's are lowered as libcalls.  Pass nullptr as (u)divti3 calls are
@@ -3535,7 +3575,7 @@ InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
 
     InstructionCost Cost = BaseT::getArithmeticInstrCost(
         Opcode, Ty, CostKind, Op1Info, Op2Info);
-    if (Ty->isVectorTy()) {
+    if (Ty->isVectorTy() && (ISD == ISD::SDIV || ISD == ISD::UDIV)) {
       if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
         // SDIV/UDIV operations are lowered using SVE, then we can have less
         // costs.
diff --git a/llvm/test/Analysis/CostModel/AArch64/div.ll b/llvm/test/Analysis/CostModel/AArch64/div.ll
index ef52d0db01eefd..fcf567f37b665a 100644
--- a/llvm/test/Analysis/CostModel/AArch64/div.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/div.ll
@@ -180,28 +180,28 @@ define i32 @sdiv_const() {
 define i32 @udiv_const() {
 ; CHECK-LABEL: 'udiv_const'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 6, i64 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = udiv i32 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i32 = udiv <2 x i32> undef, <i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i32 = udiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 96 for instruction: %V8i32 = udiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 192 for instruction: %V16i32 = udiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = udiv i16 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i16 = udiv <2 x i16> undef, <i16 4, i16 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i16 = udiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 88 for instruction: %V8i16 = udiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 176 for instruction: %V16i16 = udiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 352 for instruction: %V32i16 = udiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = udiv i8 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i8 = udiv <2 x i8> undef, <i8 4, i8 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V4i8 = udiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 88 for instruction: %V8i8 = udiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 168 for instruction: %V16i8 = udiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 336 for instruction: %V32i8 = udiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 672 for instruction: %V64i8 = udiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I64 = udiv i64 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 6, i64 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I32 = udiv i32 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = udiv <2 x i32> undef, <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4i32 = udiv <4 x i32> undef, <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V8i32 = udiv <8 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V16i32 = udiv <16 x i32> undef, <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I16 = udiv i16 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = udiv <2 x i16> undef, <i16 4, i16 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = udiv <4 x i16> undef, <i16 4, i16 5, i16 6, i16 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8i16 = udiv <8 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16i16 = udiv <16 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32i16 = udiv <32 x i16> undef, <i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19, i16 4, i16 5, i16 6, i16 7, i16 8, i16 9, i16 10, i16 11, i16 12, i16 13, i16 14, i16 15, i16 16, i16 17, i16 18, i16 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I8 = udiv i8 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = udiv <2 x i8> undef, <i8 4, i8 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = udiv <4 x i8> undef, <i8 4, i8 5, i8 6, i8 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = udiv <8 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16i8 = udiv <16 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32i8 = udiv <32 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V64i8 = udiv <64 x i8> undef, <i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16, i8 17, i8 18, i8 19>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
 
@@ -297,28 +297,28 @@ define i32 @sdiv_uniformconst() {
 define i32 @udiv_uniformconst() {
 ; CHECK-LABEL: 'udiv_uniformconst'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i64 = udiv <2 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i64 = udiv <4 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i64 = udiv <8 x i64> undef, splat (i64 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = udiv i32 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i32 = udiv <2 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4i32 = udiv <4 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i32 = udiv <8 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %V16i32 = udiv <16 x i32> undef, splat (i32 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = udiv i16 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i16 = udiv <2 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i16 = udiv <4 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8i16 = udiv <8 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %V16i16 = udiv <16 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %V32i16 = udiv <32 x i16> undef, splat (i16 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = udiv i8 undef, 7
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2i8 = udiv <2 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V4i8 = udiv <4 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8i8 = udiv <8 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V16i8 = udiv <16 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %V32i8 = udiv <32 x i8> undef, splat (i8 7)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 320 for instruction: %V64i8 = udiv <64 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %I64 = udiv i64 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V2i64 = udiv <2 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: %V4i64 = udiv <4 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: %V8i64 = udiv <8 x i64> undef, splat (i64 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I32 = udiv i32 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i32 = udiv <2 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4i32 = udiv <4 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V8i32 = udiv <8 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V16i32 = udiv <16 x i32> undef, splat (i32 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I16 = udiv i16 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i16 = udiv <2 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i16 = udiv <4 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8i16 = udiv <8 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16i16 = udiv <16 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32i16 = udiv <32 x i16> undef, splat (i16 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %I8 = udiv i8 undef, 7
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2i8 = udiv <2 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4i8 = udiv <4 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8i8 = udiv <8 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16i8 = udiv <16 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32i8 = udiv <32 x i8> undef, splat (i8 7)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V64i8 = udiv <64 x i8> undef, splat (i8 7)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
 ;
   %I128 = udiv i128 undef, 7
@@ -412,29 +412,29 @@ define i32 @sdiv_constpow2() {
 
 define i32 @udiv_constpow2() {
 ; CHECK-LABEL: 'udiv_constpow2'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I128 = udiv i128 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %I64 = udiv i64 undef, 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 8, i64 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V8i64 = udiv <8 x i64> undef, <i64 2, i64 4, i64 8, i64 16, i64 32, i64 64, i64 128, i64 256>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %I128 = udiv i128 undef, 16
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I64 = udiv i64 undef, 16
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2i64 = udiv <2 x i64> undef, <i64 8, i64 16>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4i64 = udiv <4 x i64> undef, <i64 2, i64 4, i64 8, i64 16>
+; CHECK-NEXT:  Cost Model: Found ...
[truncated]

Copy link

github-actions bot commented Jan 9, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -893,7 +893,8 @@ TargetTransformInfo::getOperandInfo(const Value *V) {

// Check for a splat of a constant or for a non uniform vector of constants
// and check if the constant(s) are all powers of two.
if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V)) {
if (isa<ConstantVector>(V) || isa<ConstantDataVector>(V) ||
isa<ConstantExpr>(V)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth committing this separately, or is it only exposed with your code changes? I know with the work @paulwalker-arm has been doing on migrating LLVM to use splat(i32 ...) for constant splats we should be seeing ConstantExpr used a lot more in future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was a relatively late addition whilst writing SVE tests. #122469 has it separated out.

davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Jan 10, 2025
Pulled out of llvm#122236, this allows Splats contants to be recognized in by
getOperandInfo, allowing "better" costs for instructions like divides by
constants to be produced (which are expanded into mul+add+shift). Some of the
costs are not very accurate yet, but the comparison of scalar vs fixed-with vs
scalable for the same fiv can become more accurate, especially with patches
like llvm#122236.
sushgokh added a commit to sushgokh/llvm-project that referenced this pull request Jan 20, 2025
This patch draws its inspiration from the udiv/urem patch llvm#122236

For sdiv, typical sequence of instructions as per the type and divisor property is as follows:

Scalar power-of-2: cmp + csel + asr
Neon power-of-2: usra + sshr

Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add
Vector non-power-2:
  a) <2 x i64>: 2 * (smulh + asr + add)    . This yeilds scalarized form.
  b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra

SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions.

For srem, typical sequence of instructions as per the type and divisor property is as follows:
Scalar version: <set of sdiv instructions> + msub
Vector version: <set of sdiv instructions> + 2-msub/1-mls
sushgokh added a commit that referenced this pull request Jan 22, 2025
This patch updates the motivating test for the above PR so that it does
not conflict with urem PR #122236
davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Feb 1, 2025
Pulled out of llvm#122236, this allows Splats contants to be recognized in by
getOperandInfo, allowing "better" costs for instructions like divides by
constants to be produced (which are expanded into mul+add+shift). Some of the
costs are not very accurate yet, but the comparison of scalar vs fixed-with vs
scalable for the same fiv can become more accurate, especially with patches
like llvm#122236.
davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Feb 1, 2025
Pulled out of llvm#122236, this allows Splats contants to be recognized in by
getOperandInfo, allowing "better" costs for instructions like divides by
constants to be produced (which are expanded into mul+add+shift). Some of the
costs are not very accurate yet, but the comparison of scalar vs fixed-with vs
scalable for the same fiv can become more accurate, especially with patches
like llvm#122236.
@davemgreen davemgreen requested a review from sushgokh February 4, 2025 06:35
@sushgokh
Copy link
Contributor

sushgokh commented Feb 5, 2025

@davemgreen
I was just looking at the examples with 3 different scenarios for the constant in my mind:

  1. If the constant is pow-2
  2. If the constant is uniform
  3. If the constant is negative

So, essentially, its like having 8 possibilities for the constant with these 3 variables.

Now, things change pretty quickly if the constant is negative(I think I said that things dont change much if the constant is negative on the other patch but I was wrong).
e.g.

for <4 x i32> :
a) positive constant: ushr
b) negative constant: ushr + umull2 + umull + uzp2 + ushr     (for certain cases)

The non-uniform case also brings in more complexity but not that much.

So, maybe, can be break this patch so that we conquer just few possibilities(from 8 possibilities) in every patch at a time ? This will ensure precision in cost estimation and easy verification I believe.

davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Feb 13, 2025
Pulled out of llvm#122236, this allows Splats contants to be recognized in by
getOperandInfo, allowing "better" costs for instructions like divides by
constants to be produced (which are expanded into mul+add+shift). Some of the
costs are not very accurate yet, but the comparison of scalar vs fixed-with vs
scalable for the same fiv can become more accurate, especially with patches
like llvm#122236.
davemgreen added a commit that referenced this pull request Feb 21, 2025
Pulled out of #122236, this allows Splats constants to be recognized by
getOperandInfo, allowing "better" costs for instructions like divides by
constants to be produced (which are expanded into mul+add+shift). Some
of the costs are not very accurate yet, but the comparison of scalar vs
fixed-width vs scalable for the same div can become more accurate,
especially with patches like #122236.
@davemgreen
Copy link
Collaborator Author

@davemgreen I was just looking at the examples with 3 different scenarios for the constant in my mind:

  1. If the constant is pow-2
  2. If the constant is uniform
  3. If the constant is negative

So, essentially, its like having 8 possibilities for the constant with these 3 variables.

Now, things change pretty quickly if the constant is negative(I think I said that things dont change much if the constant is negative on the other patch but I was wrong). e.g.

for <4 x i32> :
a) positive constant: ushr
b) negative constant: ushr + umull2 + umull + uzp2 + ushr     (for certain cases)

The non-uniform case also brings in more complexity but not that much.

So, maybe, can be break this patch so that we conquer just few possibilities(from 8 possibilities) in every patch at a time ? This will ensure precision in cost estimation and easy verification I believe.

I believe for udiv/urem there are only really 2 cases - positive power-2 and everything else. The power-2 cases almost never come up as they are canonicalised early and are mostly added for completeness. Like you say I don't think there is a lot of difference between uniform and non-uniforms, it just takes the worst-case.

The exact cost will depend on the constant though, and we are currently keeping things simple and assuming no simplification happens. Some positive constants will simplify, some negative ones will, it is hard to tell without doing a lot of maths to see what the exact instructions will be. We could try to use DivisionByConstantInfo to get a more precise expansion and figure out what will simplify, but that feels a bit more complex than is helpful considering the other costs that could be improved. Maybe one for the future, if you don't want to look into it yourself.

Op1Info.getNoProps(), Op2Info.getNoProps());
InstructionCost ShrCost =
getArithmeticInstrCost(Instruction::AShr, Ty, CostKind,
Op1Info.getNoProps(), Op2Info.getNoProps());
return MulCost * 2 + AddCost * 2 + ShrCost * 2 + 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this part can be removed. Division by non-constant, uniform divisor is giving scalar code. Check here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - This is inside the top-level if (Op2Info.isConstant()) {, so I don't think that should be an issue. We are only trying to update constant costs in this patch, to keep it simpler. I've added some uniform tests in f08824b to check.

I think this bit of code can probably be removed when sdiv gets added. For now I will re-add the isScalableVector check to make sure the sdiv scores don't change yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Thanks. Will remove this if(){...} when I update the sdiv/srem patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

A urem by a constant, much like a udiv by a constant, can be expanded into a
series of mul/add/shift instructions. The exact sequence of instructions
depends on the constants and the types.

If the constant is a power-2 then a shift / and will be used, so the cost will
be 1. This canonicalization happens relatively early so this likely has very
little effect in practice (it does help the cost of funnel shifts).

For a non-power 2 the code for div will expand to a series of UMULH + Add +
Shift + Add, depending on the constant. urem is generally udiv + mul + sub, so
involves a few extra instructions. The UMULH is not always available, i32 will
use umull+shift, and vector types will use umull+shift or umull+umull2+uzp
depending on the vector size. v2i64 will be scalarized because there is no mul
available. SVE does have a UMULH instruction.

The end result is that the costs should be closer to reality, with scalable
types a little lower cost than the fixed-width versions. (In the future we
might be able to use umulh for fixed-width when the SVE instruction is
available, but for the moment this should favour scalable vectorization a
little).

I've tried to make this patch only apply to constant UREM/UDIV instructions.
SDIV and SREM are left until a later patch to prevent this becoming too
complex. The funnel shift costs are changing as it believes it will need a urem
to clamp the shift amount, which should be a power-2 value for most common
types.
Copy link
Contributor

@sushgokh sushgokh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davemgreen davemgreen merged commit a5d8b7a into llvm:main Feb 26, 2025
11 checks passed
@davemgreen davemgreen deleted the gh-a64-uremcost branch February 26, 2025 13:49
sushgokh added a commit to sushgokh/llvm-project that referenced this pull request Mar 10, 2025
This patch draws its inspiration from the udiv/urem patch llvm#122236

For sdiv, typical sequence of instructions as per the type and divisor property is as follows:

Scalar power-of-2: cmp + csel + asr
Neon power-of-2: usra + sshr

Scalar non-power-2: smulh/smull + asr/lsr + add/sub + asr + add
Vector non-power-2:
  a) <2 x i64>: 2 * (smulh + asr + add)    . This yeilds scalarized form.
  b) <4 x i32>: smull2 + smull + uzp2 + add + sshr + usra

SVE versions should have more or less the same cost because sometimes they yeild native sdiv instructions, which should have less cost or the same sequence of neon instructions.

For srem, typical sequence of instructions as per the type and divisor property is as follows:
Scalar version: <set of sdiv instructions> + msub
Vector version: <set of sdiv instructions> + 2-msub/1-mls
sushgokh added a commit that referenced this pull request Mar 10, 2025
…nt (#123552)

This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236

The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants