[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051

preames · 2024-07-29T18:07:08Z

For a narrowing operation, the work performed scales with the source LMUL not the destination LMUL. A side effect of the code sharing with FP_EXTEND was that we used the wrong LMUL when costing the inserted narrowing operations. For casts which start with a high LMUL operation, this change makes the cost significantly more expensive.

llvmbot · 2024-07-29T18:07:43Z

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-analysis

Author: Philip Reames (preames)

Changes

For a narrowing operation, the work performed scales with the source LMUL not the destination LMUL. A side effect of the code sharing with FP_EXTEND was that we used the wrong LMUL when costing the inserted narrowing operations. For casts which start with a high LMUL operation, this change makes the cost significantly more expensive.

Patch is 62.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/101051.diff

2 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+20-11)
(modified) llvm/test/Analysis/CostModel/RISCV/cast.ll (+161-161)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61a9f10be86c..3738c52b11db5 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1077,24 +1077,33 @@ InstructionCost RISCVTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
                                      SrcLT.second, CostKind);
     }
     [[fallthrough]];
-  case ISD::FP_EXTEND:
   case ISD::FP_ROUND: {
-    // Counts of narrow/widen instructions.
+    // Counts of narrowing instructions.
     unsigned SrcEltSize = Src->getScalarSizeInBits();
     unsigned DstEltSize = Dst->getScalarSizeInBits();
 
-    unsigned Op = (ISD == ISD::TRUNCATE)    ? RISCV::VNSRL_WI
-                  : (ISD == ISD::FP_EXTEND) ? RISCV::VFWCVT_F_F_V
-                                            : RISCV::VFNCVT_F_F_W;
+    const unsigned Op =
+        (ISD == ISD::TRUNCATE) ? RISCV::VNSRL_WI : RISCV::VFNCVT_F_F_W;
     InstructionCost Cost = 0;
-    for (; SrcEltSize != DstEltSize;) {
+    for (; SrcEltSize != DstEltSize; SrcEltSize = SrcEltSize >> 1) {
       MVT ElementMVT = (ISD == ISD::TRUNCATE)
-                           ? MVT::getIntegerVT(DstEltSize)
-                           : MVT::getFloatingPointVT(DstEltSize);
+                           ? MVT::getIntegerVT(SrcEltSize)
+                           : MVT::getFloatingPointVT(SrcEltSize);
+      MVT SrcMVT = SrcLT.second.changeVectorElementType(ElementMVT);
+      Cost += getRISCVInstructionCost(Op, SrcMVT, CostKind);
+    }
+    return Cost;
+  }
+  case ISD::FP_EXTEND: {
+    // Counts of widening instructions.
+    unsigned SrcEltSize = Src->getScalarSizeInBits();
+    unsigned DstEltSize = Dst->getScalarSizeInBits();
+
+    InstructionCost Cost = 0;
+    for (; SrcEltSize != DstEltSize; DstEltSize = DstEltSize >> 1) {
+      MVT ElementMVT = MVT::getFloatingPointVT(DstEltSize);
       MVT DstMVT = DstLT.second.changeVectorElementType(ElementMVT);
-      DstEltSize =
-          (DstEltSize > SrcEltSize) ? DstEltSize >> 1 : DstEltSize << 1;
-      Cost += getRISCVInstructionCost(Op, DstMVT, CostKind);
+      Cost += getRISCVInstructionCost(RISCV::VFWCVT_F_F_V, DstMVT, CostKind);
     }
     return Cost;
   }
diff --git a/llvm/test/Analysis/CostModel/RISCV/cast.ll b/llvm/test/Analysis/CostModel/RISCV/cast.ll
index 669e7028ff54d..b460e81ec348f 100644
--- a/llvm/test/Analysis/CostModel/RISCV/cast.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/cast.ll
@@ -1028,20 +1028,20 @@ define void @trunc() {
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v2i64_v2i1 = trunc <2 x i64> undef to <2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v4i16_v4i8 = trunc <4 x i16> undef to <4 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i32_v4i8 = trunc <4 x i32> undef to <4 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v4i64_v4i8 = trunc <4 x i64> undef to <4 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v4i64_v4i8 = trunc <4 x i64> undef to <4 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v4i32_v4i16 = trunc <4 x i32> undef to <4 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i64_v4i16 = trunc <4 x i64> undef to <4 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v4i64_v4i32 = trunc <4 x i64> undef to <4 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v4i64_v4i16 = trunc <4 x i64> undef to <4 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i64_v4i32 = trunc <4 x i64> undef to <4 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i8_v4i1 = trunc <4 x i8> undef to <4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i16_v4i1 = trunc <4 x i16> undef to <4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4i32_v4i1 = trunc <4 x i32> undef to <4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v4i64_v4i1 = trunc <4 x i64> undef to <4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v8i16_v8i8 = trunc <8 x i16> undef to <8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8i32_v8i8 = trunc <8 x i32> undef to <8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v8i64_v8i8 = trunc <8 x i64> undef to <8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v8i32_v8i16 = trunc <8 x i32> undef to <8 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v8i64_v8i16 = trunc <8 x i64> undef to <8 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8i64_v8i32 = trunc <8 x i64> undef to <8 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v8i32_v8i8 = trunc <8 x i32> undef to <8 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v8i64_v8i8 = trunc <8 x i64> undef to <8 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8i32_v8i16 = trunc <8 x i32> undef to <8 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v8i64_v8i16 = trunc <8 x i64> undef to <8 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v8i64_v8i32 = trunc <8 x i64> undef to <8 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8i8_v8i1 = trunc <8 x i8> undef to <8 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8i16_v8i1 = trunc <8 x i16> undef to <8 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v8i32_v8i1 = trunc <8 x i32> undef to <8 x i1>
@@ -1056,42 +1056,42 @@ define void @trunc() {
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16i16_v16i1 = trunc <2 x i16> undef to <2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16i32_v16i1 = trunc <2 x i32> undef to <2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16i64_v16i1 = trunc <2 x i64> undef to <2 x i1>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v32i16_v32i8 = trunc <16 x i16> undef to <16 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v32i32_v32i8 = trunc <16 x i32> undef to <16 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v32i64_v32i8 = trunc <16 x i64> undef to <16 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v32i32_v32i16 = trunc <16 x i32> undef to <16 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v32i64_v32i16 = trunc <16 x i64> undef to <16 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v32i64_v32i32 = trunc <16 x i64> undef to <16 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v32i16_v32i8 = trunc <16 x i16> undef to <16 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v32i32_v32i8 = trunc <16 x i32> undef to <16 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %v32i64_v32i8 = trunc <16 x i64> undef to <16 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v32i32_v32i16 = trunc <16 x i32> undef to <16 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %v32i64_v32i16 = trunc <16 x i64> undef to <16 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v32i64_v32i32 = trunc <16 x i64> undef to <16 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v32i8_v32i1 = trunc <16 x i8> undef to <16 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v32i16_v32i1 = trunc <16 x i16> undef to <16 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v32i32_v32i1 = trunc <16 x i32> undef to <16 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %v32i64_v32i1 = trunc <16 x i64> undef to <16 x i1>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v64i16_v64i8 = trunc <64 x i16> undef to <64 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %v64i32_v64i8 = trunc <64 x i32> undef to <64 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %v64i64_v64i8 = trunc <64 x i64> undef to <64 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v64i32_v64i16 = trunc <64 x i32> undef to <64 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %v64i64_v64i16 = trunc <64 x i64> undef to <64 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v64i64_v64i32 = trunc <64 x i64> undef to <64 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v64i16_v64i8 = trunc <64 x i16> undef to <64 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %v64i32_v64i8 = trunc <64 x i32> undef to <64 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 59 for instruction: %v64i64_v64i8 = trunc <64 x i64> undef to <64 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v64i32_v64i16 = trunc <64 x i32> undef to <64 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 51 for instruction: %v64i64_v64i16 = trunc <64 x i64> undef to <64 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v64i64_v64i32 = trunc <64 x i64> undef to <64 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v64i8_v64i1 = trunc <64 x i8> undef to <64 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %v64i16_v64i1 = trunc <64 x i16> undef to <64 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v64i32_v64i1 = trunc <64 x i32> undef to <64 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v64i64_v64i1 = trunc <64 x i64> undef to <64 x i1>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v128i16_v128i8 = trunc <128 x i16> undef to <128 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %v128i32_v128i8 = trunc <128 x i32> undef to <128 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %v128i64_v128i8 = trunc <128 x i64> undef to <128 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v128i32_v128i16 = trunc <128 x i32> undef to <128 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 54 for instruction: %v128i64_v128i16 = trunc <128 x i64> undef to <128 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: %v128i64_v128i32 = trunc <128 x i64> undef to <128 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %v128i16_v128i8 = trunc <128 x i16> undef to <128 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 51 for instruction: %v128i32_v128i8 = trunc <128 x i32> undef to <128 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 119 for instruction: %v128i64_v128i8 = trunc <128 x i64> undef to <128 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v128i32_v128i16 = trunc <128 x i32> undef to <128 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 102 for instruction: %v128i64_v128i16 = trunc <128 x i64> undef to <128 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %v128i64_v128i32 = trunc <128 x i64> undef to <128 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %v128i8_v128i1 = trunc <128 x i8> undef to <128 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %v128i16_v128i1 = trunc <128 x i16> undef to <128 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 67 for instruction: %v128i32_v128i1 = trunc <128 x i32> undef to <128 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v128i64_v128i1 = trunc <128 x i64> undef to <128 x i1>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %v256i16_v256i8 = trunc <256 x i16> undef to <256 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 54 for instruction: %v256i32_v256i8 = trunc <256 x i32> undef to <256 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 126 for instruction: %v256i64_v256i8 = trunc <256 x i64> undef to <256 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: %v256i32_v256i16 = trunc <256 x i32> undef to <256 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 108 for instruction: %v256i64_v256i16 = trunc <256 x i64> undef to <256 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 72 for instruction: %v256i64_v256i32 = trunc <256 x i64> undef to <256 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %v256i16_v256i8 = trunc <256 x i16> undef to <256 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 102 for instruction: %v256i32_v256i8 = trunc <256 x i32> undef to <256 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 238 for instruction: %v256i64_v256i8 = trunc <256 x i64> undef to <256 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %v256i32_v256i16 = trunc <256 x i32> undef to <256 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 204 for instruction: %v256i64_v256i16 = trunc <256 x i64> undef to <256 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 136 for instruction: %v256i64_v256i32 = trunc <256 x i64> undef to <256 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v256i8_v256i1 = trunc <256 x i8> undef to <256 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 66 for instruction: %v256i16_v256i1 = trunc <256 x i16> undef to <256 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 134 for instruction: %v256i32_v256i1 = trunc <256 x i32> undef to <256 x i1>
@@ -1108,60 +1108,60 @@ define void @trunc() {
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv1i64_nxv1i1 = trunc <vscale x 1 x i64> undef to <vscale x 1 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i16_nxv2i8 = trunc <vscale x 2 x i16> undef to <vscale x 2 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32_nxv2i8 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i32_nxv2i16 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i8_nxv2i1 = trunc <vscale x 2 x i8> undef to <vscale x 2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i16_nxv2i1 = trunc <vscale x 2 x i16> undef to <vscale x 2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32_nxv2i1 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv2i64_nxv2i1 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv4i16_nxv4i8 = trunc <vscale x 4 x i16> undef to <vscale x 4 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv4i32_nxv4i8 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv4i64_nxv4i8 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv4i32_nxv4i16 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv4i64_nxv4i16 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv4i64_nxv4i32 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv4i32_nxv4i8 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %nxv4i64_nxv4i8 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv4i32_nxv4i16 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %nxv4i64_nxv4i16 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv4i64_nxv4i32 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv4i8_nxv4i1 = trunc <vscale x 4 x i8> undef to <vscale x 4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv4i16_nxv4i1 = trunc <vscale x 4 x i16> undef to <vscale x 4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv4i32_nxv4i1 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i1>
 ; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %nxv4i64_nxv4i1 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i1>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv8i16_nxv8i8 = trunc <vscale x 8 x i16> undef to <vscale x 8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv8i32_nxv8i8 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %nxv8i64_nxv8i8 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i8>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv8i32_nxv8i16 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %nxv8i64_nxv8i16 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i16>
-; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv8i64_nxv8i32 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv8i16_nxv8i8 = trunc <vscale x 8 x i16> undef to <vscale x 8 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %nxv8i32_nxv8i8 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %nxv8i64_nxv8i8 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i8>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %nxv8i32_nxv8i16 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %nxv8i64_nxv8i16 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %nxv8i64_nxv8i32 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i32>
 ; RV32-NEXT:  Cost Model: Found an estimated co...
[truncated]

arcbbb · 2024-07-30T02:04:47Z

llvm/test/Analysis/CostModel/RISCV/cast.ll

-; RV32-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
+; RV32-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>


In this case, (LMUL2, e64) is truncated to (LMUL1, e32).
Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
I believe the cost is implementation-defined. It can be executed either as a single uop reading two registers from vs2 or as two uops reading different parts from vs2. In the former case, the cost is 1, and in the latter case, it is 2.
The implementation-related decision would be encapsulated in getRISCVInstructionCost(RISCV::VNSRL_WI), so we should always pass the destination type to getRISCVInstructionCost.

In general, we model the cost of an LMUL operation proportional to the number of registers read or written - not just written. As an example, consider that we give reductions a non-unit cost. In that particular case, we actually scale by log(VL) so it's not a perfect analogy.

As for the point about encapsulating this change in getRISCVInstructionCost, I'd be fine shifting to that approach. On reflection, passing in the destination type (which matches the instruction semantic of defining it's output in terms of SEW and input as 2 x SEW) seems reasonable.

If we do that, we probably want to model the other narrowing instructions (vnsra, vnclip, vfncvt), in an analogous manner.

I went and ran a couple micro benchmarks on the bp3.

SEW=e32 Running vnsrl-mf2.out ~1.994110 cycles-per-inst ~2013.251700 cycles-per-iteration ~1009.599300 insts-per-iteration Running vnsrl-m1.out ~3.987263 cycles-per-inst ~20049.131700 cycles-per-iteration ~5028.294000 insts-per-iteration Running vnsrl-m2.out ~7.899607 cycles-per-inst ~8022.488800 cycles-per-iteration ~1015.555400 insts-per-iteration SEW=e16 Running vnsrl-e16-mf4.out ~1.010277 cycles-per-inst ~1020.599450 cycles-per-iteration ~1010.217100 insts-per-iteration Running vnsrl-e16-mf2.out ~1.995437 cycles-per-inst ~2033.040150 cycles-per-iteration ~1018.844450 insts-per-iteration Running vnsrl-e16-m1.out ~3.970088 cycles-per-inst ~4017.414500 cycles-per-iteration ~1011.920650 insts-per-iteration

A key detail is that these are throughput tests. I have five independent chains of instructions to minimize the dependency through the destination register. The initial dataset I shared (and then deleted) were latency tests. I had accidentally chained all the instructions through the destination register. For this particular piece of hardware, there's a big difference between the two approaches. Using a vadd.vv as an example, it appears to be able to execute one a cycle, but takes a total of four cycles to make that result available for the next dependent instruction.

Note as well that the actual execution width is less than VLEN. You can see this very distinctly in the SEW=e16 data set.

(As an unrelated aside, trying to use an mf4 SEW=e32 w/VL=2 causes an illegal instruction fault. That's a bit surprising on VLEN=256 machine...)

So, at least on this machine, it does appear that the cost of a vnsrl.wi scales with source type.

The throughput tests show that the processing time doubles when the LMUL doubles.
I am not sure if I understand it correctly, but it looks like the vnsrl on this machine is capable to output VLEN/4 per cycle.
Based on the observation that CPI is ~1 for mf4, ~2 for mf2, ~4 for m1, and ~8 for m2, regardless of sew.
And I would say the number of uops for each vnsrl is LMUL*4.
It is just assumption because I don't know the micro-architecture.
Based on my understanding, vnsrl on sifive-x280 is capable to handle DLEN/2 per cycle, and vnsrl on sifive-p670 is DLEN per cycle.

preames · 2024-11-07T03:43:02Z

No longer actively working on this.

preames requested review from arcbbb, topperc, luke957 and pcwang-thead July 29, 2024 18:07

llvmbot added backend:RISC-V llvm:analysis labels Jul 29, 2024

preames mentioned this pull request Jul 29, 2024

[RISCV][TTI] Cost non-power-of-two size changing casts #101047

Merged

arcbbb reviewed Jul 30, 2024

View reviewed changes

preames closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051

[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051

Uh oh!

preames commented Jul 29, 2024

Uh oh!

llvmbot commented Jul 29, 2024 •

edited

Loading

Uh oh!

arcbbb Jul 30, 2024

Uh oh!

preames Jul 30, 2024

Uh oh!

preames Jul 30, 2024 •

edited

Loading

Uh oh!

arcbbb Jul 31, 2024

Uh oh!

preames commented Nov 7, 2024

Uh oh!

Uh oh!

[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051

[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051

Uh oh!

Conversation

preames commented Jul 29, 2024

Uh oh!

llvmbot commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arcbbb Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

preames Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

preames Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arcbbb Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

preames commented Nov 7, 2024

Uh oh!

Uh oh!

llvmbot commented Jul 29, 2024 •

edited

Loading

preames Jul 30, 2024 •

edited

Loading