-
Notifications
You must be signed in to change notification settings - Fork 14.1k
[RISCV][TTI] Fix a costing mistake for truncate/fp_round with LMUL>m1 #101051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For a narrowing operation, the work performed scales with the source LMUL not the destination LMUL. A side effect of the code sharing with FP_EXTEND was that we used the wrong LMUL when costing the inserted narrowing operations. For casts which start with a high LMUL operation, this change makes the cost significantly more expensive.
@llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-llvm-analysis Author: Philip Reames (preames) ChangesFor a narrowing operation, the work performed scales with the source LMUL not the destination LMUL. A side effect of the code sharing with FP_EXTEND was that we used the wrong LMUL when costing the inserted narrowing operations. For casts which start with a high LMUL operation, this change makes the cost significantly more expensive. Patch is 62.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/101051.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61a9f10be86c..3738c52b11db5 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1077,24 +1077,33 @@ InstructionCost RISCVTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
SrcLT.second, CostKind);
}
[[fallthrough]];
- case ISD::FP_EXTEND:
case ISD::FP_ROUND: {
- // Counts of narrow/widen instructions.
+ // Counts of narrowing instructions.
unsigned SrcEltSize = Src->getScalarSizeInBits();
unsigned DstEltSize = Dst->getScalarSizeInBits();
- unsigned Op = (ISD == ISD::TRUNCATE) ? RISCV::VNSRL_WI
- : (ISD == ISD::FP_EXTEND) ? RISCV::VFWCVT_F_F_V
- : RISCV::VFNCVT_F_F_W;
+ const unsigned Op =
+ (ISD == ISD::TRUNCATE) ? RISCV::VNSRL_WI : RISCV::VFNCVT_F_F_W;
InstructionCost Cost = 0;
- for (; SrcEltSize != DstEltSize;) {
+ for (; SrcEltSize != DstEltSize; SrcEltSize = SrcEltSize >> 1) {
MVT ElementMVT = (ISD == ISD::TRUNCATE)
- ? MVT::getIntegerVT(DstEltSize)
- : MVT::getFloatingPointVT(DstEltSize);
+ ? MVT::getIntegerVT(SrcEltSize)
+ : MVT::getFloatingPointVT(SrcEltSize);
+ MVT SrcMVT = SrcLT.second.changeVectorElementType(ElementMVT);
+ Cost += getRISCVInstructionCost(Op, SrcMVT, CostKind);
+ }
+ return Cost;
+ }
+ case ISD::FP_EXTEND: {
+ // Counts of widening instructions.
+ unsigned SrcEltSize = Src->getScalarSizeInBits();
+ unsigned DstEltSize = Dst->getScalarSizeInBits();
+
+ InstructionCost Cost = 0;
+ for (; SrcEltSize != DstEltSize; DstEltSize = DstEltSize >> 1) {
+ MVT ElementMVT = MVT::getFloatingPointVT(DstEltSize);
MVT DstMVT = DstLT.second.changeVectorElementType(ElementMVT);
- DstEltSize =
- (DstEltSize > SrcEltSize) ? DstEltSize >> 1 : DstEltSize << 1;
- Cost += getRISCVInstructionCost(Op, DstMVT, CostKind);
+ Cost += getRISCVInstructionCost(RISCV::VFWCVT_F_F_V, DstMVT, CostKind);
}
return Cost;
}
diff --git a/llvm/test/Analysis/CostModel/RISCV/cast.ll b/llvm/test/Analysis/CostModel/RISCV/cast.ll
index 669e7028ff54d..b460e81ec348f 100644
--- a/llvm/test/Analysis/CostModel/RISCV/cast.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/cast.ll
@@ -1028,20 +1028,20 @@ define void @trunc() {
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2i64_v2i1 = trunc <2 x i64> undef to <2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v4i16_v4i8 = trunc <4 x i16> undef to <4 x i8>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i32_v4i8 = trunc <4 x i32> undef to <4 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i64_v4i8 = trunc <4 x i64> undef to <4 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v4i64_v4i8 = trunc <4 x i64> undef to <4 x i8>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v4i32_v4i16 = trunc <4 x i32> undef to <4 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i64_v4i16 = trunc <4 x i64> undef to <4 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v4i64_v4i32 = trunc <4 x i64> undef to <4 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v4i64_v4i16 = trunc <4 x i64> undef to <4 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i64_v4i32 = trunc <4 x i64> undef to <4 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i8_v4i1 = trunc <4 x i8> undef to <4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i16_v4i1 = trunc <4 x i16> undef to <4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v4i32_v4i1 = trunc <4 x i32> undef to <4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v4i64_v4i1 = trunc <4 x i64> undef to <4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v8i16_v8i8 = trunc <8 x i16> undef to <8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_v8i8 = trunc <8 x i32> undef to <8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v8i64_v8i8 = trunc <8 x i64> undef to <8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v8i32_v8i16 = trunc <8 x i32> undef to <8 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i64_v8i16 = trunc <8 x i64> undef to <8 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i64_v8i32 = trunc <8 x i64> undef to <8 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v8i32_v8i8 = trunc <8 x i32> undef to <8 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v8i64_v8i8 = trunc <8 x i64> undef to <8 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i32_v8i16 = trunc <8 x i32> undef to <8 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v8i64_v8i16 = trunc <8 x i64> undef to <8 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v8i64_v8i32 = trunc <8 x i64> undef to <8 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i8_v8i1 = trunc <8 x i8> undef to <8 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v8i16_v8i1 = trunc <8 x i16> undef to <8 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v8i32_v8i1 = trunc <8 x i32> undef to <8 x i1>
@@ -1056,42 +1056,42 @@ define void @trunc() {
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i16_v16i1 = trunc <2 x i16> undef to <2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i32_v16i1 = trunc <2 x i32> undef to <2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v16i64_v16i1 = trunc <2 x i64> undef to <2 x i1>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v32i16_v32i8 = trunc <16 x i16> undef to <16 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v32i32_v32i8 = trunc <16 x i32> undef to <16 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v32i64_v32i8 = trunc <16 x i64> undef to <16 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v32i32_v32i16 = trunc <16 x i32> undef to <16 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v32i64_v32i16 = trunc <16 x i64> undef to <16 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v32i64_v32i32 = trunc <16 x i64> undef to <16 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v32i16_v32i8 = trunc <16 x i16> undef to <16 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v32i32_v32i8 = trunc <16 x i32> undef to <16 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v32i64_v32i8 = trunc <16 x i64> undef to <16 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v32i32_v32i16 = trunc <16 x i32> undef to <16 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %v32i64_v32i16 = trunc <16 x i64> undef to <16 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v32i64_v32i32 = trunc <16 x i64> undef to <16 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v32i8_v32i1 = trunc <16 x i8> undef to <16 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v32i16_v32i1 = trunc <16 x i16> undef to <16 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v32i32_v32i1 = trunc <16 x i32> undef to <16 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %v32i64_v32i1 = trunc <16 x i64> undef to <16 x i1>
-; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v64i16_v64i8 = trunc <64 x i16> undef to <64 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 13 for instruction: %v64i32_v64i8 = trunc <64 x i32> undef to <64 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 31 for instruction: %v64i64_v64i8 = trunc <64 x i64> undef to <64 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v64i32_v64i16 = trunc <64 x i32> undef to <64 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 27 for instruction: %v64i64_v64i16 = trunc <64 x i64> undef to <64 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %v64i64_v64i32 = trunc <64 x i64> undef to <64 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v64i16_v64i8 = trunc <64 x i16> undef to <64 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v64i32_v64i8 = trunc <64 x i32> undef to <64 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 59 for instruction: %v64i64_v64i8 = trunc <64 x i64> undef to <64 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %v64i32_v64i16 = trunc <64 x i32> undef to <64 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 51 for instruction: %v64i64_v64i16 = trunc <64 x i64> undef to <64 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %v64i64_v64i32 = trunc <64 x i64> undef to <64 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v64i8_v64i1 = trunc <64 x i8> undef to <64 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %v64i16_v64i1 = trunc <64 x i16> undef to <64 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 33 for instruction: %v64i32_v64i1 = trunc <64 x i32> undef to <64 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v64i64_v64i1 = trunc <64 x i64> undef to <64 x i1>
-; RV32-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %v128i16_v128i8 = trunc <128 x i16> undef to <128 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 27 for instruction: %v128i32_v128i8 = trunc <128 x i32> undef to <128 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 63 for instruction: %v128i64_v128i8 = trunc <128 x i64> undef to <128 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %v128i32_v128i16 = trunc <128 x i32> undef to <128 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 54 for instruction: %v128i64_v128i16 = trunc <128 x i64> undef to <128 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %v128i64_v128i32 = trunc <128 x i64> undef to <128 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 17 for instruction: %v128i16_v128i8 = trunc <128 x i16> undef to <128 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 51 for instruction: %v128i32_v128i8 = trunc <128 x i32> undef to <128 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 119 for instruction: %v128i64_v128i8 = trunc <128 x i64> undef to <128 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %v128i32_v128i16 = trunc <128 x i32> undef to <128 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 102 for instruction: %v128i64_v128i16 = trunc <128 x i64> undef to <128 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 68 for instruction: %v128i64_v128i32 = trunc <128 x i64> undef to <128 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %v128i8_v128i1 = trunc <128 x i8> undef to <128 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 33 for instruction: %v128i16_v128i1 = trunc <128 x i16> undef to <128 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 67 for instruction: %v128i32_v128i1 = trunc <128 x i32> undef to <128 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v128i64_v128i1 = trunc <128 x i64> undef to <128 x i1>
-; RV32-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %v256i16_v256i8 = trunc <256 x i16> undef to <256 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 54 for instruction: %v256i32_v256i8 = trunc <256 x i32> undef to <256 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 126 for instruction: %v256i64_v256i8 = trunc <256 x i64> undef to <256 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 36 for instruction: %v256i32_v256i16 = trunc <256 x i32> undef to <256 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 108 for instruction: %v256i64_v256i16 = trunc <256 x i64> undef to <256 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 72 for instruction: %v256i64_v256i32 = trunc <256 x i64> undef to <256 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 34 for instruction: %v256i16_v256i8 = trunc <256 x i16> undef to <256 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 102 for instruction: %v256i32_v256i8 = trunc <256 x i32> undef to <256 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 238 for instruction: %v256i64_v256i8 = trunc <256 x i64> undef to <256 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 68 for instruction: %v256i32_v256i16 = trunc <256 x i32> undef to <256 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 204 for instruction: %v256i64_v256i16 = trunc <256 x i64> undef to <256 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 136 for instruction: %v256i64_v256i32 = trunc <256 x i64> undef to <256 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %v256i8_v256i1 = trunc <256 x i8> undef to <256 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 66 for instruction: %v256i16_v256i1 = trunc <256 x i16> undef to <256 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 134 for instruction: %v256i32_v256i1 = trunc <256 x i32> undef to <256 x i1>
@@ -1108,60 +1108,60 @@ define void @trunc() {
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv1i64_nxv1i1 = trunc <vscale x 1 x i64> undef to <vscale x 1 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv2i16_nxv2i8 = trunc <vscale x 2 x i16> undef to <vscale x 2 x i8>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32_nxv2i8 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv2i64_nxv2i8 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i8>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv2i32_nxv2i16 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i8_nxv2i1 = trunc <vscale x 2 x i8> undef to <vscale x 2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i16_nxv2i1 = trunc <vscale x 2 x i16> undef to <vscale x 2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32_nxv2i1 = trunc <vscale x 2 x i32> undef to <vscale x 2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv2i64_nxv2i1 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv4i16_nxv4i8 = trunc <vscale x 4 x i16> undef to <vscale x 4 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i32_nxv4i8 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv4i64_nxv4i8 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv4i32_nxv4i16 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv4i64_nxv4i16 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i64_nxv4i32 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv4i32_nxv4i8 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nxv4i64_nxv4i8 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i32_nxv4i16 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nxv4i64_nxv4i16 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv4i64_nxv4i32 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i32>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i8_nxv4i1 = trunc <vscale x 4 x i8> undef to <vscale x 4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i16_nxv4i1 = trunc <vscale x 4 x i16> undef to <vscale x 4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv4i32_nxv4i1 = trunc <vscale x 4 x i32> undef to <vscale x 4 x i1>
; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %nxv4i64_nxv4i1 = trunc <vscale x 4 x i64> undef to <vscale x 4 x i1>
-; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv8i16_nxv8i8 = trunc <vscale x 8 x i16> undef to <vscale x 8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv8i32_nxv8i8 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %nxv8i64_nxv8i8 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i8>
-; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv8i32_nxv8i16 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nxv8i64_nxv8i16 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i16>
-; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv8i64_nxv8i32 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i32>
+; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv8i16_nxv8i8 = trunc <vscale x 8 x i16> undef to <vscale x 8 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %nxv8i32_nxv8i8 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %nxv8i64_nxv8i8 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i8>
+; RV32-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv8i32_nxv8i16 = trunc <vscale x 8 x i32> undef to <vscale x 8 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %nxv8i64_nxv8i16 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i16>
+; RV32-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %nxv8i64_nxv8i32 = trunc <vscale x 8 x i64> undef to <vscale x 8 x i32>
; RV32-NEXT: Cost Model: Found an estimated co...
[truncated]
|
; RV32-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32> | ||
; RV32-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %nxv2i64_nxv2i16 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i16> | ||
; RV32-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, (LMUL2, e64) is truncated to (LMUL1, e32).
Cost Model: Found an estimated cost of 1 for instruction: %nxv2i64_nxv2i32 = trunc <vscale x 2 x i64> undef to <vscale x 2 x i32>
I believe the cost is implementation-defined. It can be executed either as a single uop reading two registers from vs2
or as two uops reading different parts from vs2
. In the former case, the cost is 1, and in the latter case, it is 2.
The implementation-related decision would be encapsulated in getRISCVInstructionCost(RISCV::VNSRL_WI)
, so we should always pass the destination type to getRISCVInstructionCost
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we model the cost of an LMUL operation proportional to the number of registers read or written - not just written. As an example, consider that we give reductions a non-unit cost. In that particular case, we actually scale by log(VL) so it's not a perfect analogy.
As for the point about encapsulating this change in getRISCVInstructionCost, I'd be fine shifting to that approach. On reflection, passing in the destination type (which matches the instruction semantic of defining it's output in terms of SEW and input as 2 x SEW) seems reasonable.
If we do that, we probably want to model the other narrowing instructions (vnsra, vnclip, vfncvt), in an analogous manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went and ran a couple micro benchmarks on the bp3.
SEW=e32
Running vnsrl-mf2.out
~1.994110 cycles-per-inst
~2013.251700 cycles-per-iteration
~1009.599300 insts-per-iteration
Running vnsrl-m1.out
~3.987263 cycles-per-inst
~20049.131700 cycles-per-iteration
~5028.294000 insts-per-iteration
Running vnsrl-m2.out
~7.899607 cycles-per-inst
~8022.488800 cycles-per-iteration
~1015.555400 insts-per-iteration
SEW=e16
Running vnsrl-e16-mf4.out
~1.010277 cycles-per-inst
~1020.599450 cycles-per-iteration
~1010.217100 insts-per-iteration
Running vnsrl-e16-mf2.out
~1.995437 cycles-per-inst
~2033.040150 cycles-per-iteration
~1018.844450 insts-per-iteration
Running vnsrl-e16-m1.out
~3.970088 cycles-per-inst
~4017.414500 cycles-per-iteration
~1011.920650 insts-per-iteration
A key detail is that these are throughput tests. I have five independent chains of instructions to minimize the dependency through the destination register. The initial dataset I shared (and then deleted) were latency tests. I had accidentally chained all the instructions through the destination register. For this particular piece of hardware, there's a big difference between the two approaches. Using a vadd.vv as an example, it appears to be able to execute one a cycle, but takes a total of four cycles to make that result available for the next dependent instruction.
Note as well that the actual execution width is less than VLEN. You can see this very distinctly in the SEW=e16 data set.
(As an unrelated aside, trying to use an mf4 SEW=e32 w/VL=2 causes an illegal instruction fault. That's a bit surprising on VLEN=256 machine...)
So, at least on this machine, it does appear that the cost of a vnsrl.wi scales with source type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The throughput tests show that the processing time doubles when the LMUL doubles.
I am not sure if I understand it correctly, but it looks like the vnsrl on this machine is capable to output VLEN/4 per cycle.
Based on the observation that CPI is ~1 for mf4, ~2 for mf2, ~4 for m1, and ~8 for m2, regardless of sew.
And I would say the number of uops for each vnsrl is LMUL*4.
It is just assumption because I don't know the micro-architecture.
Based on my understanding, vnsrl on sifive-x280 is capable to handle DLEN/2 per cycle, and vnsrl on sifive-p670 is DLEN per cycle.
No longer actively working on this. |
For a narrowing operation, the work performed scales with the source LMUL not the destination LMUL. A side effect of the code sharing with FP_EXTEND was that we used the wrong LMUL when costing the inserted narrowing operations. For casts which start with a high LMUL operation, this change makes the cost significantly more expensive.