-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CostModel][X86] Fix fpext conversion cost for 16 elements #76278
Conversation
The fpext conversion cost for 16 elements should be 5.
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-analysis Author: None (HaohaiWen) ChangesThe fpext conversion cost for 16 elements should be 5. Full diff: https://github.com/llvm/llvm-project/pull/76278.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 8a04987e768a12..e7b7c9666ed43c 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2223,6 +2223,7 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
static const TypeConversionCostTblEntry AVX512FConversionTbl[] = {
{ ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, 1 },
{ ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, 3 },
+ { ISD::FP_EXTEND, MVT::v16f64, MVT::v16f32, 5 }, // 2*vcvtps2pd+vextractf64x4
{ ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, 1 },
{ ISD::TRUNCATE, MVT::v2i1, MVT::v2i8, 3 }, // sext+vpslld+vptestmd
diff --git a/llvm/test/Analysis/CostModel/X86/cast.ll b/llvm/test/Analysis/CostModel/X86/cast.ll
index 5a83d4e81fd38e..790f408b00609d 100644
--- a/llvm/test/Analysis/CostModel/X86/cast.ll
+++ b/llvm/test/Analysis/CostModel/X86/cast.ll
@@ -616,27 +616,31 @@ define void @fp_conv(<8 x float> %a, <16 x float>%b, <4 x float> %c) {
; SSE-LABEL: 'fp_conv'
; SSE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
; SSE-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; SSE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; SSE-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; SSE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; SSE-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; SSE-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
; SSE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
; AVX-LABEL: 'fp_conv'
; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
; AVX-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; AVX-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; AVX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; AVX-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
; AVX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
; AVX512-LABEL: 'fp_conv'
; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; AVX512-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
%A1 = fpext <4 x float> %c to <4 x double>
%A2 = fpext <8 x float> %a to <8 x double>
- %A3 = fptrunc <4 x double> undef to <4 x float>
- %A4 = fptrunc <8 x double> undef to <8 x float>
+ %A3 = fpext <16 x float> %b to <16 x double>
+ %A4 = fptrunc <4 x double> undef to <4 x float>
+ %A5 = fptrunc <8 x double> undef to <8 x float>
ret void
}
|
This PR should be landed after #76277 |
You can test this locally with the following command:git-clang-format --diff e147dcbcbc8f92b7f4973eaebe800308f480dd84 a925d7f32a0615ad1d07c43b6d2f314adf480ab1 -- llvm/lib/Target/X86/X86TargetTransformInfo.cpp View the diff from clang-format here.diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index cd40b1d3b0..c8229b59fa 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2230,140 +2230,141 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
// 256-bit wide vectors.
static const TypeConversionCostTblEntry AVX512FConversionTbl[] = {
- { ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, 1 },
- { ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, 3 },
- { ISD::FP_EXTEND, MVT::v16f64, MVT::v16f32, 4 }, // 2*vcvtps2pd+vextractf64x4
- { ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, 1 },
-
- { ISD::TRUNCATE, MVT::v2i1, MVT::v2i8, 3 }, // sext+vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v4i1, MVT::v4i8, 3 }, // sext+vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v8i1, MVT::v8i8, 3 }, // sext+vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v16i1, MVT::v16i8, 3 }, // sext+vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v2i1, MVT::v2i16, 3 }, // sext+vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v4i1, MVT::v4i16, 3 }, // sext+vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v8i1, MVT::v8i16, 3 }, // sext+vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v16i1, MVT::v16i16, 3 }, // sext+vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v2i1, MVT::v2i32, 2 }, // zmm vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v4i1, MVT::v4i32, 2 }, // zmm vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v8i1, MVT::v8i32, 2 }, // zmm vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v16i1, MVT::v16i32, 2 }, // vpslld+vptestmd
- { ISD::TRUNCATE, MVT::v2i1, MVT::v2i64, 2 }, // zmm vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v4i1, MVT::v4i64, 2 }, // zmm vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v8i1, MVT::v8i64, 2 }, // vpsllq+vptestmq
- { ISD::TRUNCATE, MVT::v2i8, MVT::v2i32, 2 }, // vpmovdb
- { ISD::TRUNCATE, MVT::v4i8, MVT::v4i32, 2 }, // vpmovdb
- { ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, 2 }, // vpmovdb
- { ISD::TRUNCATE, MVT::v32i8, MVT::v16i32, 2 }, // vpmovdb
- { ISD::TRUNCATE, MVT::v64i8, MVT::v16i32, 2 }, // vpmovdb
- { ISD::TRUNCATE, MVT::v16i16, MVT::v16i32, 2 }, // vpmovdw
- { ISD::TRUNCATE, MVT::v32i16, MVT::v16i32, 2 }, // vpmovdw
- { ISD::TRUNCATE, MVT::v2i8, MVT::v2i64, 2 }, // vpmovqb
- { ISD::TRUNCATE, MVT::v2i16, MVT::v2i64, 1 }, // vpshufb
- { ISD::TRUNCATE, MVT::v8i8, MVT::v8i64, 2 }, // vpmovqb
- { ISD::TRUNCATE, MVT::v16i8, MVT::v8i64, 2 }, // vpmovqb
- { ISD::TRUNCATE, MVT::v32i8, MVT::v8i64, 2 }, // vpmovqb
- { ISD::TRUNCATE, MVT::v64i8, MVT::v8i64, 2 }, // vpmovqb
- { ISD::TRUNCATE, MVT::v8i16, MVT::v8i64, 2 }, // vpmovqw
- { ISD::TRUNCATE, MVT::v16i16, MVT::v8i64, 2 }, // vpmovqw
- { ISD::TRUNCATE, MVT::v32i16, MVT::v8i64, 2 }, // vpmovqw
- { ISD::TRUNCATE, MVT::v8i32, MVT::v8i64, 1 }, // vpmovqd
- { ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, 1 }, // zmm vpmovqd
- { ISD::TRUNCATE, MVT::v16i8, MVT::v16i64, 5 },// 2*vpmovqd+concat+vpmovdb
-
- { ISD::TRUNCATE, MVT::v16i8, MVT::v16i16, 3 }, // extend to v16i32
- { ISD::TRUNCATE, MVT::v32i8, MVT::v32i16, 8 },
- { ISD::TRUNCATE, MVT::v64i8, MVT::v32i16, 8 },
-
- // Sign extend is zmm vpternlogd+vptruncdb.
- // Zero extend is zmm broadcast load+vptruncdw.
- { ISD::SIGN_EXTEND, MVT::v2i8, MVT::v2i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v2i8, MVT::v2i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v4i8, MVT::v4i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v4i8, MVT::v4i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v8i8, MVT::v8i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v8i8, MVT::v8i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v16i8, MVT::v16i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v16i8, MVT::v16i1, 4 },
-
- // Sign extend is zmm vpternlogd+vptruncdw.
- // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
- { ISD::SIGN_EXTEND, MVT::v2i16, MVT::v2i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v2i16, MVT::v2i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v4i16, MVT::v4i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v4i16, MVT::v4i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v8i16, MVT::v8i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v8i16, MVT::v8i1, 4 },
- { ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1, 3 },
- { ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1, 4 },
-
- { ISD::SIGN_EXTEND, MVT::v2i32, MVT::v2i1, 1 }, // zmm vpternlogd
- { ISD::ZERO_EXTEND, MVT::v2i32, MVT::v2i1, 2 }, // zmm vpternlogd+psrld
- { ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i1, 1 }, // zmm vpternlogd
- { ISD::ZERO_EXTEND, MVT::v4i32, MVT::v4i1, 2 }, // zmm vpternlogd+psrld
- { ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i1, 1 }, // zmm vpternlogd
- { ISD::ZERO_EXTEND, MVT::v8i32, MVT::v8i1, 2 }, // zmm vpternlogd+psrld
- { ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i1, 1 }, // zmm vpternlogq
- { ISD::ZERO_EXTEND, MVT::v2i64, MVT::v2i1, 2 }, // zmm vpternlogq+psrlq
- { ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i1, 1 }, // zmm vpternlogq
- { ISD::ZERO_EXTEND, MVT::v4i64, MVT::v4i1, 2 }, // zmm vpternlogq+psrlq
-
- { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1, 1 }, // vpternlogd
- { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1, 2 }, // vpternlogd+psrld
- { ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i1, 1 }, // vpternlogq
- { ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i1, 2 }, // vpternlogq+psrlq
-
- { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, 1 },
- { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, 1 },
- { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, 1 },
- { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, 1 },
- { ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, 1 },
- { ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, 1 },
- { ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, 1 },
- { ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, 1 },
- { ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i32, 1 },
- { ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i32, 1 },
-
- { ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8, 3 }, // FIXME: May not be right
- { ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8, 3 }, // FIXME: May not be right
-
- { ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i1, 4 },
- { ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i1, 3 },
- { ISD::SINT_TO_FP, MVT::v8f64, MVT::v16i8, 2 },
- { ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i8, 1 },
- { ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i16, 2 },
- { ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, 1 },
- { ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i32, 1 },
- { ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, 1 },
-
- { ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i1, 4 },
- { ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i1, 3 },
- { ISD::UINT_TO_FP, MVT::v8f64, MVT::v16i8, 2 },
- { ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i8, 1 },
- { ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i16, 2 },
- { ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, 1 },
- { ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i32, 1 },
- { ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, 1 },
- { ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i64, 26 },
- { ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i64, 5 },
-
- { ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f32, 2 },
- { ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f64, 7 },
- { ISD::FP_TO_SINT, MVT::v32i8, MVT::v32f64,15 },
- { ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f32,11 },
- { ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f64,31 },
- { ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f64, 3 },
- { ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f64, 7 },
- { ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f32, 5 },
- { ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f64,15 },
- { ISD::FP_TO_SINT, MVT::v8i32, MVT::v8f64, 1 },
- { ISD::FP_TO_SINT, MVT::v16i32, MVT::v16f64, 3 },
-
- { ISD::FP_TO_UINT, MVT::v8i32, MVT::v8f64, 1 },
- { ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f64, 3 },
- { ISD::FP_TO_UINT, MVT::v8i8, MVT::v8f64, 3 },
- { ISD::FP_TO_UINT, MVT::v16i32, MVT::v16f32, 1 },
- { ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, 3 },
- { ISD::FP_TO_UINT, MVT::v16i8, MVT::v16f32, 3 },
+ {ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, 1},
+ {ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, 3},
+ {ISD::FP_EXTEND, MVT::v16f64, MVT::v16f32,
+ 4}, // 2*vcvtps2pd+vextractf64x4
+ {ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, 1},
+
+ {ISD::TRUNCATE, MVT::v2i1, MVT::v2i8, 3}, // sext+vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v4i1, MVT::v4i8, 3}, // sext+vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v8i1, MVT::v8i8, 3}, // sext+vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v16i1, MVT::v16i8, 3}, // sext+vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v2i1, MVT::v2i16, 3}, // sext+vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v4i1, MVT::v4i16, 3}, // sext+vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v8i1, MVT::v8i16, 3}, // sext+vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v16i1, MVT::v16i16, 3}, // sext+vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v2i1, MVT::v2i32, 2}, // zmm vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v4i1, MVT::v4i32, 2}, // zmm vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v8i1, MVT::v8i32, 2}, // zmm vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v16i1, MVT::v16i32, 2}, // vpslld+vptestmd
+ {ISD::TRUNCATE, MVT::v2i1, MVT::v2i64, 2}, // zmm vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v4i1, MVT::v4i64, 2}, // zmm vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v8i1, MVT::v8i64, 2}, // vpsllq+vptestmq
+ {ISD::TRUNCATE, MVT::v2i8, MVT::v2i32, 2}, // vpmovdb
+ {ISD::TRUNCATE, MVT::v4i8, MVT::v4i32, 2}, // vpmovdb
+ {ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, 2}, // vpmovdb
+ {ISD::TRUNCATE, MVT::v32i8, MVT::v16i32, 2}, // vpmovdb
+ {ISD::TRUNCATE, MVT::v64i8, MVT::v16i32, 2}, // vpmovdb
+ {ISD::TRUNCATE, MVT::v16i16, MVT::v16i32, 2}, // vpmovdw
+ {ISD::TRUNCATE, MVT::v32i16, MVT::v16i32, 2}, // vpmovdw
+ {ISD::TRUNCATE, MVT::v2i8, MVT::v2i64, 2}, // vpmovqb
+ {ISD::TRUNCATE, MVT::v2i16, MVT::v2i64, 1}, // vpshufb
+ {ISD::TRUNCATE, MVT::v8i8, MVT::v8i64, 2}, // vpmovqb
+ {ISD::TRUNCATE, MVT::v16i8, MVT::v8i64, 2}, // vpmovqb
+ {ISD::TRUNCATE, MVT::v32i8, MVT::v8i64, 2}, // vpmovqb
+ {ISD::TRUNCATE, MVT::v64i8, MVT::v8i64, 2}, // vpmovqb
+ {ISD::TRUNCATE, MVT::v8i16, MVT::v8i64, 2}, // vpmovqw
+ {ISD::TRUNCATE, MVT::v16i16, MVT::v8i64, 2}, // vpmovqw
+ {ISD::TRUNCATE, MVT::v32i16, MVT::v8i64, 2}, // vpmovqw
+ {ISD::TRUNCATE, MVT::v8i32, MVT::v8i64, 1}, // vpmovqd
+ {ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, 1}, // zmm vpmovqd
+ {ISD::TRUNCATE, MVT::v16i8, MVT::v16i64, 5}, // 2*vpmovqd+concat+vpmovdb
+
+ {ISD::TRUNCATE, MVT::v16i8, MVT::v16i16, 3}, // extend to v16i32
+ {ISD::TRUNCATE, MVT::v32i8, MVT::v32i16, 8},
+ {ISD::TRUNCATE, MVT::v64i8, MVT::v32i16, 8},
+
+ // Sign extend is zmm vpternlogd+vptruncdb.
+ // Zero extend is zmm broadcast load+vptruncdw.
+ {ISD::SIGN_EXTEND, MVT::v2i8, MVT::v2i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v2i8, MVT::v2i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v4i8, MVT::v4i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v4i8, MVT::v4i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v8i8, MVT::v8i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v8i8, MVT::v8i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v16i8, MVT::v16i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v16i8, MVT::v16i1, 4},
+
+ // Sign extend is zmm vpternlogd+vptruncdw.
+ // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
+ {ISD::SIGN_EXTEND, MVT::v2i16, MVT::v2i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v2i16, MVT::v2i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v4i16, MVT::v4i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v4i16, MVT::v4i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v8i16, MVT::v8i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v8i16, MVT::v8i1, 4},
+ {ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1, 3},
+ {ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1, 4},
+
+ {ISD::SIGN_EXTEND, MVT::v2i32, MVT::v2i1, 1}, // zmm vpternlogd
+ {ISD::ZERO_EXTEND, MVT::v2i32, MVT::v2i1, 2}, // zmm vpternlogd+psrld
+ {ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i1, 1}, // zmm vpternlogd
+ {ISD::ZERO_EXTEND, MVT::v4i32, MVT::v4i1, 2}, // zmm vpternlogd+psrld
+ {ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i1, 1}, // zmm vpternlogd
+ {ISD::ZERO_EXTEND, MVT::v8i32, MVT::v8i1, 2}, // zmm vpternlogd+psrld
+ {ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i1, 1}, // zmm vpternlogq
+ {ISD::ZERO_EXTEND, MVT::v2i64, MVT::v2i1, 2}, // zmm vpternlogq+psrlq
+ {ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i1, 1}, // zmm vpternlogq
+ {ISD::ZERO_EXTEND, MVT::v4i64, MVT::v4i1, 2}, // zmm vpternlogq+psrlq
+
+ {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1, 1}, // vpternlogd
+ {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1, 2}, // vpternlogd+psrld
+ {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i1, 1}, // vpternlogq
+ {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i1, 2}, // vpternlogq+psrlq
+
+ {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, 1},
+ {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, 1},
+ {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, 1},
+ {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, 1},
+ {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, 1},
+ {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, 1},
+ {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, 1},
+ {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, 1},
+ {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i32, 1},
+ {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i32, 1},
+
+ {ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8, 3}, // FIXME: May not be right
+ {ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8, 3}, // FIXME: May not be right
+
+ {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i1, 4},
+ {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i1, 3},
+ {ISD::SINT_TO_FP, MVT::v8f64, MVT::v16i8, 2},
+ {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i8, 1},
+ {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i16, 2},
+ {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, 1},
+ {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i32, 1},
+ {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, 1},
+
+ {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i1, 4},
+ {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i1, 3},
+ {ISD::UINT_TO_FP, MVT::v8f64, MVT::v16i8, 2},
+ {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i8, 1},
+ {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i16, 2},
+ {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, 1},
+ {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i32, 1},
+ {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, 1},
+ {ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i64, 26},
+ {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i64, 5},
+
+ {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f32, 2},
+ {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f64, 7},
+ {ISD::FP_TO_SINT, MVT::v32i8, MVT::v32f64, 15},
+ {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f32, 11},
+ {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f64, 31},
+ {ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f64, 3},
+ {ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f64, 7},
+ {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f32, 5},
+ {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f64, 15},
+ {ISD::FP_TO_SINT, MVT::v8i32, MVT::v8f64, 1},
+ {ISD::FP_TO_SINT, MVT::v16i32, MVT::v16f64, 3},
+
+ {ISD::FP_TO_UINT, MVT::v8i32, MVT::v8f64, 1},
+ {ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f64, 3},
+ {ISD::FP_TO_UINT, MVT::v8i8, MVT::v8f64, 3},
+ {ISD::FP_TO_UINT, MVT::v16i32, MVT::v16f32, 1},
+ {ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, 3},
+ {ISD::FP_TO_UINT, MVT::v16i8, MVT::v16f32, 3},
};
static const TypeConversionCostTblEntry AVX512BWVLConversionTbl[] {
|
$llc test.ll -mtriple=x86_64-unknown-unknown -mattr=avx512f -o -
uiCA measured its TP is 5 for SKX. https://bit.ly/3rGcUKF |
Please can you confirm this as llvm-mca predicts worse case (znver4) to be 4 |
Currenttly, uiCA don't support Zen4 and I don't have Zen4 machine.
|
I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 5cy - so do you know why the intel scheduler models are underestimating the throughput? |
SKX schedule model reports correct lat/uops/tpt for each instruction.
vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX
There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 3*p5 and 2*p05 can run in parallel. |
There's cross iteration true dependency in previous experiment.
The second cvt and first cvt of the next iteration need to wait for finish of vextract64x4. Therefore its cost is 5.
This breaks the dependency and now cost is 3.
|
Got it thanks - given znver4 now has the worst cost, we should be setting the cost to 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - cheers!
The fpext conversion cost for 16 elements should be 4 from Znver4.
The fpext conversion cost for 16 elements should be 4 from Znver4.