Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CostModel][X86] Fix fpext conversion cost for 16 elements #76278

Merged
merged 8 commits into from
Jan 9, 2024

Conversation

HaohaiWen
Copy link
Contributor

@HaohaiWen HaohaiWen commented Dec 23, 2023

The fpext conversion cost for 16 elements should be 4 from Znver4.

@llvmbot
Copy link
Member

llvmbot commented Dec 23, 2023

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-analysis

Author: None (HaohaiWen)

Changes

The fpext conversion cost for 16 elements should be 5.


Full diff: https://github.com/llvm/llvm-project/pull/76278.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+1)
  • (modified) llvm/test/Analysis/CostModel/X86/cast.ll (+12-8)
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 8a04987e768a12..e7b7c9666ed43c 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2223,6 +2223,7 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   static const TypeConversionCostTblEntry AVX512FConversionTbl[] = {
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,  1 },
     { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32, 3 },
+    { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32, 5 }, // 2*vcvtps2pd+vextractf64x4
     { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,  1 },
 
     { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,   3 }, // sext+vpslld+vptestmd
diff --git a/llvm/test/Analysis/CostModel/X86/cast.ll b/llvm/test/Analysis/CostModel/X86/cast.ll
index 5a83d4e81fd38e..790f408b00609d 100644
--- a/llvm/test/Analysis/CostModel/X86/cast.ll
+++ b/llvm/test/Analysis/CostModel/X86/cast.ll
@@ -616,27 +616,31 @@ define void @fp_conv(<8 x float> %a, <16 x float>%b, <4 x float> %c) {
 ; SSE-LABEL: 'fp_conv'
 ; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
 ; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
 ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; AVX-LABEL: 'fp_conv'
 ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
 ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
 ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; AVX512-LABEL: 'fp_conv'
 ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A1 = fpext <4 x float> %c to <4 x double>
 ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A2 = fpext <8 x float> %a to <8 x double>
-; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A3 = fptrunc <4 x double> undef to <4 x float>
-; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <8 x double> undef to <8 x float>
+; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %A3 = fpext <16 x float> %b to <16 x double>
+; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A4 = fptrunc <4 x double> undef to <4 x float>
+; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A5 = fptrunc <8 x double> undef to <8 x float>
 ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %A1 = fpext <4 x float> %c to <4 x double>
   %A2 = fpext <8 x float> %a to <8 x double>
-  %A3 = fptrunc <4 x double> undef to <4 x float>
-  %A4 = fptrunc <8 x double> undef to <8 x float>
+  %A3 = fpext <16 x float> %b to <16 x double>
+  %A4 = fptrunc <4 x double> undef to <4 x float>
+  %A5 = fptrunc <8 x double> undef to <8 x float>
   ret void
 }

@HaohaiWen
Copy link
Contributor Author

This PR should be landed after #76277

Copy link

github-actions bot commented Dec 23, 2023

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff e147dcbcbc8f92b7f4973eaebe800308f480dd84 a925d7f32a0615ad1d07c43b6d2f314adf480ab1 -- llvm/lib/Target/X86/X86TargetTransformInfo.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index cd40b1d3b0..c8229b59fa 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -2230,140 +2230,141 @@ InstructionCost X86TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
   // 256-bit wide vectors.
 
   static const TypeConversionCostTblEntry AVX512FConversionTbl[] = {
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v8f32,  1 },
-    { ISD::FP_EXTEND, MVT::v8f64,   MVT::v16f32, 3 },
-    { ISD::FP_EXTEND, MVT::v16f64,  MVT::v16f32, 4 }, // 2*vcvtps2pd+vextractf64x4
-    { ISD::FP_ROUND,  MVT::v8f32,   MVT::v8f64,  1 },
-
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i8,   3 }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i8,   3 }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i8,   3 }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i8,  3 }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i16,  3 }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i16,  3 }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i16,  3 }, // sext+vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i16, 3 }, // sext+vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i32,  2 }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i32,  2 }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i32,  2 }, // zmm vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v16i1,   MVT::v16i32, 2 }, // vpslld+vptestmd
-    { ISD::TRUNCATE,  MVT::v2i1,    MVT::v2i64,  2 }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v4i1,    MVT::v4i64,  2 }, // zmm vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v8i1,    MVT::v8i64,  2 }, // vpsllq+vptestmq
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i32,  2 }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v4i8,    MVT::v4i32,  2 }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i32, 2 }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v16i32, 2 }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v16i32, 2 }, // vpmovdb
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v16i32, 2 }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v16i32, 2 }, // vpmovdw
-    { ISD::TRUNCATE,  MVT::v2i8,    MVT::v2i64,  2 }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v2i16,   MVT::v2i64,  1 }, // vpshufb
-    { ISD::TRUNCATE,  MVT::v8i8,    MVT::v8i64,  2 }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v8i64,  2 }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v32i8,   MVT::v8i64,  2 }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v64i8,   MVT::v8i64,  2 }, // vpmovqb
-    { ISD::TRUNCATE,  MVT::v8i16,   MVT::v8i64,  2 }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v16i16,  MVT::v8i64,  2 }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v32i16,  MVT::v8i64,  2 }, // vpmovqw
-    { ISD::TRUNCATE,  MVT::v8i32,   MVT::v8i64,  1 }, // vpmovqd
-    { ISD::TRUNCATE,  MVT::v4i32,   MVT::v4i64,  1 }, // zmm vpmovqd
-    { ISD::TRUNCATE,  MVT::v16i8,   MVT::v16i64, 5 },// 2*vpmovqd+concat+vpmovdb
-
-    { ISD::TRUNCATE,  MVT::v16i8,  MVT::v16i16,  3 }, // extend to v16i32
-    { ISD::TRUNCATE,  MVT::v32i8,  MVT::v32i16,  8 },
-    { ISD::TRUNCATE,  MVT::v64i8,  MVT::v32i16,  8 },
-
-    // Sign extend is zmm vpternlogd+vptruncdb.
-    // Zero extend is zmm broadcast load+vptruncdw.
-    { ISD::SIGN_EXTEND, MVT::v2i8,   MVT::v2i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v2i8,   MVT::v2i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v4i8,   MVT::v4i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v4i8,   MVT::v4i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v8i8,   MVT::v8i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v8i8,   MVT::v8i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v16i8,  MVT::v16i1,  3 },
-    { ISD::ZERO_EXTEND, MVT::v16i8,  MVT::v16i1,  4 },
-
-    // Sign extend is zmm vpternlogd+vptruncdw.
-    // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
-    { ISD::SIGN_EXTEND, MVT::v2i16,  MVT::v2i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v2i16,  MVT::v2i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v4i16,  MVT::v4i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v4i16,  MVT::v4i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v8i16,  MVT::v8i1,   3 },
-    { ISD::ZERO_EXTEND, MVT::v8i16,  MVT::v8i1,   4 },
-    { ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1,  3 },
-    { ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1,  4 },
-
-    { ISD::SIGN_EXTEND, MVT::v2i32,  MVT::v2i1,   1 }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v2i32,  MVT::v2i1,   2 }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v4i32,  MVT::v4i1,   1 }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v4i32,  MVT::v4i1,   2 }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i32,  MVT::v8i1,   1 }, // zmm vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v8i32,  MVT::v8i1,   2 }, // zmm vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v2i64,  MVT::v2i1,   1 }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v2i64,  MVT::v2i1,   2 }, // zmm vpternlogq+psrlq
-    { ISD::SIGN_EXTEND, MVT::v4i64,  MVT::v4i1,   1 }, // zmm vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v4i64,  MVT::v4i1,   2 }, // zmm vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1,  1 }, // vpternlogd
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1,  2 }, // vpternlogd+psrld
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i1,   1 }, // vpternlogq
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i1,   2 }, // vpternlogq+psrlq
-
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8,  1 },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8,  1 },
-    { ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, 1 },
-    { ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, 1 },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i8,   1 },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i8,   1 },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i16,  1 },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i16,  1 },
-    { ISD::SIGN_EXTEND, MVT::v8i64,  MVT::v8i32,  1 },
-    { ISD::ZERO_EXTEND, MVT::v8i64,  MVT::v8i32,  1 },
-
-    { ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8,  3 }, // FIXME: May not be right
-    { ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8,  3 }, // FIXME: May not be right
-
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   4 },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i1,  3 },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  2 },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i8,  1 },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  2 },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i16, 1 },
-    { ISD::SINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  1 },
-    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i32, 1 },
-
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i1,   4 },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i1,  3 },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v16i8,  2 },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i8,  1 },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i16,  2 },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i16, 1 },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i32,  1 },
-    { ISD::UINT_TO_FP,  MVT::v16f32, MVT::v16i32, 1 },
-    { ISD::UINT_TO_FP,  MVT::v8f32,  MVT::v8i64, 26 },
-    { ISD::UINT_TO_FP,  MVT::v8f64,  MVT::v8i64,  5 },
-
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f32, 2 },
-    { ISD::FP_TO_SINT,  MVT::v16i8,  MVT::v16f64, 7 },
-    { ISD::FP_TO_SINT,  MVT::v32i8,  MVT::v32f64,15 },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f32,11 },
-    { ISD::FP_TO_SINT,  MVT::v64i8,  MVT::v64f64,31 },
-    { ISD::FP_TO_SINT,  MVT::v8i16,  MVT::v8f64,  3 },
-    { ISD::FP_TO_SINT,  MVT::v16i16, MVT::v16f64, 7 },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f32, 5 },
-    { ISD::FP_TO_SINT,  MVT::v32i16, MVT::v32f64,15 },
-    { ISD::FP_TO_SINT,  MVT::v8i32,  MVT::v8f64,  1 },
-    { ISD::FP_TO_SINT,  MVT::v16i32, MVT::v16f64, 3 },
-
-    { ISD::FP_TO_UINT,  MVT::v8i32,  MVT::v8f64,  1 },
-    { ISD::FP_TO_UINT,  MVT::v8i16,  MVT::v8f64,  3 },
-    { ISD::FP_TO_UINT,  MVT::v8i8,   MVT::v8f64,  3 },
-    { ISD::FP_TO_UINT,  MVT::v16i32, MVT::v16f32, 1 },
-    { ISD::FP_TO_UINT,  MVT::v16i16, MVT::v16f32, 3 },
-    { ISD::FP_TO_UINT,  MVT::v16i8,  MVT::v16f32, 3 },
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v8f32, 1},
+      {ISD::FP_EXTEND, MVT::v8f64, MVT::v16f32, 3},
+      {ISD::FP_EXTEND, MVT::v16f64, MVT::v16f32,
+       4}, // 2*vcvtps2pd+vextractf64x4
+      {ISD::FP_ROUND, MVT::v8f32, MVT::v8f64, 1},
+
+      {ISD::TRUNCATE, MVT::v2i1, MVT::v2i8, 3},     // sext+vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v4i1, MVT::v4i8, 3},     // sext+vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i8, 3},     // sext+vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v16i1, MVT::v16i8, 3},   // sext+vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v2i1, MVT::v2i16, 3},    // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v4i1, MVT::v4i16, 3},    // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i16, 3},    // sext+vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v16i1, MVT::v16i16, 3},  // sext+vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v2i1, MVT::v2i32, 2},    // zmm vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v4i1, MVT::v4i32, 2},    // zmm vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i32, 2},    // zmm vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v16i1, MVT::v16i32, 2},  // vpslld+vptestmd
+      {ISD::TRUNCATE, MVT::v2i1, MVT::v2i64, 2},    // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v4i1, MVT::v4i64, 2},    // zmm vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v8i1, MVT::v8i64, 2},    // vpsllq+vptestmq
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i32, 2},    // vpmovdb
+      {ISD::TRUNCATE, MVT::v4i8, MVT::v4i32, 2},    // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v16i32, 2},  // vpmovdb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v16i32, 2},  // vpmovdb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v16i32, 2},  // vpmovdb
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v16i32, 2}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v16i32, 2}, // vpmovdw
+      {ISD::TRUNCATE, MVT::v2i8, MVT::v2i64, 2},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v2i16, MVT::v2i64, 1},   // vpshufb
+      {ISD::TRUNCATE, MVT::v8i8, MVT::v8i64, 2},    // vpmovqb
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v8i64, 2},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v8i64, 2},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v8i64, 2},   // vpmovqb
+      {ISD::TRUNCATE, MVT::v8i16, MVT::v8i64, 2},   // vpmovqw
+      {ISD::TRUNCATE, MVT::v16i16, MVT::v8i64, 2},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v32i16, MVT::v8i64, 2},  // vpmovqw
+      {ISD::TRUNCATE, MVT::v8i32, MVT::v8i64, 1},   // vpmovqd
+      {ISD::TRUNCATE, MVT::v4i32, MVT::v4i64, 1},   // zmm vpmovqd
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v16i64, 5},  // 2*vpmovqd+concat+vpmovdb
+
+      {ISD::TRUNCATE, MVT::v16i8, MVT::v16i16, 3}, // extend to v16i32
+      {ISD::TRUNCATE, MVT::v32i8, MVT::v32i16, 8},
+      {ISD::TRUNCATE, MVT::v64i8, MVT::v32i16, 8},
+
+      // Sign extend is zmm vpternlogd+vptruncdb.
+      // Zero extend is zmm broadcast load+vptruncdw.
+      {ISD::SIGN_EXTEND, MVT::v2i8, MVT::v2i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v2i8, MVT::v2i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v4i8, MVT::v4i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v4i8, MVT::v4i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v8i8, MVT::v8i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v8i8, MVT::v8i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v16i8, MVT::v16i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v16i8, MVT::v16i1, 4},
+
+      // Sign extend is zmm vpternlogd+vptruncdw.
+      // Zero extend is zmm vpternlogd+vptruncdw+vpsrlw.
+      {ISD::SIGN_EXTEND, MVT::v2i16, MVT::v2i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v2i16, MVT::v2i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v4i16, MVT::v4i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v4i16, MVT::v4i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v8i16, MVT::v8i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v8i16, MVT::v8i1, 4},
+      {ISD::SIGN_EXTEND, MVT::v16i16, MVT::v16i1, 3},
+      {ISD::ZERO_EXTEND, MVT::v16i16, MVT::v16i1, 4},
+
+      {ISD::SIGN_EXTEND, MVT::v2i32, MVT::v2i1, 1}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND, MVT::v2i32, MVT::v2i1, 2}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v4i32, MVT::v4i1, 1}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND, MVT::v4i32, MVT::v4i1, 2}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i32, MVT::v8i1, 1}, // zmm vpternlogd
+      {ISD::ZERO_EXTEND, MVT::v8i32, MVT::v8i1, 2}, // zmm vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v2i64, MVT::v2i1, 1}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND, MVT::v2i64, MVT::v2i1, 2}, // zmm vpternlogq+psrlq
+      {ISD::SIGN_EXTEND, MVT::v4i64, MVT::v4i1, 1}, // zmm vpternlogq
+      {ISD::ZERO_EXTEND, MVT::v4i64, MVT::v4i1, 2}, // zmm vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i1, 1}, // vpternlogd
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i1, 2}, // vpternlogd+psrld
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i1, 1},   // vpternlogq
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i1, 2},   // vpternlogq+psrlq
+
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i8, 1},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i8, 1},
+      {ISD::SIGN_EXTEND, MVT::v16i32, MVT::v16i16, 1},
+      {ISD::ZERO_EXTEND, MVT::v16i32, MVT::v16i16, 1},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i8, 1},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i8, 1},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i16, 1},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i16, 1},
+      {ISD::SIGN_EXTEND, MVT::v8i64, MVT::v8i32, 1},
+      {ISD::ZERO_EXTEND, MVT::v8i64, MVT::v8i32, 1},
+
+      {ISD::SIGN_EXTEND, MVT::v32i16, MVT::v32i8, 3}, // FIXME: May not be right
+      {ISD::ZERO_EXTEND, MVT::v32i16, MVT::v32i8, 3}, // FIXME: May not be right
+
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i1, 4},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i1, 3},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v16i8, 2},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i8, 1},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i16, 2},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i16, 1},
+      {ISD::SINT_TO_FP, MVT::v8f64, MVT::v8i32, 1},
+      {ISD::SINT_TO_FP, MVT::v16f32, MVT::v16i32, 1},
+
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i1, 4},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i1, 3},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v16i8, 2},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i8, 1},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i16, 2},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i16, 1},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i32, 1},
+      {ISD::UINT_TO_FP, MVT::v16f32, MVT::v16i32, 1},
+      {ISD::UINT_TO_FP, MVT::v8f32, MVT::v8i64, 26},
+      {ISD::UINT_TO_FP, MVT::v8f64, MVT::v8i64, 5},
+
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f32, 2},
+      {ISD::FP_TO_SINT, MVT::v16i8, MVT::v16f64, 7},
+      {ISD::FP_TO_SINT, MVT::v32i8, MVT::v32f64, 15},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f32, 11},
+      {ISD::FP_TO_SINT, MVT::v64i8, MVT::v64f64, 31},
+      {ISD::FP_TO_SINT, MVT::v8i16, MVT::v8f64, 3},
+      {ISD::FP_TO_SINT, MVT::v16i16, MVT::v16f64, 7},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f32, 5},
+      {ISD::FP_TO_SINT, MVT::v32i16, MVT::v32f64, 15},
+      {ISD::FP_TO_SINT, MVT::v8i32, MVT::v8f64, 1},
+      {ISD::FP_TO_SINT, MVT::v16i32, MVT::v16f64, 3},
+
+      {ISD::FP_TO_UINT, MVT::v8i32, MVT::v8f64, 1},
+      {ISD::FP_TO_UINT, MVT::v8i16, MVT::v8f64, 3},
+      {ISD::FP_TO_UINT, MVT::v8i8, MVT::v8f64, 3},
+      {ISD::FP_TO_UINT, MVT::v16i32, MVT::v16f32, 1},
+      {ISD::FP_TO_UINT, MVT::v16i16, MVT::v16f32, 3},
+      {ISD::FP_TO_UINT, MVT::v16i8, MVT::v16f32, 3},
   };
 
   static const TypeConversionCostTblEntry AVX512BWVLConversionTbl[] {

@HaohaiWen
Copy link
Contributor Author

 $cat test.ll
define <8 x double> @foo8(<8 x float> %in) {
  %1 = fpext <8 x float> %in to <8 x double>
  ret <8 x double> %1
}

define <16 x double> @foo16(<16 x float> %in) {
  %1 = fpext <16 x float> %in to <16 x double>
  ret <16 x double> %1
}

$llc test.ll -mtriple=x86_64-unknown-unknown -mattr=avx512f -o -

        .text
        .file   "test.ll"
        .globl  foo8                            # -- Begin function foo8
        .p2align        4, 0x90
        .type   foo8,@function
foo8:                                   # @foo8
        .cfi_startproc
# %bb.0:
        vcvtps2pd       %ymm0, %zmm0
        retq
.Lfunc_end0:
        .size   foo8, .Lfunc_end0-foo8
        .cfi_endproc
                                        # -- End function
        .globl  foo16                           # -- Begin function foo16
        .p2align        4, 0x90
        .type   foo16,@function
foo16:                                  # @foo16
        .cfi_startproc
# %bb.0:
        vcvtps2pd       %ymm0, %zmm2
        vextractf64x4   $1, %zmm0, %ymm0
        vcvtps2pd       %ymm0, %zmm1
        vmovaps %zmm2, %zmm0
        retq
.Lfunc_end1:
        .size   foo16, .Lfunc_end1-foo16
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

uiCA measured its TP is 5 for SKX. https://bit.ly/3rGcUKF

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 23, 2023

Please can you confirm this as llvm-mca predicts worse case (znver4) to be 4
https://llvm.godbolt.org/z/fxWTaf3Gv

@HaohaiWen
Copy link
Contributor Author

HaohaiWen commented Dec 25, 2023

Please can you confirm this as llvm-mca predicts worse case (znver4) to be 4 https://llvm.godbolt.org/z/fxWTaf3Gv

Currenttly, uiCA don't support Zen4 and I don't have Zen4 machine.
I can measure it on local SKX machine with nanoBench (https://github.com/andreas-abel/nanoBench). Maybe you can use it to confirm Zen4 cost if you have Zen4 machine.
e.g.

./nanoBench.sh -init "xor zmm0, zmm0" -asm "vcvtps2pd zmm2, ymm0; vextractf64x4 ymm0, zmm0, 1; vcvtps2pd zmm1, ymm0" -config configs/cfg_SkylakeX_common.txt -unroll 1000 -loop 1000 -warm_up_count 10 -cpu 0
Note: Hyper-threading is enabled; it can be disabled with "sudo ./disable-HT.sh"
CORE_CYCLES: 4.77
INST_RETIRED: 3.00
IDQ.MITE_UOPS: 5.71
IDQ.DSB_UOPS: -0.70
IDQ.MS_UOPS: 0.01
LSD.UOPS: 0.00
UOPS_ISSUED: 5.01
UOPS_EXECUTED: 5.01
UOPS_RETIRED.RETIRE_SLOTS: 5.01
UOPS_DISPATCHED_PORT.PORT_0: 2.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 3.00
UOPS_DISPATCHED_PORT.PORT_6: 0.00
UOPS_DISPATCHED_PORT.PORT_7: 0.00

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 28, 2023

I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 5cy - so do you know why the intel scheduler models are underestimating the throughput?

@HaohaiWen
Copy link
Contributor Author

I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 5cy - so do you know why the intel scheduler models are underestimating the throughput?

SKX schedule model reports correct lat/uops/tpt for each instruction.
vcvtps2pd: https://uops.info/html-instr/VCVTPS2PD_ZMM_YMM.html#SKX

Instruction		                      Lat	        TP	Uops	Ports
VEXTRACTF64X4 (YMM, ZMM, I8)	AVX512EVEX	3	1.00 / 1.00	1 / 1	 1*p5

vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX

Instruction		                 Lat	         TP	Uops	Ports
VCVTPS2PD (ZMM, YMM)	AVX512EVEX	7	1.00 / 1.09	2 / 2  	1*p05+1*p5

There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 3*p5 and 2*p05 can run in parallel.
We can see 2*p05 indeed went to p0 from nanoBench result. Looks like there're some dependencies and they can't ideally run parallelly. I don't know uiCA analyzed it.

@HaohaiWen
Copy link
Contributor Author

There's cross iteration true dependency in previous experiment.

vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0

The second cvt and first cvt of the next iteration need to wait for finish of vextract64x4. Therefore its cost is 5.
In real scenario, value of zmm0 should be reset to fpext new input.

vmovaps zmm0, zmm3
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0

This breaks the dependency and now cost is 3.

# ./nanoBench.sh -init "xor zmm0, zmm0" -asm "vmovaps zmm0, zmm3; vcvtps2pd zmm2, ymm0; vextractf64x4 ymm0, zmm0, 1; vcvtps2pd zmm1, ymm0" -config configs/cfg_SkylakeX_common.txt -unroll 1000 -loop 1000 -warm_up_count 10 -cpu 0
Note: Hyper-threading is enabled; it can be disabled with "sudo ./disable-HT.sh"
CORE_CYCLES: 3.00
INST_RETIRED: 4.00
IDQ.MITE_UOPS: 6.46
IDQ.DSB_UOPS: -0.45
IDQ.MS_UOPS: 0.01
LSD.UOPS: 0.00
UOPS_ISSUED: 6.01
UOPS_EXECUTED: 5.01
UOPS_RETIRED.RETIRE_SLOTS: 6.01
UOPS_DISPATCHED_PORT.PORT_0: 2.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 3.00
UOPS_DISPATCHED_PORT.PORT_6: 0.01
UOPS_DISPATCHED_PORT.PORT_7: 0.00

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 5, 2024

Got it thanks - given znver4 now has the worst cost, we should be setting the cost to 4

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - cheers!

@HaohaiWen HaohaiWen merged commit 4147b72 into llvm:main Jan 9, 2024
3 of 4 checks passed
@HaohaiWen HaohaiWen deleted the tti-fix branch January 9, 2024 01:05
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
The fpext conversion cost for 16 elements should be 4 from Znver4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants