Skip to content

[LLVM][CodeGen][AArch64] Don't scalarise v8{f16,bf16} vsetcc operations. #135398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

paulwalker-arm
Copy link
Collaborator

I have also removed custom promotion code for the v4{f16,bf16} cases because the same common code can be used.

@llvmbot
Copy link
Member

llvmbot commented Apr 11, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Paul Walker (paulwalker-arm)

Changes

I have also removed custom promotion code for the v4{f16,bf16} cases because the same common code can be used.


Patch is 115.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135398.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+11-25)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+3-1)
  • (modified) llvm/test/Analysis/CostModel/AArch64/cmp.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/AArch64/vector-select.ll (+7-7)
  • (modified) llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll (+139-762)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+111-471)
  • (modified) llvm/test/CodeGen/AArch64/fp16-v8-instructions.ll (+163-786)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index e366d7cb54490..92b77fde8bab2 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -841,11 +841,11 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationPromotedToType(ISD::FRINT,      V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FNEARBYINT, V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FCANONICALIZE, V4Narrow, MVT::v4f32);
+    setOperationPromotedToType(ISD::SETCC,         V4Narrow, MVT::v4f32);
 
     setOperationAction(ISD::FABS,        V4Narrow, Legal);
-    setOperationAction(ISD::FNEG, 	 V4Narrow, Legal);
+    setOperationAction(ISD::FNEG,        V4Narrow, Legal);
     setOperationAction(ISD::FMA,         V4Narrow, Expand);
-    setOperationAction(ISD::SETCC,       V4Narrow, Custom);
     setOperationAction(ISD::BR_CC,       V4Narrow, Expand);
     setOperationAction(ISD::SELECT,      V4Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V4Narrow, Expand);
@@ -853,6 +853,9 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::FSQRT,       V4Narrow, Expand);
 
     auto V8Narrow = MVT::getVectorVT(ScalarVT, 8);
+    setOperationPromotedToType(ISD::FCANONICALIZE, V8Narrow, MVT::v8f32);
+    setOperationPromotedToType(ISD::SETCC,         V8Narrow, MVT::v8f32);
+
     setOperationAction(ISD::FABS,        V8Narrow, Legal);
     setOperationAction(ISD::FADD,        V8Narrow, Legal);
     setOperationAction(ISD::FCEIL,       V8Narrow, Legal);
@@ -862,19 +865,17 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::FMA,         V8Narrow, Expand);
     setOperationAction(ISD::FMUL,        V8Narrow, Legal);
     setOperationAction(ISD::FNEARBYINT,  V8Narrow, Legal);
-    setOperationAction(ISD::FNEG, 	 V8Narrow, Legal);
+    setOperationAction(ISD::FNEG,        V8Narrow, Legal);
     setOperationAction(ISD::FROUND,      V8Narrow, Legal);
     setOperationAction(ISD::FROUNDEVEN,  V8Narrow, Legal);
     setOperationAction(ISD::FRINT,       V8Narrow, Legal);
     setOperationAction(ISD::FSQRT,       V8Narrow, Expand);
     setOperationAction(ISD::FSUB,        V8Narrow, Legal);
     setOperationAction(ISD::FTRUNC,      V8Narrow, Legal);
-    setOperationAction(ISD::SETCC,       V8Narrow, Expand);
     setOperationAction(ISD::BR_CC,       V8Narrow, Expand);
     setOperationAction(ISD::SELECT,      V8Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V8Narrow, Expand);
     setOperationAction(ISD::FP_EXTEND,   V8Narrow, Expand);
-    setOperationPromotedToType(ISD::FCANONICALIZE, V8Narrow, MVT::v8f32);
   };
 
   if (!Subtarget->hasFullFP16()) {
@@ -15905,6 +15906,11 @@ SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,
   if (LHS.getValueType().getVectorElementType().isInteger())
     return Op;
 
+  assert((!Subtarget->hasFullFP16() &&
+          LHS.getValueType().getVectorElementType() != MVT::f16) ||
+         LHS.getValueType().getVectorElementType() != MVT::bf16 ||
+         LHS.getValueType().getVectorElementType() != MVT::f128);
+
   // Lower isnan(x) | isnan(never-nan) to x != x.
   // Lower !isnan(x) & !isnan(never-nan) to x == x.
   if (CC == ISD::SETUO || CC == ISD::SETO) {
@@ -15923,26 +15929,6 @@ SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,
     }
   }
 
-  const bool FullFP16 = DAG.getSubtarget<AArch64Subtarget>().hasFullFP16();
-
-  // Make v4f16 (only) fcmp operations utilise vector instructions
-  // v8f16 support will be a litle more complicated
-  if ((!FullFP16 && LHS.getValueType().getVectorElementType() == MVT::f16) ||
-      LHS.getValueType().getVectorElementType() == MVT::bf16) {
-    if (LHS.getValueType().getVectorNumElements() == 4) {
-      LHS = DAG.getNode(ISD::FP_EXTEND, dl, MVT::v4f32, LHS);
-      RHS = DAG.getNode(ISD::FP_EXTEND, dl, MVT::v4f32, RHS);
-      SDValue NewSetcc = DAG.getSetCC(dl, MVT::v4i16, LHS, RHS, CC);
-      DAG.ReplaceAllUsesWith(Op, NewSetcc);
-      CmpVT = MVT::v4i32;
-    } else
-      return SDValue();
-  }
-
-  assert((!FullFP16 && LHS.getValueType().getVectorElementType() != MVT::f16) ||
-         LHS.getValueType().getVectorElementType() != MVT::bf16 ||
-         LHS.getValueType().getVectorElementType() != MVT::f128);
-
   // Unfortunately, the mapping of LLVM FP CC's onto AArch64 CC's isn't totally
   // clean.  Some of them require two branches to implement.
   AArch64CC::CondCode CC1, CC2;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index b1d8277182add..7356392c51df7 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4236,9 +4236,11 @@ InstructionCost AArch64TTIImpl::getCmpSelInstrCost(
 
   if (isa<FixedVectorType>(ValTy) && ISD == ISD::SETCC) {
     auto LT = getTypeLegalizationCost(ValTy);
-    // Cost v4f16 FCmp without FP16 support via converting to v4f32 and back.
+    // Cost v#f16 FCmp without FP16 support via converting to v#f32 and back.
     if (LT.second == MVT::v4f16 && !ST->hasFullFP16())
       return LT.first * 4; // fcvtl + fcvtl + fcmp + xtn
+    if (LT.second == MVT::v8f16 && !ST->hasFullFP16())
+      return LT.first * 8; // 2*(fcvtl + fcvtl2 + fcmp) + uzp1 + xtn
   }
 
   // Treat the icmp in icmp(and, 0) as free, as we can make use of ands.
diff --git a/llvm/test/Analysis/CostModel/AArch64/cmp.ll b/llvm/test/Analysis/CostModel/AArch64/cmp.ll
index c79005ee0e5d4..ea43057ec3d2d 100644
--- a/llvm/test/Analysis/CostModel/AArch64/cmp.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/cmp.ll
@@ -16,7 +16,7 @@ define void @cmps() {
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf16 = fcmp oge half undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf32 = fcmp ogt float undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf64 = fcmp ogt double undef, undef
-; CHECK-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cfv816 = fcmp olt <8 x half> undef, undef
+; CHECK-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cfv816 = fcmp olt <8 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cfv432 = fcmp oge <4 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cfv264 = fcmp oge <2 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret void
diff --git a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
index e35eabfb35f58..452a782ed5026 100644
--- a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
@@ -119,7 +119,7 @@ define <4 x half> @v4f16_select_ogt(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_ogt(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_ogt'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ogt <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ogt <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -184,7 +184,7 @@ define <4 x half> @v4f16_select_oge(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_oge(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_oge'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oge <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oge <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -249,7 +249,7 @@ define <4 x half> @v4f16_select_olt(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_olt(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_olt'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp olt <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp olt <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -314,7 +314,7 @@ define <4 x half> @v4f16_select_ole(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_ole(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_ole'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ole <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ole <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -379,7 +379,7 @@ define <4 x half> @v4f16_select_oeq(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_oeq(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_oeq'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oeq <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oeq <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -444,7 +444,7 @@ define <4 x half> @v4f16_select_one(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_one(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_one'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp one <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp one <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -513,7 +513,7 @@ define <4 x half> @v4f16_select_une(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_une(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_une'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp une <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp une <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
diff --git a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
index 3a55b68f2d1a3..715693dd6ed07 100644
--- a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
+++ b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
@@ -882,61 +882,17 @@ define <8 x i16> @fptoui_i16(<8 x bfloat> %a) #0 {
 define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_une:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, ne
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v1.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v0.8h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    fcmeq v2.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmeq v0.4s, v0.4s, v1.4s
+; CHECK-NEXT:    uzp1 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    mvn v0.16b, v0.16b
+; CHECK-NEXT:    shl v0.8h, v0.8h, #15
+; CHECK-NEXT:    cmlt v0.8h, v0.8h, #0
+; CHECK-NEXT:    xtn v0.8b, v0.8h
 ; CHECK-NEXT:    ret
   %1 = fcmp une <8 x bfloat> %a, %b
   ret <8 x i1> %1
@@ -945,69 +901,21 @@ define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ueq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, eq
-; CHECK-NEXT:    csinv w9, w9, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v1.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v0.8h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    fcmgt v4.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmgt v2.4s, v2.4s, v3.4s
+; CHECK-NEXT:    fcmgt v3.4s, v0.4s, v1.4s
+; CHECK-NEXT:    fcmgt v0.4s, v1.4s, v0.4s
+; CHECK-NEXT:    orr v1.16b, v2.16b, v4.16b
+; CHECK-NEXT:    orr v0.16b, v0.16b, v3.16b
+; CHECK-NEXT:    uzp1 v0.8h, v0.8h, v1.8h
+; CHECK-NEXT:    mvn v0.16b, v0.16b
+; CHECK-NEXT:    shl v0.8h, v0.8h, #15
+; CHECK-NEXT:    cmlt v0.8h, v0.8h, #0
+; CHECK-NEXT:    xtn v0.8b, v0.8h
 ; CHECK-NEXT:    ret
   %1 = fcmp ueq <8 x bfloat> %a, %b
   ret <8 x i1> %1
@@ -1016,61 +924,17 @@ define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ugt(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ugt:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, hi
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v1.4s, v1.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v0.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v1.8h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    shll v1.4s, v1.4h, #16
+; CHECK-NEXT:    fcmge v2.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmge v...
[truncated]


setOperationAction(ISD::FABS, V4Narrow, Legal);
setOperationAction(ISD::FNEG, V4Narrow, Legal);
setOperationAction(ISD::FNEG, V4Narrow, Legal);
Copy link
Collaborator Author

@paulwalker-arm paulwalker-arm Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tab removal.

@llvmbot
Copy link
Member

llvmbot commented Apr 11, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Paul Walker (paulwalker-arm)

Changes

I have also removed custom promotion code for the v4{f16,bf16} cases because the same common code can be used.


Patch is 115.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135398.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+11-25)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+3-1)
  • (modified) llvm/test/Analysis/CostModel/AArch64/cmp.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/AArch64/vector-select.ll (+7-7)
  • (modified) llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll (+139-762)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+111-471)
  • (modified) llvm/test/CodeGen/AArch64/fp16-v8-instructions.ll (+163-786)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index e366d7cb54490..92b77fde8bab2 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -841,11 +841,11 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationPromotedToType(ISD::FRINT,      V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FNEARBYINT, V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FCANONICALIZE, V4Narrow, MVT::v4f32);
+    setOperationPromotedToType(ISD::SETCC,         V4Narrow, MVT::v4f32);
 
     setOperationAction(ISD::FABS,        V4Narrow, Legal);
-    setOperationAction(ISD::FNEG, 	 V4Narrow, Legal);
+    setOperationAction(ISD::FNEG,        V4Narrow, Legal);
     setOperationAction(ISD::FMA,         V4Narrow, Expand);
-    setOperationAction(ISD::SETCC,       V4Narrow, Custom);
     setOperationAction(ISD::BR_CC,       V4Narrow, Expand);
     setOperationAction(ISD::SELECT,      V4Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V4Narrow, Expand);
@@ -853,6 +853,9 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::FSQRT,       V4Narrow, Expand);
 
     auto V8Narrow = MVT::getVectorVT(ScalarVT, 8);
+    setOperationPromotedToType(ISD::FCANONICALIZE, V8Narrow, MVT::v8f32);
+    setOperationPromotedToType(ISD::SETCC,         V8Narrow, MVT::v8f32);
+
     setOperationAction(ISD::FABS,        V8Narrow, Legal);
     setOperationAction(ISD::FADD,        V8Narrow, Legal);
     setOperationAction(ISD::FCEIL,       V8Narrow, Legal);
@@ -862,19 +865,17 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::FMA,         V8Narrow, Expand);
     setOperationAction(ISD::FMUL,        V8Narrow, Legal);
     setOperationAction(ISD::FNEARBYINT,  V8Narrow, Legal);
-    setOperationAction(ISD::FNEG, 	 V8Narrow, Legal);
+    setOperationAction(ISD::FNEG,        V8Narrow, Legal);
     setOperationAction(ISD::FROUND,      V8Narrow, Legal);
     setOperationAction(ISD::FROUNDEVEN,  V8Narrow, Legal);
     setOperationAction(ISD::FRINT,       V8Narrow, Legal);
     setOperationAction(ISD::FSQRT,       V8Narrow, Expand);
     setOperationAction(ISD::FSUB,        V8Narrow, Legal);
     setOperationAction(ISD::FTRUNC,      V8Narrow, Legal);
-    setOperationAction(ISD::SETCC,       V8Narrow, Expand);
     setOperationAction(ISD::BR_CC,       V8Narrow, Expand);
     setOperationAction(ISD::SELECT,      V8Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V8Narrow, Expand);
     setOperationAction(ISD::FP_EXTEND,   V8Narrow, Expand);
-    setOperationPromotedToType(ISD::FCANONICALIZE, V8Narrow, MVT::v8f32);
   };
 
   if (!Subtarget->hasFullFP16()) {
@@ -15905,6 +15906,11 @@ SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,
   if (LHS.getValueType().getVectorElementType().isInteger())
     return Op;
 
+  assert((!Subtarget->hasFullFP16() &&
+          LHS.getValueType().getVectorElementType() != MVT::f16) ||
+         LHS.getValueType().getVectorElementType() != MVT::bf16 ||
+         LHS.getValueType().getVectorElementType() != MVT::f128);
+
   // Lower isnan(x) | isnan(never-nan) to x != x.
   // Lower !isnan(x) & !isnan(never-nan) to x == x.
   if (CC == ISD::SETUO || CC == ISD::SETO) {
@@ -15923,26 +15929,6 @@ SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,
     }
   }
 
-  const bool FullFP16 = DAG.getSubtarget<AArch64Subtarget>().hasFullFP16();
-
-  // Make v4f16 (only) fcmp operations utilise vector instructions
-  // v8f16 support will be a litle more complicated
-  if ((!FullFP16 && LHS.getValueType().getVectorElementType() == MVT::f16) ||
-      LHS.getValueType().getVectorElementType() == MVT::bf16) {
-    if (LHS.getValueType().getVectorNumElements() == 4) {
-      LHS = DAG.getNode(ISD::FP_EXTEND, dl, MVT::v4f32, LHS);
-      RHS = DAG.getNode(ISD::FP_EXTEND, dl, MVT::v4f32, RHS);
-      SDValue NewSetcc = DAG.getSetCC(dl, MVT::v4i16, LHS, RHS, CC);
-      DAG.ReplaceAllUsesWith(Op, NewSetcc);
-      CmpVT = MVT::v4i32;
-    } else
-      return SDValue();
-  }
-
-  assert((!FullFP16 && LHS.getValueType().getVectorElementType() != MVT::f16) ||
-         LHS.getValueType().getVectorElementType() != MVT::bf16 ||
-         LHS.getValueType().getVectorElementType() != MVT::f128);
-
   // Unfortunately, the mapping of LLVM FP CC's onto AArch64 CC's isn't totally
   // clean.  Some of them require two branches to implement.
   AArch64CC::CondCode CC1, CC2;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index b1d8277182add..7356392c51df7 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4236,9 +4236,11 @@ InstructionCost AArch64TTIImpl::getCmpSelInstrCost(
 
   if (isa<FixedVectorType>(ValTy) && ISD == ISD::SETCC) {
     auto LT = getTypeLegalizationCost(ValTy);
-    // Cost v4f16 FCmp without FP16 support via converting to v4f32 and back.
+    // Cost v#f16 FCmp without FP16 support via converting to v#f32 and back.
     if (LT.second == MVT::v4f16 && !ST->hasFullFP16())
       return LT.first * 4; // fcvtl + fcvtl + fcmp + xtn
+    if (LT.second == MVT::v8f16 && !ST->hasFullFP16())
+      return LT.first * 8; // 2*(fcvtl + fcvtl2 + fcmp) + uzp1 + xtn
   }
 
   // Treat the icmp in icmp(and, 0) as free, as we can make use of ands.
diff --git a/llvm/test/Analysis/CostModel/AArch64/cmp.ll b/llvm/test/Analysis/CostModel/AArch64/cmp.ll
index c79005ee0e5d4..ea43057ec3d2d 100644
--- a/llvm/test/Analysis/CostModel/AArch64/cmp.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/cmp.ll
@@ -16,7 +16,7 @@ define void @cmps() {
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf16 = fcmp oge half undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf32 = fcmp ogt float undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cf64 = fcmp ogt double undef, undef
-; CHECK-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cfv816 = fcmp olt <8 x half> undef, undef
+; CHECK-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cfv816 = fcmp olt <8 x half> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cfv432 = fcmp oge <4 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of 1 for: %cfv264 = fcmp oge <2 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret void
diff --git a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
index e35eabfb35f58..452a782ed5026 100644
--- a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll
@@ -119,7 +119,7 @@ define <4 x half> @v4f16_select_ogt(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_ogt(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_ogt'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ogt <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ogt <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -184,7 +184,7 @@ define <4 x half> @v4f16_select_oge(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_oge(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_oge'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oge <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oge <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -249,7 +249,7 @@ define <4 x half> @v4f16_select_olt(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_olt(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_olt'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp olt <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp olt <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -314,7 +314,7 @@ define <4 x half> @v4f16_select_ole(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_ole(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_ole'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ole <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp ole <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -379,7 +379,7 @@ define <4 x half> @v4f16_select_oeq(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_oeq(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_oeq'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oeq <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp oeq <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -444,7 +444,7 @@ define <4 x half> @v4f16_select_one(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_one(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_one'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp one <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp one <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
@@ -513,7 +513,7 @@ define <4 x half> @v4f16_select_une(<4 x half> %a, <4 x half> %b, <4 x half> %c)
 
 define <8 x half> @v8f16_select_une(<8 x half> %a, <8 x half> %b, <8 x half> %c) {
 ; COST-NOFP16-LABEL: 'v8f16_select_une'
-; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:22 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp une <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:8 CodeSize:1 Lat:1 SizeLat:1 for: %cmp.1 = fcmp une <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:1 SizeLat:1 for: %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
 ; COST-NOFP16-NEXT:  Cost Model: Found costs of RThru:0 CodeSize:1 Lat:1 SizeLat:1 for: ret <8 x half> %s.1
 ;
diff --git a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
index 3a55b68f2d1a3..715693dd6ed07 100644
--- a/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
+++ b/llvm/test/CodeGen/AArch64/bf16-v8-instructions.ll
@@ -882,61 +882,17 @@ define <8 x i16> @fptoui_i16(<8 x bfloat> %a) #0 {
 define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_une:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, ne
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v1.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v0.8h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, ne
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    fcmeq v2.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmeq v0.4s, v0.4s, v1.4s
+; CHECK-NEXT:    uzp1 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    mvn v0.16b, v0.16b
+; CHECK-NEXT:    shl v0.8h, v0.8h, #15
+; CHECK-NEXT:    cmlt v0.8h, v0.8h, #0
+; CHECK-NEXT:    xtn v0.8b, v0.8h
 ; CHECK-NEXT:    ret
   %1 = fcmp une <8 x bfloat> %a, %b
   ret <8 x i1> %1
@@ -945,69 +901,21 @@ define <8 x i1> @test_fcmp_une(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ueq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, eq
-; CHECK-NEXT:    csinv w9, w9, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v1.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v0.8h, #16
 ; CHECK-NEXT:    shll v1.4s, v1.4h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, eq
-; CHECK-NEXT:    csinv w8, w8, wzr, vc
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    fcmgt v4.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmgt v2.4s, v2.4s, v3.4s
+; CHECK-NEXT:    fcmgt v3.4s, v0.4s, v1.4s
+; CHECK-NEXT:    fcmgt v0.4s, v1.4s, v0.4s
+; CHECK-NEXT:    orr v1.16b, v2.16b, v4.16b
+; CHECK-NEXT:    orr v0.16b, v0.16b, v3.16b
+; CHECK-NEXT:    uzp1 v0.8h, v0.8h, v1.8h
+; CHECK-NEXT:    mvn v0.16b, v0.16b
+; CHECK-NEXT:    shl v0.8h, v0.8h, #15
+; CHECK-NEXT:    cmlt v0.8h, v0.8h, #0
+; CHECK-NEXT:    xtn v0.8b, v0.8h
 ; CHECK-NEXT:    ret
   %1 = fcmp ueq <8 x bfloat> %a, %b
   ret <8 x i1> %1
@@ -1016,61 +924,17 @@ define <8 x i1> @test_fcmp_ueq(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 define <8 x i1> @test_fcmp_ugt(<8 x bfloat> %a, <8 x bfloat> %b) #0 {
 ; CHECK-LABEL: test_fcmp_ugt:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    dup v2.4h, v1.h[1]
-; CHECK-NEXT:    dup v3.4h, v0.h[1]
-; CHECK-NEXT:    dup v4.4h, v1.h[2]
-; CHECK-NEXT:    dup v5.4h, v0.h[2]
-; CHECK-NEXT:    dup v6.4h, v0.h[3]
-; CHECK-NEXT:    shll v2.4s, v2.4h, #16
-; CHECK-NEXT:    shll v3.4s, v3.4h, #16
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v2.4s, v1.4h, #16
-; CHECK-NEXT:    shll v3.4s, v0.4h, #16
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s3, s2
-; CHECK-NEXT:    shll v3.4s, v4.4h, #16
-; CHECK-NEXT:    shll v4.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.4h, v1.h[3]
-; CHECK-NEXT:    csetm w9, hi
-; CHECK-NEXT:    fmov s2, w9
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[4]
-; CHECK-NEXT:    dup v6.8h, v0.h[4]
-; CHECK-NEXT:    mov v2.h[1], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[5]
-; CHECK-NEXT:    dup v6.8h, v0.h[5]
-; CHECK-NEXT:    mov v2.h[2], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    dup v5.8h, v1.h[6]
-; CHECK-NEXT:    dup v6.8h, v0.h[6]
-; CHECK-NEXT:    dup v1.8h, v1.h[7]
-; CHECK-NEXT:    dup v0.8h, v0.h[7]
-; CHECK-NEXT:    mov v2.h[3], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    shll v3.4s, v5.4h, #16
-; CHECK-NEXT:    shll v4.4s, v6.4h, #16
-; CHECK-NEXT:    shll v1.4s, v1.4h, #16
+; CHECK-NEXT:    shll2 v2.4s, v0.8h, #16
+; CHECK-NEXT:    shll2 v3.4s, v1.8h, #16
 ; CHECK-NEXT:    shll v0.4s, v0.4h, #16
-; CHECK-NEXT:    mov v2.h[4], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s4, s3
-; CHECK-NEXT:    mov v2.h[5], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    fcmp s0, s1
-; CHECK-NEXT:    mov v2.h[6], w8
-; CHECK-NEXT:    csetm w8, hi
-; CHECK-NEXT:    mov v2.h[7], w8
-; CHECK-NEXT:    xtn v0.8b, v2.8h
+; CHECK-NEXT:    shll v1.4s, v1.4h, #16
+; CHECK-NEXT:    fcmge v2.4s, v3.4s, v2.4s
+; CHECK-NEXT:    fcmge v...
[truncated]

Copy link

github-actions bot commented Apr 11, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/AArch64/AArch64ISelLowering.cpp llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 5201e1a3c..54b84b80e 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -843,11 +843,11 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationPromotedToType(ISD::FRINT,      V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FNEARBYINT, V4Narrow, MVT::v4f32);
     setOperationPromotedToType(ISD::FCANONICALIZE, V4Narrow, MVT::v4f32);
-    setOperationPromotedToType(ISD::SETCC,         V4Narrow, MVT::v4f32);
+    setOperationPromotedToType(ISD::SETCC, V4Narrow, MVT::v4f32);
 
     setOperationAction(ISD::FABS,        V4Narrow, Legal);
-    setOperationAction(ISD::FNEG,        V4Narrow, Legal);
-    setOperationAction(ISD::FMA,         V4Narrow, Expand);
+    setOperationAction(ISD::FNEG, V4Narrow, Legal);
+    setOperationAction(ISD::FMA, V4Narrow, Expand);
     setOperationAction(ISD::BR_CC,       V4Narrow, Expand);
     setOperationAction(ISD::SELECT,      V4Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V4Narrow, Expand);
@@ -856,7 +856,7 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
 
     auto V8Narrow = MVT::getVectorVT(ScalarVT, 8);
     setOperationPromotedToType(ISD::FCANONICALIZE, V8Narrow, MVT::v8f32);
-    setOperationPromotedToType(ISD::SETCC,         V8Narrow, MVT::v8f32);
+    setOperationPromotedToType(ISD::SETCC, V8Narrow, MVT::v8f32);
 
     setOperationAction(ISD::FABS,        V8Narrow, Legal);
     setOperationAction(ISD::FADD,        V8Narrow, Legal);
@@ -867,17 +867,17 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::FMA,         V8Narrow, Expand);
     setOperationAction(ISD::FMUL,        V8Narrow, Legal);
     setOperationAction(ISD::FNEARBYINT,  V8Narrow, Legal);
-    setOperationAction(ISD::FNEG,        V8Narrow, Legal);
+    setOperationAction(ISD::FNEG, V8Narrow, Legal);
     setOperationAction(ISD::FROUND,      V8Narrow, Legal);
     setOperationAction(ISD::FROUNDEVEN,  V8Narrow, Legal);
     setOperationAction(ISD::FRINT,       V8Narrow, Legal);
     setOperationAction(ISD::FSQRT,       V8Narrow, Expand);
     setOperationAction(ISD::FSUB,        V8Narrow, Legal);
-    setOperationAction(ISD::FTRUNC,      V8Narrow, Legal);
+    setOperationAction(ISD::FTRUNC, V8Narrow, Legal);
     setOperationAction(ISD::BR_CC,       V8Narrow, Expand);
     setOperationAction(ISD::SELECT,      V8Narrow, Expand);
     setOperationAction(ISD::SELECT_CC,   V8Narrow, Expand);
-    setOperationAction(ISD::FP_EXTEND,   V8Narrow, Expand);
+    setOperationAction(ISD::FP_EXTEND, V8Narrow, Expand);
   };
 
   if (!Subtarget->hasFullFP16()) {

Comment on lines 4242 to 4243
if (LT.second == MVT::v8f16 && !ST->hasFullFP16())
return LT.first * 8; // 2*(fcvtl + fcvtl2 + fcmp) + uzp1 + xtn
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly generous based on the output for fcmp oeq but I believe the two extra instructions are redundant and the previous cost of 22 was not close to reality either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be best as separate calls to getCastCost * 2 + getCmpSelCost + getCastCost to the extended type, unless there is a need to override it because it is a special case. It can be slower due to the extra calls being made, but composes better and is more extendable as the cost model changes.

Either way it is likely better to separate out into another commit - they can go in at roughly the same time but it is better to keep patches smaller and the codegen changes can be added before the cost model is updated. Or was this updated as the generic cost model was making changes itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - and should it add bf16 costs too?

Copy link
Collaborator Author

@paulwalker-arm paulwalker-arm Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the change to the lowering code triggered a change to the cost model (causing it to start returning 1) so I made this change to counteract it.

It does not look like I can precommit the cost model change because when using getCmpSelCost with the promoted type it assumes no scalarisation, returning a much lower cost than reality, which this PR fixes. That said, presumably the existing MVT::v4f16 is equally bad? I could fix that to use getCastCost * 2 + getCmpSelCost + getCastCost and then this PR would just remove the type restriction?

Edit: Removed previous side note as I was confusing truncating to the IR result type with the expected truncation from V128->V64.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the costing under #135795. That contains a FIXME for the path that is currently scalarised, which this PR will remove.

@paulwalker-arm
Copy link
Collaborator Author

I'm ignoring the code formatter errors because I'm following the format of existing code.

@@ -841,18 +841,21 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
setOperationPromotedToType(ISD::FRINT, V4Narrow, MVT::v4f32);
setOperationPromotedToType(ISD::FNEARBYINT, V4Narrow, MVT::v4f32);
setOperationPromotedToType(ISD::FCANONICALIZE, V4Narrow, MVT::v4f32);
setOperationPromotedToType(ISD::SETCC, V4Narrow, MVT::v4f32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to just mark the node as promote. Do you know why this needs to provide the necessary vector type to promote to, and that isn't automatically handled by getTypeToPromoteTo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTypeToPromoteTo does not work in the natural way when operating on vector types. Specially, there is no guarantee the input and output element counts will match and indeed on AArch64 getTypeToPromoteTo(v4f16) returns v2f32.

Comment on lines 4242 to 4243
if (LT.second == MVT::v8f16 && !ST->hasFullFP16())
return LT.first * 8; // 2*(fcvtl + fcvtl2 + fcmp) + uzp1 + xtn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be best as separate calls to getCastCost * 2 + getCmpSelCost + getCastCost to the extended type, unless there is a need to override it because it is a special case. It can be slower due to the extra calls being made, but composes better and is more extendable as the cost model changes.

Either way it is likely better to separate out into another commit - they can go in at roughly the same time but it is better to keep patches smaller and the codegen changes can be added before the cost model is updated. Or was this updated as the generic cost model was making changes itself?

paulwalker-arm added a commit to paulwalker-arm/llvm-project that referenced this pull request Apr 15, 2025
…res.

The PR also extends the code to cover bfloat vector compares that are
also promoted to float.

NOTE: There is a bail out for the compares that are scalarised that
will be removed by llvm#135398.
@paulwalker-arm
Copy link
Collaborator Author

paulwalker-arm commented Apr 15, 2025

Rebased to sit on top of #135795. Only the most recent commit is relevant to this PR.

; CHECK-CVT-NEXT: uzp1 v0.8h, v0.8h, v2.8h
; CHECK-CVT-NEXT: mvn v0.16b, v0.16b
; CHECK-CVT-NEXT: shl v0.8h, v0.8h, #15
; CHECK-CVT-NEXT: cmlt v0.8h, v0.8h, #0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the "shl+cmlt" a no-op?

Copy link
Collaborator Author

@paulwalker-arm paulwalker-arm Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. I alluded to this via a GitHub comment attached to why the instruction cost seems lower than the generated code suggests. I've not investigated yet but given it's not a regression compared to the original code I figured I could tackle it separately. Is this ok or do you prefer me to do the work under this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine.

paulwalker-arm added a commit that referenced this pull request Apr 22, 2025
…res. (#135795)

The PR also extends the code to cover bfloat vector compares that are
also promoted to float.

NOTE: There is a bail out for the compares that are scalarised that will
be removed by #135398.
I have also removed custom promotion code for the v4{f16,bf16} cases
because the same common code can be used.
@paulwalker-arm
Copy link
Collaborator Author

Rebased now the dependent PR has landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants