Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV] Add initial support for vectorizing literal struct return values #109833

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

MacDue
Copy link
Member

@MacDue MacDue commented Sep 24, 2024

This patch adds initial support for vectorizing literal struct return values. Currently, this is limited to the case where the struct is homogeneous (all elements have the same type) and not packed.

The intended use case for this is vectorizing intrinsics such as:

declare { float, float } @llvm.sincos.f32(float %x)

Mapping them to structure-returning library calls such as:

declare { <4 x float>, <4 x i32> } @Sleef_sincosf4_u10advsimd(<4 x float>)

It could also be possible to vectorize the intrinsic (without a libcall) and then later lower the intrinsic to a library call. This may be desired if the only library calls available take output pointers rather than return multiple values.

Implementing this required two main changes:

  1. Supporting widening extractvalue
  2. Adding support for "wide" types (in LV and parts of the cost model)

The first change is relatively straightforward, the second is larger as it requires changing assumptions that types are always scalars or vectors.

In this patch, a "wide" type is defined as a vector, or a struct literal where all elements are vectors (of the same element count).

To help with the second change some helpers for wide types have been added (that work similarly to existing vector helpers). These have been used along the paths needed to support vectorizing calls, however, I expect there are places that still only expect vector types.

Copy link

github-actions bot commented Sep 24, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@SamTebbs33 SamTebbs33 self-requested a review September 25, 2024 10:04
@MacDue MacDue force-pushed the loop_vec_struct_experiment branch 3 times, most recently from 8ce936e to 0f2cdbb Compare October 3, 2024 15:40
@MacDue
Copy link
Member Author

MacDue commented Oct 3, 2024

Marking the PR as "ready to review". I think this PR likely still needs further changes, but I don't plan on iterating on this much more without review 🙂 (to be sure this is the right direction).

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 3, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-llvm-analysis

Author: Benjamin Maxwell (MacDue)

Changes

This patch adds initial support for vectorizing literal struct return values. Currently, this is limited to the case where the struct is homogeneous (all elements have the same type) and not packed.

The intended use case for this is vectorizing intrinsics such as:

declare { float, float } @<!-- -->llvm.sincos.f32(float %x)

Mapping them to structure-returning library calls such as:

declare { &lt;4 x float&gt;, &lt;4 x i32&gt; } @<!-- -->Sleef_sincosf4_u10advsimd(&lt;4 x float&gt;)

It could also be possible to vectorize the intrinsic (without a libcall) and then later lower the intrinsic to a library call. This may be desired if the only library calls available take output pointers rather than return multiple values.

Implementing this required two main changes:

  1. Supporting widening extractvalue
  2. Adding support for "wide" types (in LV and parts of the cost model)

The first change is relatively straightforward, the second is larger as it requires changing assumptions that types are always scalars or vectors.

In this patch, a "wide" type is defined as a vector, or a struct literal where all elements are vectors (of the same element count).

To help with the second change some helpers for wide types have been added (that work similarly to existing vector helpers). These have been used along the paths needed to support vectorizing calls, however, I expect there are places that still only expect vector types.


Patch is 49.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109833.diff

15 Files Affected:

  • (modified) llvm/include/llvm/Analysis/VectorUtils.h (+3-12)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+25-17)
  • (added) llvm/include/llvm/IR/VectorUtils.h (+53)
  • (modified) llvm/lib/Analysis/VectorUtils.cpp (+14)
  • (modified) llvm/lib/IR/CMakeLists.txt (+1)
  • (modified) llvm/lib/IR/VFABIDemangler.cpp (+11-7)
  • (added) llvm/lib/IR/VectorUtils.cpp (+69)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+34-25)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+20-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+3-3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+9-1)
  • (modified) llvm/test/Analysis/CostModel/AMDGPU/frexp.ll (+28-28)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/struct-return.ll (+251)
diff --git a/llvm/include/llvm/Analysis/VectorUtils.h b/llvm/include/llvm/Analysis/VectorUtils.h
index e2dd4976f39065..2a419560be3030 100644
--- a/llvm/include/llvm/Analysis/VectorUtils.h
+++ b/llvm/include/llvm/Analysis/VectorUtils.h
@@ -18,6 +18,7 @@
 #include "llvm/Analysis/LoopAccessAnalysis.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/VFABIDemangler.h"
+#include "llvm/IR/VectorUtils.h"
 #include "llvm/Support/CheckedArithmetic.h"
 
 namespace llvm {
@@ -127,18 +128,8 @@ namespace Intrinsic {
 typedef unsigned ID;
 }
 
-/// A helper function for converting Scalar types to vector types. If
-/// the incoming type is void, we return void. If the EC represents a
-/// scalar, we return the scalar type.
-inline Type *ToVectorTy(Type *Scalar, ElementCount EC) {
-  if (Scalar->isVoidTy() || Scalar->isMetadataTy() || EC.isScalar())
-    return Scalar;
-  return VectorType::get(Scalar, EC);
-}
-
-inline Type *ToVectorTy(Type *Scalar, unsigned VF) {
-  return ToVectorTy(Scalar, ElementCount::getFixed(VF));
-}
+/// Returns true if `Ty` can be widened by the loop vectorizer.
+bool canWidenType(Type *Ty);
 
 /// Identify if the intrinsic is trivially vectorizable.
 /// This method returns true if the intrinsic's argument types are all scalars
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..710a9a107e1eba 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1564,8 +1564,8 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     Type *RetTy = ICA.getReturnType();
 
     ElementCount RetVF =
-        (RetTy->isVectorTy() ? cast<VectorType>(RetTy)->getElementCount()
-                             : ElementCount::getFixed(1));
+        isWideTy(RetTy) ? getWideTypeVF(RetTy) : ElementCount::getFixed(1);
+
     const IntrinsicInst *I = ICA.getInst();
     const SmallVectorImpl<const Value *> &Args = ICA.getArgs();
     FastMathFlags FMF = ICA.getFlags();
@@ -1886,10 +1886,13 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     InstructionCost ScalarizationCost = InstructionCost::getInvalid();
     if (RetVF.isVector() && !RetVF.isScalable()) {
       ScalarizationCost = 0;
-      if (!RetTy->isVoidTy())
-        ScalarizationCost += getScalarizationOverhead(
-            cast<VectorType>(RetTy),
-            /*Insert*/ true, /*Extract*/ false, CostKind);
+      if (!RetTy->isVoidTy()) {
+        for (Type *VectorTy : getContainedTypes(RetTy)) {
+          ScalarizationCost += getScalarizationOverhead(
+              cast<VectorType>(VectorTy),
+              /*Insert*/ true, /*Extract*/ false, CostKind);
+        }
+      }
       ScalarizationCost +=
           getOperandsScalarizationOverhead(Args, ICA.getArgTypes(), CostKind);
     }
@@ -2480,27 +2483,32 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     // Else, assume that we need to scalarize this intrinsic. For math builtins
     // this will emit a costly libcall, adding call overhead and spills. Make it
     // very expensive.
-    if (auto *RetVTy = dyn_cast<VectorType>(RetTy)) {
+    if (isWideTy(RetTy)) {
+      const SmallVector<Type *, 2> RetVTys = getContainedTypes(RetTy);
+
       // Scalable vectors cannot be scalarized, so return Invalid.
-      if (isa<ScalableVectorType>(RetTy) || any_of(Tys, [](const Type *Ty) {
-            return isa<ScalableVectorType>(Ty);
-          }))
+      if (any_of(concat<Type *const>(RetVTys, Tys),
+                 [](Type *Ty) { return isa<ScalableVectorType>(Ty); }))
         return InstructionCost::getInvalid();
 
-      InstructionCost ScalarizationCost =
-          SkipScalarizationCost
-              ? ScalarizationCostPassed
-              : getScalarizationOverhead(RetVTy, /*Insert*/ true,
-                                         /*Extract*/ false, CostKind);
+      InstructionCost ScalarizationCost = ScalarizationCostPassed;
+      if (!SkipScalarizationCost) {
+        ScalarizationCost = 0;
+        for (Type *RetVTy : RetVTys) {
+          ScalarizationCost += getScalarizationOverhead(
+              cast<VectorType>(RetVTy), /*Insert*/ true,
+              /*Extract*/ false, CostKind);
+        }
+      }
 
-      unsigned ScalarCalls = cast<FixedVectorType>(RetVTy)->getNumElements();
+      unsigned ScalarCalls = getWideTypeVF(RetTy).getFixedValue();
       SmallVector<Type *, 4> ScalarTys;
       for (Type *Ty : Tys) {
         if (Ty->isVectorTy())
           Ty = Ty->getScalarType();
         ScalarTys.push_back(Ty);
       }
-      IntrinsicCostAttributes Attrs(IID, RetTy->getScalarType(), ScalarTys, FMF);
+      IntrinsicCostAttributes Attrs(IID, ToNarrowTy(RetTy), ScalarTys, FMF);
       InstructionCost ScalarCost =
           thisT()->getIntrinsicInstrCost(Attrs, CostKind);
       for (Type *Ty : Tys) {
diff --git a/llvm/include/llvm/IR/VectorUtils.h b/llvm/include/llvm/IR/VectorUtils.h
new file mode 100644
index 00000000000000..e8e838d8287c42
--- /dev/null
+++ b/llvm/include/llvm/IR/VectorUtils.h
@@ -0,0 +1,53 @@
+//===----------- VectorUtils.h -  Vector type utility functions -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/IR/DerivedTypes.h"
+
+namespace llvm {
+
+/// A helper function for converting Scalar types to vector types. If
+/// the incoming type is void, we return void. If the EC represents a
+/// scalar, we return the scalar type.
+inline Type *ToVectorTy(Type *Scalar, ElementCount EC) {
+  if (Scalar->isVoidTy() || Scalar->isMetadataTy() || EC.isScalar())
+    return Scalar;
+  return VectorType::get(Scalar, EC);
+}
+
+inline Type *ToVectorTy(Type *Scalar, unsigned VF) {
+  return ToVectorTy(Scalar, ElementCount::getFixed(VF));
+}
+
+/// A helper for converting to wider (vector) types. For scalar types, this is
+/// equivalent to calling `ToVectorTy`. For struct types, this returns a new
+/// struct where each element type has been widened to a vector type. Note: Only
+/// unpacked literal struct types are supported.
+Type *ToWideTy(Type *Ty, ElementCount EC);
+
+/// A helper for converting wide types to narrow (non-vector) types. For vector
+/// types, this is equivalent to calling .getScalarType(). For struct types,
+/// this returns a new struct where each element type has been converted to a
+/// scalar type. Note: Only unpacked literal struct types are supported.
+Type *ToNarrowTy(Type *Ty);
+
+/// Returns the types contained in `Ty`. For struct types, it returns the
+/// elements, all other types are returned directly.
+SmallVector<Type *, 2> getContainedTypes(Type *Ty);
+
+/// Returns true if `Ty` is a vector type or a struct of vector types where all
+/// vector types share the same VF.
+bool isWideTy(Type *Ty);
+
+/// Returns the vectorization factor for a widened type.
+inline ElementCount getWideTypeVF(Type *Ty) {
+  assert(isWideTy(Ty) && "expected widened type!");
+  return cast<VectorType>(getContainedTypes(Ty).front())->getElementCount();
+}
+
+} // namespace llvm
diff --git a/llvm/lib/Analysis/VectorUtils.cpp b/llvm/lib/Analysis/VectorUtils.cpp
index dbffbb8a5f81d9..38b9da69ae2b76 100644
--- a/llvm/lib/Analysis/VectorUtils.cpp
+++ b/llvm/lib/Analysis/VectorUtils.cpp
@@ -39,6 +39,20 @@ static cl::opt<unsigned> MaxInterleaveGroupFactor(
     cl::desc("Maximum factor for an interleaved access group (default = 8)"),
     cl::init(8));
 
+/// Returns true if `Ty` can be widened by the loop vectorizer.
+bool llvm::canWidenType(Type *Ty) {
+  Type *ElTy = Ty;
+  // For now, only allow widening non-packed literal structs where all
+  // element types are the same. This simplifies the cost model and
+  // conversion between scalar and wide types.
+  if (auto *StructTy = dyn_cast<StructType>(Ty);
+      StructTy && !StructTy->isPacked() && StructTy->isLiteral() &&
+      StructTy->containsHomogeneousTypes()) {
+    ElTy = StructTy->elements().front();
+  }
+  return VectorType::isValidElementType(ElTy);
+}
+
 /// Return true if all of the intrinsic's arguments and return type are scalars
 /// for the scalar form of the intrinsic, and vectors for the vector form of the
 /// intrinsic (except operands that are marked as always being scalar by
diff --git a/llvm/lib/IR/CMakeLists.txt b/llvm/lib/IR/CMakeLists.txt
index 544f4ea9223d0e..7eaf35e10ebc67 100644
--- a/llvm/lib/IR/CMakeLists.txt
+++ b/llvm/lib/IR/CMakeLists.txt
@@ -73,6 +73,7 @@ add_llvm_component_library(LLVMCore
   Value.cpp
   ValueSymbolTable.cpp
   VectorBuilder.cpp
+  VectorUtils.cpp
   Verifier.cpp
   VFABIDemangler.cpp
   RuntimeLibcalls.cpp
diff --git a/llvm/lib/IR/VFABIDemangler.cpp b/llvm/lib/IR/VFABIDemangler.cpp
index cdfb9fbfaa084d..6ccd77fd23793a 100644
--- a/llvm/lib/IR/VFABIDemangler.cpp
+++ b/llvm/lib/IR/VFABIDemangler.cpp
@@ -11,6 +11,7 @@
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/IR/Module.h"
+#include "llvm/IR/VectorUtils.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/raw_ostream.h"
 #include <limits>
@@ -346,12 +347,15 @@ getScalableECFromSignature(const FunctionType *Signature, const VFISAKind ISA,
   // Also check the return type if not void.
   Type *RetTy = Signature->getReturnType();
   if (!RetTy->isVoidTy()) {
-    std::optional<ElementCount> ReturnEC = getElementCountForTy(ISA, RetTy);
-    // If we have an unknown scalar element type we can't find a reasonable VF.
-    if (!ReturnEC)
-      return std::nullopt;
-    if (ElementCount::isKnownLT(*ReturnEC, MinEC))
-      MinEC = *ReturnEC;
+    for (Type *RetTy : getContainedTypes(RetTy)) {
+      std::optional<ElementCount> ReturnEC = getElementCountForTy(ISA, RetTy);
+      // If we have an unknown scalar element type we can't find a reasonable
+      // VF.
+      if (!ReturnEC)
+        return std::nullopt;
+      if (ElementCount::isKnownLT(*ReturnEC, MinEC))
+        MinEC = *ReturnEC;
+    }
   }
 
   // The SVE Vector function call ABI bases the VF on the widest element types
@@ -566,7 +570,7 @@ FunctionType *VFABI::createFunctionType(const VFInfo &Info,
 
   auto *RetTy = ScalarFTy->getReturnType();
   if (!RetTy->isVoidTy())
-    RetTy = VectorType::get(RetTy, VF);
+    RetTy = ToWideTy(RetTy, VF);
   return FunctionType::get(RetTy, VecTypes, false);
 }
 
diff --git a/llvm/lib/IR/VectorUtils.cpp b/llvm/lib/IR/VectorUtils.cpp
new file mode 100644
index 00000000000000..c89a8eaf2ad1e0
--- /dev/null
+++ b/llvm/lib/IR/VectorUtils.cpp
@@ -0,0 +1,69 @@
+//===----------- VectorUtils.cpp - Vector type utility functions ----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/IR/VectorUtils.h"
+#include "llvm/ADT/SmallVectorExtras.h"
+
+using namespace llvm;
+
+/// A helper for converting to wider (vector) types. For scalar types, this is
+/// equivalent to calling `ToVectorTy`. For struct types, this returns a new
+/// struct where each element type has been widened to a vector type. Note: Only
+/// unpacked literal struct types are supported.
+Type *llvm::ToWideTy(Type *Ty, ElementCount EC) {
+  if (EC.isScalar())
+    return Ty;
+  auto *StructTy = dyn_cast<StructType>(Ty);
+  if (!StructTy)
+    return ToVectorTy(Ty, EC);
+  assert(StructTy->isLiteral() && !StructTy->isPacked() &&
+         "expected unpacked struct literal");
+  return StructType::get(
+      Ty->getContext(),
+      map_to_vector(StructTy->elements(), [&](Type *ElTy) -> Type * {
+        return VectorType::get(ElTy, EC);
+      }));
+}
+
+/// A helper for converting wide types to narrow (non-vector) types. For vector
+/// types, this is equivalent to calling .getScalarType(). For struct types,
+/// this returns a new struct where each element type has been converted to a
+/// scalar type. Note: Only unpacked literal struct types are supported.
+Type *llvm::ToNarrowTy(Type *Ty) {
+  auto *StructTy = dyn_cast<StructType>(Ty);
+  if (!StructTy)
+    return Ty->getScalarType();
+  assert(StructTy->isLiteral() && !StructTy->isPacked() &&
+         "expected unpacked struct literal");
+  return StructType::get(
+      Ty->getContext(),
+      map_to_vector(StructTy->elements(), [](Type *ElTy) -> Type * {
+        return ElTy->getScalarType();
+      }));
+}
+
+/// Returns the types contained in `Ty`. For struct types, it returns the
+/// elements, all other types are returned directly.
+SmallVector<Type *, 2> llvm::getContainedTypes(Type *Ty) {
+  auto *StructTy = dyn_cast<StructType>(Ty);
+  if (StructTy)
+    return to_vector<2>(StructTy->elements());
+  return {Ty};
+}
+
+/// Returns true if `Ty` is a vector type or a struct of vector types where all
+/// vector types share the same VF.
+bool llvm::isWideTy(Type *Ty) {
+  auto ContainedTys = getContainedTypes(Ty);
+  if (ContainedTys.empty() || !ContainedTys.front()->isVectorTy())
+    return false;
+  ElementCount VF = cast<VectorType>(ContainedTys.front())->getElementCount();
+  return all_of(ContainedTys, [&](Type *Ty) {
+    return Ty->isVectorTy() && cast<VectorType>(Ty)->getElementCount() == VF;
+  });
+}
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 43be72f0f34d45..cb6327640dbdbb 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -949,8 +949,8 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
       // Check that the instruction return type is vectorizable.
       // We can't vectorize casts from vector type to scalar type.
       // Also, we can't vectorize extractelement instructions.
-      if ((!VectorType::isValidElementType(I.getType()) &&
-           !I.getType()->isVoidTy()) ||
+      Type *InstTy = I.getType();
+      if (!(InstTy->isVoidTy() || canWidenType(InstTy)) ||
           (isa<CastInst>(I) &&
            !VectorType::isValidElementType(I.getOperand(0)->getType())) ||
           isa<ExtractElementInst>(I)) {
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 6e082b1c134dee..7283bf8d323971 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2302,7 +2302,9 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,
                                                VPReplicateRecipe *RepRecipe,
                                                const VPLane &Lane,
                                                VPTransformState &State) {
-  assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
+  assert((!Instr->getType()->isAggregateType() ||
+          canWidenType(Instr->getType())) &&
+         "expected widenable or non-aggregate type!");
 
   // Does this instruction return a value ?
   bool IsVoidRetTy = Instr->getType()->isVoidTy();
@@ -2864,10 +2866,10 @@ LoopVectorizationCostModel::getVectorCallCost(CallInst *CI,
   return ScalarCallCost;
 }
 
-static Type *maybeVectorizeType(Type *Elt, ElementCount VF) {
-  if (VF.isScalar() || (!Elt->isIntOrPtrTy() && !Elt->isFloatingPointTy()))
-    return Elt;
-  return VectorType::get(Elt, VF);
+static Type *maybeVectorizeType(Type *Ty, ElementCount VF) {
+  if (VF.isScalar() || !canWidenType(Ty))
+    return Ty;
+  return ToWideTy(Ty, VF);
 }
 
 InstructionCost
@@ -3633,9 +3635,8 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
 
       // ExtractValue instructions must be uniform, because the operands are
       // known to be loop-invariant.
-      if (auto *EVI = dyn_cast<ExtractValueInst>(&I)) {
-        assert(IsOutOfScope(EVI->getAggregateOperand()) &&
-               "Expected aggregate value to be loop invariant");
+      if (auto *EVI = dyn_cast<ExtractValueInst>(&I);
+          EVI && IsOutOfScope(EVI->getAggregateOperand())) {
         AddToWorklistIfAllowed(EVI);
         continue;
       }
@@ -4471,8 +4472,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
         llvm_unreachable("unhandled recipe");
       }
 
-      auto WillWiden = [&TTI, VF](Type *ScalarTy) {
-        Type *VectorTy = ToVectorTy(ScalarTy, VF);
+      auto WillWiden = [&TTI, VF](Type *VectorTy) {
         unsigned NumLegalParts = TTI.getNumberOfParts(VectorTy);
         if (!NumLegalParts)
           return false;
@@ -4503,7 +4503,8 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       Type *ScalarTy = TypeInfo.inferScalarType(ToCheck);
       if (!Visited.insert({ScalarTy}).second)
         continue;
-      if (WillWiden(ScalarTy))
+      Type *WideTy = ToWideTy(ScalarTy, VF);
+      if (any_of(getContainedTypes(WideTy), WillWiden))
         return true;
     }
   }
@@ -5452,10 +5453,13 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
     // and phi nodes.
     TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
     if (isScalarWithPredication(I, VF) && !I->getType()->isVoidTy()) {
-      ScalarCost += TTI.getScalarizationOverhead(
-          cast<VectorType>(ToVectorTy(I->getType(), VF)),
-          APInt::getAllOnes(VF.getFixedValue()), /*Insert*/ true,
-          /*Extract*/ false, CostKind);
+      Type *WideTy = ToWideTy(I->getType(), VF);
+      for (Type *VectorTy : getContainedTypes(WideTy)) {
+        ScalarCost += TTI.getScalarizationOverhead(
+            cast<VectorType>(VectorTy), APInt::getAllOnes(VF.getFixedValue()),
+            /*Insert*/ true,
+            /*Extract*/ false, CostKind);
+      }
       ScalarCost +=
           VF.getFixedValue() * TTI.getCFInstrCost(Instruction::PHI, CostKind);
     }
@@ -5944,13 +5948,17 @@ InstructionCost LoopVectorizationCostModel::getScalarizationOverhead(
     return 0;
 
   InstructionCost Cost = 0;
-  Type *RetTy = ToVectorTy(I->getType(), VF);
+  Type *RetTy = ToWideTy(I->getType(), VF);
   if (!RetTy->isVoidTy() &&
-      (!isa<LoadInst>(I) || !TTI.supportsEfficientVectorElementLoadStore()))
-    Cost += TTI.getScalarizationOverhead(
-        cast<VectorType>(RetTy), APInt::getAllOnes(VF.getKnownMinValue()),
-        /*Insert*/ true,
-        /*Extract*/ false, CostKind);
+      (!isa<LoadInst>(I) || !TTI.supportsEfficientVectorElementLoadStore())) {
+
+    for (Type *VectorTy : getContainedTypes(RetTy)) {
+      Cost += TTI.getScalarizationOverhead(
+          cast<VectorType>(VectorTy), APInt::getAllOnes(VF.getKnownMinValue()),
+          /*Insert*/ true,
+          /*Extract*/ false, CostKind);
+    }
+  }
 
   // Some targets keep addresses scalar.
   if (isa<LoadInst>(I) && !TTI.prefersVectorizedAddressing())
@@ -6209,9 +6217,9 @@ void LoopVectorizationCostModel::setVectorizedCallDecision(ElementCount VF) {
 
       bool MaskRequired = Legal->isMaskRequired(CI);
       // Compute corresponding vector type for return value and arguments.
-      Type *RetTy = ToVectorTy(ScalarRetTy, VF);
+      Type *RetTy = ToWideTy(ScalarRetTy, VF);
       for (Type *ScalarTy : ScalarTys)
-        Tys.push_back(ToVectorTy(ScalarTy, VF));
+        Tys.push_back(ToWideTy(ScalarTy, VF));
 
       // An in-loop reduction using an fmuladd intrinsic is a special case;
       // we don't want the normal cost for that intrinsic.
@@ -6388,7 +6396,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
            HasSingleCopyAfterVectorization(I, VF));
     VectorTy = RetTy;
   } else
-    VectorTy = ToVectorTy(RetTy, VF);
+    VectorTy = ToWideTy(RetTy, VF);
 
   if (VF.isVector() && VectorTy->isVectorTy() &&
       !TTI.getNumberOfParts(VectorTy))
@@ -8423,6 +8431,7 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
   case Instruction::Sub:
   case Instruction::Xor:
   case Instruction::Freeze:
+  case Instruction::ExtractValue:
     SmallVector<VPValue *> NewOps(Operands);
     if (Instruction::isBinaryOp(I->getOpcode())) {
       // The legacy cost model uses SCEV to check if some of the operands are
@@ -9466,7 +9475,7 @...
[truncated]

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlan.cpp Outdated Show resolved Hide resolved
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed.

The intended use case for this is vectorizing intrinsics such as:

```
declare { float, float } @llvm.sincos.f32(float %x)
```

Mapping them to structure-returning library calls such as:

```
declare { <4 x float>, <4 x i32> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```

It could also be possible to vectorize the intrinsic (without a libcall)
and then later lower the intrinsic to a library call. This may be
desired if the only library calls available take output pointers rather
than return multiple values.

Implementing this required two main changes:

1. Supporting widening `extractvalue`
2. Adding support for "wide" types (in LV and parts of the cost model)

The first change is relatively straightforward, the second is larger as
it requires changing assumptions that types are always scalars or
vectors.

In this patch, a "wide" type is defined as a vector, or a struct literal
where all elements are vectors (of the same element count).

To help with the second change some helpers for wide types have been
added (that work similarly to existing vector helpers). These have been
used along the paths needed to support vectorizing calls, however, I
expect there are many places that still only expect vector types.
@MacDue
Copy link
Member Author

MacDue commented Oct 9, 2024

Kind ping 🙂

Copy link
Collaborator

@SamTebbs33 SamTebbs33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with a couple of comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPlan supports recipes that produce multiple results, e.g. as VPInterleaveRecipe. The recipe then would take care of de-interleaving the result, this would help avoid the need to modify various places in VPTransformState.

I might have missed it, but could you make sure that there are tests with struct return values that cannot be widened?

@@ -0,0 +1,284 @@
; RUN: opt < %s -passes=loop-vectorize,dce,instcombine -force-vector-interleave=1 -S | FileCheck %s --check-prefixes=NEON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test need to be specific to SVE? Ideally we would have a target-specific test for a target specific feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test that prints a VPlan with the new recipes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test need to be specific to SVE?

Not really, but I find convincing the vectorizer to use scalable vectors without +sve quite finicky... I was playing with -force-target-supports-scalable-vectors=true, but it seems not to try scalable VFs unless I force the vector width too, and then it's hit or miss. I'm probably doing something wrong, so I'll play around with it a bit more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target independent tests don't have to necessarily use scalable vectors, it would be sufficient to cover the various code paths for fixed vectors. And then have some scalable codegen & AArch64 cost modeling tests in AArch64/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've split the fixed-size tests into a target agnostic test, and left the scalable tests under AArch64 👍
Also, added another test to print the VPlan (when widening or replicating struct returns).

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved
@@ -122,6 +122,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenRecipe *R) {
case Instruction::FNeg:
case Instruction::Freeze:
return inferScalarType(R->getOperand(0));
case Instruction::ExtractValue:
return R->getUnderlyingInstr()->getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may return a vector type from inferScalarTypeForRecipe now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return a struct of scalars (so a "narrow" type rather than a scalar).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would it work to infer the scalar type of the operand and then extract the type for the index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that works, and now that the extract indices are in the vplan, that does not require looking at the underlying value.

P.s. I was wrong before, this should always return a scalar for extractvalue, as this patch only allow widening structs of scalars.

case Instruction::ExtractValue: {
Value *Op = State.get(getOperand(0));
Value *Extract = Builder.CreateExtractValue(
Op, cast<ExtractValueInst>(getUnderlyingValue())->getIndices());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the operand of the recipe and not access the underlying IR value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operand of the recipe is the return value of the call (the struct), to build the widened extract I need the original extract indices, and I'm not sure how else to access them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how the recipe is constructed (would be helpful to have a test that prints a VPlan with the recipes to see the operands). But as the indices are not Value * operands of the instruction, they might need to be added as operands manually. (also would be good to make sure there are tests with multiple indices?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. I've updated tryToWiden() to create live-in operands for the extract indices, and updated this to use that (rather than the underlying value). I've asserted that more than one index is unsupported, as this patch only allows widening structs of scalars.

@MacDue
Copy link
Member Author

MacDue commented Oct 11, 2024

VPlan supports recipes that produce multiple results, e.g. as VPInterleaveRecipe. The recipe then would take care of de-interleaving the result, this would help avoid the need to modify various places in VPTransformState.

I'm a little unclear on how this would work. Do you mean making VPWidenCallRecipe/VPReplicateRecipe return a value for each element of the struct? I'm not sure what to do around the extractvalues in the original IR/plan, I guess you'd have to replace them with the widen call recipe while constructing the plan?

I might have missed it, but could you make sure that there are tests with struct return values that cannot be widened?

There's a few tests in llvm/test/Transforms/LoopVectorize/struct-return.ll (negative_mixed_element_type_struct_return, test_named_struct_return, test_overflow_intrinsic).

@fhahn
Copy link
Contributor

fhahn commented Oct 14, 2024

VPlan supports recipes that produce multiple results, e.g. as VPInterleaveRecipe. The recipe then would take care of de-interleaving the result, this would help avoid the need to modify various places in VPTransformState.

I'm a little unclear on how this would work. Do you mean making VPWidenCallRecipe/VPReplicateRecipe return a value for each element of the struct? I'm not sure what to do around the extractvalues in the original IR/plan, I guess you'd have to replace them with the widen call recipe while constructing the plan?

Hm, yes, VPWidenCallRecipe could define one VPValue per element vector and take care of splitting up the return value directly, which may also allow more accurately estimating the cost overhead. It might be worth introducing a separate recipe for calls that return multiple results. Perhaps @ayalz has any additional thoughts?

One question is how to deal with VPReplicateRecipe, I suppose we would need a version as well that produces multiple results.

I'm not sure what to do around the extractvalues in the original IR/plan, I guess you'd have to replace them with the widen call recipe while constructing the plan?

When lowering extractvalue's, they would be replaced by just looking up the VPValue from the first multi-def operand at the index of the extract.

I might have missed it, but could you make sure that there are tests with struct return values that cannot be widened?

There's a few tests in llvm/test/Transforms/LoopVectorize/struct-return.ll (negative_mixed_element_type_struct_return, test_named_struct_return, test_overflow_intrinsic).

Thanks!

@MacDue
Copy link
Member Author

MacDue commented Oct 15, 2024

I'm looking into the multiple results now, I think it may require some amount of refactoring. The recipes likely affected all inherit from VPRecipeWithIRFlags which in turn inherits from VPSingleDefRecipe.

  • VPWidenCallRecipe -> VPRecipeWithIRFlags -> VPSingleDefRecipe
  • VPWidenIntrinsicRecipe -> VPRecipeWithIRFlags -> VPSingleDefRecipe
  • VPReplicateRecipe -> VPRecipeWithIRFlags -> VPSingleDefRecipe

Fast math flags can also apply to struct results (see #110506). So if we go with the multiple results approach, I don't think VPRecipeWithIRFlags can stay a VPSingleDefRecipe, since even if we make new recipes for widen call/intrinsic the new
recipes would need to be some kind of VPRecipeWithIRFlags.

I'll see what I can prototype today 🙂

@MacDue
Copy link
Member Author

MacDue commented Oct 15, 2024

I've looked at the multiple result VPWidenCallRecipe approach in #112402 (prototype patch). I've got it working for VPWidenCallRecipe (no replication yet), but I'm unsure of the approach. It seems to require more refactoring and special cases than allowing widened struct values. The initial problems are:

  • VPRecipeWithIRFlags currently forces the recipe to inherit from VPSingleDefRecipe
  • VPRecipeBuilder assumes a 1-to-1 mapping from Instruction to (single-def) recipe
    • Currently worked around by special-casing extractvalue:
      // Ugh: Not sure where to handle this :(
      if (auto *EVI = dyn_cast<ExtractValueInst>(V)) {
      Value *AggOp = EVI->getAggregateOperand();
      if (auto *R = getRecipe(cast<Instruction>(AggOp))) {
      assert(R->getNumDefinedValues() ==
      cast<StructType>(AggOp->getType())->getNumElements());
      unsigned Idx = EVI->getIndices()[0];
      return R->getVPValue(Idx);
      }
      }
    • Attempted to improve this by allowing RecipeBuilder to map (LLVM) instructions to VPValues in cd73ad1
  • extractvalue is also special-cased in tryToBuildVPlanWithVPRecipes() (as it does not directly produce a recipe)

There may be a few issues to solve around replication too (but I've not got there yet).

Just posting my initial thoughts in case I've missed something as I'm still fairly new to looking at VPlan 🙂

MacDue added a commit to MacDue/llvm-project that referenced this pull request Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants