Skip to content

Conversation

@Pierre-vh
Copy link
Contributor

I was benchmarking the MatchTable when I found that getConstantVRegValWithLookThrough took a non-negligible amount of time, about 7.5% of all of AArch64PreLegalizerCombinerImpl::tryCombineAll.

I decided to take a closer look to see if I could squeeze some performance out of it, and I landed on a few changes that:

  • Avoid copying APint unnecessarily, especially returning std::optional can be expensive when a out parameter also works.
  • Avoid indirect call by using templated function pointers instead of function_ref/std::function

Both of those changes seem to speedup this function by about 50%, but my benchmarking (perf record) seems inconsistent (so take measurements with a grain of salt), I saw as high as 4.5% and as low as 2% for this function on the exact same input after the changes, but it never got close again to 7% in a few runs so this looks like a stable improvement.

I was benchmarking the MatchTable when I found that getConstantVRegValWithLookThrough took a non-negligible amount of time, about 7.5% of all of AArch64PreLegalizerCombinerImpl::tryCombineAll.

I decided to take a closer look to see if I could squeeze some performance out of it, and I landed on a few changes that:
 - Avoid copying APint unnecessarily, especially returning std::optional<APInt> can be expensive when a out parameter also works.
 - Avoid indirect call by using templated function pointers instead of function_ref/std::function

Both of those changes seem to speedup this function by about 50%, but my benchmarking (`perf record`) seems inconsistent (so take measurements with a grain of salt), I saw as high as 4.5% and as low as 2% for this function on the exact same input after the changes, but it never got close again to 7% in a few runs so this looks like a stable improvement.
@llvmbot
Copy link
Member

llvmbot commented May 13, 2024

@llvm/pr-subscribers-llvm-globalisel

Author: Pierre van Houtryve (Pierre-vh)

Changes

I was benchmarking the MatchTable when I found that getConstantVRegValWithLookThrough took a non-negligible amount of time, about 7.5% of all of AArch64PreLegalizerCombinerImpl::tryCombineAll.

I decided to take a closer look to see if I could squeeze some performance out of it, and I landed on a few changes that:

  • Avoid copying APint unnecessarily, especially returning std::optional<APInt> can be expensive when a out parameter also works.
  • Avoid indirect call by using templated function pointers instead of function_ref/std::function

Both of those changes seem to speedup this function by about 50%, but my benchmarking (perf record) seems inconsistent (so take measurements with a grain of salt), I saw as high as 4.5% and as low as 2% for this function on the exact same input after the changes, but it never got close again to 7% in a few runs so this looks like a stable improvement.


Full diff: https://github.com/llvm/llvm-project/pull/91969.diff

1 Files Affected:

  • (modified) llvm/lib/CodeGen/GlobalISel/Utils.cpp (+44-32)
diff --git a/llvm/lib/CodeGen/GlobalISel/Utils.cpp b/llvm/lib/CodeGen/GlobalISel/Utils.cpp
index 4e3781cb4e9d5..cd5dc0e01ed0e 100644
--- a/llvm/lib/CodeGen/GlobalISel/Utils.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/Utils.cpp
@@ -313,13 +313,22 @@ llvm::getIConstantVRegSExtVal(Register VReg, const MachineRegisterInfo &MRI) {
 
 namespace {
 
-typedef std::function<bool(const MachineInstr *)> IsOpcodeFn;
-typedef std::function<std::optional<APInt>(const MachineInstr *MI)> GetAPCstFn;
-
-std::optional<ValueAndVReg> getConstantVRegValWithLookThrough(
-    Register VReg, const MachineRegisterInfo &MRI, IsOpcodeFn IsConstantOpcode,
-    GetAPCstFn getAPCstValue, bool LookThroughInstrs = true,
-    bool LookThroughAnyExt = false) {
+// This function is used in many places, and as such, it has some
+// micro-optimizations to try and make it as fast as it can be.
+//
+// - We use template arguments to avoid an indirect call caused by passing a
+// function_ref/std::function
+// - GetAPCstValue does not return std::optional<APInt> as that's expensive.
+// Instead it returns true/false and places the result in a pre-constructed
+// APInt.
+//
+// Please change this function carefully and benchmark your changes.
+template <bool (*IsConstantOpcode)(const MachineInstr *),
+          bool (*GetAPCstValue)(const MachineInstr *MI, APInt &)>
+std::optional<ValueAndVReg>
+getConstantVRegValWithLookThrough(Register VReg, const MachineRegisterInfo &MRI,
+                                  bool LookThroughInstrs = true,
+                                  bool LookThroughAnyExt = false) {
   SmallVector<std::pair<unsigned, unsigned>, 4> SeenOpcodes;
   MachineInstr *MI;
 
@@ -353,26 +362,25 @@ std::optional<ValueAndVReg> getConstantVRegValWithLookThrough(
   if (!MI || !IsConstantOpcode(MI))
     return std::nullopt;
 
-  std::optional<APInt> MaybeVal = getAPCstValue(MI);
-  if (!MaybeVal)
+  APInt Val;
+  if (!GetAPCstValue(MI, Val))
     return std::nullopt;
-  APInt &Val = *MaybeVal;
-  for (auto [Opcode, Size] : reverse(SeenOpcodes)) {
-    switch (Opcode) {
+  for (auto &Pair : reverse(SeenOpcodes)) {
+    switch (Pair.first) {
     case TargetOpcode::G_TRUNC:
-      Val = Val.trunc(Size);
+      Val = Val.trunc(Pair.second);
       break;
     case TargetOpcode::G_ANYEXT:
     case TargetOpcode::G_SEXT:
-      Val = Val.sext(Size);
+      Val = Val.sext(Pair.second);
       break;
     case TargetOpcode::G_ZEXT:
-      Val = Val.zext(Size);
+      Val = Val.zext(Pair.second);
       break;
     }
   }
 
-  return ValueAndVReg{Val, VReg};
+  return ValueAndVReg{std::move(Val), VReg};
 }
 
 bool isIConstant(const MachineInstr *MI) {
@@ -394,42 +402,46 @@ bool isAnyConstant(const MachineInstr *MI) {
   return Opc == TargetOpcode::G_CONSTANT || Opc == TargetOpcode::G_FCONSTANT;
 }
 
-std::optional<APInt> getCImmAsAPInt(const MachineInstr *MI) {
+bool getCImmAsAPInt(const MachineInstr *MI, APInt &Result) {
   const MachineOperand &CstVal = MI->getOperand(1);
-  if (CstVal.isCImm())
-    return CstVal.getCImm()->getValue();
-  return std::nullopt;
+  if (!CstVal.isCImm())
+    return false;
+  Result = CstVal.getCImm()->getValue();
+  return true;
 }
 
-std::optional<APInt> getCImmOrFPImmAsAPInt(const MachineInstr *MI) {
+bool getCImmOrFPImmAsAPInt(const MachineInstr *MI, APInt &Result) {
   const MachineOperand &CstVal = MI->getOperand(1);
   if (CstVal.isCImm())
-    return CstVal.getCImm()->getValue();
-  if (CstVal.isFPImm())
-    return CstVal.getFPImm()->getValueAPF().bitcastToAPInt();
-  return std::nullopt;
+    Result = CstVal.getCImm()->getValue();
+  else if (CstVal.isFPImm())
+    Result = CstVal.getFPImm()->getValueAPF().bitcastToAPInt();
+  else
+    return false;
+  return true;
 }
 
 } // end anonymous namespace
 
 std::optional<ValueAndVReg> llvm::getIConstantVRegValWithLookThrough(
     Register VReg, const MachineRegisterInfo &MRI, bool LookThroughInstrs) {
-  return getConstantVRegValWithLookThrough(VReg, MRI, isIConstant,
-                                           getCImmAsAPInt, LookThroughInstrs);
+  return getConstantVRegValWithLookThrough<isIConstant, getCImmAsAPInt>(
+      VReg, MRI, LookThroughInstrs);
 }
 
 std::optional<ValueAndVReg> llvm::getAnyConstantVRegValWithLookThrough(
     Register VReg, const MachineRegisterInfo &MRI, bool LookThroughInstrs,
     bool LookThroughAnyExt) {
-  return getConstantVRegValWithLookThrough(
-      VReg, MRI, isAnyConstant, getCImmOrFPImmAsAPInt, LookThroughInstrs,
-      LookThroughAnyExt);
+  return getConstantVRegValWithLookThrough<isAnyConstant,
+                                           getCImmOrFPImmAsAPInt>(
+      VReg, MRI, LookThroughInstrs, LookThroughAnyExt);
 }
 
 std::optional<FPValueAndVReg> llvm::getFConstantVRegValWithLookThrough(
     Register VReg, const MachineRegisterInfo &MRI, bool LookThroughInstrs) {
-  auto Reg = getConstantVRegValWithLookThrough(
-      VReg, MRI, isFConstant, getCImmOrFPImmAsAPInt, LookThroughInstrs);
+  auto Reg =
+      getConstantVRegValWithLookThrough<isFConstant, getCImmOrFPImmAsAPInt>(
+          VReg, MRI, LookThroughInstrs);
   if (!Reg)
     return std::nullopt;
   return FPValueAndVReg{getConstantFPVRegVal(Reg->VReg, MRI)->getValueAPF(),

@qcolombet
Copy link
Collaborator

Do you know how each of these items contributes to the compile time individually:

  • Avoid copying APint unnecessarily, especially returning std::optional can be expensive when an out parameter also works.
  • Avoid indirect call by using templated function pointers instead of function_ref/std::function

The use of optional versus bool + out parameter is what I think we've been moving towards in the past few years within the LLVM code base and I'm wondering if that means this is generally bad for performance.

@Pierre-vh
Copy link
Contributor Author

Pierre-vh commented May 13, 2024

Do you know how each of these items contributes to the compile time individually:

  • Avoid copying APint unnecessarily, especially returning std::optional can be expensive when an out parameter also works.
  • Avoid indirect call by using templated function pointers instead of function_ref/std::function

The use of optional versus bool + out parameter is what I think we've been moving towards in the past few years within the LLVM code base and I'm wondering if that means this is generally bad for performance.

I can come back to this and do more thorough testing if you want a precise answer, but from what i remember (take with a big grain of salt): the biggest contributor was removing the indirect call (which enabled inlining as well probably), as the call occurred within a loop.

The removal of optional brought an additional nice improvement (maybe 1/3rd of the indirect call removal?) mostly because it removed APInt copies, and APInt copies aren't trivial.

Finally this patch needs to be put in context: this is a micro-optimization I did for fun and profit (because it was a low-hanging fruit). The gain is a single digit % improvement on the AArch64PreLegalizerCombine match table execution. That part of the compilation process is in itself like 5-7% of the build in my test, so in the best case this saved 0.2% of a full build.

I wouldn't worry about using optional more, but I'd just be careful with it when dealing with non-trivial types (especially large types). It's easy to lose optimizations like copy elision with it. (hot take) I'd also be wary of using optional just "out of convenience" and ending up with sub-optimal code when sometimes returning a bool & using an out parameter just makes more sense and avoids a copy or two.

Copy link
Contributor

@aemerson aemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you benchmarking MatchTable for these changes? Not saying you have to do this, but I'd be curious to see if we still see gains with LTO builds of llvm.

@Pierre-vh
Copy link
Contributor Author

How are you benchmarking MatchTable for these changes? Not saying you have to do this, but I'd be curious to see if we still see gains with LTO builds of llvm.

I was writing a comment but then it got too long so I just added it to the thread, let's continue the discussion there: https://discourse.llvm.org/t/rfc-measuring-globalisel-compile-time-performance/78412/19

@Pierre-vh Pierre-vh merged commit 922fafa into llvm:main May 14, 2024
@Pierre-vh Pierre-vh deleted the microopt-getConstantVRegVal branch May 14, 2024 06:36
@qcolombet
Copy link
Collaborator

Do you know how each of these items contributes to the compile time individually:

  • Avoid copying APint unnecessarily, especially returning std::optional can be expensive when an out parameter also works.
  • Avoid indirect call by using templated function pointers instead of function_ref/std::function

The use of optional versus bool + out parameter is what I think we've been moving towards in the past few years within the LLVM code base and I'm wondering if that means this is generally bad for performance.

I can come back to this and do more thorough testing if you want a precise answer, but from what i remember (take with a big grain of salt): the biggest contributor was removing the indirect call (which enabled inlining as well probably), as the call occurred within a loop.

The removal of optional brought an additional nice improvement (maybe 1/3rd of the indirect call removal?) mostly because it removed APInt copies, and APInt copies aren't trivial.

Finally this patch needs to be put in context: this is a micro-optimization I did for fun and profit (because it was a low-hanging fruit). The gain is a single digit % improvement on the AArch64PreLegalizerCombine match table execution. That part of the compilation process is in itself like 5-7% of the build in my test, so in the best case this saved 0.2% of a full build.

I wouldn't worry about using optional more, but I'd just be careful with it when dealing with non-trivial types (especially large types). It's easy to lose optimizations like copy elision with it. (hot take) I'd also be wary of using optional just "out of convenience" and ending up with sub-optimal code when sometimes returning a bool & using an out parameter just makes more sense and avoids a copy or two.

Thanks for the details!

I was afraid the optional pattern was a time sink in general, but like you said here we are looking at a micro optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants