Skip to content

[SamplePGO] Add a cutoff for number of profile matching anchors #95542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

amharc
Copy link
Contributor

@amharc amharc commented Jun 14, 2024

The algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls.

However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed.

@llvmbot llvmbot added PGO Profile Guided Optimizations llvm:transforms labels Jun 14, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 14, 2024

@llvm/pr-subscribers-pgo

Author: Krzysztof Pszeniczny (amharc)

Changes

The algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls.

However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed.


Full diff: https://github.com/llvm/llvm-project/pull/95542.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp (+18)
  • (modified) llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll (+3)
diff --git a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
index d7613bce4c52e..d484b53904801 100644
--- a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
@@ -14,6 +14,7 @@
 #include "llvm/Transforms/IPO/SampleProfileMatcher.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/MDBuilder.h"
+#include "llvm/Support/CommandLine.h"
 
 using namespace llvm;
 using namespace sampleprof;
@@ -24,6 +25,11 @@ extern cl::opt<bool> SalvageStaleProfile;
 extern cl::opt<bool> PersistProfileStaleness;
 extern cl::opt<bool> ReportProfileStaleness;
 
+static cl::opt<int> SalvageStaleProfileMaxCallsites(
+    "salvage-stale-profile-max-callsites", cl::Hidden, cl::init(INT_MAX),
+    cl::desc("The maximum number of callsites in a function, above which stale "
+             "profile matching will be skipped."));
+
 void SampleProfileMatcher::findIRAnchors(const Function &F,
                                          AnchorMap &IRAnchors) {
   // For inlined code, recover the original callsite and callee by finding the
@@ -300,6 +306,18 @@ void SampleProfileMatcher::runStaleProfileMatching(
   if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty())
     return;
 
+
+  if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites ||
+      FilteredProfileAnchorList.size() > SalvageStaleProfileMaxCallsites) {
+    LLVM_DEBUG(dbgs() << "Skip stale profile matching for " << F.getName()
+                      << " because the number of callsites in the IR is "
+                      << FilteredIRAnchorsList.size()
+                      << " and in the profile is "
+                      << FilteredProfileAnchorList.size() << "\n");
+    return;
+  }
+
+
   // Match the callsite anchors by finding the longest common subsequence
   // between IR and profile. Note that we need to use IR anchor as base(A side)
   // to align with the order of IRToProfileLocationMap.
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
index ecf8484d98e59..4b8cd853301ed 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
@@ -1,6 +1,7 @@
 ; REQUIRES: x86_64-linux
 ; REQUIRES: asserts
 ; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl 2>&1 | FileCheck %s
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl --salvage-stale-profile-max-callsites=6 2>&1 | FileCheck %s -check-prefix=CHECK-MAX-CALLSITES
 
 ; CHECK: Run stale profile matching for test_direct_call
 ; CHECK: Location is matched from 1 to 1
@@ -27,6 +28,8 @@
 ; CHECK: Callsite with callee:unknown.indirect.callee is matched from 9 to 6
 ; CHECK: Callsite with callee:C is matched from 10 to 7
 
+; CHECK-MAX-CALLSITES: Skip stale profile matching for test_direct_call
+; CHECK-MAX-CALLSITES-NOT: Skip stale profile matching for test_indirect_call
 
 target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
 target triple = "x86_64-unknown-linux-gnu"

@llvmbot
Copy link
Member

llvmbot commented Jun 14, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Krzysztof Pszeniczny (amharc)

Changes

The algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls.

However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed.


Full diff: https://github.com/llvm/llvm-project/pull/95542.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp (+18)
  • (modified) llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll (+3)
diff --git a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
index d7613bce4c52e..d484b53904801 100644
--- a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
@@ -14,6 +14,7 @@
 #include "llvm/Transforms/IPO/SampleProfileMatcher.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/MDBuilder.h"
+#include "llvm/Support/CommandLine.h"
 
 using namespace llvm;
 using namespace sampleprof;
@@ -24,6 +25,11 @@ extern cl::opt<bool> SalvageStaleProfile;
 extern cl::opt<bool> PersistProfileStaleness;
 extern cl::opt<bool> ReportProfileStaleness;
 
+static cl::opt<int> SalvageStaleProfileMaxCallsites(
+    "salvage-stale-profile-max-callsites", cl::Hidden, cl::init(INT_MAX),
+    cl::desc("The maximum number of callsites in a function, above which stale "
+             "profile matching will be skipped."));
+
 void SampleProfileMatcher::findIRAnchors(const Function &F,
                                          AnchorMap &IRAnchors) {
   // For inlined code, recover the original callsite and callee by finding the
@@ -300,6 +306,18 @@ void SampleProfileMatcher::runStaleProfileMatching(
   if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty())
     return;
 
+
+  if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites ||
+      FilteredProfileAnchorList.size() > SalvageStaleProfileMaxCallsites) {
+    LLVM_DEBUG(dbgs() << "Skip stale profile matching for " << F.getName()
+                      << " because the number of callsites in the IR is "
+                      << FilteredIRAnchorsList.size()
+                      << " and in the profile is "
+                      << FilteredProfileAnchorList.size() << "\n");
+    return;
+  }
+
+
   // Match the callsite anchors by finding the longest common subsequence
   // between IR and profile. Note that we need to use IR anchor as base(A side)
   // to align with the order of IRToProfileLocationMap.
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
index ecf8484d98e59..4b8cd853301ed 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
@@ -1,6 +1,7 @@
 ; REQUIRES: x86_64-linux
 ; REQUIRES: asserts
 ; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl 2>&1 | FileCheck %s
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl --salvage-stale-profile-max-callsites=6 2>&1 | FileCheck %s -check-prefix=CHECK-MAX-CALLSITES
 
 ; CHECK: Run stale profile matching for test_direct_call
 ; CHECK: Location is matched from 1 to 1
@@ -27,6 +28,8 @@
 ; CHECK: Callsite with callee:unknown.indirect.callee is matched from 9 to 6
 ; CHECK: Callsite with callee:C is matched from 10 to 7
 
+; CHECK-MAX-CALLSITES: Skip stale profile matching for test_direct_call
+; CHECK-MAX-CALLSITES-NOT: Skip stale profile matching for test_indirect_call
 
 target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
 target triple = "x86_64-unknown-linux-gnu"

@amharc amharc force-pushed the stale_profile_matching_limit branch 2 times, most recently from af8188c to e349ed7 Compare June 14, 2024 13:25
The algorithm added by PR llvm#87375 can be potentially quadratic in the
number of anchors. This is almost never a problem because normally
functions have a reasonable number of function calls.

However, in some rare cases of auto-generated code we observed very
large functions that trigger quadratic behaviour here (resulting in
>130GB of peak heap memory usage for clang). Let's add a knob for
controlling the max number of callsites in a function above which stale
profile matching won't be performed.
@amharc amharc force-pushed the stale_profile_matching_limit branch from e349ed7 to ea2e568 Compare June 14, 2024 13:32
@amharc
Copy link
Contributor Author

amharc commented Jun 14, 2024

Tagging @WenleiHe @wlei-llvm @david-xl.

@@ -300,6 +306,16 @@ void SampleProfileMatcher::runStaleProfileMatching(
if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty())
return;

if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to check both anchor list size or just one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this check is meant to prevent superlinear behaviour in case of very large functions, I think it makes sense to check for both "the profile has too many anchors" and "the IR has too many anchors". In the regular case both should be small and cases where any of those is say >1000 are exotic enough that it might be better to err on the side of caution and skip those functions.

For example: those functions are often autogenerated, in which case stale profile matching (which targets localised edits made by humans) might not be very appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the default to be something other than UINT_MAX?

@WenleiHe WenleiHe requested a review from wlei-llvm June 14, 2024 15:14
Copy link
Member

@WenleiHe WenleiHe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myer's diffing used here is supposed to be of O(ND), usually we don't have a ton of diff from LCS, so it's surprising to see quadratic behavior here..

@wlei-llvm
Copy link
Contributor

Thanks for the work, acknowledged that I supposed the num of anchors is not super big while working on #87375 . However, we could lost some opportunities if skipping the large function matching , especially since larger functions are more likely to have code changes(no matter if it's edited by human). Trying to think if we can fix it by improve the current one: One is to improve the myers-diff memory cost, which I thinking maybe we can use a more efficient structure instead of 2-D array. Another way is I'm wondering if it makes sense to support both the algorithm(the diff and the old one), then make it configurable, e.g. if the callsite num is big, it automatically switches to a the lightweight one. (Git also support different diff algorithm, like git diff --diff-algorithm=minimal|myers|...).

(More for discussion for the best approach than asking to do it in this path)

@david-xl
Copy link
Contributor

Thanks for the work, acknowledged that I supposed the num of anchors is not super big while working on #87375 . However, we could lost some opportunities if skipping the large function matching , especially since larger functions are more likely to have code changes(no matter if it's edited by human). Trying to think if we can fix it by improve the current one: One is to improve the myers-diff memory cost, which I thinking maybe we can use a more efficient structure instead of 2-D array. Another way is I'm wondering if it makes sense to support both the algorithm(the diff and the old one), then make it configurable, e.g. if the callsite num is big, it automatically switches to a the lightweight one. (Git also support different diff algorithm, like git diff --diff-algorithm=minimal|myers|...).

(More for discussion for the best approach than asking to do it in this path)

Sounds good. For the time being, the patch looks ok (to workaround OOM issues experience)?

@WenleiHe
Copy link
Member

Yes, this looks good to get unblocked for pathological cases.

Copy link
Contributor

@wlei-llvm wlei-llvm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. For the time being, the patch looks ok (to workaround OOM issues experience)?

Sounds good to unblock the OOM.

@amharc amharc merged commit 21ba91c into llvm:main Jun 17, 2024
5 of 7 checks passed
@amharc amharc deleted the stale_profile_matching_limit branch June 17, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants