-
Notifications
You must be signed in to change notification settings - Fork 14.1k
[SamplePGO] Add a cutoff for number of profile matching anchors #95542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-pgo Author: Krzysztof Pszeniczny (amharc) ChangesThe algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls. However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed. Full diff: https://github.com/llvm/llvm-project/pull/95542.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
index d7613bce4c52e..d484b53904801 100644
--- a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
@@ -14,6 +14,7 @@
#include "llvm/Transforms/IPO/SampleProfileMatcher.h"
#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/MDBuilder.h"
+#include "llvm/Support/CommandLine.h"
using namespace llvm;
using namespace sampleprof;
@@ -24,6 +25,11 @@ extern cl::opt<bool> SalvageStaleProfile;
extern cl::opt<bool> PersistProfileStaleness;
extern cl::opt<bool> ReportProfileStaleness;
+static cl::opt<int> SalvageStaleProfileMaxCallsites(
+ "salvage-stale-profile-max-callsites", cl::Hidden, cl::init(INT_MAX),
+ cl::desc("The maximum number of callsites in a function, above which stale "
+ "profile matching will be skipped."));
+
void SampleProfileMatcher::findIRAnchors(const Function &F,
AnchorMap &IRAnchors) {
// For inlined code, recover the original callsite and callee by finding the
@@ -300,6 +306,18 @@ void SampleProfileMatcher::runStaleProfileMatching(
if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty())
return;
+
+ if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites ||
+ FilteredProfileAnchorList.size() > SalvageStaleProfileMaxCallsites) {
+ LLVM_DEBUG(dbgs() << "Skip stale profile matching for " << F.getName()
+ << " because the number of callsites in the IR is "
+ << FilteredIRAnchorsList.size()
+ << " and in the profile is "
+ << FilteredProfileAnchorList.size() << "\n");
+ return;
+ }
+
+
// Match the callsite anchors by finding the longest common subsequence
// between IR and profile. Note that we need to use IR anchor as base(A side)
// to align with the order of IRToProfileLocationMap.
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
index ecf8484d98e59..4b8cd853301ed 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
@@ -1,6 +1,7 @@
; REQUIRES: x86_64-linux
; REQUIRES: asserts
; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl 2>&1 | FileCheck %s
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl --salvage-stale-profile-max-callsites=6 2>&1 | FileCheck %s -check-prefix=CHECK-MAX-CALLSITES
; CHECK: Run stale profile matching for test_direct_call
; CHECK: Location is matched from 1 to 1
@@ -27,6 +28,8 @@
; CHECK: Callsite with callee:unknown.indirect.callee is matched from 9 to 6
; CHECK: Callsite with callee:C is matched from 10 to 7
+; CHECK-MAX-CALLSITES: Skip stale profile matching for test_direct_call
+; CHECK-MAX-CALLSITES-NOT: Skip stale profile matching for test_indirect_call
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
|
@llvm/pr-subscribers-llvm-transforms Author: Krzysztof Pszeniczny (amharc) ChangesThe algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls. However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed. Full diff: https://github.com/llvm/llvm-project/pull/95542.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
index d7613bce4c52e..d484b53904801 100644
--- a/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfileMatcher.cpp
@@ -14,6 +14,7 @@
#include "llvm/Transforms/IPO/SampleProfileMatcher.h"
#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/MDBuilder.h"
+#include "llvm/Support/CommandLine.h"
using namespace llvm;
using namespace sampleprof;
@@ -24,6 +25,11 @@ extern cl::opt<bool> SalvageStaleProfile;
extern cl::opt<bool> PersistProfileStaleness;
extern cl::opt<bool> ReportProfileStaleness;
+static cl::opt<int> SalvageStaleProfileMaxCallsites(
+ "salvage-stale-profile-max-callsites", cl::Hidden, cl::init(INT_MAX),
+ cl::desc("The maximum number of callsites in a function, above which stale "
+ "profile matching will be skipped."));
+
void SampleProfileMatcher::findIRAnchors(const Function &F,
AnchorMap &IRAnchors) {
// For inlined code, recover the original callsite and callee by finding the
@@ -300,6 +306,18 @@ void SampleProfileMatcher::runStaleProfileMatching(
if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty())
return;
+
+ if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites ||
+ FilteredProfileAnchorList.size() > SalvageStaleProfileMaxCallsites) {
+ LLVM_DEBUG(dbgs() << "Skip stale profile matching for " << F.getName()
+ << " because the number of callsites in the IR is "
+ << FilteredIRAnchorsList.size()
+ << " and in the profile is "
+ << FilteredProfileAnchorList.size() << "\n");
+ return;
+ }
+
+
// Match the callsite anchors by finding the longest common subsequence
// between IR and profile. Note that we need to use IR anchor as base(A side)
// to align with the order of IRToProfileLocationMap.
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
index ecf8484d98e59..4b8cd853301ed 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-stale-profile-matching-LCS.ll
@@ -1,6 +1,7 @@
; REQUIRES: x86_64-linux
; REQUIRES: asserts
; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl 2>&1 | FileCheck %s
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/pseudo-probe-stale-profile-matching-LCS.prof --salvage-stale-profile -S --debug-only=sample-profile,sample-profile-matcher,sample-profile-impl --salvage-stale-profile-max-callsites=6 2>&1 | FileCheck %s -check-prefix=CHECK-MAX-CALLSITES
; CHECK: Run stale profile matching for test_direct_call
; CHECK: Location is matched from 1 to 1
@@ -27,6 +28,8 @@
; CHECK: Callsite with callee:unknown.indirect.callee is matched from 9 to 6
; CHECK: Callsite with callee:C is matched from 10 to 7
+; CHECK-MAX-CALLSITES: Skip stale profile matching for test_direct_call
+; CHECK-MAX-CALLSITES-NOT: Skip stale profile matching for test_indirect_call
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
|
af8188c
to
e349ed7
Compare
The algorithm added by PR llvm#87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls. However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed.
e349ed7
to
ea2e568
Compare
Tagging @WenleiHe @wlei-llvm @david-xl. |
@@ -300,6 +306,16 @@ void SampleProfileMatcher::runStaleProfileMatching( | |||
if (FilteredIRAnchorsList.empty() || FilteredProfileAnchorList.empty()) | |||
return; | |||
|
|||
if (FilteredIRAnchorsList.size() > SalvageStaleProfileMaxCallsites || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to check both anchor list size or just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this check is meant to prevent superlinear behaviour in case of very large functions, I think it makes sense to check for both "the profile has too many anchors" and "the IR has too many anchors". In the regular case both should be small and cases where any of those is say >1000 are exotic enough that it might be better to err on the side of caution and skip those functions.
For example: those functions are often autogenerated, in which case stale profile matching (which targets localised edits made by humans) might not be very appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default to be something other than UINT_MAX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myer's diffing used here is supposed to be of O(ND), usually we don't have a ton of diff from LCS, so it's surprising to see quadratic behavior here..
Thanks for the work, acknowledged that I supposed the num of anchors is not super big while working on #87375 . However, we could lost some opportunities if skipping the large function matching , especially since larger functions are more likely to have code changes(no matter if it's edited by human). Trying to think if we can fix it by improve the current one: One is to improve the myers-diff memory cost, which I thinking maybe we can use a more efficient structure instead of 2-D array. Another way is I'm wondering if it makes sense to support both the algorithm(the diff and the old one), then make it configurable, e.g. if the callsite num is big, it automatically switches to a the lightweight one. ( (More for discussion for the best approach than asking to do it in this path) |
Sounds good. For the time being, the patch looks ok (to workaround OOM issues experience)? |
Yes, this looks good to get unblocked for pathological cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. For the time being, the patch looks ok (to workaround OOM issues experience)?
Sounds good to unblock the OOM.
The algorithm added by PR #87375 can be potentially quadratic in the number of anchors. This is almost never a problem because normally functions have a reasonable number of function calls.
However, in some rare cases of auto-generated code we observed very large functions that trigger quadratic behaviour here (resulting in >130GB of peak heap memory usage for clang). Let's add a knob for controlling the max number of callsites in a function above which stale profile matching won't be performed.