-
Notifications
You must be signed in to change notification settings - Fork 14.4k
AlwaysInliner: A new inlining algorithm to interleave alloca promotion with inlines. #145613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/aemerson/spr/main.alwaysinliner-a-new-inlining-algorithm-to-interleave-alloca-promotion-with-inlines
Are you sure you want to change the base?
Conversation
Created using spr 1.3.5
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-transforms Author: Amara Emerson (aemerson) ChangesWe use a different visitation order of functions here to solve a phase Instead what we want is to try to do, as best as we can, is to inline leaf We want to do this before the caller is inlined into another caller Here's a simple pseudo-example: middle_caller() { inner_callee(int *x) { In this case, we want to inline inner_callee() into middle_caller() and Full diff: https://github.com/llvm/llvm-project/pull/145613.diff 4 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
index 8de1467ea47ec..8fab2985a0dca 100644
--- a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
+++ b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
@@ -12,21 +12,39 @@
//===----------------------------------------------------------------------===//
#include "llvm/Transforms/IPO/AlwaysInliner.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallPtrSet.h"
+#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/Analysis/DominanceFrontier.h"
#include "llvm/Analysis/InlineAdvisor.h"
#include "llvm/Analysis/InlineCost.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/Dominators.h"
#include "llvm/IR/Module.h"
+#include "llvm/IR/ValueHandle.h"
#include "llvm/InitializePasses.h"
+#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/Utils/Cloning.h"
#include "llvm/Transforms/Utils/ModuleUtils.h"
+#include "llvm/Transforms/Utils/PromoteMemToReg.h"
+
using namespace llvm;
#define DEBUG_TYPE "inline"
+static cl::opt<bool> EnableMem2RegInterleaving(
+ "enable-always-inliner-mem2reg", cl::init(true), cl::Hidden,
+ cl::desc("Enable interleaving always-inlining with alloca promotion"));
+
+STATISTIC(NumAllocasPromoted,
+ "Number of allocas promoted to registers after inlining");
namespace {
@@ -129,6 +147,245 @@ bool AlwaysInlineImpl(
return Changed;
}
+/// Promote allocas to registers if possible.
+static void promoteAllocas(
+ Function *Caller, SmallPtrSetImpl<AllocaInst *> &AllocasToPromote,
+ function_ref<AssumptionCache &(Function &)> &GetAssumptionCache) {
+ if (AllocasToPromote.empty())
+ return;
+
+ SmallVector<AllocaInst *, 4> PromotableAllocas;
+ llvm::copy_if(AllocasToPromote, std::back_inserter(PromotableAllocas),
+ isAllocaPromotable);
+ if (PromotableAllocas.empty())
+ return;
+
+ DominatorTree DT(*Caller);
+ AssumptionCache &AC = GetAssumptionCache(*Caller);
+ PromoteMemToReg(PromotableAllocas, DT, &AC);
+ NumAllocasPromoted += PromotableAllocas.size();
+ // Emit a remark for the promotion.
+ OptimizationRemarkEmitter ORE(Caller);
+ DebugLoc DLoc = Caller->getEntryBlock().getTerminator()->getDebugLoc();
+ ORE.emit([&]() {
+ return OptimizationRemark(DEBUG_TYPE, "PromoteAllocas", DLoc,
+ &Caller->getEntryBlock())
+ << "Promoting " << ore::NV("NumAlloca", PromotableAllocas.size())
+ << " allocas to SSA registers in function '"
+ << ore::NV("Function", Caller) << "'";
+ });
+ LLVM_DEBUG(dbgs() << "Promoted " << PromotableAllocas.size()
+ << " allocas to registers in function " << Caller->getName()
+ << "\n");
+}
+
+/// We use a different visitation order of functions here to solve a phase
+/// ordering problem. After inlining, a caller function may have allocas that
+/// were previously used for passing reference arguments to the callee that
+/// are now promotable to registers, using SROA/mem2reg. However if we just let
+/// the AlwaysInliner continue inlining everything at once, the later SROA pass
+/// in the pipeline will end up placing phis for these allocas into blocks along
+/// the dominance frontier which may extend further than desired (e.g. loop
+/// headers). This can happen when the caller is then inlined into another
+/// caller, and the allocas end up hoisted further before SROA is run.
+///
+/// Instead what we want is to try to do, as best as we can, is to inline leaf
+/// functions into callers, and then run PromoteMemToReg() on the allocas that
+/// were passed into the callee before it was inlined.
+///
+/// We want to do this *before* the caller is inlined into another caller
+/// because we want the alloca promotion to happen before its scope extends too
+/// far because of further inlining.
+///
+/// Here's a simple pseudo-example:
+/// outermost_caller() {
+/// for (...) {
+/// middle_caller();
+/// }
+/// }
+///
+/// middle_caller() {
+/// int stack_var;
+/// inner_callee(&stack_var);
+/// }
+///
+/// inner_callee(int *x) {
+/// // Do something with x.
+/// }
+///
+/// In this case, we want to inline inner_callee() into middle_caller() and
+/// then promote stack_var to a register before we inline middle_caller() into
+/// outermost_caller(). The regular always_inliner would inline everything at
+/// once, and then SROA/mem2reg would promote stack_var to a register but in
+/// the context of outermost_caller() which is not what we want.
+bool AlwaysInlineInterleavedMem2RegImpl(
+ Module &M, bool InsertLifetime, ProfileSummaryInfo &PSI,
+ FunctionAnalysisManager &FAM,
+ function_ref<AssumptionCache &(Function &)> GetAssumptionCache,
+ function_ref<AAResults &(Function &)> GetAAR) {
+
+ bool Changed = false;
+
+ // Use SetVector as we may rely on the deterministic iteration order for
+ // finding candidates later.
+ SetVector<Function *> AlwaysInlineFunctions;
+
+ MapVector<Function *, SmallVector<WeakVH>> CalleeToCallSites;
+ // Incoming always-inline calls for a function.
+ DenseMap<Function *, unsigned> IncomingAICount;
+ // Outgoing always-inline calls for a function.
+ DenseMap<Function *, unsigned> OutgoingAICount;
+ // First collect all always_inline functions.
+ for (Function &F : M) {
+ if (F.isDeclaration() || !F.hasFnAttribute(Attribute::AlwaysInline) ||
+ !isInlineViable(F).isSuccess())
+ continue;
+ if (F.isPresplitCoroutine())
+ continue;
+ AlwaysInlineFunctions.insert(&F);
+ }
+
+ DenseSet<Function *> ProcessedFunctions;
+ SmallVector<Function *> InlinedComdatFns;
+ // Build the call graph of always_inline functions.
+ for (Function *F : AlwaysInlineFunctions) {
+ for (User *U : F->users()) {
+ if (auto *CB = dyn_cast<CallBase>(U)) {
+ if (CB->getCalledFunction() != F || !canInlineCallBase(CB))
+ continue;
+ CalleeToCallSites[F].push_back(WeakVH(CB));
+ // Keep track of the number of incoming calls to this function.
+ // This is used to determine the order in which we inline functions.
+ IncomingAICount[F]++;
+ if (AlwaysInlineFunctions.count(CB->getCaller()))
+ OutgoingAICount[CB->getCaller()]++;
+ }
+ }
+ }
+
+ SmallVector<Function *, 16> Worklist;
+ for (Function *F : AlwaysInlineFunctions) {
+ // If this is a always_inline leaf function, we select it for inlining.
+ if (OutgoingAICount.lookup(F) == 0)
+ Worklist.push_back(F);
+ }
+
+ while (!Worklist.empty()) {
+ Function *Callee = Worklist.pop_back_val();
+ auto &Calls = CalleeToCallSites[Callee];
+
+ // Group the calls by their caller. This allows us to collect all allocas
+ // which need to be promoted together.
+ MapVector<Function *, SmallVector<WeakVH>> CallerToCalls;
+
+ for (WeakVH &WH : Calls)
+ if (auto *CB = dyn_cast_or_null<CallBase>(WH))
+ CallerToCalls[CB->getCaller()].push_back(WH);
+
+ // Now collect the allocas.
+ for (auto &CallerAndCalls : CallerToCalls) {
+ Function *Caller = CallerAndCalls.first;
+ SmallVector<WeakVH> &CallerCalls = CallerAndCalls.second;
+ SmallPtrSet<AllocaInst *, 4> AllocasToPromote;
+
+ for (WeakVH &WH : CallerCalls) {
+ if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+ for (Value *Arg : CB->args())
+ if (auto *AI = dyn_cast<AllocaInst>(getUnderlyingObject(Arg)))
+ AllocasToPromote.insert(AI);
+ }
+ }
+
+ // Do the actual inlining.
+ bool InlinedAny = false;
+ SmallVector<WeakVH> SuccessfullyInlinedCalls;
+
+ for (WeakVH &WH : CallerCalls) {
+ if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+ if (attemptInlineFunction(*Callee, CB, InsertLifetime, GetAAR,
+ GetAssumptionCache, PSI)) {
+ Changed = true;
+ InlinedAny = true;
+ SuccessfullyInlinedCalls.push_back(WH);
+ }
+ }
+ }
+
+ if (!InlinedAny)
+ continue;
+
+ // Promote any allocas that were used by the just-inlined call site.
+ promoteAllocas(Caller, AllocasToPromote, GetAssumptionCache);
+
+ unsigned InlinedCountForCaller = SuccessfullyInlinedCalls.size();
+ if (!AlwaysInlineFunctions.contains(Caller))
+ continue; // Caller wasn't part of our always-inline call graph.
+ unsigned OldOutgoing = OutgoingAICount[Caller];
+ assert(OldOutgoing >= InlinedCountForCaller &&
+ "Inlined more calls than we had outgoing calls!");
+ OutgoingAICount[Caller] = OldOutgoing - InlinedCountForCaller;
+ // If these were the last outgoing calls in the caller, we can now
+ // consider it a leaf function and add it to the worklist.
+ if (OutgoingAICount[Caller] == 0 && !ProcessedFunctions.count(Caller))
+ Worklist.push_back(Caller);
+ }
+
+ ProcessedFunctions.insert(Callee);
+ AlwaysInlineFunctions.remove(Callee);
+ CalleeToCallSites.erase(Callee);
+
+ Callee->removeDeadConstantUsers();
+ if (Callee->hasFnAttribute(Attribute::AlwaysInline) &&
+ Callee->isDefTriviallyDead()) {
+ if (Callee->hasComdat()) {
+ InlinedComdatFns.push_back(Callee);
+ } else {
+ M.getFunctionList().erase(Callee);
+ Changed = true;
+ }
+ }
+
+ if (AlwaysInlineFunctions.empty())
+ break;
+
+ // If we have no more leaf functions to inline, we use a greedy heuristic
+ // that selects the function with the most incoming calls. The intuition is
+ // inlining this function will eliminate the most call sites and give the
+ // highest chance of creating new leaf functions.
+ if (Worklist.empty()) {
+ Function *BestFunc = nullptr;
+ unsigned MaxIncoming = 0;
+ for (Function *F : AlwaysInlineFunctions) {
+ if (ProcessedFunctions.count(F))
+ continue;
+
+ unsigned CurrentIncoming = IncomingAICount.lookup(F);
+ if (!BestFunc || CurrentIncoming > MaxIncoming) {
+ BestFunc = F;
+ MaxIncoming = CurrentIncoming;
+ }
+ }
+ Worklist.push_back(BestFunc);
+ }
+ }
+
+ if (!InlinedComdatFns.empty()) {
+ filterDeadComdatFunctions(InlinedComdatFns);
+ for (Function *F : InlinedComdatFns) {
+ M.getFunctionList().erase(F);
+ Changed = true;
+ }
+ }
+
+ // We may have missed some call sites that were marked as always_inline but
+ // for which the callee function itself wasn't always_inline. Call the
+ // standard handler here to deal with those.
+ Changed |= AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+ GetAAR);
+ return Changed;
+}
+
+
struct AlwaysInlinerLegacyPass : public ModulePass {
bool InsertLifetime;
@@ -191,8 +448,14 @@ PreservedAnalyses AlwaysInlinerPass::run(Module &M,
};
auto &PSI = MAM.getResult<ProfileSummaryAnalysis>(M);
- bool Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM,
- GetAssumptionCache, GetAAR);
+ bool Changed = false;
+ if (EnableMem2RegInterleaving) {
+ Changed = AlwaysInlineInterleavedMem2RegImpl(M, InsertLifetime, PSI, FAM,
+ GetAssumptionCache, GetAAR);
+ } else {
+ Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+ GetAAR);
+ }
if (!Changed)
return PreservedAnalyses::all();
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
new file mode 100644
index 0000000000000..5048c2868e691
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
@@ -0,0 +1,46 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always):
+
+; This test constructs a call graph with no always-inline leaf functions,
+; showing that the function with the most incoming always-inline calls
+; is inlined first.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+ store i32 42, i32* %x
+ %v = load i32, i32* %x
+ call void @side(i32 %v)
+ call void @outer() ; not a always-inline leaf.
+ ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+ %var = alloca i32
+ call void @inner(i32* %var)
+ call void @inner(i32* %var)
+ call void @inner(i32* %var)
+ ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK: call void @side(i32 42)
+; CHECK-NEXT: call void @outer()
+; CHECK-NEXT: call void @side(i32 42)
+; CHECK-NEXT: call void @outer()
+; CHECK-NEXT: call void @side(i32 42)
+; CHECK-NEXT: call void @outer()
+entry:
+ call void @middle()
+ ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
new file mode 100644
index 0000000000000..d52ca012a1b59
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
@@ -0,0 +1,35 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always): always inline attribute
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always): always inline attribute
+
+; A simple example to ensure we inline leaf function @inner into @middle,
+; promote the alloca in @middle, then inline @middle into @outer.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+ store i32 42, i32* %x
+ %v = load i32, i32* %x
+ call void @side(i32 %v)
+ ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+ %var = alloca i32
+ call void @inner(i32* %var)
+ ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK: call void @side(i32 42)
+; CHECK: ret void
+entry:
+ call void @middle()
+ ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
index defd1f4fd426b..319f0f806bbb3 100644
--- a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
+++ b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
@@ -2,10 +2,10 @@
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64e-apple-macosx13"
+; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'snork' with (cost=always): always inline attribute
+; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'bar.8' with (cost=always): always inline attribute
; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'pluto' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'blam' with (cost=always): always inline attribute
; CHECK: remark: <unknown>:0:0: 'spam' inlined into 'blam' with (cost=65, threshold=75)
; CHECK: remark: <unknown>:0:0: 'wibble.1' inlined into 'widget' with (cost=30, threshold=75)
; CHECK: remark: <unknown>:0:0: 'widget' inlined into 'bar.8' with (cost=30, threshold=75)
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
Depends on refactoring PR: #145614 |
There's another alternative to this approach, which could be that we simply run the main inliner twice in the pipeline, along with the instcombine/simplify opts. The first time for alwaysinline functions. The constraint is that we must inline all always-inline call sites before we do any heuristic analysis in the main inliner. We can't do "intra-SCC" splitting of the always-inline and normal call sites as we used to do. |
/// then promote stack_var to a register before we inline middle_caller() into | ||
/// outermost_caller(). The regular always_inliner would inline everything at | ||
/// once, and then SROA/mem2reg would promote stack_var to a register but in | ||
/// the context of outermost_caller() which is not what we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand a bit more why this (promoting stack_var
in outermost_caller
) is bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. The problem is that mem2reg promotion has to place phi nodes for the value along the dominance frontier. This frontier is different depending on inlining order. For allocas, what you want is to insert phis when the size of the dominance frontier is as small as possible. The motivation is that allocas inside nested loops can "leak" phis beyond the innermost loop header, and that's bad for register pressure.
The main inliner already handles this because the pass manager interleaves optimizations with inlining, but for always-inliner we don't have that capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We have been experimenting with other traversal orders (hence ModuleInliner.cpp
) and this aspect is good to keep in mind. In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.
(I'm not trying to scope-creep, rather want to understand what options we have, and that doesn't have to impact what we do right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.
I don't think so, unless there's something I've missed. Before doing this I tried other approaches, such as:
- Trying to detect these over-extended PHIs and then demoting them back to allocas. Didn't work as we end up pessimizing codegen.
- Avoiding hoisting large vector allocas to the entry block, in order to block mem2reg. This works but is conceptually the wrong place to do it (no other heuristics code exists there).
I wasn't aware of ModuleInliner. Is the long term plan for it to replace the existing inliner? If so we could in future merge it with AlwaysInliner and if we interleave optimization as the current SCC manager does then this should fix the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no plan yet with the ModuleInliner, currently it lets us experiment with alternative traversals, and some of them have been showing promise.
I'm mainly trying to understand if:
- the order of traversal matters (for this problem here)
- do all the function simplification passes need to be run after some inlining or just some? I'm guessing it's really "just a specific subset", correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the traversal order matters here, because for optimal codegen we want mem2reg to happen between the inner->middle and middle->outer inlines. If you do it the other way around mem2reg can't do anything until the final inner->outer inline and by that point it's too late.
For now I think only this promotion is a known issue, I don't know of general issues with simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, so that means that ModuleInliner running interleaved simplifications won't help (if the order isn't bottom-up traversal).
(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.
(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.
The motivating use case happens to be builtins code written with multiple large vectors (like <16 x float>
) using the Arm SME extension. It's common for programmers to define these large vector values on the stack and then pass references to helper functions. It's these allocas that we're trying to ensure get promoted at the right time. You won't see this in general code like loop-vectorization since that all happens after inlining.
That said, this isn't solely restricted to builtins/vectors. We also see some negligible changes (I wouldn't go as far as "improvements") in WebKit which is a very heavy user of always_inline
.
(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?
Not sure I understand the question. Here we're not really delaying the promotion, but eagerly promoting at the smallest scope possible. In general though yes, if you were to avoid promotion/mem2reg you'd effectively force spilling and hope that other memory optimizations clean things up (unlikely).
Created using spr 1.3.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an improved inlining algorithm that interleaves alloca promotion with inlining, addressing phase ordering issues that could otherwise extend alloca lifetimes unintentionally. Key changes include updated inline remark checks in test files, new tests for both leaf and non-leaf always-inline functions, and significant modifications in AlwaysInliner.cpp to interleave PromoteMemToReg calls during the inlining process.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
llvm/test/Transforms/Inline/always-inline-phase-ordering.ll | Updated expected inline remark ordering in CHECK directives. |
llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll | Added new test to ensure proper promotion of allocas after inlining leaf functions. |
llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll | Added test for non-leaf functions to verify the greedy inlining heuristic. |
llvm/lib/Transforms/IPO/AlwaysInliner.cpp | Introduced interleaved alloca promotion via a new function and controlled by a command-line flag. |
// that selects the function with the most incoming calls. The intuition is | ||
// inlining this function will eliminate the most call sites and give the | ||
// highest chance of creating new leaf functions. | ||
if (Worklist.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider checking whether 'BestFunc' is non-null before pushing it onto 'Worklist' to prevent a potential null pointer from being added in cases where no unprocessed always-inline functions are found.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not bad! Good bot.
I just realized a bunch of clang tests are failing because they're running the inliner. Will fix and update this PR along with the copilot reported bug. |
Created using spr 1.3.6
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please pre-commit a PhaseOrdering test that demonstrates the problem you are trying to solve? It's hard to understand whether this is the correct solution to the problem without an actual test case.
Created using spr 1.3.6
I managed to reduce down the original SME test to |
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll clang/test/CodeGen/PowerPC/builtins-ppc-xl-xst.c clang/test/CodeGen/X86/avx512f-builtins.c clang/test/CodeGen/X86/avx512vl-builtins.c clang/test/CodeGen/X86/avx512vlbw-builtins.c clang/test/CodeGen/arm_acle.c clang/test/Headers/amdgcn_openmp_device_math.c clang/test/Headers/openmp_device_math_isnan.cpp clang/test/OpenMP/bug57757.cpp llvm/lib/Transforms/IPO/AlwaysInliner.cpp llvm/test/Transforms/Inline/always-inline-phase-ordering.ll llvm/test/Transforms/PhaseOrdering/always-inline-alloca-promotion.ll The following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
} Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
} Please refer to the Undefined Behavior Manual for more information. |
We use a different visitation order of functions here to solve a phase
ordering problem. After inlining, a caller function may have allocas that
were previously used for passing reference arguments to the callee that
are now promotable to registers, using SROA/mem2reg. However if we just let
the AlwaysInliner continue inlining everything at once, the later SROA pass
in the pipeline will end up placing phis for these allocas into blocks along
the dominance frontier which may extend further than desired (e.g. loop
headers). This can happen when the caller is then inlined into another
caller, and the allocas end up hoisted further before SROA is run.
Instead what we want is to try to do, as best as we can, is to inline leaf
functions into callers, and then run PromoteMemToReg() on the allocas that
were passed into the callee before it was inlined.
We want to do this before the caller is inlined into another caller
because we want the alloca promotion to happen before its scope extends too
far because of further inlining.
Here's a simple pseudo-example:
outermost_caller() {
for (...) {
middle_caller();
}
}
middle_caller() {
int stack_var;
inner_callee(&stack_var);
}
inner_callee(int *x) {
// Do something with x.
}
In this case, we want to inline inner_callee() into middle_caller() and
then promote stack_var to a register before we inline middle_caller() into
outermost_caller(). The regular always_inliner would inline everything at
once, and then SROA/mem2reg would promote stack_var to a register but in
the context of outermost_caller() which is not what we want.