AlwaysInliner: A new inlining algorithm to interleave alloca promotion with inlines. #145613

aemerson · 2025-06-24T23:04:05Z

We use a different visitation order of functions here to solve a phase
ordering problem. After inlining, a caller function may have allocas that
were previously used for passing reference arguments to the callee that
are now promotable to registers, using SROA/mem2reg. However if we just let
the AlwaysInliner continue inlining everything at once, the later SROA pass
in the pipeline will end up placing phis for these allocas into blocks along
the dominance frontier which may extend further than desired (e.g. loop
headers). This can happen when the caller is then inlined into another
caller, and the allocas end up hoisted further before SROA is run.

Instead what we want is to try to do, as best as we can, is to inline leaf
functions into callers, and then run PromoteMemToReg() on the allocas that
were passed into the callee before it was inlined.

We want to do this before the caller is inlined into another caller
because we want the alloca promotion to happen before its scope extends too
far because of further inlining.

Here's a simple pseudo-example:
outermost_caller() {
for (...) {
middle_caller();
}
}

middle_caller() {
int stack_var;
inner_callee(&stack_var);
}

inner_callee(int *x) {
// Do something with x.
}

In this case, we want to inline inner_callee() into middle_caller() and
then promote stack_var to a register before we inline middle_caller() into
outermost_caller(). The regular always_inliner would inline everything at
once, and then SROA/mem2reg would promote stack_var to a register but in
the context of outermost_caller() which is not what we want.

Created using spr 1.3.5

llvmbot · 2025-06-24T23:04:34Z

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-hlsl
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-transforms

Author: Amara Emerson (aemerson)

Changes

We use a different visitation order of functions here to solve a phase
ordering problem. After inlining, a caller function may have allocas that
were previously used for passing reference arguments to the callee that
are now promotable to registers, using SROA/mem2reg. However if we just let
the AlwaysInliner continue inlining everything at once, the later SROA pass
in the pipeline will end up placing phis for these allocas into blocks along
the dominance frontier which may extend further than desired (e.g. loop
headers). This can happen when the caller is then inlined into another
caller, and the allocas end up hoisted further before SROA is run.

Instead what we want is to try to do, as best as we can, is to inline leaf
functions into callers, and then run PromoteMemToReg() on the allocas that
were passed into the callee before it was inlined.

We want to do this before the caller is inlined into another caller
because we want the alloca promotion to happen before its scope extends too
far because of further inlining.

Here's a simple pseudo-example:
outermost_caller() {
for (...) {
middle_caller();
}
}

middle_caller() {
int stack_var;
inner_callee(&stack_var);
}

inner_callee(int *x) {
// Do something with x.
}

In this case, we want to inline inner_callee() into middle_caller() and
then promote stack_var to a register before we inline middle_caller() into
outermost_caller(). The regular always_inliner would inline everything at
once, and then SROA/mem2reg would promote stack_var to a register but in
the context of outermost_caller() which is not what we want.

Full diff: https://github.com/llvm/llvm-project/pull/145613.diff

4 Files Affected:

(modified) llvm/lib/Transforms/IPO/AlwaysInliner.cpp (+265-2)
(added) llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll (+46)
(added) llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll (+35)
(modified) llvm/test/Transforms/Inline/always-inline-phase-ordering.ll (+2-2)

diff --git a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
index 8de1467ea47ec..8fab2985a0dca 100644
--- a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
+++ b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
@@ -12,21 +12,39 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/Transforms/IPO/AlwaysInliner.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallPtrSet.h"
+#include "llvm/ADT/Statistic.h"
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/Analysis/DominanceFrontier.h"
 #include "llvm/Analysis/InlineAdvisor.h"
 #include "llvm/Analysis/InlineCost.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/Dominators.h"
 #include "llvm/IR/Module.h"
+#include "llvm/IR/ValueHandle.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Support/CommandLine.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include "llvm/Transforms/Utils/ModuleUtils.h"
+#include "llvm/Transforms/Utils/PromoteMemToReg.h"
+
 
 using namespace llvm;
 
 #define DEBUG_TYPE "inline"
+static cl::opt<bool> EnableMem2RegInterleaving(
+    "enable-always-inliner-mem2reg", cl::init(true), cl::Hidden,
+    cl::desc("Enable interleaving always-inlining with alloca promotion"));
+
+STATISTIC(NumAllocasPromoted,
+          "Number of allocas promoted to registers after inlining");
 
 namespace {
 
@@ -129,6 +147,245 @@ bool AlwaysInlineImpl(
   return Changed;
 }
 
+/// Promote allocas to registers if possible.
+static void promoteAllocas(
+    Function *Caller, SmallPtrSetImpl<AllocaInst *> &AllocasToPromote,
+    function_ref<AssumptionCache &(Function &)> &GetAssumptionCache) {
+  if (AllocasToPromote.empty())
+    return;
+
+  SmallVector<AllocaInst *, 4> PromotableAllocas;
+  llvm::copy_if(AllocasToPromote, std::back_inserter(PromotableAllocas),
+                isAllocaPromotable);
+  if (PromotableAllocas.empty())
+    return;
+
+  DominatorTree DT(*Caller);
+  AssumptionCache &AC = GetAssumptionCache(*Caller);
+  PromoteMemToReg(PromotableAllocas, DT, &AC);
+  NumAllocasPromoted += PromotableAllocas.size();
+  // Emit a remark for the promotion.
+  OptimizationRemarkEmitter ORE(Caller);
+  DebugLoc DLoc = Caller->getEntryBlock().getTerminator()->getDebugLoc();
+  ORE.emit([&]() {
+    return OptimizationRemark(DEBUG_TYPE, "PromoteAllocas", DLoc,
+                              &Caller->getEntryBlock())
+           << "Promoting " << ore::NV("NumAlloca", PromotableAllocas.size())
+           << " allocas to SSA registers in function '"
+           << ore::NV("Function", Caller) << "'";
+  });
+  LLVM_DEBUG(dbgs() << "Promoted " << PromotableAllocas.size()
+                    << " allocas to registers in function " << Caller->getName()
+                    << "\n");
+}
+
+/// We use a different visitation order of functions here to solve a phase
+/// ordering problem. After inlining, a caller function may have allocas that
+/// were previously used for passing reference arguments to the callee that
+/// are now promotable to registers, using SROA/mem2reg. However if we just let
+/// the AlwaysInliner continue inlining everything at once, the later SROA pass
+/// in the pipeline will end up placing phis for these allocas into blocks along
+/// the dominance frontier which may extend further than desired (e.g. loop
+/// headers). This can happen when the caller is then inlined into another
+/// caller, and the allocas end up hoisted further before SROA is run.
+///
+/// Instead what we want is to try to do, as best as we can, is to inline leaf
+/// functions into callers, and then run PromoteMemToReg() on the allocas that
+/// were passed into the callee before it was inlined.
+///
+/// We want to do this *before* the caller is inlined into another caller
+/// because we want the alloca promotion to happen before its scope extends too
+/// far because of further inlining.
+///
+/// Here's a simple pseudo-example:
+/// outermost_caller() {
+///   for (...) {
+///     middle_caller();
+///   }
+/// }
+///
+/// middle_caller() {
+///   int stack_var;
+///   inner_callee(&stack_var);
+/// }
+///
+/// inner_callee(int *x) {
+///   // Do something with x.
+/// }
+///
+/// In this case, we want to inline inner_callee() into middle_caller() and
+/// then promote stack_var to a register before we inline middle_caller() into
+/// outermost_caller(). The regular always_inliner would inline everything at
+/// once, and then SROA/mem2reg would promote stack_var to a register but in
+/// the context of outermost_caller() which is not what we want.
+bool AlwaysInlineInterleavedMem2RegImpl(
+    Module &M, bool InsertLifetime, ProfileSummaryInfo &PSI,
+    FunctionAnalysisManager &FAM,
+    function_ref<AssumptionCache &(Function &)> GetAssumptionCache,
+    function_ref<AAResults &(Function &)> GetAAR) {
+
+  bool Changed = false;
+
+  // Use SetVector as we may rely on the deterministic iteration order for
+  // finding candidates later.
+  SetVector<Function *> AlwaysInlineFunctions;
+
+  MapVector<Function *, SmallVector<WeakVH>> CalleeToCallSites;
+  // Incoming always-inline calls for a function.
+  DenseMap<Function *, unsigned> IncomingAICount;
+  // Outgoing always-inline calls for a function.
+  DenseMap<Function *, unsigned> OutgoingAICount;
+  // First collect all always_inline functions.
+  for (Function &F : M) {
+    if (F.isDeclaration() || !F.hasFnAttribute(Attribute::AlwaysInline) ||
+        !isInlineViable(F).isSuccess())
+      continue;
+    if (F.isPresplitCoroutine())
+      continue;
+    AlwaysInlineFunctions.insert(&F);
+  }
+
+  DenseSet<Function *> ProcessedFunctions;
+  SmallVector<Function *> InlinedComdatFns;
+  // Build the call graph of always_inline functions.
+  for (Function *F : AlwaysInlineFunctions) {
+    for (User *U : F->users()) {
+      if (auto *CB = dyn_cast<CallBase>(U)) {
+        if (CB->getCalledFunction() != F || !canInlineCallBase(CB))
+          continue;
+        CalleeToCallSites[F].push_back(WeakVH(CB));
+        // Keep track of the number of incoming calls to this function.
+        // This is used to determine the order in which we inline functions.
+        IncomingAICount[F]++;
+        if (AlwaysInlineFunctions.count(CB->getCaller()))
+          OutgoingAICount[CB->getCaller()]++;
+      }
+    }
+  }
+
+  SmallVector<Function *, 16> Worklist;
+  for (Function *F : AlwaysInlineFunctions) {
+    // If this is a always_inline leaf function, we select it for inlining.
+    if (OutgoingAICount.lookup(F) == 0)
+      Worklist.push_back(F);
+  }
+
+  while (!Worklist.empty()) {
+    Function *Callee = Worklist.pop_back_val();
+    auto &Calls = CalleeToCallSites[Callee];
+
+    // Group the calls by their caller. This allows us to collect all allocas
+    // which need to be promoted together.
+    MapVector<Function *, SmallVector<WeakVH>> CallerToCalls;
+
+    for (WeakVH &WH : Calls)
+      if (auto *CB = dyn_cast_or_null<CallBase>(WH))
+        CallerToCalls[CB->getCaller()].push_back(WH);
+
+    // Now collect the allocas.
+    for (auto &CallerAndCalls : CallerToCalls) {
+      Function *Caller = CallerAndCalls.first;
+      SmallVector<WeakVH> &CallerCalls = CallerAndCalls.second;
+      SmallPtrSet<AllocaInst *, 4> AllocasToPromote;
+
+      for (WeakVH &WH : CallerCalls) {
+        if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+          for (Value *Arg : CB->args())
+            if (auto *AI = dyn_cast<AllocaInst>(getUnderlyingObject(Arg)))
+              AllocasToPromote.insert(AI);
+        }
+      }
+
+      // Do the actual inlining.
+      bool InlinedAny = false;
+      SmallVector<WeakVH> SuccessfullyInlinedCalls;
+
+      for (WeakVH &WH : CallerCalls) {
+        if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+          if (attemptInlineFunction(*Callee, CB, InsertLifetime, GetAAR,
+                                    GetAssumptionCache, PSI)) {
+            Changed = true;
+            InlinedAny = true;
+            SuccessfullyInlinedCalls.push_back(WH);
+          }
+        }
+      }
+
+      if (!InlinedAny)
+        continue;
+
+      // Promote any allocas that were used by the just-inlined call site.
+      promoteAllocas(Caller, AllocasToPromote, GetAssumptionCache);
+
+      unsigned InlinedCountForCaller = SuccessfullyInlinedCalls.size();
+      if (!AlwaysInlineFunctions.contains(Caller))
+        continue; // Caller wasn't part of our always-inline call graph.
+      unsigned OldOutgoing = OutgoingAICount[Caller];
+      assert(OldOutgoing >= InlinedCountForCaller &&
+             "Inlined more calls than we had outgoing calls!");
+      OutgoingAICount[Caller] = OldOutgoing - InlinedCountForCaller;
+      // If these were the last outgoing calls in the caller, we can now
+      // consider it a leaf function and add it to the worklist.
+      if (OutgoingAICount[Caller] == 0 && !ProcessedFunctions.count(Caller))
+        Worklist.push_back(Caller);
+    }
+
+    ProcessedFunctions.insert(Callee);
+    AlwaysInlineFunctions.remove(Callee);
+    CalleeToCallSites.erase(Callee);
+
+    Callee->removeDeadConstantUsers();
+    if (Callee->hasFnAttribute(Attribute::AlwaysInline) &&
+        Callee->isDefTriviallyDead()) {
+      if (Callee->hasComdat()) {
+        InlinedComdatFns.push_back(Callee);
+      } else {
+        M.getFunctionList().erase(Callee);
+        Changed = true;
+      }
+    }
+
+    if (AlwaysInlineFunctions.empty())
+      break;
+
+    // If we have no more leaf functions to inline, we use a greedy heuristic
+    // that selects the function with the most incoming calls. The intuition is
+    // inlining this function will eliminate the most call sites and give the
+    // highest chance of creating new leaf functions.
+    if (Worklist.empty()) {
+      Function *BestFunc = nullptr;
+      unsigned MaxIncoming = 0;
+      for (Function *F : AlwaysInlineFunctions) {
+        if (ProcessedFunctions.count(F))
+          continue;
+
+        unsigned CurrentIncoming = IncomingAICount.lookup(F);
+        if (!BestFunc || CurrentIncoming > MaxIncoming) {
+          BestFunc = F;
+          MaxIncoming = CurrentIncoming;
+        }
+      }
+      Worklist.push_back(BestFunc);
+    }
+  }
+
+  if (!InlinedComdatFns.empty()) {
+    filterDeadComdatFunctions(InlinedComdatFns);
+    for (Function *F : InlinedComdatFns) {
+      M.getFunctionList().erase(F);
+      Changed = true;
+    }
+  }
+
+  // We may have missed some call sites that were marked as always_inline but
+  // for which the callee function itself wasn't always_inline. Call the
+  // standard handler here to deal with those.
+  Changed |= AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+                              GetAAR);
+  return Changed;
+}
+
+
 struct AlwaysInlinerLegacyPass : public ModulePass {
   bool InsertLifetime;
 
@@ -191,8 +448,14 @@ PreservedAnalyses AlwaysInlinerPass::run(Module &M,
   };
   auto &PSI = MAM.getResult<ProfileSummaryAnalysis>(M);
 
-  bool Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM,
-                                  GetAssumptionCache, GetAAR);
+  bool Changed = false;
+  if (EnableMem2RegInterleaving) {
+    Changed = AlwaysInlineInterleavedMem2RegImpl(M, InsertLifetime, PSI, FAM,
+                                                 GetAssumptionCache, GetAAR);
+  } else {
+    Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+                               GetAAR);
+  }
   if (!Changed)
     return PreservedAnalyses::all();
 
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
new file mode 100644
index 0000000000000..5048c2868e691
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
@@ -0,0 +1,46 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always):
+
+; This test constructs a call graph with no always-inline leaf functions,
+; showing that the function with the most incoming always-inline calls
+; is inlined first.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+  store i32 42, i32* %x
+  %v = load i32, i32* %x
+  call void @side(i32 %v)
+  call void @outer() ; not a always-inline leaf.
+  ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+  %var = alloca i32
+  call void @inner(i32* %var)
+  call void @inner(i32* %var)
+  call void @inner(i32* %var)
+  ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK:       call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+; CHECK-NEXT:  call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+; CHECK-NEXT:  call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+entry:
+  call void @middle()
+  ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
new file mode 100644
index 0000000000000..d52ca012a1b59
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
@@ -0,0 +1,35 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always): always inline attribute
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always): always inline attribute
+
+; A simple example to ensure we inline leaf function @inner into @middle,
+; promote the alloca in @middle, then inline @middle into @outer.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+  store i32 42, i32* %x
+  %v = load i32, i32* %x
+  call void @side(i32 %v)
+  ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+  %var = alloca i32
+  call void @inner(i32* %var)
+  ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK:  call void @side(i32 42)
+; CHECK:  ret void
+entry:
+  call void @middle()
+  ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
index defd1f4fd426b..319f0f806bbb3 100644
--- a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
+++ b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
@@ -2,10 +2,10 @@
 target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
 target triple = "arm64e-apple-macosx13"
 
+; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'snork' with (cost=always): always inline attribute
+; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'bar.8' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'pluto' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'blam' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'spam' inlined into 'blam' with (cost=65, threshold=75)
 ; CHECK: remark: <unknown>:0:0: 'wibble.1' inlined into 'widget' with (cost=30, threshold=75)
 ; CHECK: remark: <unknown>:0:0: 'widget' inlined into 'bar.8' with (cost=30, threshold=75)

github-actions · 2025-06-24T23:07:30Z

✅ With the latest revision this PR passed the C/C++ code formatter.

aemerson · 2025-06-24T23:07:32Z

Depends on refactoring PR: #145614

aemerson · 2025-06-24T23:10:02Z

There's another alternative to this approach, which could be that we simply run the main inliner twice in the pipeline, along with the instcombine/simplify opts. The first time for alwaysinline functions.

The constraint is that we must inline all always-inline call sites before we do any heuristic analysis in the main inliner. We can't do "intra-SCC" splitting of the always-inline and normal call sites as we used to do.

mtrofin · 2025-06-24T23:15:59Z

llvm/lib/Transforms/IPO/AlwaysInliner.cpp

+/// then promote stack_var to a register before we inline middle_caller() into
+/// outermost_caller(). The regular always_inliner would inline everything at
+/// once, and then SROA/mem2reg would promote stack_var to a register but in
+/// the context of outermost_caller() which is not what we want.


Could you expand a bit more why this (promoting stack_var in outermost_caller) is bad?

Sure. The problem is that mem2reg promotion has to place phi nodes for the value along the dominance frontier. This frontier is different depending on inlining order. For allocas, what you want is to insert phis when the size of the dominance frontier is as small as possible. The motivation is that allocas inside nested loops can "leak" phis beyond the innermost loop header, and that's bad for register pressure.

The main inliner already handles this because the pass manager interleaves optimizations with inlining, but for always-inliner we don't have that capability.

Thanks. We have been experimenting with other traversal orders (hence ModuleInliner.cpp) and this aspect is good to keep in mind. In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.

(I'm not trying to scope-creep, rather want to understand what options we have, and that doesn't have to impact what we do right now)

In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.

I don't think so, unless there's something I've missed. Before doing this I tried other approaches, such as:

Trying to detect these over-extended PHIs and then demoting them back to allocas. Didn't work as we end up pessimizing codegen.

Avoiding hoisting large vector allocas to the entry block, in order to block mem2reg. This works but is conceptually the wrong place to do it (no other heuristics code exists there).

I wasn't aware of ModuleInliner. Is the long term plan for it to replace the existing inliner? If so we could in future merge it with AlwaysInliner and if we interleave optimization as the current SCC manager does then this should fix the problem.

There's no plan yet with the ModuleInliner, currently it lets us experiment with alternative traversals, and some of them have been showing promise.

I'm mainly trying to understand if:

the order of traversal matters (for this problem here)

do all the function simplification passes need to be run after some inlining or just some? I'm guessing it's really "just a specific subset", correct?

Yes the traversal order matters here, because for optimal codegen we want mem2reg to happen between the inner->middle and middle->outer inlines. If you do it the other way around mem2reg can't do anything until the final inner->outer inline and by that point it's too late.

For now I think only this promotion is a known issue, I don't know of general issues with simplification.

Ack, so that means that ModuleInliner running interleaved simplifications won't help (if the order isn't bottom-up traversal).

(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.

(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?

(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.

The motivating use case happens to be builtins code written with multiple large vectors (like <16 x float>) using the Arm SME extension. It's common for programmers to define these large vector values on the stack and then pass references to helper functions. It's these allocas that we're trying to ensure get promoted at the right time. You won't see this in general code like loop-vectorization since that all happens after inlining.

That said, this isn't solely restricted to builtins/vectors. We also see some negligible changes (I wouldn't go as far as "improvements") in WebKit which is a very heavy user of always_inline.

(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?

Not sure I understand the question. Here we're not really delaying the promotion, but eagerly promoting at the smallest scope possible. In general though yes, if you were to avoid promotion/mem2reg you'd effectively force spilling and hope that other memory optimizations clean things up (unlikely).

Created using spr 1.3.6

Copilot

Pull Request Overview

This PR introduces an improved inlining algorithm that interleaves alloca promotion with inlining, addressing phase ordering issues that could otherwise extend alloca lifetimes unintentionally. Key changes include updated inline remark checks in test files, new tests for both leaf and non-leaf always-inline functions, and significant modifications in AlwaysInliner.cpp to interleave PromoteMemToReg calls during the inlining process.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
llvm/test/Transforms/Inline/always-inline-phase-ordering.ll	Updated expected inline remark ordering in CHECK directives.
llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll	Added new test to ensure proper promotion of allocas after inlining leaf functions.
llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll	Added test for non-leaf functions to verify the greedy inlining heuristic.
llvm/lib/Transforms/IPO/AlwaysInliner.cpp	Introduced interleaved alloca promotion via a new function and controlled by a command-line flag.

Copilot · 2025-06-25T22:14:03Z

llvm/lib/Transforms/IPO/AlwaysInliner.cpp

+    // that selects the function with the most incoming calls. The intuition is
+    // inlining this function will eliminate the most call sites and give the
+    // highest chance of creating new leaf functions.
+    if (Worklist.empty()) {


Consider checking whether 'BestFunc' is non-null before pushing it onto 'Worklist' to prevent a potential null pointer from being added in cases where no unprocessed always-inline functions are found.

Not bad! Good bot.

aemerson · 2025-06-25T23:13:10Z

I just realized a bunch of clang tests are failing because they're running the inliner. Will fix and update this PR along with the copilot reported bug.

Created using spr 1.3.6

aemerson · 2025-07-02T10:11:33Z

ping

nikic

Can you please pre-commit a PhaseOrdering test that demonstrates the problem you are trying to solve? It's hard to understand whether this is the correct solution to the problem without an actual test case.

Created using spr 1.3.6

aemerson · 2025-07-03T04:09:11Z

I managed to reduce down the original SME test to Transforms/PhaseOrdering/always-inline-alloca-promotion.ll. Compiling that to assembly with clang with and without the change shows the differences in codegen quality, but the IR shows the kind of scenario this patch is meant to handle.

github-actions · 2025-07-03T04:12:27Z

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:

git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll clang/test/CodeGen/PowerPC/builtins-ppc-xl-xst.c clang/test/CodeGen/X86/avx512f-builtins.c clang/test/CodeGen/X86/avx512vl-builtins.c clang/test/CodeGen/X86/avx512vlbw-builtins.c clang/test/CodeGen/arm_acle.c clang/test/Headers/amdgcn_openmp_device_math.c clang/test/Headers/openmp_device_math_isnan.cpp clang/test/OpenMP/bug57757.cpp llvm/lib/Transforms/IPO/AlwaysInliner.cpp llvm/test/Transforms/Inline/always-inline-phase-ordering.ll llvm/test/Transforms/PhaseOrdering/always-inline-alloca-promotion.ll

The following files introduce new uses of undef:

llvm/test/Transforms/PhaseOrdering/always-inline-alloca-promotion.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

[𝘀𝗽𝗿] initial version

c904c18

Created using spr 1.3.5

llvmbot added the llvm:transforms label Jun 24, 2025

aemerson requested review from nikic, mtrofin, dfukalov and aeubanks June 24, 2025 23:10

mtrofin reviewed Jun 24, 2025

View reviewed changes

clang-format

fdef6e7

Created using spr 1.3.6

aemerson requested a review from Copilot June 25, 2025 22:13

Copilot AI reviewed Jun 25, 2025

View reviewed changes

Add nullptr check and fix tests.

e131d4d

Created using spr 1.3.6

llvmbot added clang Clang issues not falling into any other category backend:PowerPC backend:X86 HLSL HLSL Language Support clang:openmp OpenMP related changes to Clang labels Jun 26, 2025

nikic reviewed Jul 2, 2025

View reviewed changes

Rebase on pre-committed phase ordering test.

7d45d06

Created using spr 1.3.6

AlwaysInliner: A new inlining algorithm to interleave alloca promotion with inlines. #145613

Are you sure you want to change the base?

AlwaysInliner: A new inlining algorithm to interleave alloca promotion with inlines. #145613

Uh oh!

Conversation

aemerson commented Jun 24, 2025

Uh oh!

llvmbot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aemerson commented Jun 24, 2025

Uh oh!

aemerson commented Jun 24, 2025

Uh oh!

mtrofin Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtrofin Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

mtrofin Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtrofin Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

aemerson commented Jun 25, 2025

Uh oh!

aemerson commented Jul 2, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

aemerson commented Jul 3, 2025

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

Uh oh!

llvmbot commented Jun 24, 2025 •

edited

Loading

github-actions bot commented Jun 24, 2025 •

edited

Loading

aemerson Jun 25, 2025 •

edited

Loading

aemerson Jun 25, 2025 •

edited

Loading

aemerson Jun 25, 2025 •

edited

Loading