Skip to content

AlwaysInliner: A new inlining algorithm to interleave alloca promotion with inlines. #145613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: users/aemerson/spr/main.alwaysinliner-a-new-inlining-algorithm-to-interleave-alloca-promotion-with-inlines
Choose a base branch
from

Conversation

aemerson
Copy link
Contributor

We use a different visitation order of functions here to solve a phase
ordering problem. After inlining, a caller function may have allocas that
were previously used for passing reference arguments to the callee that
are now promotable to registers, using SROA/mem2reg. However if we just let
the AlwaysInliner continue inlining everything at once, the later SROA pass
in the pipeline will end up placing phis for these allocas into blocks along
the dominance frontier which may extend further than desired (e.g. loop
headers). This can happen when the caller is then inlined into another
caller, and the allocas end up hoisted further before SROA is run.

Instead what we want is to try to do, as best as we can, is to inline leaf
functions into callers, and then run PromoteMemToReg() on the allocas that
were passed into the callee before it was inlined.

We want to do this before the caller is inlined into another caller
because we want the alloca promotion to happen before its scope extends too
far because of further inlining.

Here's a simple pseudo-example:
outermost_caller() {
for (...) {
middle_caller();
}
}

middle_caller() {
int stack_var;
inner_callee(&stack_var);
}

inner_callee(int *x) {
// Do something with x.
}

In this case, we want to inline inner_callee() into middle_caller() and
then promote stack_var to a register before we inline middle_caller() into
outermost_caller(). The regular always_inliner would inline everything at
once, and then SROA/mem2reg would promote stack_var to a register but in
the context of outermost_caller() which is not what we want.

Created using spr 1.3.5
@llvmbot
Copy link
Member

llvmbot commented Jun 24, 2025

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-hlsl
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-transforms

Author: Amara Emerson (aemerson)

Changes

We use a different visitation order of functions here to solve a phase
ordering problem. After inlining, a caller function may have allocas that
were previously used for passing reference arguments to the callee that
are now promotable to registers, using SROA/mem2reg. However if we just let
the AlwaysInliner continue inlining everything at once, the later SROA pass
in the pipeline will end up placing phis for these allocas into blocks along
the dominance frontier which may extend further than desired (e.g. loop
headers). This can happen when the caller is then inlined into another
caller, and the allocas end up hoisted further before SROA is run.

Instead what we want is to try to do, as best as we can, is to inline leaf
functions into callers, and then run PromoteMemToReg() on the allocas that
were passed into the callee before it was inlined.

We want to do this before the caller is inlined into another caller
because we want the alloca promotion to happen before its scope extends too
far because of further inlining.

Here's a simple pseudo-example:
outermost_caller() {
for (...) {
middle_caller();
}
}

middle_caller() {
int stack_var;
inner_callee(&stack_var);
}

inner_callee(int *x) {
// Do something with x.
}

In this case, we want to inline inner_callee() into middle_caller() and
then promote stack_var to a register before we inline middle_caller() into
outermost_caller(). The regular always_inliner would inline everything at
once, and then SROA/mem2reg would promote stack_var to a register but in
the context of outermost_caller() which is not what we want.


Full diff: https://github.com/llvm/llvm-project/pull/145613.diff

4 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/AlwaysInliner.cpp (+265-2)
  • (added) llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll (+46)
  • (added) llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll (+35)
  • (modified) llvm/test/Transforms/Inline/always-inline-phase-ordering.ll (+2-2)
diff --git a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
index 8de1467ea47ec..8fab2985a0dca 100644
--- a/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
+++ b/llvm/lib/Transforms/IPO/AlwaysInliner.cpp
@@ -12,21 +12,39 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/Transforms/IPO/AlwaysInliner.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallPtrSet.h"
+#include "llvm/ADT/Statistic.h"
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/Analysis/DominanceFrontier.h"
 #include "llvm/Analysis/InlineAdvisor.h"
 #include "llvm/Analysis/InlineCost.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
 #include "llvm/Analysis/ProfileSummaryInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/Dominators.h"
 #include "llvm/IR/Module.h"
+#include "llvm/IR/ValueHandle.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/Support/CommandLine.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include "llvm/Transforms/Utils/ModuleUtils.h"
+#include "llvm/Transforms/Utils/PromoteMemToReg.h"
+
 
 using namespace llvm;
 
 #define DEBUG_TYPE "inline"
+static cl::opt<bool> EnableMem2RegInterleaving(
+    "enable-always-inliner-mem2reg", cl::init(true), cl::Hidden,
+    cl::desc("Enable interleaving always-inlining with alloca promotion"));
+
+STATISTIC(NumAllocasPromoted,
+          "Number of allocas promoted to registers after inlining");
 
 namespace {
 
@@ -129,6 +147,245 @@ bool AlwaysInlineImpl(
   return Changed;
 }
 
+/// Promote allocas to registers if possible.
+static void promoteAllocas(
+    Function *Caller, SmallPtrSetImpl<AllocaInst *> &AllocasToPromote,
+    function_ref<AssumptionCache &(Function &)> &GetAssumptionCache) {
+  if (AllocasToPromote.empty())
+    return;
+
+  SmallVector<AllocaInst *, 4> PromotableAllocas;
+  llvm::copy_if(AllocasToPromote, std::back_inserter(PromotableAllocas),
+                isAllocaPromotable);
+  if (PromotableAllocas.empty())
+    return;
+
+  DominatorTree DT(*Caller);
+  AssumptionCache &AC = GetAssumptionCache(*Caller);
+  PromoteMemToReg(PromotableAllocas, DT, &AC);
+  NumAllocasPromoted += PromotableAllocas.size();
+  // Emit a remark for the promotion.
+  OptimizationRemarkEmitter ORE(Caller);
+  DebugLoc DLoc = Caller->getEntryBlock().getTerminator()->getDebugLoc();
+  ORE.emit([&]() {
+    return OptimizationRemark(DEBUG_TYPE, "PromoteAllocas", DLoc,
+                              &Caller->getEntryBlock())
+           << "Promoting " << ore::NV("NumAlloca", PromotableAllocas.size())
+           << " allocas to SSA registers in function '"
+           << ore::NV("Function", Caller) << "'";
+  });
+  LLVM_DEBUG(dbgs() << "Promoted " << PromotableAllocas.size()
+                    << " allocas to registers in function " << Caller->getName()
+                    << "\n");
+}
+
+/// We use a different visitation order of functions here to solve a phase
+/// ordering problem. After inlining, a caller function may have allocas that
+/// were previously used for passing reference arguments to the callee that
+/// are now promotable to registers, using SROA/mem2reg. However if we just let
+/// the AlwaysInliner continue inlining everything at once, the later SROA pass
+/// in the pipeline will end up placing phis for these allocas into blocks along
+/// the dominance frontier which may extend further than desired (e.g. loop
+/// headers). This can happen when the caller is then inlined into another
+/// caller, and the allocas end up hoisted further before SROA is run.
+///
+/// Instead what we want is to try to do, as best as we can, is to inline leaf
+/// functions into callers, and then run PromoteMemToReg() on the allocas that
+/// were passed into the callee before it was inlined.
+///
+/// We want to do this *before* the caller is inlined into another caller
+/// because we want the alloca promotion to happen before its scope extends too
+/// far because of further inlining.
+///
+/// Here's a simple pseudo-example:
+/// outermost_caller() {
+///   for (...) {
+///     middle_caller();
+///   }
+/// }
+///
+/// middle_caller() {
+///   int stack_var;
+///   inner_callee(&stack_var);
+/// }
+///
+/// inner_callee(int *x) {
+///   // Do something with x.
+/// }
+///
+/// In this case, we want to inline inner_callee() into middle_caller() and
+/// then promote stack_var to a register before we inline middle_caller() into
+/// outermost_caller(). The regular always_inliner would inline everything at
+/// once, and then SROA/mem2reg would promote stack_var to a register but in
+/// the context of outermost_caller() which is not what we want.
+bool AlwaysInlineInterleavedMem2RegImpl(
+    Module &M, bool InsertLifetime, ProfileSummaryInfo &PSI,
+    FunctionAnalysisManager &FAM,
+    function_ref<AssumptionCache &(Function &)> GetAssumptionCache,
+    function_ref<AAResults &(Function &)> GetAAR) {
+
+  bool Changed = false;
+
+  // Use SetVector as we may rely on the deterministic iteration order for
+  // finding candidates later.
+  SetVector<Function *> AlwaysInlineFunctions;
+
+  MapVector<Function *, SmallVector<WeakVH>> CalleeToCallSites;
+  // Incoming always-inline calls for a function.
+  DenseMap<Function *, unsigned> IncomingAICount;
+  // Outgoing always-inline calls for a function.
+  DenseMap<Function *, unsigned> OutgoingAICount;
+  // First collect all always_inline functions.
+  for (Function &F : M) {
+    if (F.isDeclaration() || !F.hasFnAttribute(Attribute::AlwaysInline) ||
+        !isInlineViable(F).isSuccess())
+      continue;
+    if (F.isPresplitCoroutine())
+      continue;
+    AlwaysInlineFunctions.insert(&F);
+  }
+
+  DenseSet<Function *> ProcessedFunctions;
+  SmallVector<Function *> InlinedComdatFns;
+  // Build the call graph of always_inline functions.
+  for (Function *F : AlwaysInlineFunctions) {
+    for (User *U : F->users()) {
+      if (auto *CB = dyn_cast<CallBase>(U)) {
+        if (CB->getCalledFunction() != F || !canInlineCallBase(CB))
+          continue;
+        CalleeToCallSites[F].push_back(WeakVH(CB));
+        // Keep track of the number of incoming calls to this function.
+        // This is used to determine the order in which we inline functions.
+        IncomingAICount[F]++;
+        if (AlwaysInlineFunctions.count(CB->getCaller()))
+          OutgoingAICount[CB->getCaller()]++;
+      }
+    }
+  }
+
+  SmallVector<Function *, 16> Worklist;
+  for (Function *F : AlwaysInlineFunctions) {
+    // If this is a always_inline leaf function, we select it for inlining.
+    if (OutgoingAICount.lookup(F) == 0)
+      Worklist.push_back(F);
+  }
+
+  while (!Worklist.empty()) {
+    Function *Callee = Worklist.pop_back_val();
+    auto &Calls = CalleeToCallSites[Callee];
+
+    // Group the calls by their caller. This allows us to collect all allocas
+    // which need to be promoted together.
+    MapVector<Function *, SmallVector<WeakVH>> CallerToCalls;
+
+    for (WeakVH &WH : Calls)
+      if (auto *CB = dyn_cast_or_null<CallBase>(WH))
+        CallerToCalls[CB->getCaller()].push_back(WH);
+
+    // Now collect the allocas.
+    for (auto &CallerAndCalls : CallerToCalls) {
+      Function *Caller = CallerAndCalls.first;
+      SmallVector<WeakVH> &CallerCalls = CallerAndCalls.second;
+      SmallPtrSet<AllocaInst *, 4> AllocasToPromote;
+
+      for (WeakVH &WH : CallerCalls) {
+        if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+          for (Value *Arg : CB->args())
+            if (auto *AI = dyn_cast<AllocaInst>(getUnderlyingObject(Arg)))
+              AllocasToPromote.insert(AI);
+        }
+      }
+
+      // Do the actual inlining.
+      bool InlinedAny = false;
+      SmallVector<WeakVH> SuccessfullyInlinedCalls;
+
+      for (WeakVH &WH : CallerCalls) {
+        if (auto *CB = dyn_cast_or_null<CallBase>(WH)) {
+          if (attemptInlineFunction(*Callee, CB, InsertLifetime, GetAAR,
+                                    GetAssumptionCache, PSI)) {
+            Changed = true;
+            InlinedAny = true;
+            SuccessfullyInlinedCalls.push_back(WH);
+          }
+        }
+      }
+
+      if (!InlinedAny)
+        continue;
+
+      // Promote any allocas that were used by the just-inlined call site.
+      promoteAllocas(Caller, AllocasToPromote, GetAssumptionCache);
+
+      unsigned InlinedCountForCaller = SuccessfullyInlinedCalls.size();
+      if (!AlwaysInlineFunctions.contains(Caller))
+        continue; // Caller wasn't part of our always-inline call graph.
+      unsigned OldOutgoing = OutgoingAICount[Caller];
+      assert(OldOutgoing >= InlinedCountForCaller &&
+             "Inlined more calls than we had outgoing calls!");
+      OutgoingAICount[Caller] = OldOutgoing - InlinedCountForCaller;
+      // If these were the last outgoing calls in the caller, we can now
+      // consider it a leaf function and add it to the worklist.
+      if (OutgoingAICount[Caller] == 0 && !ProcessedFunctions.count(Caller))
+        Worklist.push_back(Caller);
+    }
+
+    ProcessedFunctions.insert(Callee);
+    AlwaysInlineFunctions.remove(Callee);
+    CalleeToCallSites.erase(Callee);
+
+    Callee->removeDeadConstantUsers();
+    if (Callee->hasFnAttribute(Attribute::AlwaysInline) &&
+        Callee->isDefTriviallyDead()) {
+      if (Callee->hasComdat()) {
+        InlinedComdatFns.push_back(Callee);
+      } else {
+        M.getFunctionList().erase(Callee);
+        Changed = true;
+      }
+    }
+
+    if (AlwaysInlineFunctions.empty())
+      break;
+
+    // If we have no more leaf functions to inline, we use a greedy heuristic
+    // that selects the function with the most incoming calls. The intuition is
+    // inlining this function will eliminate the most call sites and give the
+    // highest chance of creating new leaf functions.
+    if (Worklist.empty()) {
+      Function *BestFunc = nullptr;
+      unsigned MaxIncoming = 0;
+      for (Function *F : AlwaysInlineFunctions) {
+        if (ProcessedFunctions.count(F))
+          continue;
+
+        unsigned CurrentIncoming = IncomingAICount.lookup(F);
+        if (!BestFunc || CurrentIncoming > MaxIncoming) {
+          BestFunc = F;
+          MaxIncoming = CurrentIncoming;
+        }
+      }
+      Worklist.push_back(BestFunc);
+    }
+  }
+
+  if (!InlinedComdatFns.empty()) {
+    filterDeadComdatFunctions(InlinedComdatFns);
+    for (Function *F : InlinedComdatFns) {
+      M.getFunctionList().erase(F);
+      Changed = true;
+    }
+  }
+
+  // We may have missed some call sites that were marked as always_inline but
+  // for which the callee function itself wasn't always_inline. Call the
+  // standard handler here to deal with those.
+  Changed |= AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+                              GetAAR);
+  return Changed;
+}
+
+
 struct AlwaysInlinerLegacyPass : public ModulePass {
   bool InsertLifetime;
 
@@ -191,8 +448,14 @@ PreservedAnalyses AlwaysInlinerPass::run(Module &M,
   };
   auto &PSI = MAM.getResult<ProfileSummaryAnalysis>(M);
 
-  bool Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM,
-                                  GetAssumptionCache, GetAAR);
+  bool Changed = false;
+  if (EnableMem2RegInterleaving) {
+    Changed = AlwaysInlineInterleavedMem2RegImpl(M, InsertLifetime, PSI, FAM,
+                                                 GetAssumptionCache, GetAAR);
+  } else {
+    Changed = AlwaysInlineImpl(M, InsertLifetime, PSI, &FAM, GetAssumptionCache,
+                               GetAAR);
+  }
   if (!Changed)
     return PreservedAnalyses::all();
 
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
new file mode 100644
index 0000000000000..5048c2868e691
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll
@@ -0,0 +1,46 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always):
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always):
+
+; This test constructs a call graph with no always-inline leaf functions,
+; showing that the function with the most incoming always-inline calls
+; is inlined first.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+  store i32 42, i32* %x
+  %v = load i32, i32* %x
+  call void @side(i32 %v)
+  call void @outer() ; not a always-inline leaf.
+  ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+  %var = alloca i32
+  call void @inner(i32* %var)
+  call void @inner(i32* %var)
+  call void @inner(i32* %var)
+  ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK:       call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+; CHECK-NEXT:  call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+; CHECK-NEXT:  call void @side(i32 42)
+; CHECK-NEXT:  call void @outer()
+entry:
+  call void @middle()
+  ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
new file mode 100644
index 0000000000000..d52ca012a1b59
--- /dev/null
+++ b/llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll
@@ -0,0 +1,35 @@
+; RUN: opt -p always-inline -enable-always-inliner-mem2reg -S -pass-remarks=inline < %s 2>&1 | FileCheck %s
+; CHECK: remark: <unknown>:0:0: 'inner' inlined into 'middle' with (cost=always): always inline attribute
+; CHECK-NEXT: remark: <unknown>:0:0: Promoting 1 allocas to SSA registers in function 'middle'
+; CHECK-NEXT: remark: <unknown>:0:0: 'middle' inlined into 'outer' with (cost=always): always inline attribute
+
+; A simple example to ensure we inline leaf function @inner into @middle,
+; promote the alloca in @middle, then inline @middle into @outer.
+
+declare void @side(i32)
+
+define linkonce_odr void @inner(i32* %x) #0 {
+entry:
+  store i32 42, i32* %x
+  %v = load i32, i32* %x
+  call void @side(i32 %v)
+  ret void
+}
+
+define linkonce_odr void @middle() #0 {
+entry:
+  %var = alloca i32
+  call void @inner(i32* %var)
+  ret void
+}
+
+define void @outer() {
+; CHECK-LABEL: outer()
+; CHECK:  call void @side(i32 42)
+; CHECK:  ret void
+entry:
+  call void @middle()
+  ret void
+}
+
+attributes #0 = { alwaysinline }
\ No newline at end of file
diff --git a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
index defd1f4fd426b..319f0f806bbb3 100644
--- a/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
+++ b/llvm/test/Transforms/Inline/always-inline-phase-ordering.ll
@@ -2,10 +2,10 @@
 target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
 target triple = "arm64e-apple-macosx13"
 
+; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'snork' with (cost=always): always inline attribute
+; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'bar.8' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'wibble' inlined into 'pluto' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'snork' inlined into 'blam' with (cost=always): always inline attribute
-; CHECK: remark: <unknown>:0:0: 'wobble' inlined into 'blam' with (cost=always): always inline attribute
 ; CHECK: remark: <unknown>:0:0: 'spam' inlined into 'blam' with (cost=65, threshold=75)
 ; CHECK: remark: <unknown>:0:0: 'wibble.1' inlined into 'widget' with (cost=30, threshold=75)
 ; CHECK: remark: <unknown>:0:0: 'widget' inlined into 'bar.8' with (cost=30, threshold=75)

Copy link

github-actions bot commented Jun 24, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@aemerson
Copy link
Contributor Author

Depends on refactoring PR: #145614

@aemerson
Copy link
Contributor Author

There's another alternative to this approach, which could be that we simply run the main inliner twice in the pipeline, along with the instcombine/simplify opts. The first time for alwaysinline functions.

The constraint is that we must inline all always-inline call sites before we do any heuristic analysis in the main inliner. We can't do "intra-SCC" splitting of the always-inline and normal call sites as we used to do.

/// then promote stack_var to a register before we inline middle_caller() into
/// outermost_caller(). The regular always_inliner would inline everything at
/// once, and then SROA/mem2reg would promote stack_var to a register but in
/// the context of outermost_caller() which is not what we want.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand a bit more why this (promoting stack_var in outermost_caller) is bad?

Copy link
Contributor Author

@aemerson aemerson Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The problem is that mem2reg promotion has to place phi nodes for the value along the dominance frontier. This frontier is different depending on inlining order. For allocas, what you want is to insert phis when the size of the dominance frontier is as small as possible. The motivation is that allocas inside nested loops can "leak" phis beyond the innermost loop header, and that's bad for register pressure.

The main inliner already handles this because the pass manager interleaves optimizations with inlining, but for always-inliner we don't have that capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We have been experimenting with other traversal orders (hence ModuleInliner.cpp) and this aspect is good to keep in mind. In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.

(I'm not trying to scope-creep, rather want to understand what options we have, and that doesn't have to impact what we do right now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that context, could the problem addressed here be decoupled from inlining order? It seems like it'd result in a more robust system.

I don't think so, unless there's something I've missed. Before doing this I tried other approaches, such as:

  • Trying to detect these over-extended PHIs and then demoting them back to allocas. Didn't work as we end up pessimizing codegen.
  • Avoiding hoisting large vector allocas to the entry block, in order to block mem2reg. This works but is conceptually the wrong place to do it (no other heuristics code exists there).

I wasn't aware of ModuleInliner. Is the long term plan for it to replace the existing inliner? If so we could in future merge it with AlwaysInliner and if we interleave optimization as the current SCC manager does then this should fix the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no plan yet with the ModuleInliner, currently it lets us experiment with alternative traversals, and some of them have been showing promise.

I'm mainly trying to understand if:

  • the order of traversal matters (for this problem here)
  • do all the function simplification passes need to be run after some inlining or just some? I'm guessing it's really "just a specific subset", correct?

Copy link
Contributor Author

@aemerson aemerson Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the traversal order matters here, because for optimal codegen we want mem2reg to happen between the inner->middle and middle->outer inlines. If you do it the other way around mem2reg can't do anything until the final inner->outer inline and by that point it's too late.

For now I think only this promotion is a known issue, I don't know of general issues with simplification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, so that means that ModuleInliner running interleaved simplifications won't help (if the order isn't bottom-up traversal).

(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.

(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?

Copy link
Contributor Author

@aemerson aemerson Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(picking on "Avoiding hoisting large vector allocas to the entry block") this happens in vectorizable kind of code? Asking (to learn) because in the experiments I mentioned, when we changed traversal order, we also (orthogonally) postponed function simplification to after all inlining was done, with no discernable performance effect, but for datacenter kind of apps, though.

The motivating use case happens to be builtins code written with multiple large vectors (like <16 x float>) using the Arm SME extension. It's common for programmers to define these large vector values on the stack and then pass references to helper functions. It's these allocas that we're trying to ensure get promoted at the right time. You won't see this in general code like loop-vectorization since that all happens after inlining.

That said, this isn't solely restricted to builtins/vectors. We also see some negligible changes (I wouldn't go as far as "improvements") in WebKit which is a very heavy user of always_inline.

(also to learn/understand) by delaying promoting the allocas, you're kind of pre-spilling live ranges, right?

Not sure I understand the question. Here we're not really delaying the promotion, but eagerly promoting at the smallest scope possible. In general though yes, if you were to avoid promotion/mem2reg you'd effectively force spilling and hope that other memory optimizations clean things up (unlikely).

Created using spr 1.3.6
@aemerson aemerson requested a review from Copilot June 25, 2025 22:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an improved inlining algorithm that interleaves alloca promotion with inlining, addressing phase ordering issues that could otherwise extend alloca lifetimes unintentionally. Key changes include updated inline remark checks in test files, new tests for both leaf and non-leaf always-inline functions, and significant modifications in AlwaysInliner.cpp to interleave PromoteMemToReg calls during the inlining process.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
llvm/test/Transforms/Inline/always-inline-phase-ordering.ll Updated expected inline remark ordering in CHECK directives.
llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll Added new test to ensure proper promotion of allocas after inlining leaf functions.
llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll Added test for non-leaf functions to verify the greedy inlining heuristic.
llvm/lib/Transforms/IPO/AlwaysInliner.cpp Introduced interleaved alloca promotion via a new function and controlled by a command-line flag.

// that selects the function with the most incoming calls. The intuition is
// inlining this function will eliminate the most call sites and give the
// highest chance of creating new leaf functions.
if (Worklist.empty()) {
Copy link
Preview

Copilot AI Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking whether 'BestFunc' is non-null before pushing it onto 'Worklist' to prevent a potential null pointer from being added in cases where no unprocessed always-inline functions are found.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad! Good bot.

@aemerson
Copy link
Contributor Author

I just realized a bunch of clang tests are failing because they're running the inliner. Will fix and update this PR along with the copilot reported bug.

Created using spr 1.3.6
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:PowerPC backend:X86 HLSL HLSL Language Support clang:openmp OpenMP related changes to Clang labels Jun 26, 2025
@aemerson
Copy link
Contributor Author

aemerson commented Jul 2, 2025

ping

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please pre-commit a PhaseOrdering test that demonstrates the problem you are trying to solve? It's hard to understand whether this is the correct solution to the problem without an actual test case.

@aemerson
Copy link
Contributor Author

aemerson commented Jul 3, 2025

I managed to reduce down the original SME test to Transforms/PhaseOrdering/always-inline-alloca-promotion.ll. Compiling that to assembly with clang with and without the change shows the differences in codegen quality, but the IR shows the kind of scenario this patch is meant to handle.

Copy link

github-actions bot commented Jul 3, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/Transforms/Inline/always-inline-alloca-promotion-no-leaf.ll llvm/test/Transforms/Inline/always-inline-alloca-promotion.ll clang/test/CodeGen/PowerPC/builtins-ppc-xl-xst.c clang/test/CodeGen/X86/avx512f-builtins.c clang/test/CodeGen/X86/avx512vl-builtins.c clang/test/CodeGen/X86/avx512vlbw-builtins.c clang/test/CodeGen/arm_acle.c clang/test/Headers/amdgcn_openmp_device_math.c clang/test/Headers/openmp_device_math_isnan.cpp clang/test/OpenMP/bug57757.cpp llvm/lib/Transforms/IPO/AlwaysInliner.cpp llvm/test/Transforms/Inline/always-inline-phase-ordering.ll llvm/test/Transforms/PhaseOrdering/always-inline-alloca-promotion.ll

The following files introduce new uses of undef:

  • llvm/test/Transforms/PhaseOrdering/always-inline-alloca-promotion.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:PowerPC backend:X86 clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category HLSL HLSL Language Support llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants