[AArch64][SME] Support split ZPR and PPR area allocation #142392

MacDue · 2025-06-02T13:41:52Z

For a while we have supported the -aarch64-stack-hazard-size=<size> option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

This layout is only enabled if:

SplitSVEObjects are enabled (-aarch64-split-sve-objects)
- (This may be enabled by default in a later patch)
Streaming memory hazards are present
- (-aarch64-stack-hazard-size=<val> != 0)
PPRs and FPRs/ZPRs are on the stack
There's no stack realignment or variable-sized objects
- This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.

MacDue · 2025-06-02T13:46:52Z

This is a stacked PR. See previous PRs below:

llvmbot · 2025-06-02T14:01:08Z

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

For a while we have supported the -aarch64-stack-hazard-size=<size> option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|   &lt;hazard padding&gt;                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   &lt;hazard padding&gt;                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

This layout is only enabled if:

SplitSVEObjects are enabled (-aarch64-split-sve-objects)
- (This may be enabled by default in a later patch)
Streaming memory hazards are present
- (-aarch64-stack-hazard-size=<val> != 0)
PPRs and FPRs/ZPRs are on the stack
There's no stack realignment or variable-sized objects
- This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.

Patch is 172.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142392.diff

8 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+391-133)
(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.h (+2-1)
(modified) llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h (+7)
(added) llvm/test/CodeGen/AArch64/framelayout-split-sve.mir (+526)
(modified) llvm/test/CodeGen/AArch64/spill-fill-zpr-predicates.mir (+7-10)
(added) llvm/test/CodeGen/AArch64/split-sve-stack-frame-layout.ll (+751)
(modified) llvm/test/CodeGen/AArch64/stack-hazard.ll (+542-326)
(modified) llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll (-2)

diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index e5592a921e192..36775b8dc05e5 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -270,6 +270,11 @@ static cl::opt<bool> OrderFrameObjects("aarch64-order-frame-objects",
                                        cl::desc("sort stack allocations"),
                                        cl::init(true), cl::Hidden);
 
+static cl::opt<bool>
+    SplitSVEObjects("aarch64-split-sve-objects",
+                    cl::desc("Split allocation of ZPR & PPR objects"),
+                    cl::init(false), cl::Hidden);
+
 cl::opt<bool> EnableHomogeneousPrologEpilog(
     "homogeneous-prolog-epilog", cl::Hidden,
     cl::desc("Emit homogeneous prologue and epilogue for the size "
@@ -458,10 +463,13 @@ static StackOffset getZPRStackSize(const MachineFunction &MF) {
   return StackOffset::getScalable(AFI->getStackSizeZPR());
 }
 
-/// Returns the size of the entire PPR stackframe (calleesaves + spills).
+/// Returns the size of the entire PPR stackframe (calleesaves + spills + hazard
+/// padding).
 static StackOffset getPPRStackSize(const MachineFunction &MF) {
   const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
-  return StackOffset::getScalable(AFI->getStackSizePPR());
+  return StackOffset::get(AFI->hasSplitSVEObjects() ? getStackHazardSize(MF)
+                                                    : 0,
+                          AFI->getStackSizePPR());
 }
 
 /// Returns the size of the entire SVE stackframe (PPRs + ZPRs).
@@ -484,6 +492,10 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
   if (!EnableRedZone)
     return false;
 
+  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
+  if (AFI->hasSplitSVEObjects())
+    return false;
+
   // Don't use the red zone if the function explicitly asks us not to.
   // This is typically used for kernel code.
   const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
@@ -493,7 +505,6 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
     return false;
 
   const MachineFrameInfo &MFI = MF.getFrameInfo();
-  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
   uint64_t NumBytes = AFI->getLocalStackSize();
 
   // If neither NEON or SVE are available, a COPY from one Q-reg to
@@ -666,9 +677,11 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
   const TargetRegisterInfo &TRI = *STI.getRegisterInfo();
   AArch64FunctionInfo &AFI = *MF.getInfo<AArch64FunctionInfo>();
   CFIInstBuilder CFIBuilder(MBB, MBBI, MachineInstr::FrameSetup);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
 
   for (const auto &Info : CSI) {
-    if (!MFI.isScalableStackID(Info.getFrameIdx()))
+    int FI = Info.getFrameIdx();
+    if (!MFI.isScalableStackID(FI))
       continue;
 
     // Not all unwinders may know about SVE registers, so assume the lowest
@@ -679,9 +692,13 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
       continue;
 
     StackOffset Offset =
-        StackOffset::getScalable(MFI.getObjectOffset(Info.getFrameIdx())) -
+        StackOffset::getScalable(MFI.getObjectOffset(FI)) -
         StackOffset::getFixed(AFI.getCalleeSavedStackSize(MFI));
 
+    if (AFI.hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      Offset -= PPRStackSize;
+
     CFIBuilder.insertCFIInst(createCFAOffset(TRI, Reg, Offset));
   }
 }
@@ -2188,33 +2205,68 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes);
   StackOffset LocalsSize =
       PPRLocalsSize + ZPRLocalsSize + StackOffset::getFixed(NumBytes);
-  StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
-  MachineBasicBlock::iterator CalleeSavesBegin =
-      AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
-                                        : ZPRCalleeSavesBegin;
-  allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
-                     nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
-                     MFI.hasVarSizedObjects() || LocalsSize);
-  CFAOffset += SVECalleeSavesSize;
-
-  MachineBasicBlock::iterator CalleeSavesEnd =
-      AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd : PPRCalleeSavesEnd;
-  if (EmitAsyncCFI)
-    emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
-
-  // Allocate space for the rest of the frame including SVE locals. Align the
-  // stack as necessary.
-  assert(!(canUseRedZone(MF) && NeedsRealignment) &&
-         "Cannot use redzone with stack realignment");
-  if (!canUseRedZone(MF)) {
-    // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
-    // the correct value here, as NumBytes also includes padding bytes,
-    // which shouldn't be counted here.
-    StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
-    allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
-                       SVELocalsSize + StackOffset::getFixed(NumBytes),
-                       NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
-                       CFAOffset, MFI.hasVarSizedObjects());
+  if (!AFI->hasSplitSVEObjects()) {
+    StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
+    MachineBasicBlock::iterator CalleeSavesBegin =
+        AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
+                                          : ZPRCalleeSavesBegin;
+    allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || LocalsSize);
+    CFAOffset += SVECalleeSavesSize;
+
+    MachineBasicBlock::iterator CalleeSavesEnd =
+        AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd
+                                          : PPRCalleeSavesEnd;
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
+
+    // Allocate space for the rest of the frame including SVE locals. Align the
+    // stack as necessary.
+    assert(!(canUseRedZone(MF) && NeedsRealignment) &&
+           "Cannot use redzone with stack realignment");
+    if (!canUseRedZone(MF)) {
+      // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
+      // the correct value here, as NumBytes also includes padding bytes,
+      // which shouldn't be counted here.
+      StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
+      allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
+                         SVELocalsSize + StackOffset::getFixed(NumBytes),
+                         NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
+                         CFAOffset, MFI.hasVarSizedObjects());
+    }
+  } else {
+    assert(!canUseRedZone(MF) &&
+           "Cannot use redzone with aarch64-split-sve-objects");
+    // TODO: Handle HasWinCFI/NeedsWinCFI?
+    assert(!NeedsWinCFI &&
+           "WinCFI with aarch64-split-sve-objects is not supported");
+
+    // Split ZPR and PPR allocation.
+    // Allocate PPR callee saves
+    allocateStackSpace(MBB, PPRCalleeSavesBegin, 0, PPRCalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRCalleeSavesSize ||
+                           ZPRLocalsSize || PPRLocalsSize);
+    CFAOffset += PPRCalleeSavesSize;
+
+    // Allocate PPR locals + ZPR callee saves
+    assert(PPRCalleeSavesEnd == ZPRCalleeSavesBegin &&
+           "Expected ZPR callee saves after PPR locals");
+    allocateStackSpace(MBB, PPRCalleeSavesEnd, RealignmentPadding,
+                       PPRLocalsSize + ZPRCalleeSavesSize, false, nullptr,
+                       EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRLocalsSize);
+    CFAOffset += PPRLocalsSize + ZPRCalleeSavesSize;
+
+    // Allocate ZPR locals
+    allocateStackSpace(MBB, ZPRCalleeSavesEnd, RealignmentPadding,
+                       ZPRLocalsSize + StackOffset::getFixed(NumBytes), false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects());
+
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, ZPRCalleeSavesEnd);
   }
 
   // If we need a base pointer, set it up here. It's whatever the value of the
@@ -2456,7 +2508,9 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
     }
   }
 
-  StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // If there is a single SP update, insert it before the ret and we're done.
   if (CombineSPBump) {
@@ -2477,69 +2531,141 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
   NumBytes -= PrologueSaveSize;
   assert(NumBytes >= 0 && "Negative stack allocation size!?");
 
-  // Process the SVE callee-saves to determine what space needs to be
-  // deallocated.
-  StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
-  MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
-  int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
-  int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
-  int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
-
-  if (SVECalleeSavedSize) {
-    RestoreBegin = std::prev(RestoreEnd);
-    while (RestoreBegin != MBB.begin() &&
-           IsSVECalleeSave(std::prev(RestoreBegin)))
-      --RestoreBegin;
-
-    assert(IsSVECalleeSave(RestoreBegin) &&
-           IsSVECalleeSave(std::prev(RestoreEnd)) && "Unexpected instruction");
-
-    StackOffset CalleeSavedSizeAsOffset =
-        StackOffset::getScalable(SVECalleeSavedSize);
-    DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
-    DeallocateAfter = CalleeSavedSizeAsOffset;
-  }
-
-  // Deallocate the SVE area.
-  if (SVEStackSize) {
-    // If we have stack realignment or variable sized objects on the stack,
-    // restore the stack pointer from the frame pointer prior to SVE CSR
-    // restoration.
-    if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
-      if (SVECalleeSavedSize) {
-        // Set SP to start of SVE callee-save area from which they can
-        // be reloaded. The code below will deallocate the stack space
-        // space by moving FP -> SP.
-        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
-                        StackOffset::getScalable(-SVECalleeSavedSize), TII,
-                        MachineInstr::FrameDestroy);
-      }
-    } else {
-      if (SVECalleeSavedSize) {
-        // Deallocate the non-SVE locals first before we can deallocate (and
-        // restore callee saves) from the SVE area.
-        emitFrameOffset(
-            MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-            StackOffset::getFixed(NumBytes), TII, MachineInstr::FrameDestroy,
-            false, false, nullptr, EmitCFI && !hasFP(MF),
-            SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize));
-        NumBytes = 0;
+  if (!AFI->hasSplitSVEObjects()) {
+    // Process the SVE callee-saves to determine what space needs to be
+    // deallocated.
+    StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
+    MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
+    int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
+    int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
+    int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
+
+    if (SVECalleeSavedSize) {
+      RestoreBegin = std::prev(RestoreEnd);
+      while (RestoreBegin != MBB.begin() &&
+             IsSVECalleeSave(std::prev(RestoreBegin)))
+        --RestoreBegin;
+
+      assert(IsSVECalleeSave(RestoreBegin) &&
+             IsSVECalleeSave(std::prev(RestoreEnd)) &&
+             "Unexpected instruction");
+
+      StackOffset CalleeSavedSizeAsOffset =
+          StackOffset::getScalable(SVECalleeSavedSize);
+      DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
+      DeallocateAfter = CalleeSavedSizeAsOffset;
+    }
+
+    // Deallocate the SVE area.
+    if (SVEStackSize) {
+      // If we have stack realignment or variable sized objects on the stack,
+      // restore the stack pointer from the frame pointer prior to SVE CSR
+      // restoration.
+      if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
+        if (SVECalleeSavedSize) {
+          // Set SP to start of SVE callee-save area from which they can
+          // be reloaded. The code below will deallocate the stack space
+          // space by moving FP -> SP.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
+                          StackOffset::getScalable(-SVECalleeSavedSize), TII,
+                          MachineInstr::FrameDestroy);
+        }
+      } else {
+        if (SVECalleeSavedSize) {
+          // Deallocate the non-SVE locals first before we can deallocate (and
+          // restore callee saves) from the SVE area.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                          StackOffset::getFixed(NumBytes), TII,
+                          MachineInstr::FrameDestroy, false, false, nullptr,
+                          EmitCFI && !hasFP(MF),
+                          SVEStackSize + StackOffset::getFixed(
+                                             NumBytes + PrologueSaveSize));
+          NumBytes = 0;
+        }
+
+        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                        DeallocateBefore, TII, MachineInstr::FrameDestroy,
+                        false, false, nullptr, EmitCFI && !hasFP(MF),
+                        SVEStackSize +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
+
+        emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
+                        DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
+                        false, nullptr, EmitCFI && !hasFP(MF),
+                        DeallocateAfter +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
       }
+      if (EmitCFI)
+        emitCalleeSavedSVERestores(MBB, RestoreEnd);
+    }
+  } else if (SVEStackSize) {
+    assert(!AFI->isStackRealigned() && !MFI.hasVarSizedObjects() &&
+           "TODO: Support stack realigment / variable-sized objects");
+    // SplitSVEObjects. Determine the sizes and starts/ends of the ZPR and PPR
+    // areas.
+    auto ZPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getZPRCalleeSavedStackSize());
+    auto PPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getPPRCalleeSavedStackSize());
+    StackOffset PPRLocalsSize = PPRStackSize - PPRCalleeSavedSize;
+    StackOffset ZPRLocalsSize = ZPRStackSize - ZPRCalleeSavedSize;
+
+    MachineBasicBlock::iterator PPRRestoreBegin = LastPopI,
+                                PPRRestoreEnd = LastPopI;
+    if (PPRCalleeSavedSize) {
+      PPRRestoreBegin = std::prev(PPRRestoreEnd);
+      while (PPRRestoreBegin != MBB.begin() &&
+             IsPPRCalleeSave(std::prev(PPRRestoreBegin)))
+        --PPRRestoreBegin;
+    }
+
+    MachineBasicBlock::iterator ZPRRestoreBegin = PPRRestoreBegin,
+                                ZPRRestoreEnd = PPRRestoreBegin;
+    if (ZPRCalleeSavedSize) {
+      ZPRRestoreBegin = std::prev(ZPRRestoreEnd);
+      while (ZPRRestoreBegin != MBB.begin() &&
+             IsZPRCalleeSave(std::prev(ZPRRestoreBegin)))
+        --ZPRRestoreBegin;
+    }
+
+    auto CFAOffset =
+        SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize);
+    if (PPRCalleeSavedSize || ZPRCalleeSavedSize) {
+      // Deallocate the non-SVE locals first before we can deallocate (and
+      // restore callee saves) from the SVE area.
+      auto NonSVELocals = StackOffset::getFixed(NumBytes);
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      NonSVELocals, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      NumBytes = 0;
+      CFAOffset -= NonSVELocals;
+    }
+
+    if (ZPRLocalsSize) {
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      ZPRLocalsSize, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= ZPRLocalsSize;
+    }
 
-      emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-                      DeallocateBefore, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      SVEStackSize +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
-
-      emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
-                      DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      DeallocateAfter +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
+    if (PPRLocalsSize || ZPRCalleeSavedSize) {
+      assert(PPRRestoreBegin == ZPRRestoreEnd &&
+             "Expected PPR restores after ZPR");
+      emitFrameOffset(MBB, PPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      PPRLocalsSize + ZPRCalleeSavedSize, TII,
+                      MachineInstr::FrameDestroy, false, false, nullptr,
+                      EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= PPRLocalsSize + ZPRCalleeSavedSize;
     }
+    if (PPRCalleeSavedSize) {
+      emitFrameOffset(MBB, PPRRestoreEnd, DL, AArch64::SP, AArch64::SP,
+                      PPRCalleeSavedSize, TII, MachineInstr::FrameDestroy,
+                      false, false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+    }
+
+    // We only emit CFI information for ZPRs so emit CFI after the ZPR restores.
     if (EmitCFI)
-      emitCalleeSavedSVERestores(MBB, RestoreEnd);
+      emitCalleeSavedSVERestores(MBB, ZPRRestoreEnd);
   }
 
   if (!hasFP(MF)) {
@@ -2660,9 +2786,15 @@ AArch64FrameLowering::getFrameIndexReferenceFromSP(const MachineFunction &MF,
     return StackOffset::getFixed(ObjectOffset - getOffsetOfLocalArea());
 
   const auto *AFI = MF.getInfo<AArch64FunctionInfo>();
-  if (MFI.isScalableStackID(FI))
-    return StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
+  if (MFI.isScalableStackID(FI)) {
+    StackOffset AccessOffset{};
+    if (AFI->hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      AccessOffset = -PPRStackSize;
+    return AccessOffset +
+           StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
                             ObjectOffset);
+  }
 
   bool IsFixed = MFI.isFixedObjectIndex(FI);
   bool IsCSR =
@@ -2720,12 +2852,12 @@ StackOffset AArch64FrameLowering::resolveFrameIndexReference(
   bool isFixed = MFI.isFixedObjectIndex(FI);
   bool isSVE = MFI.isScalableStackID(FI);
   return resolveFrameOffsetReference(MF, ObjectOffset, isFixed, isSVE, FrameReg,
-                                     PreferFP, ForSimm);
+                                     PreferFP, ForSimm, FI);
 }
 
 StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
     const MachineFunction &MF, int64_t ObjectOffset, bool isFixed, bool isSVE,
-    Register &FrameReg, bool PreferFP, bool ForSimm) const {
+    Register &FrameReg, bool PreferFP, bool ForSimm, int64_t FI) const {
   const auto &MFI = MF.getFrameInfo();
   const auto *RegInfo = static_cast<const AArch64RegisterInfo *>(
       MF.getSubtarget().getRegisterInfo());
@@ -2737,7 +2869,9 @@ StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
   bool isCSR =
       !isFixed && ObjectOffset >= -((int)AFI->getCalleeSavedStackSize(MFI));
 
-  const StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // Use frame pointer to reference fixed object...
[truncated]

efriedma-quic · 2025-06-02T20:21:17Z

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

MacDue · 2025-06-03T09:03:20Z

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

There are no hazards between PPRs and GPRs (those types of memory accesses can both be considered as occurring on the CPU).

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

It cannot be enabled at the same time. We intend to remove aarch64-enable-zpr-predicate-spills after this lands (as it was a stopgap solution).

For a while we have supported the `-aarch64-stack-hazard-size=<size>` option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by `-pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>`), and the current stack layout places PPRs and ZPRs within the same area. Which looks like: ------------------------------------ Higher address | callee-saved gpr registers | |---------------------------------- | | lr,fp (a.k.a. "frame record") | |-----------------------------------| <- fp(=x29) | <hazard padding> | |-----------------------------------| | callee-saved fp/simd/SVE regs | |-----------------------------------| | SVE stack objects | |-----------------------------------| | local variables of fixed size | | <FPR> | | <hazard padding> | | <GPR> | ------------------------------------| <- sp | Lower address With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this: ------------------------------------ Higher address | callee-saved gpr registers | |---------------------------------- | | lr,fp (a.k.a. "frame record") | |-----------------------------------| <- fp(=x29) | callee-saved PPRs | | PPR stack objects | (These are SVE predicates) |-----------------------------------| | <hazard padding> | |-----------------------------------| | callee-saved ZPR regs | (These are SVE vectors) | ZPR stack objects | Note: FPRs are promoted to ZPRs |-----------------------------------| | local variables of fixed size | | <FPR> | | <hazard padding> | | <GPR> | ------------------------------------| <- sp | Lower address This layout is only enabled if: * SplitSVEObjects are enabled (`-aarch64-split-sve-objects`) - (This may be enabled by default in a later patch) * Streaming memory hazards are present - (`-aarch64-stack-hazard-size=<val>` != 0) * PPRs and FPRs/ZPRs are on the stack * There's no stack realignment or variable-sized objects - This is left as a TODO for now Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves). This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.

MacDue · 2025-06-03T15:21:30Z

Rebased this PR stack on the changes from #138609... Which makes things even hairier 😅 It would be nice if all these modes were not so intertwined in the code.

I may look into if there's someway to move some of the logic into some kind of SVEFrameLayout class with overridable hooks to tidy some of this up.

efriedma-quic

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

efriedma-quic · 2025-06-03T21:25:11Z

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

+      HasFPRStackObjects |= SlotTypes[FI] == SlotType::ZPRorFPR;
+      // For SplitSVEObjects remember that this stack slot is a predicate, this
+      // will be needed later when determining the frame layout.
+      if (SlotTypes[FI] == SlotType::PPR) {


This doesn't look quite right. Anything with scalable size needs to be either ScalablePredVector or ScalableVector. Anything with non-scalable size needs to be on the normal stack, I think?

Which bit does not look right? SlotType::PPR (not to be confused with SlotType::GPR) is only set if the original stack ID was scalable and all accesses to that slot used predicate load/stores. The original stack ID could be ScalableVector, as the earlier selection of the stack ID is only based on the type size. Therefore, we change it to ScalablePredVector here so that it can be sorted into the correct region.

Oh, I got confused by this bit:

if (IsScalable || AArch64InstrInfo::isFpOrNEON(MI)) { SlotTypes[*FI] |= IsPPR ? SlotType::PPR : SlotType::ZPRorFPR; } else { SlotTypes[*FI] |= SlotType::GPR; }

Maybe rewrite it as the following, to split the cases a bit more cleanly?

if (IsScalable) { SlotTypes[*FI] |= isPPRAccess(MI) ? SlotType::PPR : SlotType::ZPRorFPR; } else { SlotTypes[*FI] |= AArch64InstrInfo::isFpOrNEON(MI) ? SlotType::ZPRorFPR : SlotType::GPR; }

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Maybe rewrite it as the following, to split the cases a bit more cleanly?

👍 I've tidied up this case.

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Hmmm, I'll look into this, these checks are mostly the same as the for the current hazard padding option.

llvm/lib/Target/AArch64/AArch64FrameLowering.h

MacDue · 2025-06-04T09:16:43Z

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

It would, but for the sizes of hazard padding and vscale we're interested in, it would result in a much larger allocation than necessary and likely complicate addressing predicates and vectors more so (due to limited ranges for scalable offsets).

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

I'll take a look at LocalStackSlotAllocation, I've not worked on that pass before. From the name though, I believe we'll still need to do something for the CSRs (which are in the same area). Having a base pointer for ZPRs is something we intend to add as it would improve addressing (though I initially considered that as an extension for a later patch).

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

I don't think it's that important. Given the stack protector can end up in the scalable area it will result in a hazard if enabled (with this change or the previous hazard padding patch), but I don't think it's something likely to be enabled for library kernels.

Change-Id: I30e2cf5ea7a1df932f145e685a3fbd39cd974d4d

This was referenced Jun 2, 2025

[Codegen] Add a separate stack ID for scalable predicates #142390

Open

[AArch64] Prepare for split ZPR and PPR area allocation (NFCI) #142391

Open

MacDue requested review from sdesmalen-arm, efriedma-quic and davemgreen June 2, 2025 14:00

MacDue marked this pull request as ready for review June 2, 2025 14:00

llvmbot added the backend:AArch64 label Jun 2, 2025

MacDue force-pushed the users/MacDue/prepare_split branch from 0b09392 to 42af819 Compare June 3, 2025 15:15

MacDue force-pushed the users/MacDue/split_pprs branch from 9052fbd to bd86d4a Compare June 3, 2025 15:15

efriedma-quic reviewed Jun 3, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64FrameLowering.h Outdated Show resolved Hide resolved

MacDue added 2 commits June 4, 2025 10:11

Remove default

afc57ca

Tidy PPR/ZPR check

b59cae4

Change-Id: I30e2cf5ea7a1df932f145e685a3fbd39cd974d4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64][SME] Support split ZPR and PPR area allocation #142392

[AArch64][SME] Support split ZPR and PPR area allocation #142392

Uh oh!

MacDue commented Jun 2, 2025 •

edited

Loading

Uh oh!

MacDue commented Jun 2, 2025

Uh oh!

llvmbot commented Jun 2, 2025

Uh oh!

efriedma-quic commented Jun 2, 2025

Uh oh!

MacDue commented Jun 3, 2025

Uh oh!

MacDue commented Jun 3, 2025 •

edited

Loading

Uh oh!

efriedma-quic left a comment

Uh oh!

efriedma-quic Jun 3, 2025

Uh oh!

MacDue Jun 4, 2025

Uh oh!

efriedma-quic Jun 4, 2025

Uh oh!

MacDue Jun 5, 2025

Uh oh!

Uh oh!

MacDue commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

[AArch64][SME] Support split ZPR and PPR area allocation #142392

Are you sure you want to change the base?

[AArch64][SME] Support split ZPR and PPR area allocation #142392

Uh oh!

Conversation

MacDue commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MacDue commented Jun 2, 2025

Uh oh!

llvmbot commented Jun 2, 2025

Uh oh!

efriedma-quic commented Jun 2, 2025

Uh oh!

MacDue commented Jun 3, 2025

Uh oh!

MacDue commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

efriedma-quic Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

MacDue Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

efriedma-quic Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

MacDue Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MacDue commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MacDue commented Jun 2, 2025 •

edited

Loading

MacDue commented Jun 3, 2025 •

edited

Loading

MacDue commented Jun 4, 2025 •

edited

Loading