Skip to content

[AArch64][SME] Support split ZPR and PPR area allocation #142392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: users/MacDue/prepare_split
Choose a base branch
from

Conversation

MacDue
Copy link
Member

@MacDue MacDue commented Jun 2, 2025

For a while we have supported the -aarch64-stack-hazard-size=<size> option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

This layout is only enabled if:

  • SplitSVEObjects are enabled (-aarch64-split-sve-objects)
    • (This may be enabled by default in a later patch)
  • Streaming memory hazards are present
    • (-aarch64-stack-hazard-size=<val> != 0)
  • PPRs and FPRs/ZPRs are on the stack
  • There's no stack realignment or variable-sized objects
    • This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.

@MacDue
Copy link
Member Author

MacDue commented Jun 2, 2025

@MacDue MacDue marked this pull request as ready for review June 2, 2025 14:00
@llvmbot
Copy link
Member

llvmbot commented Jun 2, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

For a while we have supported the -aarch64-stack-hazard-size=&lt;size&gt; option, which adds "hazard padding" between GPRs and FPR/ZPRs. However, there is currently a hole in this mitigation as PPR and FPR/ZPR accesses to the same area also cause streaming memory hazards (this is noted by -pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=&lt;val&gt;), and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|   &lt;hazard padding&gt;                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that hazard padding is placed between the PPRs and ZPRs rather than within the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| &lt;- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   &lt;hazard padding&gt;                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   &lt;FPR&gt;                           |
|   &lt;hazard padding&gt;                |
|   &lt;GPR&gt;                           |
------------------------------------| &lt;- sp
                                    | Lower address

This layout is only enabled if:

  • SplitSVEObjects are enabled (-aarch64-split-sve-objects)
    • (This may be enabled by default in a later patch)
  • Streaming memory hazards are present
    • (-aarch64-stack-hazard-size=&lt;val&gt; != 0)
  • PPRs and FPRs/ZPRs are on the stack
  • There's no stack realignment or variable-sized objects
    • This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to ZPRs. This is to prevent stack hazards between FPRs and GRPs in the fixed size callee-save area (which would otherwise require more hazard padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation, and is not intended change codegen for non-SME code.


Patch is 172.93 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/142392.diff

8 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+391-133)
  • (modified) llvm/lib/Target/AArch64/AArch64FrameLowering.h (+2-1)
  • (modified) llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h (+7)
  • (added) llvm/test/CodeGen/AArch64/framelayout-split-sve.mir (+526)
  • (modified) llvm/test/CodeGen/AArch64/spill-fill-zpr-predicates.mir (+7-10)
  • (added) llvm/test/CodeGen/AArch64/split-sve-stack-frame-layout.ll (+751)
  • (modified) llvm/test/CodeGen/AArch64/stack-hazard.ll (+542-326)
  • (modified) llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll (-2)
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index e5592a921e192..36775b8dc05e5 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -270,6 +270,11 @@ static cl::opt<bool> OrderFrameObjects("aarch64-order-frame-objects",
                                        cl::desc("sort stack allocations"),
                                        cl::init(true), cl::Hidden);
 
+static cl::opt<bool>
+    SplitSVEObjects("aarch64-split-sve-objects",
+                    cl::desc("Split allocation of ZPR & PPR objects"),
+                    cl::init(false), cl::Hidden);
+
 cl::opt<bool> EnableHomogeneousPrologEpilog(
     "homogeneous-prolog-epilog", cl::Hidden,
     cl::desc("Emit homogeneous prologue and epilogue for the size "
@@ -458,10 +463,13 @@ static StackOffset getZPRStackSize(const MachineFunction &MF) {
   return StackOffset::getScalable(AFI->getStackSizeZPR());
 }
 
-/// Returns the size of the entire PPR stackframe (calleesaves + spills).
+/// Returns the size of the entire PPR stackframe (calleesaves + spills + hazard
+/// padding).
 static StackOffset getPPRStackSize(const MachineFunction &MF) {
   const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
-  return StackOffset::getScalable(AFI->getStackSizePPR());
+  return StackOffset::get(AFI->hasSplitSVEObjects() ? getStackHazardSize(MF)
+                                                    : 0,
+                          AFI->getStackSizePPR());
 }
 
 /// Returns the size of the entire SVE stackframe (PPRs + ZPRs).
@@ -484,6 +492,10 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
   if (!EnableRedZone)
     return false;
 
+  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
+  if (AFI->hasSplitSVEObjects())
+    return false;
+
   // Don't use the red zone if the function explicitly asks us not to.
   // This is typically used for kernel code.
   const AArch64Subtarget &Subtarget = MF.getSubtarget<AArch64Subtarget>();
@@ -493,7 +505,6 @@ bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
     return false;
 
   const MachineFrameInfo &MFI = MF.getFrameInfo();
-  const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
   uint64_t NumBytes = AFI->getLocalStackSize();
 
   // If neither NEON or SVE are available, a COPY from one Q-reg to
@@ -666,9 +677,11 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
   const TargetRegisterInfo &TRI = *STI.getRegisterInfo();
   AArch64FunctionInfo &AFI = *MF.getInfo<AArch64FunctionInfo>();
   CFIInstBuilder CFIBuilder(MBB, MBBI, MachineInstr::FrameSetup);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
 
   for (const auto &Info : CSI) {
-    if (!MFI.isScalableStackID(Info.getFrameIdx()))
+    int FI = Info.getFrameIdx();
+    if (!MFI.isScalableStackID(FI))
       continue;
 
     // Not all unwinders may know about SVE registers, so assume the lowest
@@ -679,9 +692,13 @@ void AArch64FrameLowering::emitCalleeSavedSVELocations(
       continue;
 
     StackOffset Offset =
-        StackOffset::getScalable(MFI.getObjectOffset(Info.getFrameIdx())) -
+        StackOffset::getScalable(MFI.getObjectOffset(FI)) -
         StackOffset::getFixed(AFI.getCalleeSavedStackSize(MFI));
 
+    if (AFI.hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      Offset -= PPRStackSize;
+
     CFIBuilder.insertCFIInst(createCFAOffset(TRI, Reg, Offset));
   }
 }
@@ -2188,33 +2205,68 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       StackOffset::getFixed((int64_t)MFI.getStackSize() - NumBytes);
   StackOffset LocalsSize =
       PPRLocalsSize + ZPRLocalsSize + StackOffset::getFixed(NumBytes);
-  StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
-  MachineBasicBlock::iterator CalleeSavesBegin =
-      AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
-                                        : ZPRCalleeSavesBegin;
-  allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
-                     nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
-                     MFI.hasVarSizedObjects() || LocalsSize);
-  CFAOffset += SVECalleeSavesSize;
-
-  MachineBasicBlock::iterator CalleeSavesEnd =
-      AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd : PPRCalleeSavesEnd;
-  if (EmitAsyncCFI)
-    emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
-
-  // Allocate space for the rest of the frame including SVE locals. Align the
-  // stack as necessary.
-  assert(!(canUseRedZone(MF) && NeedsRealignment) &&
-         "Cannot use redzone with stack realignment");
-  if (!canUseRedZone(MF)) {
-    // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
-    // the correct value here, as NumBytes also includes padding bytes,
-    // which shouldn't be counted here.
-    StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
-    allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
-                       SVELocalsSize + StackOffset::getFixed(NumBytes),
-                       NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
-                       CFAOffset, MFI.hasVarSizedObjects());
+  if (!AFI->hasSplitSVEObjects()) {
+    StackOffset SVECalleeSavesSize = PPRCalleeSavesSize + ZPRCalleeSavesSize;
+    MachineBasicBlock::iterator CalleeSavesBegin =
+        AFI->getPPRCalleeSavedStackSize() ? PPRCalleeSavesBegin
+                                          : ZPRCalleeSavesBegin;
+    allocateStackSpace(MBB, CalleeSavesBegin, 0, SVECalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || LocalsSize);
+    CFAOffset += SVECalleeSavesSize;
+
+    MachineBasicBlock::iterator CalleeSavesEnd =
+        AFI->getZPRCalleeSavedStackSize() ? ZPRCalleeSavesEnd
+                                          : PPRCalleeSavesEnd;
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, CalleeSavesEnd);
+
+    // Allocate space for the rest of the frame including SVE locals. Align the
+    // stack as necessary.
+    assert(!(canUseRedZone(MF) && NeedsRealignment) &&
+           "Cannot use redzone with stack realignment");
+    if (!canUseRedZone(MF)) {
+      // FIXME: in the case of dynamic re-alignment, NumBytes doesn't have
+      // the correct value here, as NumBytes also includes padding bytes,
+      // which shouldn't be counted here.
+      StackOffset SVELocalsSize = PPRLocalsSize + ZPRLocalsSize;
+      allocateStackSpace(MBB, CalleeSavesEnd, RealignmentPadding,
+                         SVELocalsSize + StackOffset::getFixed(NumBytes),
+                         NeedsWinCFI, &HasWinCFI, EmitAsyncCFI && !HasFP,
+                         CFAOffset, MFI.hasVarSizedObjects());
+    }
+  } else {
+    assert(!canUseRedZone(MF) &&
+           "Cannot use redzone with aarch64-split-sve-objects");
+    // TODO: Handle HasWinCFI/NeedsWinCFI?
+    assert(!NeedsWinCFI &&
+           "WinCFI with aarch64-split-sve-objects is not supported");
+
+    // Split ZPR and PPR allocation.
+    // Allocate PPR callee saves
+    allocateStackSpace(MBB, PPRCalleeSavesBegin, 0, PPRCalleeSavesSize, false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRCalleeSavesSize ||
+                           ZPRLocalsSize || PPRLocalsSize);
+    CFAOffset += PPRCalleeSavesSize;
+
+    // Allocate PPR locals + ZPR callee saves
+    assert(PPRCalleeSavesEnd == ZPRCalleeSavesBegin &&
+           "Expected ZPR callee saves after PPR locals");
+    allocateStackSpace(MBB, PPRCalleeSavesEnd, RealignmentPadding,
+                       PPRLocalsSize + ZPRCalleeSavesSize, false, nullptr,
+                       EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects() || ZPRLocalsSize);
+    CFAOffset += PPRLocalsSize + ZPRCalleeSavesSize;
+
+    // Allocate ZPR locals
+    allocateStackSpace(MBB, ZPRCalleeSavesEnd, RealignmentPadding,
+                       ZPRLocalsSize + StackOffset::getFixed(NumBytes), false,
+                       nullptr, EmitAsyncCFI && !HasFP, CFAOffset,
+                       MFI.hasVarSizedObjects());
+
+    if (EmitAsyncCFI)
+      emitCalleeSavedSVELocations(MBB, ZPRCalleeSavesEnd);
   }
 
   // If we need a base pointer, set it up here. It's whatever the value of the
@@ -2456,7 +2508,9 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
     }
   }
 
-  StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // If there is a single SP update, insert it before the ret and we're done.
   if (CombineSPBump) {
@@ -2477,69 +2531,141 @@ void AArch64FrameLowering::emitEpilogue(MachineFunction &MF,
   NumBytes -= PrologueSaveSize;
   assert(NumBytes >= 0 && "Negative stack allocation size!?");
 
-  // Process the SVE callee-saves to determine what space needs to be
-  // deallocated.
-  StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
-  MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
-  int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
-  int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
-  int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
-
-  if (SVECalleeSavedSize) {
-    RestoreBegin = std::prev(RestoreEnd);
-    while (RestoreBegin != MBB.begin() &&
-           IsSVECalleeSave(std::prev(RestoreBegin)))
-      --RestoreBegin;
-
-    assert(IsSVECalleeSave(RestoreBegin) &&
-           IsSVECalleeSave(std::prev(RestoreEnd)) && "Unexpected instruction");
-
-    StackOffset CalleeSavedSizeAsOffset =
-        StackOffset::getScalable(SVECalleeSavedSize);
-    DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
-    DeallocateAfter = CalleeSavedSizeAsOffset;
-  }
-
-  // Deallocate the SVE area.
-  if (SVEStackSize) {
-    // If we have stack realignment or variable sized objects on the stack,
-    // restore the stack pointer from the frame pointer prior to SVE CSR
-    // restoration.
-    if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
-      if (SVECalleeSavedSize) {
-        // Set SP to start of SVE callee-save area from which they can
-        // be reloaded. The code below will deallocate the stack space
-        // space by moving FP -> SP.
-        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
-                        StackOffset::getScalable(-SVECalleeSavedSize), TII,
-                        MachineInstr::FrameDestroy);
-      }
-    } else {
-      if (SVECalleeSavedSize) {
-        // Deallocate the non-SVE locals first before we can deallocate (and
-        // restore callee saves) from the SVE area.
-        emitFrameOffset(
-            MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-            StackOffset::getFixed(NumBytes), TII, MachineInstr::FrameDestroy,
-            false, false, nullptr, EmitCFI && !hasFP(MF),
-            SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize));
-        NumBytes = 0;
+  if (!AFI->hasSplitSVEObjects()) {
+    // Process the SVE callee-saves to determine what space needs to be
+    // deallocated.
+    StackOffset DeallocateBefore = {}, DeallocateAfter = SVEStackSize;
+    MachineBasicBlock::iterator RestoreBegin = LastPopI, RestoreEnd = LastPopI;
+    int64_t ZPRCalleeSavedSize = AFI->getZPRCalleeSavedStackSize();
+    int64_t PPRCalleeSavedSize = AFI->getPPRCalleeSavedStackSize();
+    int64_t SVECalleeSavedSize = ZPRCalleeSavedSize + PPRCalleeSavedSize;
+
+    if (SVECalleeSavedSize) {
+      RestoreBegin = std::prev(RestoreEnd);
+      while (RestoreBegin != MBB.begin() &&
+             IsSVECalleeSave(std::prev(RestoreBegin)))
+        --RestoreBegin;
+
+      assert(IsSVECalleeSave(RestoreBegin) &&
+             IsSVECalleeSave(std::prev(RestoreEnd)) &&
+             "Unexpected instruction");
+
+      StackOffset CalleeSavedSizeAsOffset =
+          StackOffset::getScalable(SVECalleeSavedSize);
+      DeallocateBefore = SVEStackSize - CalleeSavedSizeAsOffset;
+      DeallocateAfter = CalleeSavedSizeAsOffset;
+    }
+
+    // Deallocate the SVE area.
+    if (SVEStackSize) {
+      // If we have stack realignment or variable sized objects on the stack,
+      // restore the stack pointer from the frame pointer prior to SVE CSR
+      // restoration.
+      if (AFI->isStackRealigned() || MFI.hasVarSizedObjects()) {
+        if (SVECalleeSavedSize) {
+          // Set SP to start of SVE callee-save area from which they can
+          // be reloaded. The code below will deallocate the stack space
+          // space by moving FP -> SP.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::FP,
+                          StackOffset::getScalable(-SVECalleeSavedSize), TII,
+                          MachineInstr::FrameDestroy);
+        }
+      } else {
+        if (SVECalleeSavedSize) {
+          // Deallocate the non-SVE locals first before we can deallocate (and
+          // restore callee saves) from the SVE area.
+          emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                          StackOffset::getFixed(NumBytes), TII,
+                          MachineInstr::FrameDestroy, false, false, nullptr,
+                          EmitCFI && !hasFP(MF),
+                          SVEStackSize + StackOffset::getFixed(
+                                             NumBytes + PrologueSaveSize));
+          NumBytes = 0;
+        }
+
+        emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
+                        DeallocateBefore, TII, MachineInstr::FrameDestroy,
+                        false, false, nullptr, EmitCFI && !hasFP(MF),
+                        SVEStackSize +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
+
+        emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
+                        DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
+                        false, nullptr, EmitCFI && !hasFP(MF),
+                        DeallocateAfter +
+                            StackOffset::getFixed(NumBytes + PrologueSaveSize));
       }
+      if (EmitCFI)
+        emitCalleeSavedSVERestores(MBB, RestoreEnd);
+    }
+  } else if (SVEStackSize) {
+    assert(!AFI->isStackRealigned() && !MFI.hasVarSizedObjects() &&
+           "TODO: Support stack realigment / variable-sized objects");
+    // SplitSVEObjects. Determine the sizes and starts/ends of the ZPR and PPR
+    // areas.
+    auto ZPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getZPRCalleeSavedStackSize());
+    auto PPRCalleeSavedSize =
+        StackOffset::getScalable(AFI->getPPRCalleeSavedStackSize());
+    StackOffset PPRLocalsSize = PPRStackSize - PPRCalleeSavedSize;
+    StackOffset ZPRLocalsSize = ZPRStackSize - ZPRCalleeSavedSize;
+
+    MachineBasicBlock::iterator PPRRestoreBegin = LastPopI,
+                                PPRRestoreEnd = LastPopI;
+    if (PPRCalleeSavedSize) {
+      PPRRestoreBegin = std::prev(PPRRestoreEnd);
+      while (PPRRestoreBegin != MBB.begin() &&
+             IsPPRCalleeSave(std::prev(PPRRestoreBegin)))
+        --PPRRestoreBegin;
+    }
+
+    MachineBasicBlock::iterator ZPRRestoreBegin = PPRRestoreBegin,
+                                ZPRRestoreEnd = PPRRestoreBegin;
+    if (ZPRCalleeSavedSize) {
+      ZPRRestoreBegin = std::prev(ZPRRestoreEnd);
+      while (ZPRRestoreBegin != MBB.begin() &&
+             IsZPRCalleeSave(std::prev(ZPRRestoreBegin)))
+        --ZPRRestoreBegin;
+    }
+
+    auto CFAOffset =
+        SVEStackSize + StackOffset::getFixed(NumBytes + PrologueSaveSize);
+    if (PPRCalleeSavedSize || ZPRCalleeSavedSize) {
+      // Deallocate the non-SVE locals first before we can deallocate (and
+      // restore callee saves) from the SVE area.
+      auto NonSVELocals = StackOffset::getFixed(NumBytes);
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      NonSVELocals, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      NumBytes = 0;
+      CFAOffset -= NonSVELocals;
+    }
+
+    if (ZPRLocalsSize) {
+      emitFrameOffset(MBB, ZPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      ZPRLocalsSize, TII, MachineInstr::FrameDestroy, false,
+                      false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= ZPRLocalsSize;
+    }
 
-      emitFrameOffset(MBB, RestoreBegin, DL, AArch64::SP, AArch64::SP,
-                      DeallocateBefore, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      SVEStackSize +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
-
-      emitFrameOffset(MBB, RestoreEnd, DL, AArch64::SP, AArch64::SP,
-                      DeallocateAfter, TII, MachineInstr::FrameDestroy, false,
-                      false, nullptr, EmitCFI && !hasFP(MF),
-                      DeallocateAfter +
-                          StackOffset::getFixed(NumBytes + PrologueSaveSize));
+    if (PPRLocalsSize || ZPRCalleeSavedSize) {
+      assert(PPRRestoreBegin == ZPRRestoreEnd &&
+             "Expected PPR restores after ZPR");
+      emitFrameOffset(MBB, PPRRestoreBegin, DL, AArch64::SP, AArch64::SP,
+                      PPRLocalsSize + ZPRCalleeSavedSize, TII,
+                      MachineInstr::FrameDestroy, false, false, nullptr,
+                      EmitCFI && !hasFP(MF), CFAOffset);
+      CFAOffset -= PPRLocalsSize + ZPRCalleeSavedSize;
     }
+    if (PPRCalleeSavedSize) {
+      emitFrameOffset(MBB, PPRRestoreEnd, DL, AArch64::SP, AArch64::SP,
+                      PPRCalleeSavedSize, TII, MachineInstr::FrameDestroy,
+                      false, false, nullptr, EmitCFI && !hasFP(MF), CFAOffset);
+    }
+
+    // We only emit CFI information for ZPRs so emit CFI after the ZPR restores.
     if (EmitCFI)
-      emitCalleeSavedSVERestores(MBB, RestoreEnd);
+      emitCalleeSavedSVERestores(MBB, ZPRRestoreEnd);
   }
 
   if (!hasFP(MF)) {
@@ -2660,9 +2786,15 @@ AArch64FrameLowering::getFrameIndexReferenceFromSP(const MachineFunction &MF,
     return StackOffset::getFixed(ObjectOffset - getOffsetOfLocalArea());
 
   const auto *AFI = MF.getInfo<AArch64FunctionInfo>();
-  if (MFI.isScalableStackID(FI))
-    return StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
+  if (MFI.isScalableStackID(FI)) {
+    StackOffset AccessOffset{};
+    if (AFI->hasSplitSVEObjects() &&
+        MFI.getStackID(FI) == TargetStackID::ScalableVector)
+      AccessOffset = -PPRStackSize;
+    return AccessOffset +
+           StackOffset::get(-((int64_t)AFI->getCalleeSavedStackSize()),
                             ObjectOffset);
+  }
 
   bool IsFixed = MFI.isFixedObjectIndex(FI);
   bool IsCSR =
@@ -2720,12 +2852,12 @@ StackOffset AArch64FrameLowering::resolveFrameIndexReference(
   bool isFixed = MFI.isFixedObjectIndex(FI);
   bool isSVE = MFI.isScalableStackID(FI);
   return resolveFrameOffsetReference(MF, ObjectOffset, isFixed, isSVE, FrameReg,
-                                     PreferFP, ForSimm);
+                                     PreferFP, ForSimm, FI);
 }
 
 StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
     const MachineFunction &MF, int64_t ObjectOffset, bool isFixed, bool isSVE,
-    Register &FrameReg, bool PreferFP, bool ForSimm) const {
+    Register &FrameReg, bool PreferFP, bool ForSimm, int64_t FI) const {
   const auto &MFI = MF.getFrameInfo();
   const auto *RegInfo = static_cast<const AArch64RegisterInfo *>(
       MF.getSubtarget().getRegisterInfo());
@@ -2737,7 +2869,9 @@ StackOffset AArch64FrameLowering::resolveFrameOffsetReference(
   bool isCSR =
       !isFixed && ObjectOffset >= -((int)AFI->getCalleeSavedStackSize(MFI));
 
-  const StackOffset SVEStackSize = getSVEStackSize(MF);
+  StackOffset ZPRStackSize = getZPRStackSize(MF);
+  StackOffset PPRStackSize = getPPRStackSize(MF);
+  StackOffset SVEStackSize = ZPRStackSize + PPRStackSize;
 
   // Use frame pointer to reference fixed object...
[truncated]

@efriedma-quic
Copy link
Collaborator

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

@MacDue
Copy link
Member Author

MacDue commented Jun 3, 2025

In the implementation you're interested in, is there not a hazard between PPRs and GPRs?

There are no hazards between PPRs and GPRs (those types of memory accesses can both be considered as occurring on the CPU).

What's the interaction between this and aarch64-enable-zpr-predicate-spills?

It cannot be enabled at the same time. We intend to remove aarch64-enable-zpr-predicate-spills after this lands (as it was a stopgap solution).

For a while we have supported the `-aarch64-stack-hazard-size=<size>`
option, which adds "hazard padding" between GPRs and FPR/ZPRs. However,
there is currently a hole in this mitigation as PPR and FPR/ZPR accesses
to the same area also cause streaming memory hazards (this is noted by
`-pass-remarks-analysis=sme -aarch64-stack-hazard-remark-size=<val>`),
and the current stack layout places PPRs and ZPRs within the same area.

Which looks like:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|   <hazard padding>                |
|-----------------------------------|
| callee-saved fp/simd/SVE regs     |
|-----------------------------------|
|        SVE stack objects          |
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

With this patch the stack (and hazard padding) is rearranged so that
hazard padding is placed between the PPRs and ZPRs rather than within
the (fixed size) callee-save region. Which looks something like this:

------------------------------------  Higher address
| callee-saved gpr registers        |
|---------------------------------- |
| lr,fp  (a.k.a. "frame record")    |
|-----------------------------------| <- fp(=x29)
|        callee-saved PPRs          |
|        PPR stack objects          | (These are SVE predicates)
|-----------------------------------|
|   <hazard padding>                |
|-----------------------------------|
|       callee-saved ZPR regs       | (These are SVE vectors)
|        ZPR stack objects          | Note: FPRs are promoted to ZPRs
|-----------------------------------|
| local variables of fixed size     |
|   <FPR>                           |
|   <hazard padding>                |
|   <GPR>                           |
------------------------------------| <- sp
                                    | Lower address

This layout is only enabled if:

 * SplitSVEObjects are enabled (`-aarch64-split-sve-objects`)
   - (This may be enabled by default in a later patch)
 * Streaming memory hazards are present
   - (`-aarch64-stack-hazard-size=<val>` != 0)
 * PPRs and FPRs/ZPRs are on the stack
 * There's no stack realignment or variable-sized objects
   - This is left as a TODO for now

Additionally, any FPR callee-saves that are present will be promoted to
ZPRs. This is to prevent stack hazards between FPRs and GRPs in the
fixed size callee-save area (which would otherwise require more hazard
padding, or moving the FPR callee-saves).

This layout should resolve the hole in the hazard padding mitigation,
and is not intended change codegen for non-SME code.
@MacDue MacDue force-pushed the users/MacDue/prepare_split branch from 0b09392 to 42af819 Compare June 3, 2025 15:15
@MacDue MacDue force-pushed the users/MacDue/split_pprs branch from 9052fbd to bd86d4a Compare June 3, 2025 15:15
@MacDue
Copy link
Member Author

MacDue commented Jun 3, 2025

Rebased this PR stack on the changes from #138609... Which makes things even hairier 😅 It would be nice if all these modes were not so intertwined in the code.

I may look into if there's someway to move some of the logic into some kind of SVEFrameLayout class with overridable hooks to tidy some of this up.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

HasFPRStackObjects |= SlotTypes[FI] == SlotType::ZPRorFPR;
// For SplitSVEObjects remember that this stack slot is a predicate, this
// will be needed later when determining the frame layout.
if (SlotTypes[FI] == SlotType::PPR) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right. Anything with scalable size needs to be either ScalablePredVector or ScalableVector. Anything with non-scalable size needs to be on the normal stack, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bit does not look right? SlotType::PPR (not to be confused with SlotType::GPR) is only set if the original stack ID was scalable and all accesses to that slot used predicate load/stores. The original stack ID could be ScalableVector, as the earlier selection of the stack ID is only based on the type size. Therefore, we change it to ScalablePredVector here so that it can be sorted into the correct region.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got confused by this bit:

        if (IsScalable || AArch64InstrInfo::isFpOrNEON(MI)) {
          SlotTypes[*FI] |= IsPPR ? SlotType::PPR : SlotType::ZPRorFPR;
        } else {
          SlotTypes[*FI] |= SlotType::GPR;
        }

Maybe rewrite it as the following, to split the cases a bit more cleanly?

if (IsScalable) {
  SlotTypes[*FI] |= isPPRAccess(MI) ? SlotType::PPR : SlotType::ZPRorFPR;
} else {
  SlotTypes[*FI] |= AArch64InstrInfo::isFpOrNEON(MI) ? SlotType::ZPRorFPR : SlotType::GPR;
}

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rewrite it as the following, to split the cases a bit more cleanly?

👍 I've tidied up this case.

Also, I'm not sure isFpOrNEON is quite sufficient here; we probably also want to handle SVE accesses to fixed-width slots.

Hmmm, I'll look into this, these checks are mostly the same as the for the current hazard padding option.

@MacDue
Copy link
Member Author

MacDue commented Jun 4, 2025

If you express the size of the hazard padding between the PPRs and ZPRs as a scalable size, that might simplify some of the logic? You wouldn't need to represent the two areas as separate stacks, at least.

It would, but for the sizes of hazard padding and vscale we're interested in, it would result in a much larger allocation than necessary and likely complicate addressing predicates and vectors more so (due to limited ranges for scalable offsets).

Maybe it's also worth considering "preallocating" some stack slots, along the lines of LocalStackSlotAllocation; anything we can allocate in an earlier pass doesn't have to be part of the core stack layout, and we probably get better code if we have a "base" for ZPR variables.

I'll take a look at LocalStackSlotAllocation, I've not worked on that pass before. From the name though, I believe we'll still need to do something for the CSRs (which are in the same area). Having a base pointer for ZPRs is something we intend to add as it would improve addressing (though I initially considered that as an extension for a later patch).

Do we care at all how stack protectors fit into all of this? I suspect the stack protector ends up in an unfortunate position by default.

I don't think it's that important. Given the stack protector can end up in the scalable area it will result in a hazard if enabled (with this change or the previous hazard padding patch), but I don't think it's something likely to be enabled for library kernels.

MacDue added 2 commits June 4, 2025 10:11
Change-Id: I30e2cf5ea7a1df932f145e685a3fbd39cd974d4d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants