[AArch64] Add streaming-mode stack hazards. #98956

davemgreen · 2024-07-15T20:33:34Z

Under some SME contexts, a coprocessor with its own separate cache will be used for FPR operations. This can create hazards if the CPU and the SME unit try to access the same area of memory, including if the access is to an area of the stack.

To try to alleviate that, this patch attempts to introduce extra padding into the stack frame between FP and GPR accesses, controlled by the StackHazardSize option. Without changing the layout of the stack frame, a stack object of the right size is added between GPR and FPR CSRs. Another is added to the stack objects section, and stack objects are sorted so that FPR > Hazard padding slot > GPRs (where possible).

Unfortunately some things are not handled well (VLA area, FPR arguments on the stack, object with both GPR and FPR accesses), but if those are controlled by the user then the entire stack frame becomes GPR at the start/end with FPR in the middle, surrounded by Hazard padding. This can greatly help reduce something that can be difficult for the user to control themselves.

The current implementation is opt-in through an -aarch64-stack-hazard-size flag, and should have no effect if the option is unset. In the long run the implementation might change (for example using more base pointers to separate in more cases, re-enabling ldp/stp using an extra register, etc), but this gets at least something for people to use in llvm-19 if they need it. The only change whilst the option is unset will be a fix for making sure the stack increment is added at the right place when it cannot be converted to postinc (++MBBI). I believe without extra padding that can not normally be reached.

llvmbot · 2024-07-15T20:34:06Z

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

Under some SME contexts, a coprocessor with its own separate cache will be used for FPR operations. This can create hazards if the CPU and the SME unit try to access the same area of memory, including if the access is to an area of the stack.

To try to alleviate that, this patch attempts to introduce extra padding into the stack frame between FP and GPR accesses, controlled by the StackHazardSize option. Without changing the layout of the stack frame, a stack object of the right size is added between GPR and FPR CSRs. Another is added to the stack objects section, and stack objects are sorted so that FPR > Hazard padding slot > GPRs (where possible).

Unfortunately some things are not handled well (VLA area, FPR arguments on the stack, object with both GPR and FPR accesses), but if those are controlled by the user then the entire stack frame becomes GPR at the start/end with FPR in the middle, surrounded by Hazard padding. This can greatly help reduce something that can be difficult for the user to control themselves.

The current implementation is opt-in through an -aarch64-stack-hazard-size flag, and should have no effect if the option is unset. In the long run the implementation might change (for example using more base pointers to separate in more cases, re-enabling ldp/stp using an extra register, etc), but this gets at least something for people to use in llvm-19 if they need it. The only change whilst the option is unset will be a fix for making sure the stack increment is added at the right place when it cannot be converted to postinc (++MBBI). I believe without extra padding that can not normally be reached.

Patch is 154.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98956.diff

4 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+193-8)
(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.h (+4)
(modified) llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h (+27)
(added) llvm/test/CodeGen/AArch64/stack-hazard.ll (+2812)

diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index 0f1e860fac732..58cfa42b6d445 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -52,6 +52,8 @@
 // | async context if needed           |
 // | (a.k.a. "frame record")           |
 // |-----------------------------------| <- fp(=x29)
+// |   <hazard padding>                |
+// |-----------------------------------|
 // |                                   |
 // | callee-saved fp/simd/SVE regs     |
 // |                                   |
@@ -64,9 +66,11 @@
 // |.aligned.in.case.it.needs.more.than| (size of this area is unknown at
 // |.the.standard.16-byte.alignment....|  compile time; if present)
 // |-----------------------------------|
-// |                                   |
 // | local variables of fixed size     |
 // | including spill slots             |
+// |   <FPR>                           |
+// |   <hazard padding>                |
+// |   <GPR>                           |
 // |-----------------------------------| <- bp(not defined by ABI,
 // |.variable-sized.local.variables....|       LLVM chooses X19)
 // |.(VLAs)............................| (size of this area is unknown at
@@ -117,6 +121,20 @@
 //
 // FIXME: also explain the redzone concept.
 //
+// About stack hazards: Under some SME contexts, a coprocessor with its own
+// separate cache can used for FP operations. This can create hazards if the CPU
+// and the SME unit try to access the same area of memory, including if the
+// access is to an area of the stack. To try to alleviate this we attempt to
+// introduce extra padding into the stack frame between FP and GPR accesses,
+// controlled by the StackHazardSize option. Without changing the layout of the
+// stack frame in the diagram above, a stack object of size StackHazardSize is
+// added between GPR and FPR CSRs. Another is added to the stack objects
+// section, and stack objects are sorted so that FPR > Hazard padding slot >
+// GPRs (where possible). Unfortunately some things are not handled well (VLA
+// area, arguments on the stack, object with both GPR and FPR accesses), but if
+// those are controlled by the user then the entire stack frame becomes GPR at
+// the start/end with FPR in the middle, surrounded by Hazard padding.
+//
 // An example of the prologue:
 //
 //     .globl __foo
@@ -196,6 +214,7 @@
 #include "llvm/ADT/ScopeExit.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/LivePhysRegs.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
@@ -253,6 +272,14 @@ cl::opt<bool> EnableHomogeneousPrologEpilog(
     cl::desc("Emit homogeneous prologue and epilogue for the size "
              "optimization (default = off)"));
 
+// Stack hazard padding size. 0 = disabled.
+static cl::opt<unsigned> StackHazardSize("aarch64-stack-hazard-size",
+                                         cl::init(0), cl::Hidden);
+// Whether to insert padding into non-streaming functions (for testing).
+static cl::opt<bool>
+    StackHazardInNonStreaming("aarch64-stack-hazard-in-non-streaming",
+                              cl::init(false), cl::Hidden);
+
 STATISTIC(NumRedZoneFunctions, "Number of functions using red zone");
 
 /// Returns how much of the incoming argument stack area (in bytes) we should
@@ -1461,6 +1488,10 @@ static MachineBasicBlock::iterator convertCalleeSaveRestoreToSPPrePostIncDec(
   // update in so create a normal arithmetic instruction instead.
   if (MBBI->getOperand(MBBI->getNumOperands() - 1).getImm() != 0 ||
       CSStackSizeInc < MinOffset || CSStackSizeInc > MaxOffset) {
+    // If we are destroying the frame, make sure we add the increment after the
+    // last frame operation.
+    if (FrameFlag == MachineInstr::FrameDestroy)
+      ++MBBI;
     emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP,
                     StackOffset::getFixed(CSStackSizeInc), TII, FrameFlag,
                     false, false, nullptr, EmitCFI,
@@ -2901,6 +2932,7 @@ static void computeCalleeSaveRegisterPairs(
   }
   int ScalableByteOffset = AFI->getSVECalleeSavedStackSize();
   bool NeedGapToAlignStack = AFI->hasCalleeSaveStackFreeSpace();
+  Register LastReg = 0;
 
   // When iterating backwards, the loop condition relies on unsigned wraparound.
   for (unsigned i = FirstReg; i < Count; i += RegInc) {
@@ -2922,8 +2954,15 @@ static void computeCalleeSaveRegisterPairs(
     else
       llvm_unreachable("Unsupported register class.");
 
+    // Add the stack hazard size as we transition from GPR->FPR CSRs.
+    if (AFI->hasStackHazardSlotIndex() &&
+        (!LastReg || !AArch64InstrInfo::isFpOrNEON(LastReg)) &&
+        AArch64InstrInfo::isFpOrNEON(RPI.Reg1))
+      ByteOffset += StackFillDir * StackHazardSize;
+    LastReg = RPI.Reg1;
+
     // Add the next reg to the pair if it is in the same register class.
-    if (unsigned(i + RegInc) < Count) {
+    if (unsigned(i + RegInc) < Count && !AFI->hasStackHazardSlotIndex()) {
       Register NextReg = CSI[i + RegInc].getReg();
       bool IsFirst = i == FirstReg;
       switch (RPI.Type) {
@@ -3034,7 +3073,8 @@ static void computeCalleeSaveRegisterPairs(
       Offset += 8;
     RPI.Offset = Offset / Scale;
 
-    assert(((!RPI.isScalable() && RPI.Offset >= -64 && RPI.Offset <= 63) ||
+    assert((!RPI.isPaired() ||
+            (!RPI.isScalable() && RPI.Offset >= -64 && RPI.Offset <= 63) ||
             (RPI.isScalable() && RPI.Offset >= -256 && RPI.Offset <= 255)) &&
            "Offset out of bounds for LDP/STP immediate");
 
@@ -3455,6 +3495,81 @@ bool AArch64FrameLowering::restoreCalleeSavedRegisters(
   return true;
 }
 
+// Return the FrameID for a Load/Store instruction by looking at the MMO.
+static std::optional<int> getLdStFrameID(const MachineInstr &MI,
+                                         const MachineFrameInfo &MFI) {
+  if (!MI.mayLoadOrStore() || MI.getNumMemOperands() < 1)
+    return std::nullopt;
+
+  MachineMemOperand *MMO = *MI.memoperands_begin();
+  auto *PSV =
+      dyn_cast_or_null<FixedStackPseudoSourceValue>(MMO->getPseudoValue());
+  if (PSV)
+    return std::optional<int>(PSV->getFrameIndex());
+
+  if (MMO->getValue()) {
+    if (auto *Al = dyn_cast<AllocaInst>(getUnderlyingObject(MMO->getValue()))) {
+      for (int FI = MFI.getObjectIndexBegin(); FI < MFI.getObjectIndexEnd();
+           FI++)
+        if (MFI.getObjectAllocation(FI) == Al)
+          return FI;
+    }
+  }
+
+  return std::nullopt;
+}
+
+// Check if a Hazard slot is needed for the current function, and if so create
+// one for it. The index is stored in AArch64FunctionInfo->StackHazardSlotIndex,
+// which can be used to determine if any hazard padding is needed.
+void AArch64FrameLowering::determineStackHazardSlot(
+    MachineFunction &MF, BitVector &SavedRegs) const {
+  if (StackHazardSize == 0 || StackHazardSize % 16 != 0 ||
+      MF.getInfo<AArch64FunctionInfo>()->hasStackHazardSlotIndex())
+    return;
+
+  // Stack hazards are only needed in streaming functions.
+  SMEAttrs Attrs(MF.getFunction());
+  if (!StackHazardInNonStreaming &&
+      Attrs.hasNonStreamingInterfaceAndBody())
+    return;
+
+  MachineFrameInfo &MFI = MF.getFrameInfo();
+
+  // Add a hazard slot if there are any CSR FPR registers, or are any fp-only
+  // stack objects.
+  bool HasFPRCSRs = any_of(SavedRegs.set_bits(), [](unsigned Reg) {
+    return AArch64::FPR64RegClass.contains(Reg) ||
+           AArch64::FPR128RegClass.contains(Reg) ||
+           AArch64::ZPRRegClass.contains(Reg) ||
+           AArch64::PPRRegClass.contains(Reg);
+  });
+  bool HasFPRStackObjects = false;
+  if (!HasFPRCSRs) {
+    std::vector<unsigned> FrameObjects(MFI.getObjectIndexEnd());
+    for (auto &MBB : MF) {
+      for (auto &MI : MBB) {
+        std::optional<int> FI = getLdStFrameID(MI, MFI);
+        if (FI && *FI >= 0 && *FI < (int)FrameObjects.size()) {
+          if (MFI.getStackID(*FI) == 2 || AArch64InstrInfo::isFpOrNEON(MI))
+            FrameObjects[*FI] |= 2;
+          else
+            FrameObjects[*FI] |= 1;
+        }
+      }
+    }
+    HasFPRStackObjects =
+        any_of(FrameObjects, [](unsigned B) { return (B & 3) == 2; });
+  }
+
+  if (HasFPRCSRs || HasFPRStackObjects) {
+    int ID = MFI.CreateStackObject(StackHazardSize, Align(16), false);
+    LLVM_DEBUG(dbgs() << "Created Hazard slot at " << ID << " size "
+                      << StackHazardSize << "\n");
+    MF.getInfo<AArch64FunctionInfo>()->setStackHazardSlotIndex(ID);
+  }
+}
+
 void AArch64FrameLowering::determineCalleeSaves(MachineFunction &MF,
                                                 BitVector &SavedRegs,
                                                 RegScavenger *RS) const {
@@ -3595,6 +3710,12 @@ void AArch64FrameLowering::determineCalleeSaves(MachineFunction &MF,
       CSStackSize += 8;
   }
 
+  // Determine if a Hazard slot should be used, and increase the CSStackSize by
+  // StackHazardSize if so.
+  determineStackHazardSlot(MF, SavedRegs);
+  if (AFI->hasStackHazardSlotIndex())
+    CSStackSize += StackHazardSize;
+
   // Save number of saved regs, so we can easily update CSStackSize later.
   unsigned NumSavedRegs = SavedRegs.count();
 
@@ -3761,10 +3882,28 @@ bool AArch64FrameLowering::assignCalleeSavedSpillSlots(
       CSI.insert(CSI.end(), VGSaves.begin(), VGSaves.end());
   }
 
+  Register LastReg = 0;
+  int HazardSlotIndex = std::numeric_limits<int>::max();
   for (auto &CS : CSI) {
     Register Reg = CS.getReg();
     const TargetRegisterClass *RC = RegInfo->getMinimalPhysRegClass(Reg);
 
+    // Create a hazard slot as we switch between GPR and FPR CSRs.
+    if (AFI->hasStackHazardSlotIndex() &&
+        (!LastReg || !AArch64InstrInfo::isFpOrNEON(LastReg)) &&
+        AArch64InstrInfo::isFpOrNEON(Reg)) {
+      assert(HazardSlotIndex == std::numeric_limits<int>::max() &&
+             "Unexpected register order for hazard slot");
+      HazardSlotIndex = MFI.CreateStackObject(StackHazardSize, Align(8), true);
+      LLVM_DEBUG(dbgs() << "Created CSR Hazard at slot " << HazardSlotIndex
+                        << "\n");
+      AFI->setStackHazardCSRSlotIndex(HazardSlotIndex);
+      if ((unsigned)HazardSlotIndex < MinCSFrameIndex)
+        MinCSFrameIndex = HazardSlotIndex;
+      if ((unsigned)HazardSlotIndex > MaxCSFrameIndex)
+        MaxCSFrameIndex = HazardSlotIndex;
+    }
+
     unsigned Size = RegInfo->getSpillSize(*RC);
     Align Alignment(RegInfo->getSpillAlign(*RC));
     int FrameIdx = MFI.CreateStackObject(Size, Alignment, true);
@@ -3785,7 +3924,22 @@ bool AArch64FrameLowering::assignCalleeSavedSpillSlots(
       if ((unsigned)FrameIdx > MaxCSFrameIndex)
         MaxCSFrameIndex = FrameIdx;
     }
+    LastReg = Reg;
+  }
+
+  // Add hazard slot in the case where no FPR CSRs are present.
+  if (AFI->hasStackHazardSlotIndex() &&
+      HazardSlotIndex == std::numeric_limits<int>::max()) {
+    HazardSlotIndex = MFI.CreateStackObject(StackHazardSize, Align(8), true);
+    LLVM_DEBUG(dbgs() << "Created CSR Hazard at slot " << HazardSlotIndex
+                      << "\n");
+    AFI->setStackHazardCSRSlotIndex(HazardSlotIndex);
+    if ((unsigned)HazardSlotIndex < MinCSFrameIndex)
+      MinCSFrameIndex = HazardSlotIndex;
+    if ((unsigned)HazardSlotIndex > MaxCSFrameIndex)
+      MaxCSFrameIndex = HazardSlotIndex;
   }
+
   return true;
 }
 
@@ -3798,6 +3952,10 @@ bool AArch64FrameLowering::enableStackSlotScavenging(
   // function doesn't use a FP.
   if (AFI->hasStreamingModeChanges() && !hasFP(MF))
     return false;
+  // Don't allow register salvaging with hazard slots, in case it moves objects
+  // into the wrong place.
+  if (AFI->hasStackHazardSlotIndex())
+    return false;
   return AFI->hasCalleeSaveStackFreeSpace();
 }
 
@@ -4492,6 +4650,10 @@ struct FrameObject {
   // This object's group (which always contains the object with
   // ObjectFirst==true) should be placed first.
   bool GroupFirst = false;
+
+  // Used to distinguish between FP and GPR accesses.
+  // 1 = GPR, 2 = FPR, 8 = Hazard Object.
+  unsigned Accesses = 0;
 };
 
 class GroupBuilder {
@@ -4527,8 +4689,12 @@ bool FrameObjectCompare(const FrameObject &A, const FrameObject &B) {
   // at the end. This also allows us to stop walking when we hit the
   // first invalid item after it's all sorted.
   //
-  // The "first" object goes first (closest to SP), followed by the members of
-  // the "first" group.
+  // If we want to include a stack hazard region, order FPR accesses < the
+  // hazard object < GPRs accesses in order to create a separation between the
+  // two. For the Accesses field 1 = GPR, 2 = FPR, 8 = Hazard Object.
+  //
+  // Otherwise the "first" object goes first (closest to SP), followed by the
+  // members of the "first" group.
   //
   // The rest are sorted by the group index to keep the groups together.
   // Higher numbered groups are more likely to be around longer (i.e. untagged
@@ -4537,9 +4703,15 @@ bool FrameObjectCompare(const FrameObject &A, const FrameObject &B) {
   //
   // If all else equal, sort by the object index to keep the objects in the
   // original order.
-  return std::make_tuple(!A.IsValid, A.ObjectFirst, A.GroupFirst, A.GroupIndex,
+  if (A.IsValid != B.IsValid)
+    return A.IsValid;
+  if (A.Accesses == 2 && B.Accesses != 2)
+    return true;
+  if (A.Accesses == 8 && B.Accesses != 2)
+    return true;
+  return std::make_tuple(A.ObjectFirst, A.GroupFirst, A.GroupIndex,
                          A.ObjectIndex) <
-         std::make_tuple(!B.IsValid, B.ObjectFirst, B.GroupFirst, B.GroupIndex,
+         std::make_tuple(B.ObjectFirst, B.GroupFirst, B.GroupIndex,
                          B.ObjectIndex);
 }
 } // namespace
@@ -4549,6 +4721,7 @@ void AArch64FrameLowering::orderFrameObjects(
   if (!OrderFrameObjects || ObjectsToAllocate.empty())
     return;
 
+  const AArch64FunctionInfo &AFI = *MF.getInfo<AArch64FunctionInfo>();
   const MachineFrameInfo &MFI = MF.getFrameInfo();
   std::vector<FrameObject> FrameObjects(MFI.getObjectIndexEnd());
   for (auto &Obj : ObjectsToAllocate) {
@@ -4595,16 +4768,28 @@ void AArch64FrameLowering::orderFrameObjects(
         GB.AddMember(TaggedFI);
       else
         GB.EndCurrentGroup();
+
+      if (AFI.hasStackHazardSlotIndex()) {
+        std::optional<int> FI = getLdStFrameID(MI, MFI);
+        if (FI && *FI >= 0 && *FI < (int)FrameObjects.size()) {
+          if (MFI.getStackID(*FI) == 2 || AArch64InstrInfo::isFpOrNEON(MI))
+            FrameObjects[*FI].Accesses |= 2;
+          else
+            FrameObjects[*FI].Accesses |= 1;
+        }
+      }
     }
     // Groups should never span multiple basic blocks.
     GB.EndCurrentGroup();
   }
 
+  if (AFI.hasStackHazardSlotIndex())
+    FrameObjects[AFI.getStackHazardSlotIndex()].Accesses = 8;
+
   // If the function's tagged base pointer is pinned to a stack slot, we want to
   // put that slot first when possible. This will likely place it at SP + 0,
   // and save one instruction when generating the base pointer because IRG does
   // not allow an immediate offset.
-  const AArch64FunctionInfo &AFI = *MF.getInfo<AArch64FunctionInfo>();
   std::optional<int> TBPI = AFI.getTaggedBasePointerIndex();
   if (TBPI) {
     FrameObjects[*TBPI].ObjectFirst = true;
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.h b/llvm/lib/Target/AArch64/AArch64FrameLowering.h
index 941af03a78b73..da315850d6362 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.h
@@ -155,6 +155,10 @@ class AArch64FrameLowering : public TargetFrameLowering {
                           int64_t RealignmentPadding, StackOffset AllocSize,
                           bool NeedsWinCFI, bool *HasWinCFI, bool EmitCFI,
                           StackOffset InitialOffset, bool FollowupAllocs) const;
+  /// Make a determination whether a Hazard slot is used and create it if
+  /// needed.
+  void determineStackHazardSlot(MachineFunction &MF,
+                                BitVector &SavedRegs) const;
 
   /// Emit target zero call-used regs.
   void emitZeroCallUsedRegs(BitVector RegsToZero,
diff --git a/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h b/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
index 001521d1101eb..72f110cebbdc8 100644
--- a/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
@@ -109,6 +109,12 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
   /// registers.
   unsigned VarArgsFPRSize = 0;
 
+  /// The stack slots used to add space between FPR and GPR accesses when using
+  /// hazard padding. StackHazardCSRSlotIndex is added between GPR and FPR CSRs.
+  /// StackHazardSlotIndex is added between (sorted) stack objects.
+  int StackHazardSlotIndex = std::numeric_limits<int>::max();
+  int StackHazardCSRSlotIndex = std::numeric_limits<int>::max();
+
   /// True if this function has a subset of CSRs that is handled explicitly via
   /// copies.
   bool IsSplitCSR = false;
@@ -346,6 +352,13 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
         MaxOffset = std::max<int64_t>(Offset + ObjSize, MaxOffset);
       }
 
+      if (StackHazardCSRSlotIndex != std::numeric_limits<int>::max()) {
+        int64_t Offset = MFI.getObjectOffset(StackHazardCSRSlotIndex);
+        int64_t ObjSize = MFI.getObjectSize(StackHazardCSRSlotIndex);
+        MinOffset = std::min<int64_t>(Offset, MinOffset);
+        MaxOffset = std::max<int64_t>(Offset + ObjSize, MaxOffset);
+      }
+
       unsigned Size = alignTo(MaxOffset - MinOffset, 16);
       assert((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) &&
              "Invalid size calculated for callee saves");
@@ -403,6 +416,20 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
   unsigned getVarArgsFPRSize() const { return VarArgsFPRSize; }
   void setVarArgsFPRSize(unsigned Size) { VarArgsFPRSize = Size; }
 
+  bool hasStackHazardSlotIndex() const {
+    return StackHazardSlotIndex != std::numeric_limits<int>::max();
+  }
+  int getStackHazardSlotIndex() const { return StackHazardSlotIndex; }
+  void setStackHazardSlotIndex(int Index) {
+    assert(StackHazardSlotIndex == std::numeric_limits<int>::max());
+    StackHazardSlotIndex = Index;
+  }
+  int getStackHazardCSRSlotIndex() const { return StackHazardCSRSlotIndex; }
+  void setStackHazardCSRSlotIndex(int Index) {
+    assert(StackHazardCSRSlotIndex == std::numeric_limits<int>::max());
+    StackHazardCSRSlotIndex = Index;
+  }
+
   unsigned getSRetReturnReg() const { return SRetReturnReg; }
   void setSRetReturnReg(unsigned Reg) { SRetReturnReg = Reg; }
 
diff --git a/llvm/test/CodeGen/AArch64/stack-hazard.ll b/llvm/test/CodeGen/AArch64/stack-hazard.ll
new file mode 100644
index 0000000000000..7ba9a1a43ce7e
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/stack-hazard.ll
@@ -0,0 +1,2812 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=aarch64 -mattr=+sve2 -aarch64-stack-hazard-size=0 | FileCheck %s --check-prefixes=CHECK,CHECK0
+; RUN: llc < %s -mtriple=aarch64 -mattr=+sve2 -aarch64-stack-hazard-size=64 | FileCheck %s --check-prefixes=CHECK,CHECK64
+; RUN: llc < %s -mtriple=aarch64 -mattr=+sve2 -aarch64-stack-hazard-size=1024 | FileCheck %s --check-prefixes=CHECK,CHECK1024
+
+define i32 @basic(i32 noundef %num) {
+; CHECK-LABEL: basic:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    mov w0, wzr
+; CHECK-NEXT:    ret
+entry:
+  ret i32 0
+}
+
+; Non-streaming functions don't need hazards
+define i32 @csr_d8_notsc(i32 noundef %num) {
+; CHECK-LABEL: csr_d8_notsc:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str d8, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset b8, -16
+; CHECK-NEXT:    mov w0, wzr
+; CHECK-NEXT:    //APP
+; CHECK-NEXT:    //NO_APP
+; CHECK-NEXT:    ldr d8, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  tail call void asm sideeffect "", "~{d8}"() #1
+  ret i32 0
+}
+
+; Very simple - doesn't require hazards
+define i32 @basic_sc(i32 noundef %num) "aarch64_pstate_sm_compatible" {
+; CHECK-LABEL: basic_sc:
+; CHECK:       // %bb.0...
[truncated]

github-actions · 2024-07-15T20:37:26Z

✅ With the latest revision this PR passed the C/C++ code formatter.

efriedma-quic

A couple ways to possibly save some stack space:

For FP CSRs, would it make sense to go through integer registers? e.g. fmov x16, d8; str x16, [sp, #16]. More microops, but you don't need hazard padding for the FP CSRs. Maybe depends on how expensive the micro-ops are.
If the hazard is related to cache-lines, realigning the stack could reduce the amount of necessary padding.

efriedma-quic · 2024-07-15T23:26:54Z

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

@@ -4595,16 +4768,28 @@ void AArch64FrameLowering::orderFrameObjects(
        GB.AddMember(TaggedFI);
      else
        GB.EndCurrentGroup();
+
+      if (AFI.hasStackHazardSlotIndex()) {


Please rearrange this code so it isn't mixed with the stack tagging code.

I've moved it earlier, as opposed to at the end. It is in the same loop to avoid iterating instructions more than once.

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

efriedma-quic · 2024-07-15T23:34:53Z

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

+  // Don't allow register salvaging with hazard slots, in case it moves objects
+  // into the wrong place.
+  if (AFI->hasStackHazardSlotIndex())
+    return false;


What's the interaction here? Just that scavenging could introduce hazards?

Yep - in some cases it was moving the FPR registers into the empty slot next to fp or the scavenged register.

davemgreen

A couple ways to possibly save some stack space:

For FP CSRs, would it make sense to go through integer registers? e.g. fmov x16, d8; str x16, [sp, 16]. More microops, but you don't need hazard padding for the FP CSRs. Maybe depends on how expensive the micro-ops are.

If the hazard is related to cache-lines, realigning the stack could reduce the amount of necessary padding.

Thanks - we were hoping to get something into the LLVM-19 compiler so the amount of changes was deliberately kept minimal. In the longer term we might try and change more. Realigning the stack (or portions of it) is certainly an option.

If the SME unit is separate from the CPU and the SME unit hold the values of vector registers, the fmov d->gpr might be a relatively expensive operation. Depending on the function, we might need to stack frames on top of one-another and with the current scheme we end up with GPR (CPU) => GPU > hazard > FPR > FPR > hazard > GPR => GPR > hazard > FPR > ..., so I think in the general case you need two hazard paddings. If you don't have one of the regions (or any sub-calls) then things might become simpler though.

davemgreen · 2024-07-16T07:51:11Z

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

@@ -4595,16 +4768,28 @@ void AArch64FrameLowering::orderFrameObjects(
        GB.AddMember(TaggedFI);
      else
        GB.EndCurrentGroup();
+
+      if (AFI.hasStackHazardSlotIndex()) {


I've moved it earlier, as opposed to at the end. It is in the same loop to avoid iterating instructions more than once.

davemgreen · 2024-07-16T07:51:15Z

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

+  // Don't allow register salvaging with hazard slots, in case it moves objects
+  // into the wrong place.
+  if (AFI->hasStackHazardSlotIndex())
+    return false;


Yep - in some cases it was moving the FPR registers into the empty slot next to fp or the scavenged register.

efriedma-quic

LGTM

kmclaughlin-arm · 2024-07-17T14:54:11Z

llvm/test/CodeGen/AArch64/stack-hazard.ll

+  tail call void @other()
+  ret i32 %x
+}
+declare void @other()


Is it worth also adding one test which is locally-streaming?

kmclaughlin-arm

Thanks @davemgreen, I've just let one minor comment on the test file but otherwise this LGTM.

Under some SME contexts, a coprocessor with its own separate cache will be used for FPR operations. This can create hazards if the CPU and the SME unit try to access the same area of memory, including if the access is to an area of the stack. To try to alleviate that, this patch attempts to introduce extra padding into the stack frame between FP and GPR accesses, controlled by the StackHazardSize option. Without changing the layout of the stack frame, a stack object of the right size is added between GPR and FPR CSRs. Another is added to the stack objects section, and stack objects are sorted so that FPR > Hazard padding slot > GPRs (where possible). Unfortunately some things are not handled well (VLA area, FPR arguments on the stack, object with both GPR and FPR accesses), but if those are controlled by the user then the entire stack frame becomes GPR at the start/end with FPR in the middle, surrounded by Hazard padding. This can greatly help reduce something that can be difficult for the user to control themselves. The current implementation is opt-in through an aarch64-stack-hazard-size flag, and should have no effect if the option is unset. In the long run the implementation might change (for example using more base pointers to separate in more cases, re-enabling ldp/stp using an extra register, etc), but this gets at least something for people to use in llvm-19 if they need it. The only change whilst the option is unset will be a fix for making sure the stack increment is added at the right place when it cannot be converted to postinc (++MBBI). I believe without extra padding that can not normally be reached.

Formatted. Added an enum and sorted Accesses. Moved identification of frame objects up.

davemgreen · 2024-07-18T07:14:55Z

Thanks for the reviews.

llvm-ci · 2024-07-18T07:30:39Z

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/2047

Here is the relevant piece of the build log for the reference:

Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[36/38] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/InOneWeekend-hip-6.0.2.dir/workload/ray-tracing/InOneWeekend/main.cc.o -o External/HIP/InOneWeekend-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/InOneWeekend.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend.reference_output-hip-6.0.2
[37/38] /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /buildbot/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
[38/38] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -o External/HIP/TheNextWeek-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/TheNextWeek.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/TheNextWeek.reference_output-hip-6.0.2
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
@@@BUILD_STEP Testing HIP test-suite@@@
+ ninja -v check-hip-simple
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 6 tests, 6 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 6)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 383.82s

Total Discovered Tests: 6
  Passed: 5 (83.33%)
  Failed: 1 (16.67%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
+ ninja -v check-hip-simple
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 6 tests, 6 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 6)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 383.82s

Total Discovered Tests: 6
  Passed: 5 (83.33%)
  Failed: 1 (16.67%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
program finished with exit code 1
elapsedTime=496.230781

Under some SME contexts, a coprocessor with its own separate cache will be used for FPR operations. This can create hazards if the CPU and the SME unit try to access the same area of memory, including if the access is to an area of the stack. To try to alleviate that, this patch attempts to introduce extra padding into the stack frame between FP and GPR accesses, controlled by the StackHazardSize option. Without changing the layout of the stack frame, a stack object of the right size is added between GPR and FPR CSRs. Another is added to the stack objects section, and stack objects are sorted so that FPR > Hazard padding slot > GPRs (where possible). Unfortunately some things are not handled well (VLA area, FPR arguments on the stack, object with both GPR and FPR accesses), but if those are controlled by the user then the entire stack frame becomes GPR at the start/end with FPR in the middle, surrounded by Hazard padding. This can greatly help reduce something that can be difficult for the user to control themselves. The current implementation is opt-in through an -aarch64-stack-hazard-size flag, and should have no effect if the option is unset. In the long run the implementation might change (for example using more base pointers to separate in more cases, re-enabling ldp/stp using an extra register, etc), but this gets at least something for people to use in llvm-19 if they need it. The only change whilst the option is unset will be a fix for making sure the stack increment is added at the right place when it cannot be converted to postinc (++MBBI). I believe without extra padding that can not normally be reached.

davemgreen requested review from momchil-velikov, sdesmalen-arm, efriedma-quic and kmclaughlin-arm July 15, 2024 20:33

llvmbot added the backend:AArch64 label Jul 15, 2024

efriedma-quic reviewed Jul 15, 2024

View reviewed changes

davemgreen commented Jul 16, 2024

View reviewed changes

efriedma-quic approved these changes Jul 17, 2024

View reviewed changes

kmclaughlin-arm reviewed Jul 17, 2024

View reviewed changes

kmclaughlin-arm approved these changes Jul 17, 2024

View reviewed changes

davemgreen added 3 commits July 18, 2024 07:19

Updates per review comments.

a5a3a3d

Formatted. Added an enum and sorted Accesses. Moved identification of frame objects up.

Add a locally_streaming test case.

5977845

davemgreen force-pushed the gh-a64-stackhazards branch from 4dfd759 to 5977845 Compare July 18, 2024 07:01

davemgreen merged commit 4b9bcab into llvm:main Jul 18, 2024
4 of 6 checks passed

davemgreen deleted the gh-a64-stackhazards branch July 18, 2024 07:16

[AArch64] Add streaming-mode stack hazards. #98956

[AArch64] Add streaming-mode stack hazards. #98956

Uh oh!

Conversation

davemgreen commented Jul 15, 2024

Uh oh!

llvmbot commented Jul 15, 2024

Uh oh!

github-actions bot commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

efriedma-quic Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

efriedma-quic Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

efriedma-quic left a comment

Choose a reason for hiding this comment

Uh oh!

kmclaughlin-arm Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

kmclaughlin-arm left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Jul 18, 2024

Uh oh!

Uh oh!

llvm-ci commented Jul 18, 2024

Uh oh!

Uh oh!

github-actions bot commented Jul 15, 2024 •

edited

Loading