Skip to content

[MachinePipeliner] Fix loop-carried dependencies analysis #121907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kasuga-fj
Copy link
Contributor

@kasuga-fj kasuga-fj commented Jan 7, 2025

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i<n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i<n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.

Related: #109918

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-powerpc

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Jan 7, 2025

@llvm/pr-subscribers-backend-hexagon

Author: Ryotaro Kasuga (kasuga-fj)

Changes

In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.

  • Memory dependencies from top to bottom.
    • Example:
      // There is a loop-carried dependence from the first store to 
      // the second one. 
      for (int i=1; i&lt;n; i++) {
        a[i] = ...;
        a[i-1] = ...; 
      }
      
  • Store to store dependencies.
  • Store to load dependencies.
  • Output (write-after-write for physical register) dependencies.
  • Use of alias analysis results that are valid only in the single iteration.
    • Example:
      void f(double * restrict a, double * restrict b);
      ... 
      for (int i=0; i&lt;n; i++) 
        f(ptr0, ptr1);  // will be inlined 
      

This patch added these dependencies to fix correctness issues.

In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.

I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:

  • Changes in Initiation Interval on Hexagon (since I don't have a real machine, I couldn't check the actual execution time).
  • Changes of Initiation Interval and execution time on AArch64 (Neoverse V1).
    • (Note: As described below, loop unrolling is disabled).

As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when

  • Loop unrolling is enabled and
  • The target architecture doesn't implement TargetInstrInfo::getIncrementValue.

This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If getIncrementValue isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.


Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff

18 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachinePipeliner.h (+45-35)
  • (modified) llvm/lib/CodeGen/MachinePipeliner.cpp (+497-230)
  • (modified) llvm/test/CodeGen/AArch64/sms-instruction-scheduled-at-correct-cycle.mir (+6-1)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions1.mir (+107)
  • (added) llvm/test/CodeGen/AArch64/sms-loop-carried-fp-exceptions2.mir (+100)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-dep1.mir (+5-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+5)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi9.ll (+4-4)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep1.mir (+110)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep2.mir (+104)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep3.mir (+108)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep4.mir (+107)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep5.mir (+106)
  • (added) llvm/test/CodeGen/Hexagon/swp-loop-carried-order-dep6.mir (+153)
  • (modified) llvm/test/CodeGen/Hexagon/swp-loop-carried-unknown.ll (+9-6)
  • (modified) llvm/test/CodeGen/Hexagon/swp-resmii-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-recmii.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/sms-store-dependence.ll (+8-41)
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
 
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/CodeGen/DFAPacketizer.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
   bool ignoreDependence(bool IgnoreAnti) const;
 };
 
+struct LoopCarriedEdges {
+  using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+  using OrderDep = SmallSetVector<SUnit *, 8>;
+  using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+  using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+  OutputDepsType OutputDeps;
+  OrderDepsType OrderDeps;
+
+  const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+    auto Ite = OutputDeps.find(Key);
+    if (Ite == OutputDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+    auto Ite = OrderDeps.find(Key);
+    if (Ite == OrderDeps.end())
+      return nullptr;
+    return &Ite->second;
+  }
+
+  void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+            const MachineRegisterInfo *MRI) const;
+};
+
 /// Represents dependencies between instructions. This class is a wrapper of
 /// `SUnits` and its dependencies to manipulate back-edges in a natural way.
 /// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
   SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
   const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
 
+  void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+                           const LoopCarriedEdges &LCE);
+
 public:
-  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+  SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+                    const LoopCarriedEdges &LCE);
 
   const EdgesType &getInEdges(const SUnit *SU) const;
 
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     BitVector Blocked;
     SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
     SmallVector<SmallVector<int, 4>, 16> AdjK;
-    // Node to Index from ScheduleDAGTopologicalSort
-    std::vector<int> *Node2Idx;
+    SmallVector<BitVector, 16> LoopCarried;
     unsigned NumPaths = 0u;
-    static unsigned MaxPaths;
 
   public:
-    Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
-        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
-      Node2Idx = new std::vector<int>(SUs.size());
-      unsigned Idx = 0;
-      for (const auto &NodeNum : Topo)
-        Node2Idx->at(NodeNum) = Idx++;
-    }
+    Circuits(std::vector<SUnit> &SUs)
+        : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
     Circuits &operator=(const Circuits &other) = delete;
     Circuits(const Circuits &other) = delete;
-    ~Circuits() { delete Node2Idx; }
 
     /// Reset the data structures used in the circuit algorithm.
     void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
       NumPaths = 0;
     }
 
-    void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+    void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
     bool circuit(int V, int S, NodeSetType &NodeSets,
-                 const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+                 const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
     void unblock(int U);
   };
 
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
     return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
   }
 
-  bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+  bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+                            BatchAAResults *BAA) const;
 
   void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
 
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
 
 private:
-  void addLoopCarriedDependences(AAResults *AA);
+  LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+  AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+                                           const MachineInstr *Dst) const;
   void updatePhiDependences();
   void changeDependences();
   unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
   void computeNodeOrder(NodeSetType &NodeSets);
   void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
-  bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+  bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
   MachineInstr *findDefInLoop(Register Reg);
   bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
                              unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
   using iterator = SetVector<SUnit *>::const_iterator;
 
   NodeSet() = default;
-  NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+  NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
       : Nodes(S, E), HasRecurrence(true) {
     // Calculate the latency of this node set.
     // Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
     //
     // Hold a map from each SUnit in the circle to the maximum distance from the
     // source node by only considering the nodes.
-    const SwingSchedulerDDG *DDG = DAG->getDDG();
     DenseMap<SUnit *, unsigned> SUnitToDistance;
     for (auto *Node : Nodes)
       SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
         }
       }
     }
-    // Handle a back-edge in loop carried dependencies
-    SUnit *FirstNode = Nodes[0];
-    SUnit *LastNode = Nodes[Nodes.size() - 1];
-
-    for (auto &PI : DDG->getInEdges(LastNode)) {
-      // If we have an order dep that is potentially loop carried then a
-      // back-edge exists between the last node and the first node that isn't
-      // modeled in the DAG. Handle it manually by adding 1 to the distance of
-      // the last node.
-      if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
-          !DAG->isLoopCarriedDep(PI))
-        continue;
-      SUnitToDistance[FirstNode] =
-          std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
-    }
-
     // The latency is the distance from the source node to itself.
     Latency = SUnitToDistance[Nodes.front()];
   }
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
     MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
                cl::desc("Use the MVE code generator for software pipelining"));
 
+static cl::opt<unsigned> MaxCircuitPaths(
+    "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of circles to be detected for each vertex"));
+
 namespace llvm {
 
 // A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
 
 } // end namespace llvm
 
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
 char MachinePipeliner::ID = 0;
 #ifndef NDEBUG
 int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
 void SwingSchedulerDAG::schedule() {
   AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
   buildSchedGraph(AA);
-  addLoopCarriedDependences(AA);
   updatePhiDependences();
   Topo.InitDAGTopologicalSorting();
   changeDependences();
   postProcessDAG();
-  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
   LLVM_DEBUG(dump());
 
+  auto LCE = addLoopCarriedDependences(AA);
+  LLVM_DEBUG({
+    dbgs() << "Loop Carried Edges:\n";
+    for (SUnit &SU : SUnits)
+      LCE.dump(&SU, TRI, &MRI);
+  });
+  DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
   NodeSetType NodeSets;
   findCircuits(NodeSets);
   NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
   return 0;
 }
 
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
-  SmallPtrSet<SUnit *, 8> Visited;
-  SmallVector<SUnit *, 8> Worklist;
-  Worklist.push_back(SUa);
-  while (!Worklist.empty()) {
-    const SUnit *SU = Worklist.pop_back_val();
-    for (const auto &SI : SU->Succs) {
-      SUnit *SuccSU = SI.getSUnit();
-      if (SI.getKind() == SDep::Order) {
-        if (Visited.count(SuccSU))
-          continue;
-        if (SuccSU == SUb)
-          return true;
-        Worklist.push_back(SuccSU);
-        Visited.insert(SuccSU);
-      }
-    }
-  }
-  return false;
-}
-
 /// Return true if the instruction causes a chain between memory
 /// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
-  return MI.isCall() || MI.mayRaiseFPException() ||
-         MI.hasUnmodeledSideEffects() ||
-         (MI.hasOrderedMemoryRef() &&
-          (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+  return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+         (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
 }
 
 /// Return the underlying objects for the memory references of an instruction.
 /// This function calls the code in ValueTracking, but first checks that the
 /// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
-                                 SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+                                         SmallVectorImpl<const Value *> &Objs) {
   if (!MI->hasOneMemOperand())
     return;
   MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
   }
 }
 
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
-  MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
-  Value *UnknownValue =
-    UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
-  for (auto &SU : SUnits) {
-    MachineInstr &MI = *SU.getInstr();
-    if (isDependenceBarrier(MI))
-      PendingLoads.clear();
-    else if (MI.mayLoad()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
-        SUs.push_back(&SU);
-      }
-    } else if (MI.mayStore()) {
-      SmallVector<const Value *, 4> Objs;
-      ::getUnderlyingObjects(&MI, Objs);
-      if (Objs.empty())
-        Objs.push_back(UnknownValue);
-      for (const auto *V : Objs) {
-        MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
-            PendingLoads.find(V);
-        if (I == PendingLoads.end())
-          continue;
-        for (auto *Load : I->second) {
-          if (isSuccOrder(Load, &SU))
-            continue;
-          MachineInstr &LdMI = *Load->getInstr();
-          // First, perform the cheaper check that compares the base register.
-          // If they are the same and the load offset is less than the store
-          // offset, then mark the dependence as loop carried potentially.
-          const MachineOperand *BaseOp1, *BaseOp2;
-          int64_t Offset1, Offset2;
-          bool Offset1IsScalable, Offset2IsScalable;
-          if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
-                                           Offset1IsScalable, TRI) &&
-              TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
-                                           Offset2IsScalable, TRI)) {
-            if (BaseOp1->isIdenticalTo(*BaseOp2) &&
-                Offset1IsScalable == Offset2IsScalable &&
-                (int)Offset1 < (int)Offset2) {
-              assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
-                     "What happened to the chain edge?");
-              SDep Dep(Load, SDep::Barrier);
-              Dep.setLatency(1);
-              SU.addPred(Dep);
-              continue;
-            }
-          }
-          // Second, the more expensive check that uses alias analysis on the
-          // base registers. If they alias, and the load offset is less than
-          // the store offset, the mark the dependence as loop carried.
-          if (!AA) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
-          MachineMemOperand *MMO2 = *MI.memoperands_begin();
-          if (!MMO1->getValue() || !MMO2->getValue()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (MMO1->getValue() == MMO2->getValue() &&
-              MMO1->getOffset() <= MMO2->getOffset()) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-            continue;
-          }
-          if (!AA->isNoAlias(
-                  MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
-                  MemoryLocation::getAfter(MMO2->getValue(),
-                                           MMO2->getAAInfo()))) {
-            SDep Dep(Load, SDep::Barrier);
-            Dep.setLatency(1);
-            SU.addPred(Dep);
-          }
-        }
-      }
-    }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+  const MachineMemOperand *MMO = *MI->memoperands_begin();
+  const Value *Val = MMO->getValue();
+  if (!Val)
+    return std::nullopt;
+  auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+  // Peel off noalias information from `AATags` because it might be valid only
+  // in single iteration.
+  // FIXME: This is too conservative. Checking
+  // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+  // can perform more accuurately.
+  MemLoc.AATags.NoAlias = nullptr;
+  return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+                                             const MachineInstr *Dst,
+                                             BatchAAResults *BAA) const {
+  if (!SwpPruneLoopCarried)
+    return true;
+
+  // First, check the dependence by comparing base register, offset, and
+  // step value of the loop.
+  switch (checkLoopCarriedMemDep(Src, Dst)) {
+  case AliasResult::Kind::MustAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  case AliasResult::Kind::MayAlias:
+    break;
+  default:
+    llvm_unreachable("Unexpected alias");
+  }
+
+  // If we cannot determine the dependence by previouse check, then
+  // check by using alias analysis.
+  if (!BAA)
+    return true;
+
+  const auto MemLoc1 = getMemoryLocationForAA(Src);
+  const auto MemLoc2 = getMemoryLocationForAA(Dst);
+  if (!MemLoc1.has_value() || !MemLoc2.has_value())
+    return true;
+  switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+  case AliasResult::Kind::MayAlias:
+  case AliasResult::Kind::MustAlias:
+  case AliasResult::Kind::PartialAlias:
+    return true;
+  case AliasResult::Kind::NoAlias:
+    return false;
+  default:
+    llvm_unreachable("Unexpected alias");
   }
 }
 
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
   }
 };
 
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+  // Type of instruction that is relevant to order-dependencies
+  enum class InstrTag {
+    // Instruction related to global memory objects. There are order
+    // dependencies between instructions that may load or store or raise
+    // floating-point exception before and after this one.
+    GlobalMemoryObject = 0,
+
+    // Instruction that may load or store memory, but does not form a global
+    // barrier.
+    LoadOrStore = 1,
+
+    // Instruction that does not match above, but may raise floatin-point
+    // exceptions.
+    FPExceptions = 2,
+  };
+
+  struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+    TaggedSUnit(SUnit *SU, InstrTag Tag)
+        : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+    InstrTag getTag() const { return InstrTag(getInt()); }
+  };
+
+  using SUsType = SmallVector<SUnit *, 4>;
+  using Value2SUs = MapVector<const Value *, SUsType>;
+
+  // Retains loads and stores classified by the underlying objects.
+  struct LoadStoreChunk {
+    Value2SUs Loads, Stores;
+    SUsType UnknownLoads, UnknownStores;
+  };
+
+  SwingSchedulerDAG *DAG;
+  std::unique_ptr<BatchAAResults> BAA;
+  const Value *UnknownValue;
+  std::vector<SUnit> &SUnits;
+
+  // The size of SUnits, for convenience.
+  const unsigned N;
+
+  // Adjacency matrix consisiting of order dependencies of the original DAG.
+  std::vector<BitVector> AdjMatrix;
+
+  // Loop-carried Edges.
+  std::vector<BitVector> LoopCarried;
+
+  // Instructions related to chain dependencies. They are one of the following.
+  //
+  //   1. Global memory object.
+  //   2. Load, but not a global memory object, not invariant, or may load trap
+  //      value.
+  //   3. Store, but not global memory object.
+  //   4. None of them, but may raise floating-point exceptions.
+  //
+  // This is used when analyzing loop-carried dependencies that access global
+  // barrier instructions.
+  std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+  LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+      : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+        AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+    UnknownValue =
+        UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+    if (AA) {
+      BAA = std::make_unique<BatchAAResults>(*AA);
+      BAA->enableCrossIterationMode();
+    }
+    initAdjMatrix();
+  }
+
+  void computeDependencies() {
+    // Traverse all instructions and extract only what we are targetting.
+    for (auto &SU : SUnits) {
+      auto Tagged = checkInstrType(&SU);
+
+      // This instruction has no loop-carried order-dependencies.
+      if (!Tagged)
+        continue;
+
+      TaggedSUnits.push_back(*Tagged);
+    }
+
+    addLoopCarriedDependencies();
+
+    // Finalize the results.
+    for (int I = 0; I != int(N); I++) {
+      // If the dependence between two instructions already exists in the
+      // original DAG, then loop-carried dependence of the same instructions is
+      // unnecessary because the original one expresses stricter
+      // constraint than loop-carried one.
+      LoopCarried[I].reset(AdjMatrix[I]);
+
+      // Self-loops are noisy.
+      LoopCarried[I].reset(I);
+    }
+  }
+
+  const BitVector &getLoopCarried(unsigned Idx) const {
+    return LoopCarried[Idx];
+  }
+
+private:
+  // Calculate reachability induced by the adjacency matrix. The original graph
+  // is DAG, so we can compute them from bottom to top.
+  void initAdjMatrix() {
+    for (int RI = 0; RI != int(N); RI++) {
+      int I = SUnits.size() - (RI + 1);
+      f...
[truncated]

@kasuga-fj
Copy link
Contributor Author

I have modified some existing tests that depend on scheduling results more than necessary. But I now feel that these changes should be committed separately before this PR. I will split those changes.

kasuga-fj added a commit to kasuga-fj/llvm-project that referenced this pull request Jan 16, 2025
There are tests that are more sensitive to scheduling results than
necessary. For example, a test that is intended to verify the
correctness of the dependency analysis but is validating the final
scheduling results These tests can be affected by unrelated changes.
This patch fixes them to make them robust against such changes. This
patch is a prelude to llvm#121907.
kasuga-fj added a commit to kasuga-fj/llvm-project that referenced this pull request Jan 16, 2025
Currently, some tests are too sensitive to changes in MachinePipeliner.
Some of them are more sensitive to scheduling results than necessary,
others assume the current dependency analysis, which is actually
incorrect. As an example of the former, there is a test that is intended
to verify the correctness of the dependency analysis, but validates the
final scheduling results. These tests can be affected by unrelated
changes in MachinePipeliner. This patch fixes them to make them robust
against such changes.

This patch is a prelude to llvm#121907.
In current MachinePipeliner, several loop-carried edges are missed. It
can result generating invalid code. At least following loop-carried
dependencies can be missed.

  - Memory dependencies from top to bottom.
    - Example:
      ```
        for (int i=1; i<n; i++) {
          a[i] = ...;
          a[i-1] = ...;
        }
      ```
  - Store to store dependencies.
  - Store to load dependencies.
  - Output (write-after-write) dependencies.
  - Use of alias analysis results that are valid only in the single
    iteration.
    - Example:
      ```
      void f(double * restrict a, double * restrict b);
      ...
      for (int i=0; i<n; i++)
        f(ptr0, ptr1);  // will be inlined
      ```

This patch added these dependencies and fix correctness issues.

In addition, the current analysis can add excessive dependencies because
loop-carried memory dependence from bottom to top by forward direction
(i.e., top to bottom) edge. This patch also removes such dependencies.
@kasuga-fj kasuga-fj force-pushed the machinepipeliner-fix-loop-carried-deps branch from 62423c1 to 996c869 Compare February 20, 2025 07:47
Copy link
Contributor

@iajbar iajbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch causes performance regression. I am working on reducing a testcase.

@kasuga-fj
Copy link
Contributor Author

I see, thanks in advance.

@iajbar
Copy link
Contributor

iajbar commented Feb 25, 2025

@kasuga-fj, please find the testcase.ll bellow. With this patch, only one loop is pipelined. Thank you.
clang -O2 -mhvx -mv68 -S -Rpass=pipeliner reduced.ll -o reduced.s

target triple = "hexagon"

define void @IntegratePerRow(i32 %width, ptr %0, ptr %1) {
entry:
br label %for.body

for.body: ; preds = %for.body, %entry
%sIntg3.036 = phi <32 x i32> [ zeroinitializer, %entry ], [ %4, %for.body ]
%niter = phi i32 [ 0, %entry ], [ %niter.next.1, %for.body ]
%2 = tail call <32 x i32> @llvm.hexagon.V6.vrdelta.128B(<32 x i32> %sIntg3.036, <32 x i32> zeroinitializer)
%3 = tail call <32 x i32> @llvm.hexagon.V6.vaddw.128B(<32 x i32> %2, <32 x i32> zeroinitializer)
%4 = tail call <32 x i32> @llvm.hexagon.V6.vaddw.128B(<32 x i32> %3, <32 x i32> zeroinitializer)
store <32 x i32> %4, ptr %0, align 128
store <32 x i32> zeroinitializer, ptr %1, align 128
%niter.next.1 = add i32 %niter, 1
%niter.ncmp.1.not = icmp eq i32 %niter, %width
br i1 %niter.ncmp.1.not, label %for.end.loopexit.unr-lcssa, label %for.body

for.end.loopexit.unr-lcssa: ; preds = %for.body
ret void
}

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32>, <32 x i32>, i32) #0

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <32 x i32> @llvm.hexagon.V6.vrdelta.128B(<32 x i32>, <32 x i32>) #0

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32>) #0

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <32 x i32> @llvm.hexagon.V6.vaddw.128B(<32 x i32>, <32 x i32>) #0

define void @IntegrateImage(ptr %src, i32 %srcWidth, ptr %dst) {
entry:
br label %for.body.i

for.body.i: ; preds = %for.body.i, %entry
%pIn.038.i = phi ptr [ null, %entry ], [ %incdec.ptr.i.1, %for.body.i ]
%niter = phi i32 [ 0, %entry ], [ %niter.next.1, %for.body.i ]
%0 = load <32 x i32>, ptr %pIn.038.i, align 128, !noalias !0
%1 = tail call <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32> zeroinitializer, <32 x i32> %0, i32 0)
%2 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %1)
store <32 x i32> zeroinitializer, ptr %src, align 128, !alias.scope !0
store <32 x i32> %2, ptr %dst, align 128, !alias.scope !0
%incdec.ptr.i.1 = getelementptr i8, ptr %pIn.038.i, i32 256
%niter.next.1 = add i32 %niter, 1
%niter.ncmp.1.not = icmp eq i32 %niter, %srcWidth
br i1 %niter.ncmp.1.not, label %IntegratePerRow.exit.loopexit.unr-lcssa, label %for.body.i

IntegratePerRow.exit.loopexit.unr-lcssa: ; preds = %for.body.i
ret void
}
uselistorder ptr @llvm.hexagon.V6.vaddw.128B, { 1, 0 }

attributes #0 = { nocallback nofree nosync nounwind willreturn memory(none) }
!0 = !{!1}
!1 = distinct !{!1, !2, !"IntegratePerRow: %dst"}
!2 = distinct !{!2, !"IntegratePerRow"}

@kasuga-fj
Copy link
Contributor Author

Thank you for sharing. Let me ask a question. Is this the only case that causes regression? Or are there other cases? This patch addresses correctness issues, and I believe that is generally more important than performance. However, if there are many performance regressions, I think it would be one way to create an option to enable/disable the changes of this patch. Does such a flag make sense for the Hexagon backend?

@aankit-ca
Copy link
Contributor

Thank you @kasuga-fj for the patch. There are other regressions as well. It would be ideal to avoid introducing a new flag for this patch. In my opinion, a correctness patch should not cause regressions in tests that were previously passing without correctness issues. The best approach would be to address the correctness issue without causing any regressions.

@kasuga-fj
Copy link
Contributor Author

@aankit-ca Thank you for your comment.

There are other regressions as well.

I see, thank you.

In my opinion, a correctness patch should not cause regressions in tests that were previously passing without correctness issues. The best approach would be to address the correctness issue without causing any regressions.

I agree with you, but this is not such a simple case. The pipeliner is roughly consists of the following two phases:

  • Analyze dependencies and build a dependency graph.
  • Search for a schedule based on the graph.

The issue this patch addresses is that some dependencies are missed when a dependency graph is built. This can "potentially" cause correctness issues, but not necessarily: if a found schedule happens to satisfy missed dependencies, it works fine. Adding these currently missing dependencies when creating the graph is one way to fix this "potential" correctness issue, but this affects the subsequent schedule search heuristics, which is a problem I'm now facing. Improving those heuristics may resolve the regressions, but it's difficult without knowing the details of them.

Alternatively, these results may imply that pruning currently missing dependencies is a good heuristic: That is, in many cases, using the pruned dependency graph and validating all dependencies at the end tends to produce a better schedule.

Anyway, I don't know much about the details of the Hexagon benchmarks, so there is nothing more I can say. I would appreciate your input on this.

@aankit-ca
Copy link
Contributor

aankit-ca commented Mar 4, 2025

Let's review a few cases to address the performance issues. Can you confirm if the testcase shared by @iajbar introduces a new dependency that prevents one of the loops from being pipelined?

Comment on lines +817 to +832
static std::optional<MemoryLocation>
getMemoryLocationForAA(const MachineInstr *MI) {
const MachineMemOperand *MMO = *MI->memoperands_begin();
const Value *Val = MMO->getValue();
if (!Val)
return std::nullopt;
auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());

// Peel off noalias information from `AATags` because it might be valid only
// in single iteration.
// FIXME: This is too conservative. Checking
// `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
// can perform more accuurately.
MemLoc.AATags.NoAlias = nullptr;
return MemLoc;
}
Copy link
Contributor Author

@kasuga-fj kasuga-fj Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's review a few cases to address the performance issues.

Thanks.

Can you confirm if the testcase shared by @iajbar introduces a new dependency that prevents one of the loops from being pipelined?

Yes. The second function (IntegrateImage) was not pipelined with this patch. The loop body is shown as below:

SU(0): %9:intregs = PHI %7:intregs, %bb.6, %11:intregs, %bb.4
SU(1): %52:hvxvr, %11:intregs = V6_vL32b_pi %9:intregs(tied-def 1), 256 :: (load (s1024) from %ir.pIn.038.i.epil, !noalias !2)
SU(2): %55:hvxwr = V6_vshuffvdd %53:hvxvr, %52:hvxvr, %54:intregslow8
SU(3): V6_vS32b_ai %12:intregs, 0, %53:hvxvr :: (store (s1024) into %ir.src, !alias.scope !2)
SU(4): V6_vS32b_ai %14:intregs, 0, %55.vsub_hi:hvxwr :: (store (s1024) into %ir.dst, !alias.scope !2)
ExitSU: ENDLOOP0 %bb.4, implicit-def $pc, implicit-def $lc0, implicit $sa0, implicit $lc0

In this function, new loop-carried dependencies are added between all load/store pairs (SU(1), SU(3) and SU(4), except for self loop edges). This is because noalias metadata is removed by this patch, as shown the diff. The reason why I did this is that, the alias metadata may be valid only in the single iteration, not in the all iterations, if there is a @llvm.experimental.noalias.scope.decl intrinsic call inside the loop. Please also refer to LoopAccessAnalysis, where processes the same things.

MemoryLocation adjustLoc(MemoryLocation Loc) const {
// The accessed location varies within the loop, but remains within the
// underlying object.
Loc.Size = LocationSize::beforeOrAfterPointer();
Loc.AATags.Scope = adjustAliasScopeList(Loc.AATags.Scope);
Loc.AATags.NoAlias = adjustAliasScopeList(Loc.AATags.NoAlias);
return Loc;
}
/// Drop alias scopes that are only valid within a single loop iteration.
MDNode *adjustAliasScopeList(MDNode *ScopeList) const {
if (!ScopeList)
return nullptr;
// For the sake of simplicity, drop the whole scope list if any scope is
// iteration-local.
if (any_of(ScopeList->operands(), [&](Metadata *Scope) {
return LoopAliasScopes.contains(cast<MDNode>(Scope));
}))
return nullptr;
return ScopeList;
}

Unfortunately, calls to this intrinsic are eliminated when LLVM IR is lowered to Machine IR. This means that we cannot make precious decisions about whether the alias information is valid across iterations, so I've removed all the metadata here. However, this is too conservative and I think we can improve this analysis, e.g., if there are no @llvm.experimental.noalias.scope.decl calls in the original IR, then stop removing the metadata. (It might be better to have separate patches for changes related to noalias.)

@kasuga-fj
Copy link
Contributor Author

Hi @aankit-ca . Let me ask you one question. If there is another patch, would it be possible for you to evaluate its impact on the benchmark as you did before? If so, I would like to try a different approach that (I believe) would reduce regression than this patch.
Thanks.

@aankit-ca
Copy link
Contributor

Hi @aankit-ca . Let me ask you one question. If there is another patch, would it be possible for you to evaluate its impact on the benchmark as you did before? If so, I would like to try a different approach that (I believe) would reduce regression than this patch. Thanks.

Yes @kasuga-fj we can try benchmarking again on other patches too

@kasuga-fj
Copy link
Contributor Author

Ok, I'm working on the patch and will request it as soon as it's ready, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants