-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[MachinePipeliner] Fix loop-carried dependencies analysis #121907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MachinePipeliner] Fix loop-carried dependencies analysis #121907
Conversation
@llvm/pr-subscribers-backend-powerpc Author: Ryotaro Kasuga (kasuga-fj) ChangesIn the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.
This patch added these dependencies to fix correctness issues. In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies. I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:
As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when
This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff 18 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/DFAPacketizer.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
bool ignoreDependence(bool IgnoreAnti) const;
};
+struct LoopCarriedEdges {
+ using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+ using OrderDep = SmallSetVector<SUnit *, 8>;
+ using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+ using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+ OutputDepsType OutputDeps;
+ OrderDepsType OrderDeps;
+
+ const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+ auto Ite = OutputDeps.find(Key);
+ if (Ite == OutputDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+ auto Ite = OrderDeps.find(Key);
+ if (Ite == OrderDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+ const MachineRegisterInfo *MRI) const;
+};
+
/// Represents dependencies between instructions. This class is a wrapper of
/// `SUnits` and its dependencies to manipulate back-edges in a natural way.
/// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
+ void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+ const LoopCarriedEdges &LCE);
+
public:
- SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+ SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+ const LoopCarriedEdges &LCE);
const EdgesType &getInEdges(const SUnit *SU) const;
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
BitVector Blocked;
SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
SmallVector<SmallVector<int, 4>, 16> AdjK;
- // Node to Index from ScheduleDAGTopologicalSort
- std::vector<int> *Node2Idx;
+ SmallVector<BitVector, 16> LoopCarried;
unsigned NumPaths = 0u;
- static unsigned MaxPaths;
public:
- Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
- : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
- Node2Idx = new std::vector<int>(SUs.size());
- unsigned Idx = 0;
- for (const auto &NodeNum : Topo)
- Node2Idx->at(NodeNum) = Idx++;
- }
+ Circuits(std::vector<SUnit> &SUs)
+ : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
Circuits &operator=(const Circuits &other) = delete;
Circuits(const Circuits &other) = delete;
- ~Circuits() { delete Node2Idx; }
/// Reset the data structures used in the circuit algorithm.
void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
NumPaths = 0;
}
- void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+ void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
bool circuit(int V, int S, NodeSetType &NodeSets,
- const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+ const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
void unblock(int U);
};
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
}
- bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+ bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+ BatchAAResults *BAA) const;
void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
private:
- void addLoopCarriedDependences(AAResults *AA);
+ LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+ AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst) const;
void updatePhiDependences();
void changeDependences();
unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
void computeNodeOrder(NodeSetType &NodeSets);
void checkValidNodeOrder(const NodeSetType &Circuits) const;
bool schedulePipeline(SMSchedule &Schedule);
- bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+ bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
MachineInstr *findDefInLoop(Register Reg);
bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
using iterator = SetVector<SUnit *>::const_iterator;
NodeSet() = default;
- NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+ NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
: Nodes(S, E), HasRecurrence(true) {
// Calculate the latency of this node set.
// Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
//
// Hold a map from each SUnit in the circle to the maximum distance from the
// source node by only considering the nodes.
- const SwingSchedulerDDG *DDG = DAG->getDDG();
DenseMap<SUnit *, unsigned> SUnitToDistance;
for (auto *Node : Nodes)
SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
}
}
}
- // Handle a back-edge in loop carried dependencies
- SUnit *FirstNode = Nodes[0];
- SUnit *LastNode = Nodes[Nodes.size() - 1];
-
- for (auto &PI : DDG->getInEdges(LastNode)) {
- // If we have an order dep that is potentially loop carried then a
- // back-edge exists between the last node and the first node that isn't
- // modeled in the DAG. Handle it manually by adding 1 to the distance of
- // the last node.
- if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
- !DAG->isLoopCarriedDep(PI))
- continue;
- SUnitToDistance[FirstNode] =
- std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
- }
-
// The latency is the distance from the source node to itself.
Latency = SUnitToDistance[Nodes.front()];
}
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
cl::desc("Use the MVE code generator for software pipelining"));
+static cl::opt<unsigned> MaxCircuitPaths(
+ "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+ cl::desc("Maximum number of circles to be detected for each vertex"));
+
namespace llvm {
// A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
} // end namespace llvm
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
char MachinePipeliner::ID = 0;
#ifndef NDEBUG
int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
void SwingSchedulerDAG::schedule() {
AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
buildSchedGraph(AA);
- addLoopCarriedDependences(AA);
updatePhiDependences();
Topo.InitDAGTopologicalSorting();
changeDependences();
postProcessDAG();
- DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
LLVM_DEBUG(dump());
+ auto LCE = addLoopCarriedDependences(AA);
+ LLVM_DEBUG({
+ dbgs() << "Loop Carried Edges:\n";
+ for (SUnit &SU : SUnits)
+ LCE.dump(&SU, TRI, &MRI);
+ });
+ DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
NodeSetType NodeSets;
findCircuits(NodeSets);
NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
return 0;
}
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
- SmallPtrSet<SUnit *, 8> Visited;
- SmallVector<SUnit *, 8> Worklist;
- Worklist.push_back(SUa);
- while (!Worklist.empty()) {
- const SUnit *SU = Worklist.pop_back_val();
- for (const auto &SI : SU->Succs) {
- SUnit *SuccSU = SI.getSUnit();
- if (SI.getKind() == SDep::Order) {
- if (Visited.count(SuccSU))
- continue;
- if (SuccSU == SUb)
- return true;
- Worklist.push_back(SuccSU);
- Visited.insert(SuccSU);
- }
- }
- }
- return false;
-}
-
/// Return true if the instruction causes a chain between memory
/// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
- return MI.isCall() || MI.mayRaiseFPException() ||
- MI.hasUnmodeledSideEffects() ||
- (MI.hasOrderedMemoryRef() &&
- (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+ return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+ (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
}
/// Return the underlying objects for the memory references of an instruction.
/// This function calls the code in ValueTracking, but first checks that the
/// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
- SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+ SmallVectorImpl<const Value *> &Objs) {
if (!MI->hasOneMemOperand())
return;
MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
}
}
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
- MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
- Value *UnknownValue =
- UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
- for (auto &SU : SUnits) {
- MachineInstr &MI = *SU.getInstr();
- if (isDependenceBarrier(MI))
- PendingLoads.clear();
- else if (MI.mayLoad()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
- SUs.push_back(&SU);
- }
- } else if (MI.mayStore()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
- PendingLoads.find(V);
- if (I == PendingLoads.end())
- continue;
- for (auto *Load : I->second) {
- if (isSuccOrder(Load, &SU))
- continue;
- MachineInstr &LdMI = *Load->getInstr();
- // First, perform the cheaper check that compares the base register.
- // If they are the same and the load offset is less than the store
- // offset, then mark the dependence as loop carried potentially.
- const MachineOperand *BaseOp1, *BaseOp2;
- int64_t Offset1, Offset2;
- bool Offset1IsScalable, Offset2IsScalable;
- if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
- Offset1IsScalable, TRI) &&
- TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
- Offset2IsScalable, TRI)) {
- if (BaseOp1->isIdenticalTo(*BaseOp2) &&
- Offset1IsScalable == Offset2IsScalable &&
- (int)Offset1 < (int)Offset2) {
- assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
- "What happened to the chain edge?");
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- }
- // Second, the more expensive check that uses alias analysis on the
- // base registers. If they alias, and the load offset is less than
- // the store offset, the mark the dependence as loop carried.
- if (!AA) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
- MachineMemOperand *MMO2 = *MI.memoperands_begin();
- if (!MMO1->getValue() || !MMO2->getValue()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (MMO1->getValue() == MMO2->getValue() &&
- MMO1->getOffset() <= MMO2->getOffset()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (!AA->isNoAlias(
- MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
- MemoryLocation::getAfter(MMO2->getValue(),
- MMO2->getAAInfo()))) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- }
- }
- }
- }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+ const MachineMemOperand *MMO = *MI->memoperands_begin();
+ const Value *Val = MMO->getValue();
+ if (!Val)
+ return std::nullopt;
+ auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+ // Peel off noalias information from `AATags` because it might be valid only
+ // in single iteration.
+ // FIXME: This is too conservative. Checking
+ // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+ // can perform more accuurately.
+ MemLoc.AATags.NoAlias = nullptr;
+ return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst,
+ BatchAAResults *BAA) const {
+ if (!SwpPruneLoopCarried)
+ return true;
+
+ // First, check the dependence by comparing base register, offset, and
+ // step value of the loop.
+ switch (checkLoopCarriedMemDep(Src, Dst)) {
+ case AliasResult::Kind::MustAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ case AliasResult::Kind::MayAlias:
+ break;
+ default:
+ llvm_unreachable("Unexpected alias");
+ }
+
+ // If we cannot determine the dependence by previouse check, then
+ // check by using alias analysis.
+ if (!BAA)
+ return true;
+
+ const auto MemLoc1 = getMemoryLocationForAA(Src);
+ const auto MemLoc2 = getMemoryLocationForAA(Dst);
+ if (!MemLoc1.has_value() || !MemLoc2.has_value())
+ return true;
+ switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+ case AliasResult::Kind::MayAlias:
+ case AliasResult::Kind::MustAlias:
+ case AliasResult::Kind::PartialAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ default:
+ llvm_unreachable("Unexpected alias");
}
}
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
}
};
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+ // Type of instruction that is relevant to order-dependencies
+ enum class InstrTag {
+ // Instruction related to global memory objects. There are order
+ // dependencies between instructions that may load or store or raise
+ // floating-point exception before and after this one.
+ GlobalMemoryObject = 0,
+
+ // Instruction that may load or store memory, but does not form a global
+ // barrier.
+ LoadOrStore = 1,
+
+ // Instruction that does not match above, but may raise floatin-point
+ // exceptions.
+ FPExceptions = 2,
+ };
+
+ struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+ TaggedSUnit(SUnit *SU, InstrTag Tag)
+ : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+ InstrTag getTag() const { return InstrTag(getInt()); }
+ };
+
+ using SUsType = SmallVector<SUnit *, 4>;
+ using Value2SUs = MapVector<const Value *, SUsType>;
+
+ // Retains loads and stores classified by the underlying objects.
+ struct LoadStoreChunk {
+ Value2SUs Loads, Stores;
+ SUsType UnknownLoads, UnknownStores;
+ };
+
+ SwingSchedulerDAG *DAG;
+ std::unique_ptr<BatchAAResults> BAA;
+ const Value *UnknownValue;
+ std::vector<SUnit> &SUnits;
+
+ // The size of SUnits, for convenience.
+ const unsigned N;
+
+ // Adjacency matrix consisiting of order dependencies of the original DAG.
+ std::vector<BitVector> AdjMatrix;
+
+ // Loop-carried Edges.
+ std::vector<BitVector> LoopCarried;
+
+ // Instructions related to chain dependencies. They are one of the following.
+ //
+ // 1. Global memory object.
+ // 2. Load, but not a global memory object, not invariant, or may load trap
+ // value.
+ // 3. Store, but not global memory object.
+ // 4. None of them, but may raise floating-point exceptions.
+ //
+ // This is used when analyzing loop-carried dependencies that access global
+ // barrier instructions.
+ std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+ LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+ : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+ AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+ UnknownValue =
+ UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+ if (AA) {
+ BAA = std::make_unique<BatchAAResults>(*AA);
+ BAA->enableCrossIterationMode();
+ }
+ initAdjMatrix();
+ }
+
+ void computeDependencies() {
+ // Traverse all instructions and extract only what we are targetting.
+ for (auto &SU : SUnits) {
+ auto Tagged = checkInstrType(&SU);
+
+ // This instruction has no loop-carried order-dependencies.
+ if (!Tagged)
+ continue;
+
+ TaggedSUnits.push_back(*Tagged);
+ }
+
+ addLoopCarriedDependencies();
+
+ // Finalize the results.
+ for (int I = 0; I != int(N); I++) {
+ // If the dependence between two instructions already exists in the
+ // original DAG, then loop-carried dependence of the same instructions is
+ // unnecessary because the original one expresses stricter
+ // constraint than loop-carried one.
+ LoopCarried[I].reset(AdjMatrix[I]);
+
+ // Self-loops are noisy.
+ LoopCarried[I].reset(I);
+ }
+ }
+
+ const BitVector &getLoopCarried(unsigned Idx) const {
+ return LoopCarried[Idx];
+ }
+
+private:
+ // Calculate reachability induced by the adjacency matrix. The original graph
+ // is DAG, so we can compute them from bottom to top.
+ void initAdjMatrix() {
+ for (int RI = 0; RI != int(N); RI++) {
+ int I = SUnits.size() - (RI + 1);
+ f...
[truncated]
|
@llvm/pr-subscribers-backend-aarch64 Author: Ryotaro Kasuga (kasuga-fj) ChangesIn the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.
This patch added these dependencies to fix correctness issues. In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies. I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:
As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when
This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff 18 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/DFAPacketizer.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
bool ignoreDependence(bool IgnoreAnti) const;
};
+struct LoopCarriedEdges {
+ using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+ using OrderDep = SmallSetVector<SUnit *, 8>;
+ using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+ using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+ OutputDepsType OutputDeps;
+ OrderDepsType OrderDeps;
+
+ const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+ auto Ite = OutputDeps.find(Key);
+ if (Ite == OutputDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+ auto Ite = OrderDeps.find(Key);
+ if (Ite == OrderDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+ const MachineRegisterInfo *MRI) const;
+};
+
/// Represents dependencies between instructions. This class is a wrapper of
/// `SUnits` and its dependencies to manipulate back-edges in a natural way.
/// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
+ void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+ const LoopCarriedEdges &LCE);
+
public:
- SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+ SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+ const LoopCarriedEdges &LCE);
const EdgesType &getInEdges(const SUnit *SU) const;
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
BitVector Blocked;
SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
SmallVector<SmallVector<int, 4>, 16> AdjK;
- // Node to Index from ScheduleDAGTopologicalSort
- std::vector<int> *Node2Idx;
+ SmallVector<BitVector, 16> LoopCarried;
unsigned NumPaths = 0u;
- static unsigned MaxPaths;
public:
- Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
- : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
- Node2Idx = new std::vector<int>(SUs.size());
- unsigned Idx = 0;
- for (const auto &NodeNum : Topo)
- Node2Idx->at(NodeNum) = Idx++;
- }
+ Circuits(std::vector<SUnit> &SUs)
+ : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
Circuits &operator=(const Circuits &other) = delete;
Circuits(const Circuits &other) = delete;
- ~Circuits() { delete Node2Idx; }
/// Reset the data structures used in the circuit algorithm.
void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
NumPaths = 0;
}
- void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+ void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
bool circuit(int V, int S, NodeSetType &NodeSets,
- const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+ const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
void unblock(int U);
};
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
}
- bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+ bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+ BatchAAResults *BAA) const;
void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
private:
- void addLoopCarriedDependences(AAResults *AA);
+ LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+ AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst) const;
void updatePhiDependences();
void changeDependences();
unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
void computeNodeOrder(NodeSetType &NodeSets);
void checkValidNodeOrder(const NodeSetType &Circuits) const;
bool schedulePipeline(SMSchedule &Schedule);
- bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+ bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
MachineInstr *findDefInLoop(Register Reg);
bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
using iterator = SetVector<SUnit *>::const_iterator;
NodeSet() = default;
- NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+ NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
: Nodes(S, E), HasRecurrence(true) {
// Calculate the latency of this node set.
// Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
//
// Hold a map from each SUnit in the circle to the maximum distance from the
// source node by only considering the nodes.
- const SwingSchedulerDDG *DDG = DAG->getDDG();
DenseMap<SUnit *, unsigned> SUnitToDistance;
for (auto *Node : Nodes)
SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
}
}
}
- // Handle a back-edge in loop carried dependencies
- SUnit *FirstNode = Nodes[0];
- SUnit *LastNode = Nodes[Nodes.size() - 1];
-
- for (auto &PI : DDG->getInEdges(LastNode)) {
- // If we have an order dep that is potentially loop carried then a
- // back-edge exists between the last node and the first node that isn't
- // modeled in the DAG. Handle it manually by adding 1 to the distance of
- // the last node.
- if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
- !DAG->isLoopCarriedDep(PI))
- continue;
- SUnitToDistance[FirstNode] =
- std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
- }
-
// The latency is the distance from the source node to itself.
Latency = SUnitToDistance[Nodes.front()];
}
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
cl::desc("Use the MVE code generator for software pipelining"));
+static cl::opt<unsigned> MaxCircuitPaths(
+ "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+ cl::desc("Maximum number of circles to be detected for each vertex"));
+
namespace llvm {
// A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
} // end namespace llvm
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
char MachinePipeliner::ID = 0;
#ifndef NDEBUG
int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
void SwingSchedulerDAG::schedule() {
AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
buildSchedGraph(AA);
- addLoopCarriedDependences(AA);
updatePhiDependences();
Topo.InitDAGTopologicalSorting();
changeDependences();
postProcessDAG();
- DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
LLVM_DEBUG(dump());
+ auto LCE = addLoopCarriedDependences(AA);
+ LLVM_DEBUG({
+ dbgs() << "Loop Carried Edges:\n";
+ for (SUnit &SU : SUnits)
+ LCE.dump(&SU, TRI, &MRI);
+ });
+ DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
NodeSetType NodeSets;
findCircuits(NodeSets);
NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
return 0;
}
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
- SmallPtrSet<SUnit *, 8> Visited;
- SmallVector<SUnit *, 8> Worklist;
- Worklist.push_back(SUa);
- while (!Worklist.empty()) {
- const SUnit *SU = Worklist.pop_back_val();
- for (const auto &SI : SU->Succs) {
- SUnit *SuccSU = SI.getSUnit();
- if (SI.getKind() == SDep::Order) {
- if (Visited.count(SuccSU))
- continue;
- if (SuccSU == SUb)
- return true;
- Worklist.push_back(SuccSU);
- Visited.insert(SuccSU);
- }
- }
- }
- return false;
-}
-
/// Return true if the instruction causes a chain between memory
/// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
- return MI.isCall() || MI.mayRaiseFPException() ||
- MI.hasUnmodeledSideEffects() ||
- (MI.hasOrderedMemoryRef() &&
- (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+ return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+ (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
}
/// Return the underlying objects for the memory references of an instruction.
/// This function calls the code in ValueTracking, but first checks that the
/// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
- SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+ SmallVectorImpl<const Value *> &Objs) {
if (!MI->hasOneMemOperand())
return;
MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
}
}
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
- MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
- Value *UnknownValue =
- UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
- for (auto &SU : SUnits) {
- MachineInstr &MI = *SU.getInstr();
- if (isDependenceBarrier(MI))
- PendingLoads.clear();
- else if (MI.mayLoad()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
- SUs.push_back(&SU);
- }
- } else if (MI.mayStore()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
- PendingLoads.find(V);
- if (I == PendingLoads.end())
- continue;
- for (auto *Load : I->second) {
- if (isSuccOrder(Load, &SU))
- continue;
- MachineInstr &LdMI = *Load->getInstr();
- // First, perform the cheaper check that compares the base register.
- // If they are the same and the load offset is less than the store
- // offset, then mark the dependence as loop carried potentially.
- const MachineOperand *BaseOp1, *BaseOp2;
- int64_t Offset1, Offset2;
- bool Offset1IsScalable, Offset2IsScalable;
- if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
- Offset1IsScalable, TRI) &&
- TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
- Offset2IsScalable, TRI)) {
- if (BaseOp1->isIdenticalTo(*BaseOp2) &&
- Offset1IsScalable == Offset2IsScalable &&
- (int)Offset1 < (int)Offset2) {
- assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
- "What happened to the chain edge?");
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- }
- // Second, the more expensive check that uses alias analysis on the
- // base registers. If they alias, and the load offset is less than
- // the store offset, the mark the dependence as loop carried.
- if (!AA) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
- MachineMemOperand *MMO2 = *MI.memoperands_begin();
- if (!MMO1->getValue() || !MMO2->getValue()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (MMO1->getValue() == MMO2->getValue() &&
- MMO1->getOffset() <= MMO2->getOffset()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (!AA->isNoAlias(
- MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
- MemoryLocation::getAfter(MMO2->getValue(),
- MMO2->getAAInfo()))) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- }
- }
- }
- }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+ const MachineMemOperand *MMO = *MI->memoperands_begin();
+ const Value *Val = MMO->getValue();
+ if (!Val)
+ return std::nullopt;
+ auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+ // Peel off noalias information from `AATags` because it might be valid only
+ // in single iteration.
+ // FIXME: This is too conservative. Checking
+ // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+ // can perform more accuurately.
+ MemLoc.AATags.NoAlias = nullptr;
+ return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst,
+ BatchAAResults *BAA) const {
+ if (!SwpPruneLoopCarried)
+ return true;
+
+ // First, check the dependence by comparing base register, offset, and
+ // step value of the loop.
+ switch (checkLoopCarriedMemDep(Src, Dst)) {
+ case AliasResult::Kind::MustAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ case AliasResult::Kind::MayAlias:
+ break;
+ default:
+ llvm_unreachable("Unexpected alias");
+ }
+
+ // If we cannot determine the dependence by previouse check, then
+ // check by using alias analysis.
+ if (!BAA)
+ return true;
+
+ const auto MemLoc1 = getMemoryLocationForAA(Src);
+ const auto MemLoc2 = getMemoryLocationForAA(Dst);
+ if (!MemLoc1.has_value() || !MemLoc2.has_value())
+ return true;
+ switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+ case AliasResult::Kind::MayAlias:
+ case AliasResult::Kind::MustAlias:
+ case AliasResult::Kind::PartialAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ default:
+ llvm_unreachable("Unexpected alias");
}
}
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
}
};
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+ // Type of instruction that is relevant to order-dependencies
+ enum class InstrTag {
+ // Instruction related to global memory objects. There are order
+ // dependencies between instructions that may load or store or raise
+ // floating-point exception before and after this one.
+ GlobalMemoryObject = 0,
+
+ // Instruction that may load or store memory, but does not form a global
+ // barrier.
+ LoadOrStore = 1,
+
+ // Instruction that does not match above, but may raise floatin-point
+ // exceptions.
+ FPExceptions = 2,
+ };
+
+ struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+ TaggedSUnit(SUnit *SU, InstrTag Tag)
+ : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+ InstrTag getTag() const { return InstrTag(getInt()); }
+ };
+
+ using SUsType = SmallVector<SUnit *, 4>;
+ using Value2SUs = MapVector<const Value *, SUsType>;
+
+ // Retains loads and stores classified by the underlying objects.
+ struct LoadStoreChunk {
+ Value2SUs Loads, Stores;
+ SUsType UnknownLoads, UnknownStores;
+ };
+
+ SwingSchedulerDAG *DAG;
+ std::unique_ptr<BatchAAResults> BAA;
+ const Value *UnknownValue;
+ std::vector<SUnit> &SUnits;
+
+ // The size of SUnits, for convenience.
+ const unsigned N;
+
+ // Adjacency matrix consisiting of order dependencies of the original DAG.
+ std::vector<BitVector> AdjMatrix;
+
+ // Loop-carried Edges.
+ std::vector<BitVector> LoopCarried;
+
+ // Instructions related to chain dependencies. They are one of the following.
+ //
+ // 1. Global memory object.
+ // 2. Load, but not a global memory object, not invariant, or may load trap
+ // value.
+ // 3. Store, but not global memory object.
+ // 4. None of them, but may raise floating-point exceptions.
+ //
+ // This is used when analyzing loop-carried dependencies that access global
+ // barrier instructions.
+ std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+ LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+ : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+ AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+ UnknownValue =
+ UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+ if (AA) {
+ BAA = std::make_unique<BatchAAResults>(*AA);
+ BAA->enableCrossIterationMode();
+ }
+ initAdjMatrix();
+ }
+
+ void computeDependencies() {
+ // Traverse all instructions and extract only what we are targetting.
+ for (auto &SU : SUnits) {
+ auto Tagged = checkInstrType(&SU);
+
+ // This instruction has no loop-carried order-dependencies.
+ if (!Tagged)
+ continue;
+
+ TaggedSUnits.push_back(*Tagged);
+ }
+
+ addLoopCarriedDependencies();
+
+ // Finalize the results.
+ for (int I = 0; I != int(N); I++) {
+ // If the dependence between two instructions already exists in the
+ // original DAG, then loop-carried dependence of the same instructions is
+ // unnecessary because the original one expresses stricter
+ // constraint than loop-carried one.
+ LoopCarried[I].reset(AdjMatrix[I]);
+
+ // Self-loops are noisy.
+ LoopCarried[I].reset(I);
+ }
+ }
+
+ const BitVector &getLoopCarried(unsigned Idx) const {
+ return LoopCarried[Idx];
+ }
+
+private:
+ // Calculate reachability induced by the adjacency matrix. The original graph
+ // is DAG, so we can compute them from bottom to top.
+ void initAdjMatrix() {
+ for (int RI = 0; RI != int(N); RI++) {
+ int I = SUnits.size() - (RI + 1);
+ f...
[truncated]
|
@llvm/pr-subscribers-backend-hexagon Author: Ryotaro Kasuga (kasuga-fj) ChangesIn the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.
This patch added these dependencies to fix correctness issues. In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies. I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:
As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when
This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If Patch is 84.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121907.diff 18 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/MachinePipeliner.h b/llvm/include/llvm/CodeGen/MachinePipeliner.h
index 8e47d0cead7571..810a5d9f6dff00 100644
--- a/llvm/include/llvm/CodeGen/MachinePipeliner.h
+++ b/llvm/include/llvm/CodeGen/MachinePipeliner.h
@@ -42,6 +42,7 @@
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/DFAPacketizer.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
@@ -190,6 +191,33 @@ class SwingSchedulerDDGEdge {
bool ignoreDependence(bool IgnoreAnti) const;
};
+struct LoopCarriedEdges {
+ using OutputDep = SmallDenseMap<Register, SmallSetVector<SUnit *, 4>>;
+ using OrderDep = SmallSetVector<SUnit *, 8>;
+ using OutputDepsType = DenseMap<SUnit *, OutputDep>;
+ using OrderDepsType = DenseMap<SUnit *, OrderDep>;
+
+ OutputDepsType OutputDeps;
+ OrderDepsType OrderDeps;
+
+ const OutputDep *getOutputDepOrNull(SUnit *Key) const {
+ auto Ite = OutputDeps.find(Key);
+ if (Ite == OutputDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ const OrderDep *getOrderDepOrNull(SUnit *Key) const {
+ auto Ite = OrderDeps.find(Key);
+ if (Ite == OrderDeps.end())
+ return nullptr;
+ return &Ite->second;
+ }
+
+ void dump(SUnit *SU, const TargetRegisterInfo *TRI,
+ const MachineRegisterInfo *MRI) const;
+};
+
/// Represents dependencies between instructions. This class is a wrapper of
/// `SUnits` and its dependencies to manipulate back-edges in a natural way.
/// Currently it only supports back-edges via PHI, which are expressed as
@@ -217,8 +245,12 @@ class SwingSchedulerDDG {
SwingSchedulerDDGEdges &getEdges(const SUnit *SU);
const SwingSchedulerDDGEdges &getEdges(const SUnit *SU) const;
+ void addLoopCarriedEdges(std::vector<SUnit> &SUnits,
+ const LoopCarriedEdges &LCE);
+
public:
- SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU);
+ SwingSchedulerDDG(std::vector<SUnit> &SUnits, SUnit *EntrySU, SUnit *ExitSU,
+ const LoopCarriedEdges &LCE);
const EdgesType &getInEdges(const SUnit *SU) const;
@@ -285,22 +317,14 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
BitVector Blocked;
SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
SmallVector<SmallVector<int, 4>, 16> AdjK;
- // Node to Index from ScheduleDAGTopologicalSort
- std::vector<int> *Node2Idx;
+ SmallVector<BitVector, 16> LoopCarried;
unsigned NumPaths = 0u;
- static unsigned MaxPaths;
public:
- Circuits(std::vector<SUnit> &SUs, ScheduleDAGTopologicalSort &Topo)
- : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {
- Node2Idx = new std::vector<int>(SUs.size());
- unsigned Idx = 0;
- for (const auto &NodeNum : Topo)
- Node2Idx->at(NodeNum) = Idx++;
- }
+ Circuits(std::vector<SUnit> &SUs)
+ : SUnits(SUs), Blocked(SUs.size()), B(SUs.size()), AdjK(SUs.size()) {}
Circuits &operator=(const Circuits &other) = delete;
Circuits(const Circuits &other) = delete;
- ~Circuits() { delete Node2Idx; }
/// Reset the data structures used in the circuit algorithm.
void reset() {
@@ -310,9 +334,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
NumPaths = 0;
}
- void createAdjacencyStructure(SwingSchedulerDAG *DAG);
+ void createAdjacencyStructure(const SwingSchedulerDDG *DDG);
bool circuit(int V, int S, NodeSetType &NodeSets,
- const SwingSchedulerDAG *DAG, bool HasBackedge = false);
+ const SwingSchedulerDDG *DDG, bool HasLoopCarriedEdge = false);
void unblock(int U);
};
@@ -366,7 +390,8 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
}
- bool isLoopCarriedDep(const SwingSchedulerDDGEdge &Edge) const;
+ bool hasLoopCarriedMemDep(const MachineInstr *Src, const MachineInstr *Dst,
+ BatchAAResults *BAA) const;
void applyInstrChange(MachineInstr *MI, SMSchedule &Schedule);
@@ -391,7 +416,9 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
const SwingSchedulerDDG *getDDG() const { return DDG.get(); }
private:
- void addLoopCarriedDependences(AAResults *AA);
+ LoopCarriedEdges addLoopCarriedDependences(AAResults *AA);
+ AliasResult::Kind checkLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst) const;
void updatePhiDependences();
void changeDependences();
unsigned calculateResMII();
@@ -409,7 +436,7 @@ class SwingSchedulerDAG : public ScheduleDAGInstrs {
void computeNodeOrder(NodeSetType &NodeSets);
void checkValidNodeOrder(const NodeSetType &Circuits) const;
bool schedulePipeline(SMSchedule &Schedule);
- bool computeDelta(MachineInstr &MI, unsigned &Delta) const;
+ bool computeDelta(const MachineInstr &MI, unsigned &Delta) const;
MachineInstr *findDefInLoop(Register Reg);
bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
unsigned &OffsetPos, unsigned &NewBase,
@@ -437,7 +464,7 @@ class NodeSet {
using iterator = SetVector<SUnit *>::const_iterator;
NodeSet() = default;
- NodeSet(iterator S, iterator E, const SwingSchedulerDAG *DAG)
+ NodeSet(iterator S, iterator E, const SwingSchedulerDDG *DDG)
: Nodes(S, E), HasRecurrence(true) {
// Calculate the latency of this node set.
// Example to demonstrate the calculation:
@@ -453,7 +480,6 @@ class NodeSet {
//
// Hold a map from each SUnit in the circle to the maximum distance from the
// source node by only considering the nodes.
- const SwingSchedulerDDG *DDG = DAG->getDDG();
DenseMap<SUnit *, unsigned> SUnitToDistance;
for (auto *Node : Nodes)
SUnitToDistance[Node] = 0;
@@ -470,22 +496,6 @@ class NodeSet {
}
}
}
- // Handle a back-edge in loop carried dependencies
- SUnit *FirstNode = Nodes[0];
- SUnit *LastNode = Nodes[Nodes.size() - 1];
-
- for (auto &PI : DDG->getInEdges(LastNode)) {
- // If we have an order dep that is potentially loop carried then a
- // back-edge exists between the last node and the first node that isn't
- // modeled in the DAG. Handle it manually by adding 1 to the distance of
- // the last node.
- if (PI.getSrc() != FirstNode || !PI.isOrderDep() ||
- !DAG->isLoopCarriedDep(PI))
- continue;
- SUnitToDistance[FirstNode] =
- std::max(SUnitToDistance[FirstNode], SUnitToDistance[LastNode] + 1);
- }
-
// The latency is the distance from the source node to itself.
Latency = SUnitToDistance[Nodes.front()];
}
diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp
index acd42aa497c6fe..f0731ccdaa6532 100644
--- a/llvm/lib/CodeGen/MachinePipeliner.cpp
+++ b/llvm/lib/CodeGen/MachinePipeliner.cpp
@@ -194,6 +194,10 @@ static cl::opt<bool>
MVECodeGen("pipeliner-mve-cg", cl::Hidden, cl::init(false),
cl::desc("Use the MVE code generator for software pipelining"));
+static cl::opt<unsigned> MaxCircuitPaths(
+ "pipeliner-max-circuit-paths", cl::Hidden, cl::init(5),
+ cl::desc("Maximum number of circles to be detected for each vertex"));
+
namespace llvm {
// A command line option to enable the CopyToPhi DAG mutation.
@@ -221,7 +225,6 @@ cl::opt<WindowSchedulingFlag> WindowSchedulingOption(
} // end namespace llvm
-unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
char MachinePipeliner::ID = 0;
#ifndef NDEBUG
int MachinePipeliner::NumTries = 0;
@@ -562,14 +565,20 @@ void SwingSchedulerDAG::setMAX_II() {
void SwingSchedulerDAG::schedule() {
AliasAnalysis *AA = &Pass.getAnalysis<AAResultsWrapperPass>().getAAResults();
buildSchedGraph(AA);
- addLoopCarriedDependences(AA);
updatePhiDependences();
Topo.InitDAGTopologicalSorting();
changeDependences();
postProcessDAG();
- DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU);
LLVM_DEBUG(dump());
+ auto LCE = addLoopCarriedDependences(AA);
+ LLVM_DEBUG({
+ dbgs() << "Loop Carried Edges:\n";
+ for (SUnit &SU : SUnits)
+ LCE.dump(&SU, TRI, &MRI);
+ });
+ DDG = std::make_unique<SwingSchedulerDDG>(SUnits, &EntrySU, &ExitSU, LCE);
+
NodeSetType NodeSets;
findCircuits(NodeSets);
NodeSetType Circuits = NodeSets;
@@ -779,42 +788,18 @@ static unsigned getLoopPhiReg(const MachineInstr &Phi,
return 0;
}
-/// Return true if SUb can be reached from SUa following the chain edges.
-static bool isSuccOrder(SUnit *SUa, SUnit *SUb) {
- SmallPtrSet<SUnit *, 8> Visited;
- SmallVector<SUnit *, 8> Worklist;
- Worklist.push_back(SUa);
- while (!Worklist.empty()) {
- const SUnit *SU = Worklist.pop_back_val();
- for (const auto &SI : SU->Succs) {
- SUnit *SuccSU = SI.getSUnit();
- if (SI.getKind() == SDep::Order) {
- if (Visited.count(SuccSU))
- continue;
- if (SuccSU == SUb)
- return true;
- Worklist.push_back(SuccSU);
- Visited.insert(SuccSU);
- }
- }
- }
- return false;
-}
-
/// Return true if the instruction causes a chain between memory
/// references before and after it.
-static bool isDependenceBarrier(MachineInstr &MI) {
- return MI.isCall() || MI.mayRaiseFPException() ||
- MI.hasUnmodeledSideEffects() ||
- (MI.hasOrderedMemoryRef() &&
- (!MI.mayLoad() || !MI.isDereferenceableInvariantLoad()));
+static bool isGlobalMemoryObject(MachineInstr &MI) {
+ return MI.isCall() || MI.hasUnmodeledSideEffects() ||
+ (MI.hasOrderedMemoryRef() && !MI.isDereferenceableInvariantLoad());
}
/// Return the underlying objects for the memory references of an instruction.
/// This function calls the code in ValueTracking, but first checks that the
/// instruction has a memory operand.
-static void getUnderlyingObjects(const MachineInstr *MI,
- SmallVectorImpl<const Value *> &Objs) {
+static void getUnderlyingObjectsForInstr(const MachineInstr *MI,
+ SmallVectorImpl<const Value *> &Objs) {
if (!MI->hasOneMemOperand())
return;
MachineMemOperand *MM = *MI->memoperands_begin();
@@ -829,97 +814,63 @@ static void getUnderlyingObjects(const MachineInstr *MI,
}
}
-/// Add a chain edge between a load and store if the store can be an
-/// alias of the load on a subsequent iteration, i.e., a loop carried
-/// dependence. This code is very similar to the code in ScheduleDAGInstrs
-/// but that code doesn't create loop carried dependences.
-void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
- MapVector<const Value *, SmallVector<SUnit *, 4>> PendingLoads;
- Value *UnknownValue =
- UndefValue::get(Type::getVoidTy(MF.getFunction().getContext()));
- for (auto &SU : SUnits) {
- MachineInstr &MI = *SU.getInstr();
- if (isDependenceBarrier(MI))
- PendingLoads.clear();
- else if (MI.mayLoad()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
- SUs.push_back(&SU);
- }
- } else if (MI.mayStore()) {
- SmallVector<const Value *, 4> Objs;
- ::getUnderlyingObjects(&MI, Objs);
- if (Objs.empty())
- Objs.push_back(UnknownValue);
- for (const auto *V : Objs) {
- MapVector<const Value *, SmallVector<SUnit *, 4>>::iterator I =
- PendingLoads.find(V);
- if (I == PendingLoads.end())
- continue;
- for (auto *Load : I->second) {
- if (isSuccOrder(Load, &SU))
- continue;
- MachineInstr &LdMI = *Load->getInstr();
- // First, perform the cheaper check that compares the base register.
- // If they are the same and the load offset is less than the store
- // offset, then mark the dependence as loop carried potentially.
- const MachineOperand *BaseOp1, *BaseOp2;
- int64_t Offset1, Offset2;
- bool Offset1IsScalable, Offset2IsScalable;
- if (TII->getMemOperandWithOffset(LdMI, BaseOp1, Offset1,
- Offset1IsScalable, TRI) &&
- TII->getMemOperandWithOffset(MI, BaseOp2, Offset2,
- Offset2IsScalable, TRI)) {
- if (BaseOp1->isIdenticalTo(*BaseOp2) &&
- Offset1IsScalable == Offset2IsScalable &&
- (int)Offset1 < (int)Offset2) {
- assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI) &&
- "What happened to the chain edge?");
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- }
- // Second, the more expensive check that uses alias analysis on the
- // base registers. If they alias, and the load offset is less than
- // the store offset, the mark the dependence as loop carried.
- if (!AA) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- MachineMemOperand *MMO1 = *LdMI.memoperands_begin();
- MachineMemOperand *MMO2 = *MI.memoperands_begin();
- if (!MMO1->getValue() || !MMO2->getValue()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (MMO1->getValue() == MMO2->getValue() &&
- MMO1->getOffset() <= MMO2->getOffset()) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- continue;
- }
- if (!AA->isNoAlias(
- MemoryLocation::getAfter(MMO1->getValue(), MMO1->getAAInfo()),
- MemoryLocation::getAfter(MMO2->getValue(),
- MMO2->getAAInfo()))) {
- SDep Dep(Load, SDep::Barrier);
- Dep.setLatency(1);
- SU.addPred(Dep);
- }
- }
- }
- }
+static std::optional<MemoryLocation>
+getMemoryLocationForAA(const MachineInstr *MI) {
+ const MachineMemOperand *MMO = *MI->memoperands_begin();
+ const Value *Val = MMO->getValue();
+ if (!Val)
+ return std::nullopt;
+ auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo());
+
+ // Peel off noalias information from `AATags` because it might be valid only
+ // in single iteration.
+ // FIXME: This is too conservative. Checking
+ // `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR
+ // can perform more accuurately.
+ MemLoc.AATags.NoAlias = nullptr;
+ return MemLoc;
+}
+
+/// Return true for an memory dependence that is loop carried
+/// potentially. A dependence is loop carried if the destination defines a value
+/// that may be used or defined by the source in a subsequent iteration.
+bool SwingSchedulerDAG::hasLoopCarriedMemDep(const MachineInstr *Src,
+ const MachineInstr *Dst,
+ BatchAAResults *BAA) const {
+ if (!SwpPruneLoopCarried)
+ return true;
+
+ // First, check the dependence by comparing base register, offset, and
+ // step value of the loop.
+ switch (checkLoopCarriedMemDep(Src, Dst)) {
+ case AliasResult::Kind::MustAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ case AliasResult::Kind::MayAlias:
+ break;
+ default:
+ llvm_unreachable("Unexpected alias");
+ }
+
+ // If we cannot determine the dependence by previouse check, then
+ // check by using alias analysis.
+ if (!BAA)
+ return true;
+
+ const auto MemLoc1 = getMemoryLocationForAA(Src);
+ const auto MemLoc2 = getMemoryLocationForAA(Dst);
+ if (!MemLoc1.has_value() || !MemLoc2.has_value())
+ return true;
+ switch (BAA->alias(*MemLoc1, *MemLoc2)) {
+ case AliasResult::Kind::MayAlias:
+ case AliasResult::Kind::MustAlias:
+ case AliasResult::Kind::PartialAlias:
+ return true;
+ case AliasResult::Kind::NoAlias:
+ return false;
+ default:
+ llvm_unreachable("Unexpected alias");
}
}
@@ -1544,8 +1495,311 @@ class HighRegisterPressureDetector {
}
};
+/// Add loop-carried chain dependencies. This class handles the same type of
+/// dependencies added by `ScheduleDAGInstrs::buildSchedGraph`, but takes into
+/// account dependencies across iterations.
+class LoopCarriedOrderDepsTracker {
+ // Type of instruction that is relevant to order-dependencies
+ enum class InstrTag {
+ // Instruction related to global memory objects. There are order
+ // dependencies between instructions that may load or store or raise
+ // floating-point exception before and after this one.
+ GlobalMemoryObject = 0,
+
+ // Instruction that may load or store memory, but does not form a global
+ // barrier.
+ LoadOrStore = 1,
+
+ // Instruction that does not match above, but may raise floatin-point
+ // exceptions.
+ FPExceptions = 2,
+ };
+
+ struct TaggedSUnit : PointerIntPair<SUnit *, 2> {
+ TaggedSUnit(SUnit *SU, InstrTag Tag)
+ : PointerIntPair<SUnit *, 2>(SU, unsigned(Tag)) {}
+
+ InstrTag getTag() const { return InstrTag(getInt()); }
+ };
+
+ using SUsType = SmallVector<SUnit *, 4>;
+ using Value2SUs = MapVector<const Value *, SUsType>;
+
+ // Retains loads and stores classified by the underlying objects.
+ struct LoadStoreChunk {
+ Value2SUs Loads, Stores;
+ SUsType UnknownLoads, UnknownStores;
+ };
+
+ SwingSchedulerDAG *DAG;
+ std::unique_ptr<BatchAAResults> BAA;
+ const Value *UnknownValue;
+ std::vector<SUnit> &SUnits;
+
+ // The size of SUnits, for convenience.
+ const unsigned N;
+
+ // Adjacency matrix consisiting of order dependencies of the original DAG.
+ std::vector<BitVector> AdjMatrix;
+
+ // Loop-carried Edges.
+ std::vector<BitVector> LoopCarried;
+
+ // Instructions related to chain dependencies. They are one of the following.
+ //
+ // 1. Global memory object.
+ // 2. Load, but not a global memory object, not invariant, or may load trap
+ // value.
+ // 3. Store, but not global memory object.
+ // 4. None of them, but may raise floating-point exceptions.
+ //
+ // This is used when analyzing loop-carried dependencies that access global
+ // barrier instructions.
+ std::vector<TaggedSUnit> TaggedSUnits;
+
+public:
+ LoopCarriedOrderDepsTracker(SwingSchedulerDAG *SSD, AAResults *AA)
+ : DAG(SSD), BAA(nullptr), SUnits(DAG->SUnits), N(SUnits.size()),
+ AdjMatrix(N, BitVector(N)), LoopCarried(N, BitVector(N)) {
+ UnknownValue =
+ UndefValue::get(Type::getVoidTy(DAG->MF.getFunction().getContext()));
+ if (AA) {
+ BAA = std::make_unique<BatchAAResults>(*AA);
+ BAA->enableCrossIterationMode();
+ }
+ initAdjMatrix();
+ }
+
+ void computeDependencies() {
+ // Traverse all instructions and extract only what we are targetting.
+ for (auto &SU : SUnits) {
+ auto Tagged = checkInstrType(&SU);
+
+ // This instruction has no loop-carried order-dependencies.
+ if (!Tagged)
+ continue;
+
+ TaggedSUnits.push_back(*Tagged);
+ }
+
+ addLoopCarriedDependencies();
+
+ // Finalize the results.
+ for (int I = 0; I != int(N); I++) {
+ // If the dependence between two instructions already exists in the
+ // original DAG, then loop-carried dependence of the same instructions is
+ // unnecessary because the original one expresses stricter
+ // constraint than loop-carried one.
+ LoopCarried[I].reset(AdjMatrix[I]);
+
+ // Self-loops are noisy.
+ LoopCarried[I].reset(I);
+ }
+ }
+
+ const BitVector &getLoopCarried(unsigned Idx) const {
+ return LoopCarried[Idx];
+ }
+
+private:
+ // Calculate reachability induced by the adjacency matrix. The original graph
+ // is DAG, so we can compute them from bottom to top.
+ void initAdjMatrix() {
+ for (int RI = 0; RI != int(N); RI++) {
+ int I = SUnits.size() - (RI + 1);
+ f...
[truncated]
|
a2495c0
to
62423c1
Compare
I have modified some existing tests that depend on scheduling results more than necessary. But I now feel that these changes should be committed separately before this PR. I will split those changes. |
There are tests that are more sensitive to scheduling results than necessary. For example, a test that is intended to verify the correctness of the dependency analysis but is validating the final scheduling results These tests can be affected by unrelated changes. This patch fixes them to make them robust against such changes. This patch is a prelude to llvm#121907.
Currently, some tests are too sensitive to changes in MachinePipeliner. Some of them are more sensitive to scheduling results than necessary, others assume the current dependency analysis, which is actually incorrect. As an example of the former, there is a test that is intended to verify the correctness of the dependency analysis, but validates the final scheduling results. These tests can be affected by unrelated changes in MachinePipeliner. This patch fixes them to make them robust against such changes. This patch is a prelude to llvm#121907.
In current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed. - Memory dependencies from top to bottom. - Example: ``` for (int i=1; i<n; i++) { a[i] = ...; a[i-1] = ...; } ``` - Store to store dependencies. - Store to load dependencies. - Output (write-after-write) dependencies. - Use of alias analysis results that are valid only in the single iteration. - Example: ``` void f(double * restrict a, double * restrict b); ... for (int i=0; i<n; i++) f(ptr0, ptr1); // will be inlined ``` This patch added these dependencies and fix correctness issues. In addition, the current analysis can add excessive dependencies because loop-carried memory dependence from bottom to top by forward direction (i.e., top to bottom) edge. This patch also removes such dependencies.
62423c1
to
996c869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch causes performance regression. I am working on reducing a testcase.
I see, thanks in advance. |
@kasuga-fj, please find the testcase.ll bellow. With this patch, only one loop is pipelined. Thank you. target triple = "hexagon" define void @IntegratePerRow(i32 %width, ptr %0, ptr %1) { for.body: ; preds = %for.body, %entry for.end.loopexit.unr-lcssa: ; preds = %for.body ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) ; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none) define void @IntegrateImage(ptr %src, i32 %srcWidth, ptr %dst) { for.body.i: ; preds = %for.body.i, %entry IntegratePerRow.exit.loopexit.unr-lcssa: ; preds = %for.body.i attributes #0 = { nocallback nofree nosync nounwind willreturn memory(none) } |
Thank you for sharing. Let me ask a question. Is this the only case that causes regression? Or are there other cases? This patch addresses correctness issues, and I believe that is generally more important than performance. However, if there are many performance regressions, I think it would be one way to create an option to enable/disable the changes of this patch. Does such a flag make sense for the Hexagon backend? |
Thank you @kasuga-fj for the patch. There are other regressions as well. It would be ideal to avoid introducing a new flag for this patch. In my opinion, a correctness patch should not cause regressions in tests that were previously passing without correctness issues. The best approach would be to address the correctness issue without causing any regressions. |
@aankit-ca Thank you for your comment.
I see, thank you.
I agree with you, but this is not such a simple case. The pipeliner is roughly consists of the following two phases:
The issue this patch addresses is that some dependencies are missed when a dependency graph is built. This can "potentially" cause correctness issues, but not necessarily: if a found schedule happens to satisfy missed dependencies, it works fine. Adding these currently missing dependencies when creating the graph is one way to fix this "potential" correctness issue, but this affects the subsequent schedule search heuristics, which is a problem I'm now facing. Improving those heuristics may resolve the regressions, but it's difficult without knowing the details of them. Alternatively, these results may imply that pruning currently missing dependencies is a good heuristic: That is, in many cases, using the pruned dependency graph and validating all dependencies at the end tends to produce a better schedule. Anyway, I don't know much about the details of the Hexagon benchmarks, so there is nothing more I can say. I would appreciate your input on this. |
Let's review a few cases to address the performance issues. Can you confirm if the testcase shared by @iajbar introduces a new dependency that prevents one of the loops from being pipelined? |
static std::optional<MemoryLocation> | ||
getMemoryLocationForAA(const MachineInstr *MI) { | ||
const MachineMemOperand *MMO = *MI->memoperands_begin(); | ||
const Value *Val = MMO->getValue(); | ||
if (!Val) | ||
return std::nullopt; | ||
auto MemLoc = MemoryLocation::getBeforeOrAfter(Val, MMO->getAAInfo()); | ||
|
||
// Peel off noalias information from `AATags` because it might be valid only | ||
// in single iteration. | ||
// FIXME: This is too conservative. Checking | ||
// `llvm.experimental.noalias.scope.decl` instrinsics in the original LLVM IR | ||
// can perform more accuurately. | ||
MemLoc.AATags.NoAlias = nullptr; | ||
return MemLoc; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's review a few cases to address the performance issues.
Thanks.
Can you confirm if the testcase shared by @iajbar introduces a new dependency that prevents one of the loops from being pipelined?
Yes. The second function (IntegrateImage
) was not pipelined with this patch. The loop body is shown as below:
SU(0): %9:intregs = PHI %7:intregs, %bb.6, %11:intregs, %bb.4
SU(1): %52:hvxvr, %11:intregs = V6_vL32b_pi %9:intregs(tied-def 1), 256 :: (load (s1024) from %ir.pIn.038.i.epil, !noalias !2)
SU(2): %55:hvxwr = V6_vshuffvdd %53:hvxvr, %52:hvxvr, %54:intregslow8
SU(3): V6_vS32b_ai %12:intregs, 0, %53:hvxvr :: (store (s1024) into %ir.src, !alias.scope !2)
SU(4): V6_vS32b_ai %14:intregs, 0, %55.vsub_hi:hvxwr :: (store (s1024) into %ir.dst, !alias.scope !2)
ExitSU: ENDLOOP0 %bb.4, implicit-def $pc, implicit-def $lc0, implicit $sa0, implicit $lc0
In this function, new loop-carried dependencies are added between all load/store pairs (SU(1)
, SU(3)
and SU(4)
, except for self loop edges). This is because noalias
metadata is removed by this patch, as shown the diff. The reason why I did this is that, the alias metadata may be valid only in the single iteration, not in the all iterations, if there is a @llvm.experimental.noalias.scope.decl
intrinsic call inside the loop. Please also refer to LoopAccessAnalysis
, where processes the same things.
llvm-project/llvm/lib/Analysis/LoopAccessAnalysis.cpp
Lines 718 to 740 in af4ec59
MemoryLocation adjustLoc(MemoryLocation Loc) const { | |
// The accessed location varies within the loop, but remains within the | |
// underlying object. | |
Loc.Size = LocationSize::beforeOrAfterPointer(); | |
Loc.AATags.Scope = adjustAliasScopeList(Loc.AATags.Scope); | |
Loc.AATags.NoAlias = adjustAliasScopeList(Loc.AATags.NoAlias); | |
return Loc; | |
} | |
/// Drop alias scopes that are only valid within a single loop iteration. | |
MDNode *adjustAliasScopeList(MDNode *ScopeList) const { | |
if (!ScopeList) | |
return nullptr; | |
// For the sake of simplicity, drop the whole scope list if any scope is | |
// iteration-local. | |
if (any_of(ScopeList->operands(), [&](Metadata *Scope) { | |
return LoopAliasScopes.contains(cast<MDNode>(Scope)); | |
})) | |
return nullptr; | |
return ScopeList; | |
} |
Unfortunately, calls to this intrinsic are eliminated when LLVM IR is lowered to Machine IR. This means that we cannot make precious decisions about whether the alias information is valid across iterations, so I've removed all the metadata here. However, this is too conservative and I think we can improve this analysis, e.g., if there are no @llvm.experimental.noalias.scope.decl
calls in the original IR, then stop removing the metadata. (It might be better to have separate patches for changes related to noalias
.)
Hi @aankit-ca . Let me ask you one question. If there is another patch, would it be possible for you to evaluate its impact on the benchmark as you did before? If so, I would like to try a different approach that (I believe) would reduce regression than this patch. |
Yes @kasuga-fj we can try benchmarking again on other patches too |
Ok, I'm working on the patch and will request it as soon as it's ready, thank you. |
In the current MachinePipeliner, several loop-carried edges are missed. It can result generating invalid code. At least following loop-carried dependencies can be missed.
This patch added these dependencies to fix correctness issues.
In addition, the current analysis may add excessive dependencies because loop-carried memory dependencies from bottom to top by are expressed by using dependencies in the forward direction (i.e., from top to bottom edge). This patch also removes such dependencies.
I tested performance changes with and without this patch. I used llvm-test-suite as test cases and checked the following:
As far as I have tested, there has been no significant performance impact. It's worth noting, however, that a huge performance degradation can occur when
TargetInstrInfo::getIncrementValue
.This is because loop-carried edges are added to each pair of unrolled memory instructions. For example, suppose a loop contains a store to a[i] and it's unrolled 4 times. In this case, the loop has store to a[i], a[i+1], a[i+2], and a[i+3]. If
getIncrementValue
isn't implemented, we cannot be sure that these stores are independent, so loop-carried dependencies are added against each pair of them. We can avoid this problem by disabling loop unrolling or implementing getIncrementValue. I don't think it makes much sense to enable both loop unrolling and software pipelining, so I believe disabling loop unrolling is not a big problem.Related: #109918