MachineScheduler: Improve instruction clustering #137784

ruiling · 2025-04-29T10:35:04Z

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.
As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`. But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone). Another issue is: ``` if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum) std::swap(SUa, SUb); if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) ``` may break the cluster queue. For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain. To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case. There are two major reasons why there are so many test check changes. 1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical. 2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before. The most affected targets are: AMDGPU, AArch64, RISCV. For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression. For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough. For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.

llvmbot · 2025-04-29T10:35:44Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-powerpc

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering &amp;&amp; SUa-&gt;NodeNum &gt; SUb-&gt;NodeNum)
      std::swap(SUa, SUb);
   if (!DAG-&gt;addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.
As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.

Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
(modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
(modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
(modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
(modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
(modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
(modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
(modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
(modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
(modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
(modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
(modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
(modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
(modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
(modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
(modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
(modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
(modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
(modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
(modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
(modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
(modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
(modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
(modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
(modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
(modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
(modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
(modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
(modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
(modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
(modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
(modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
(modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
(modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
(modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

llvmbot · 2025-04-29T10:35:44Z

@llvm/pr-subscribers-backend-aarch64

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering &amp;&amp; SUa-&gt;NodeNum &gt; SUb-&gt;NodeNum)
      std::swap(SUa, SUb);
   if (!DAG-&gt;addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.
As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.

Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
(modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
(modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
(modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
(modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
(modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
(modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
(modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
(modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
(modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
(modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
(modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
(modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
(modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
(modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
(modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
(modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
(modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
(modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
(modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
(modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
(modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
(modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
(modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
(modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
(modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
(modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
(modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
(modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
(modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
(modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
(modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
(modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
(modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
(modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

llvmbot · 2025-04-29T10:35:45Z

@llvm/pr-subscribers-backend-amdgpu

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering &amp;&amp; SUa-&gt;NodeNum &gt; SUb-&gt;NodeNum)
      std::swap(SUa, SUb);
   if (!DAG-&gt;addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.
As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.

Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
(modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
(modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
(modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
(modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
(modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
(modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
(modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
(modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
(modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
(modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
(modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
(modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
(modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
(modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
(modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
(modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
(modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
(modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
(modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
(modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
(modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
(modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
(modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
(modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
(modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
(modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
(modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
(modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
(modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
(modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
(modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
(modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
(modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
(modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
(modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
(modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
(modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
(modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
(modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
(modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
(modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
(modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
(modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
(modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
(modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
(modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
(modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
(modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

jayfoad · 2025-04-29T11:55:38Z

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

Could you fix that bug first in a separate PR? Then this PR should have fewer test check changes.

jayfoad · 2025-04-29T11:56:48Z

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical.

Agreed, this is probably just due to good luck or bad luck in register allocation. We could put -amdgpu-enable-vopd=0 on these tests.

wangpc-pp · 2025-04-29T11:57:08Z

llvm/lib/CodeGen/MacroFusion.cpp

@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
  for (SDep &SI : SecondSU.Preds)
    if (SI.isCluster())
      return false;
+


I think the MacroFusion changes have no effect, do I understand it correctly?

It is a little bit confusing here. The "cluster" weak edge actually does not function as before. I thought about removing them at very first, but it is hard to detect like:

// Check that neither instr is already paired with another along the edge // between them."

We only keep the set of SUnit in a cluster group, no order anymore, so we cannot detect "the edge between them". I am not sure whether it is important to check for this pattern. Maybe we just need to check that they are not being clustered with anyone else? I don't have clear answer yet. So, I still keep the "cluster" weak edge. But the way nodes will be clustered during scheduling will only be determined by the Clusters in the DAG. Maybe we can cleanup it later if people have clear idea how to best handle this.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll

llvm/lib/CodeGen/MachineScheduler.cpp

ruiling · 2025-05-12T08:21:52Z

Could you fix that bug first in a separate PR? Then this PR should have fewer test check changes.

See #139513. After that change, the diff here would be dropped from "176 files changed, 31670 insertions(+), 31540 deletions(-)" to "101 files changed, 25804 insertions(+), 25670 deletions(-)"

…er-unordered

ruiling requested review from asb, jayfoad, arsenm, fhahn, michaelmaitland, davemgreen and wangpc-pp April 29, 2025 10:35

llvmbot added backend:AArch64 backend:AMDGPU backend:PowerPC llvm:globalisel labels Apr 29, 2025

wangpc-pp reviewed Apr 29, 2025

View reviewed changes

arsenm reviewed Apr 29, 2025

View reviewed changes

llvm/lib/CodeGen/MachineScheduler.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/MachineScheduler.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/MachineScheduler.cpp Outdated Show resolved Hide resolved

arsenm added the llvm:codegen label May 1, 2025

ruiling mentioned this pull request May 13, 2025

EquivalenceClasses: Make ECValue public. NFC #139689

Merged

ruiling added 3 commits May 19, 2025 13:18

Merge remote-tracking branch 'origin/main' into improve-misched-clust…

eff0914

…er-unordered

Update tests after merge

010308e

Address review comments

f06e21f

llvmbot added the llvm:transforms label May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MachineScheduler: Improve instruction clustering #137784

MachineScheduler: Improve instruction clustering #137784

Uh oh!

ruiling commented Apr 29, 2025

Uh oh!

llvmbot commented Apr 29, 2025 •

edited

Loading

Uh oh!

llvmbot commented Apr 29, 2025

Uh oh!

llvmbot commented Apr 29, 2025

Uh oh!

jayfoad commented Apr 29, 2025

Uh oh!

jayfoad commented Apr 29, 2025

Uh oh!

wangpc-pp Apr 29, 2025

Uh oh!

ruiling Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruiling commented May 12, 2025

Uh oh!

Uh oh!

MachineScheduler: Improve instruction clustering #137784

Are you sure you want to change the base?

MachineScheduler: Improve instruction clustering #137784

Uh oh!

Conversation

ruiling commented Apr 29, 2025

Uh oh!

llvmbot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 29, 2025

Uh oh!

llvmbot commented Apr 29, 2025

Uh oh!

jayfoad commented Apr 29, 2025

Uh oh!

jayfoad commented Apr 29, 2025

Uh oh!

wangpc-pp Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

ruiling Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ruiling commented May 12, 2025

Uh oh!

Uh oh!

llvmbot commented Apr 29, 2025 •

edited

Loading