Skip to content

MachineScheduler: Improve instruction clustering #137784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ruiling
Copy link
Contributor

@ruiling ruiling commented Apr 29, 2025

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

  1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

  2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.

The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the queue.
For example, we have a queue of cluster nodes A B C. But during scheduling,
node B might be picked first, then it will be very likely that we only
cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2.
1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes
2 be pred of 3. This makes both 1 and 2 become preds of 3, but
there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change.
This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.
1. The existing implemention has some buggy behavior:
   The scheduler does not reset the pointer to next cluster candidate.
   For example, we want to cluster A and B, but after picking A, we
   might pick node C. In theory, we should reset the next cluster
   candiate here, because we have decided not to cluster A and B during
   scheduling. Later picking B because of Cluster seems not logical.

2. As the cluster candidates are not ordered now, the candidates might
   be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more deeper
reason that machine scheduler cannot cluster them well both before and
after the change, and the load combine algorithm later is also not smart
enough.

For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets
more buffer_load being claused.
@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-powerpc

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

  1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

  2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.


Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
  • (modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
  • (modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
  • (modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
  • (modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
  • (modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
  • (modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
  • (modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
  • (modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
  • (modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
  • (modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
  • (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
  • (modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
  • (modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
  • (modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
  • (modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
  • (modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
  • (modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
  • (modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
  • (modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
  • (modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
  • (modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)
diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering &amp;&amp; SUa-&gt;NodeNum &gt; SUb-&gt;NodeNum)
      std::swap(SUa, SUb);
   if (!DAG-&gt;addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

  1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

  2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.


Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
  • (modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
  • (modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
  • (modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
  • (modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
  • (modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
  • (modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
  • (modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
  • (modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
  • (modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
  • (modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
  • (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
  • (modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
  • (modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
  • (modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
  • (modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
  • (modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
  • (modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
  • (modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
  • (modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
  • (modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
  • (modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)
diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Ruiling, Song (ruiling)

Changes

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as NextClusterPred or NextClusterSucc in ScheduleDAGMI.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:

   if (!ReorderWhileClustering &amp;&amp; SUa-&gt;NodeNum &gt; SUb-&gt;NodeNum)
      std::swap(SUa, SUb);
   if (!DAG-&gt;addEdge(SUb, SDep(SUa, SDep::Cluster)))

may break the cluster queue.

For example, we want to cluster nodes (order as in MemOpRecords): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.

  1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

  2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test v_vselect_v32bf16 gets more buffer_load being claused.


Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff

176 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7)
  • (modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23)
  • (modified) llvm/lib/CodeGen/MacroFusion.cpp (+13)
  • (modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12)
  • (modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8)
  • (modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3)
  • (modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10)
  • (modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17)
  • (modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70)
  • (modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90)
  • (modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8)
  • (modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15)
  • (modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40)
  • (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57)
  • (modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73)
  • (modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28)
  • (modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4)
  • (modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34)
  • (modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30)
  • (modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400)
  • (modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26)
  • (modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10)
  • (modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24)
  • (modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18)
  • (modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7)
diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

@jayfoad
Copy link
Contributor

jayfoad commented Apr 29, 2025

The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

Could you fix that bug first in a separate PR? Then this PR should have fewer test check changes.

@jayfoad
Copy link
Contributor

jayfoad commented Apr 29, 2025

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical.

Agreed, this is probably just due to good luck or bad luck in register allocation. We could put -amdgpu-enable-vopd=0 on these tests.

@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
for (SDep &SI : SecondSU.Preds)
if (SI.isCluster())
return false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the MacroFusion changes have no effect, do I understand it correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little bit confusing here. The "cluster" weak edge actually does not function as before. I thought about removing them at very first, but it is hard to detect like:

  // Check that neither instr is already paired with another along the edge
  // between them."

We only keep the set of SUnit in a cluster group, no order anymore, so we cannot detect "the edge between them". I am not sure whether it is important to check for this pattern. Maybe we just need to check that they are not being clustered with anyone else? I don't have clear answer yet. So, I still keep the "cluster" weak edge. But the way nodes will be clustered during scheduling will only be determined by the Clusters in the DAG. Maybe we can cleanup it later if people have clear idea how to best handle this.

@ruiling
Copy link
Contributor Author

ruiling commented May 12, 2025

Could you fix that bug first in a separate PR? Then this PR should have fewer test check changes.

See #139513. After that change, the diff here would be dropped from "176 files changed, 31670 insertions(+), 31540 deletions(-)" to "101 files changed, 25804 insertions(+), 25670 deletions(-)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants