AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

arsenm · 2025-03-03T03:02:28Z

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.

arsenm · 2025-03-03T03:02:42Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-03-03T03:03:02Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.

Patch is 272.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129463.diff

15 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+106-47)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll (+31-35)
(modified) llvm/test/CodeGen/AMDGPU/acc-ldst.ll (+7-4)
(added) llvm/test/CodeGen/AMDGPU/coalesces-better.mir (+74)
(added) llvm/test/CodeGen/AMDGPU/coalesces-worse.mir (+71)
(modified) llvm/test/CodeGen/AMDGPU/fold-agpr-phis.mir (+36)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.16x16x128.f8f6f4.ll (+11-15)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll (+172-258)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smfmac.gfx950.ll (+1382-1986)
(modified) llvm/test/CodeGen/AMDGPU/mfma-no-register-aliasing.ll (+10-8)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+23-26)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.mir (+19-16)
(modified) llvm/test/CodeGen/AMDGPU/remaining-virtual-register-operands.ll (+7-2)
(modified) llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll (+2-3)
(modified) llvm/test/CodeGen/AMDGPU/si-fold-operands-agpr-copy-reg-sequence.mir (+100-58)

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index eb9aabf8b6317..3a3f303293461 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1510,76 +1510,128 @@ bool SIFoldOperandsImpl::foldCopyToAGPRRegSequence(MachineInstr *CopyMI) const {
   // only accept VGPR or inline immediate. Recreate a reg_sequence with its
   // initializers right here, so we will rematerialize immediates and avoid
   // copies via different reg classes.
-  if (!TRI->isAGPR(*MRI, CopyMI->getOperand(0).getReg()))
+  const TargetRegisterClass *DefRC =
+      MRI->getRegClass(CopyMI->getOperand(0).getReg());
+  if (!TRI->isAGPRClass(DefRC))
     return false;
+
   Register UseReg = CopyMI->getOperand(1).getReg();
-  SmallVector<std::pair<MachineOperand *, unsigned>, 32> Defs;
-  if (!getRegSeqInit(Defs, UseReg, AMDGPU::OPERAND_REG_INLINE_C_INT32))
+  MachineInstr *RegSeq = MRI->getVRegDef(UseReg);
+  if (!RegSeq || !RegSeq->isRegSequence())
     return false;
 
   const DebugLoc &DL = CopyMI->getDebugLoc();
   MachineBasicBlock &MBB = *CopyMI->getParent();
 
+  MachineInstrBuilder B(*MBB.getParent(), CopyMI);
+  DenseMap<TargetInstrInfo::RegSubRegPair, Register> VGPRCopies;
+  SmallSetVector<TargetInstrInfo::RegSubRegPair, 32> SeenInputs;
+
+  const TargetRegisterClass *UseRC =
+      MRI->getRegClass(CopyMI->getOperand(1).getReg());
+
+  // Value, subregindex for new REG_SEQUENCE
+  SmallVector<std::pair<MachineOperand *, unsigned>, 32> NewDefs;
+
+  unsigned NumRegSeqOperands = RegSeq->getNumOperands();
+  unsigned NumFoldable = 0;
+
+  for (unsigned I = 1; I != NumRegSeqOperands; I += 2) {
+    MachineOperand &RegOp = RegSeq->getOperand(I);
+    unsigned SubRegIdx = RegSeq->getOperand(I + 1).getImm();
+
+    if (RegOp.getSubReg()) {
+      // TODO: Handle subregister compose
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    MachineOperand *Lookup = lookUpCopyChain(*TII, *MRI, RegOp.getReg());
+    if (!Lookup)
+      Lookup = &RegOp;
+
+    if (Lookup->isImm()) {
+      // Check if this is an agpr_32 subregister.
+      const TargetRegisterClass *DestSuperRC = TRI->getMatchingSuperRegClass(
+          DefRC, &AMDGPU::AGPR_32RegClass, SubRegIdx);
+      if (DestSuperRC &&
+          TII->isInlineConstant(*Lookup, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
+        ++NumFoldable;
+        NewDefs.emplace_back(Lookup, SubRegIdx);
+        continue;
+      }
+    }
+
+    const TargetRegisterClass *InputRC =
+        Lookup->isReg() ? MRI->getRegClass(Lookup->getReg())
+                        : MRI->getRegClass(RegOp.getReg());
+
+    // TODO: Account for Lookup->getSubReg()
+
+    // If we can't find a matching super class, this is an SGPR->AGPR or
+    // VGPR->AGPR subreg copy (or something constant-like we have to materialize
+    // in the AGPR). We can't directly copy from SGPR to AGPR on gfx908, so we
+    // want to rewrite to copy to an intermediate VGPR class.
+    const TargetRegisterClass *MatchRC =
+        TRI->getMatchingSuperRegClass(DefRC, InputRC, SubRegIdx);
+    if (!MatchRC) {
+      ++NumFoldable;
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    NewDefs.emplace_back(&RegOp, SubRegIdx);
+  }
+
+  // Do not clone a reg_sequence and merely change the result register class.
+  if (NumFoldable == 0)
+    return false;
+
   CopyMI->setDesc(TII->get(AMDGPU::REG_SEQUENCE));
   for (unsigned I = CopyMI->getNumOperands() - 1; I > 0; --I)
     CopyMI->removeOperand(I);
 
-  MachineInstrBuilder B(*MBB.getParent(), CopyMI);
-  DenseMap<TargetInstrInfo::RegSubRegPair, Register> VGPRCopies;
-  SmallSetVector<TargetInstrInfo::RegSubRegPair, 32> SeenAGPRs;
-  for (unsigned I = 0, NumElts = Defs.size(); I != NumElts; ++I) {
-    MachineOperand *Def = Defs[I].first;
-    TargetInstrInfo::RegSubRegPair CopyToVGPR;
-    if (Def->isImm() &&
-        TII->isInlineConstant(*Def, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
-      int64_t Imm = Def->getImm();
-
-      auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
+  for (auto [Def, DestSubIdx] : NewDefs) {
+    if (!Def->isReg()) {
+      // TODO: Should we use single write for each repeated value like in
+      // register case?
+      Register Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
       BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64), Tmp)
-          .addImm(Imm);
+          .add(*Def);
       B.addReg(Tmp);
-    } else if (Def->isReg() && TRI->isAGPR(*MRI, Def->getReg())) {
-      auto Src = getRegSubRegPair(*Def);
+    } else {
+      TargetInstrInfo::RegSubRegPair Src = getRegSubRegPair(*Def);
       Def->setIsKill(false);
-      if (!SeenAGPRs.insert(Src)) {
+
+      Register &VGPRCopy = VGPRCopies[Src];
+      if (!VGPRCopy) {
+        const TargetRegisterClass *VGPRUseSubRC =
+            TRI->getSubRegisterClass(UseRC, DestSubIdx);
+
         // We cannot build a reg_sequence out of the same registers, they
         // must be copied. Better do it here before copyPhysReg() created
         // several reads to do the AGPR->VGPR->AGPR copy.
-        CopyToVGPR = Src;
-      } else {
-        B.addReg(Src.Reg, Def->isUndef() ? RegState::Undef : 0, Src.SubReg);
-      }
-    } else {
-      assert(Def->isReg());
-      Def->setIsKill(false);
-      auto Src = getRegSubRegPair(*Def);
 
-      // Direct copy from SGPR to AGPR is not possible. To avoid creation
-      // of exploded copies SGPR->VGPR->AGPR in the copyPhysReg() later,
-      // create a copy here and track if we already have such a copy.
-      if (TRI->isSGPRReg(*MRI, Src.Reg)) {
-        CopyToVGPR = Src;
+        // Direct copy from SGPR to AGPR is not possible on gfx908. To avoid
+        // creation of exploded copies SGPR->VGPR->AGPR in the copyPhysReg()
+        // later, create a copy here and track if we already have such a copy.
+        if (TRI->getSubRegisterClass(MRI->getRegClass(Src.Reg), Src.SubReg) !=
+            VGPRUseSubRC) {
+          VGPRCopy = MRI->createVirtualRegister(VGPRUseSubRC);
+          BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), VGPRCopy).add(*Def);
+          B.addReg(VGPRCopy);
+        } else {
+          // If it is already a VGPR, do not copy the register.
+          B.add(*Def);
+        }
       } else {
-        auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
-        BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Tmp).add(*Def);
-        B.addReg(Tmp);
+        B.addReg(VGPRCopy);
       }
     }
 
-    if (CopyToVGPR.Reg) {
-      auto [It, Inserted] = VGPRCopies.try_emplace(CopyToVGPR);
-      Register &Vgpr = It->second;
-      if (Inserted) {
-        Vgpr = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
-        BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Vgpr).add(*Def);
-      }
-      Register Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
-      BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Tmp).addReg(Vgpr);
-      B.addReg(Tmp);
-    }
-
-    B.addImm(Defs[I].second);
+    B.addImm(DestSubIdx);
   }
+
   LLVM_DEBUG(dbgs() << "Folded " << *CopyMI);
   return true;
 }
@@ -1634,6 +1686,13 @@ bool SIFoldOperandsImpl::tryFoldFoldableCopy(
       foldCopyToVGPROfScalarAddOfFrameIndex(DstReg, OpToFold.getReg(), MI))
     return true;
 
+  // Fold copy to AGPR through reg_sequence
+  // TODO: Handle with subregister extract
+  if (OpToFold.isReg() && MI.isCopy() && !MI.getOperand(1).getSubReg()) {
+    if (foldCopyToAGPRRegSequence(&MI))
+      return true;
+  }
+
   bool Changed = foldInstOperand(MI, OpToFold);
 
   // If we managed to fold all uses of this copy then we might as well
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
index a31064e293622..3f5a99cad9543 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
@@ -16,13 +16,15 @@ define amdgpu_kernel void @test_mfma_f32_32x32x4bf16_1k(ptr addrspace(1) %arg) #
 ; GCN-NEXT:    s_load_dwordx2 s[34:35], s[4:5], 0x24
 ; GCN-NEXT:    s_mov_b64 s[36:37], 1
 ; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[36:37], s[36:37] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b32 s36, 2
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[36:37], s[36:37] op_sel:[0,1]
+; GCN-NEXT:    s_mov_b32 s38, 2
+; GCN-NEXT:    s_mov_b32 s39, s37
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    s_load_dwordx16 s[0:15], s[34:35], 0x0
 ; GCN-NEXT:    s_load_dwordx16 s[16:31], s[34:35], 0x40
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[38:39], s[38:39] op_sel:[0,1]
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
+; GCN-NEXT:    v_accvgpr_write_b32 a16, s16
 ; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
 ; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
 ; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
@@ -38,7 +40,6 @@ define amdgpu_kernel void @test_mfma_f32_32x32x4bf16_1k(ptr addrspace(1) %arg) #
 ; GCN-NEXT:    v_accvgpr_write_b32 a13, s13
 ; GCN-NEXT:    v_accvgpr_write_b32 a14, s14
 ; GCN-NEXT:    v_accvgpr_write_b32 a15, s15
-; GCN-NEXT:    v_accvgpr_write_b32 a16, s16
 ; GCN-NEXT:    v_accvgpr_write_b32 a17, s17
 ; GCN-NEXT:    v_accvgpr_write_b32 a18, s18
 ; GCN-NEXT:    v_accvgpr_write_b32 a19, s19
@@ -317,31 +318,29 @@ bb:
 define amdgpu_kernel void @test_mfma_f64_16x16x4f64_imm(ptr addrspace(1) %arg, double %a, double %b) #0 {
 ; GCN-LABEL: test_mfma_f64_16x16x4f64_imm:
 ; GCN:       ; %bb.0: ; %bb
-; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
-; GCN-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x34
-; GCN-NEXT:    s_mov_b64 s[0:1], 0
+; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GCN-NEXT:    s_load_dwordx2 s[10:11], s[4:5], 0x34
 ; GCN-NEXT:    s_mov_b64 s[6:7], 1.0
-; GCN-NEXT:    s_mov_b64 s[2:3], s[0:1]
+; GCN-NEXT:    s_mov_b64 s[8:9], 0
+; GCN-NEXT:    v_accvgpr_write_b32 a0, s8
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[10:11], s[10:11] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b64 s[4:5], s[0:1]
-; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
-; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
-; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
-; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
-; GCN-NEXT:    v_accvgpr_write_b32 a4, s4
-; GCN-NEXT:    v_accvgpr_write_b32 a5, s5
+; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
+; GCN-NEXT:    v_accvgpr_write_b32 a2, s8
+; GCN-NEXT:    v_accvgpr_write_b32 a4, s8
 ; GCN-NEXT:    v_accvgpr_write_b32 a6, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a1, s9
+; GCN-NEXT:    v_accvgpr_write_b32 a3, s9
+; GCN-NEXT:    v_accvgpr_write_b32 a5, s9
 ; GCN-NEXT:    v_accvgpr_write_b32 a7, s7
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[12:13], s[12:13] op_sel:[0,1]
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[10:11], s[10:11] op_sel:[0,1]
 ; GCN-NEXT:    s_nop 1
 ; GCN-NEXT:    v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7]
 ; GCN-NEXT:    v_mov_b32_e32 v0, 0
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[8:9]
-; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[8:9] offset:16
+; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[0:1]
+; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[0:1] offset:16
 ; GCN-NEXT:    s_endpgm
 bb:
   %mai.1 = tail call <4 x double> @llvm.amdgcn.mfma.f64.16x16x4f64(double %a, double %b, <4 x double> <double 0.0, double 0.0, double 0.0, double 1.0>, i32 0, i32 0, i32 0)
@@ -352,32 +351,29 @@ bb:
 define amdgpu_kernel void @test_mfma_f64_16x16x4f64_splat_lit(ptr addrspace(1) %arg, double %a, double %b) #0 {
 ; GCN-LABEL: test_mfma_f64_16x16x4f64_splat_lit:
 ; GCN:       ; %bb.0: ; %bb
-; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
-; GCN-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x34
-; GCN-NEXT:    s_mov_b32 s0, 0
-; GCN-NEXT:    s_mov_b32 s1, 0x405ec000
-; GCN-NEXT:    s_mov_b64 s[2:3], s[0:1]
+; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GCN-NEXT:    s_load_dwordx2 s[8:9], s[4:5], 0x34
+; GCN-NEXT:    s_mov_b32 s6, 0
+; GCN-NEXT:    s_mov_b32 s7, 0x405ec000
+; GCN-NEXT:    v_accvgpr_write_b32 a0, s6
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[10:11], s[10:11] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b64 s[4:5], s[0:1]
-; GCN-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
-; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
-; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
-; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
-; GCN-NEXT:    v_accvgpr_write_b32 a4, s4
-; GCN-NEXT:    v_accvgpr_write_b32 a5, s5
+; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
+; GCN-NEXT:    v_accvgpr_write_b32 a2, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a4, s6
 ; GCN-NEXT:    v_accvgpr_write_b32 a6, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a1, s7
+; GCN-NEXT:    v_accvgpr_write_b32 a3, s7
+; GCN-NEXT:    v_accvgpr_write_b32 a5, s7
 ; GCN-NEXT:    v_accvgpr_write_b32 a7, s7
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[12:13], s[12:13] op_sel:[0,1]
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[8:9], s[8:9] op_sel:[0,1]
 ; GCN-NEXT:    s_nop 1
 ; GCN-NEXT:    v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7]
 ; GCN-NEXT:    v_mov_b32_e32 v0, 0
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[8:9]
-; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[8:9] offset:16
+; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[0:1]
+; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[0:1] offset:16
 ; GCN-NEXT:    s_endpgm
 bb:
   %mai.1 = tail call <4 x double> @llvm.amdgcn.mfma.f64.16x16x4f64(double %a, double %b, <4 x double> <double 123.0, double 123.0, double 123.0, double 123.0>, i32 0, i32 0, i32 0)
diff --git a/llvm/test/CodeGen/AMDGPU/acc-ldst.ll b/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
index 5c484e1e52da8..726bfbab7ad48 100644
--- a/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
+++ b/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
@@ -190,11 +190,11 @@ bb:
 
 ; NB: for atomics both vdata and vdst shall be either VGPR or AGPR
 ; GCN-LABEL: {{^}}test_atomic_mfma_4xi32_atomic_store:
+; GCN: v_accvgpr_write_b32 [[A_ZERO:a[0-9]+]], 0
 ; GCN:     global_atomic_sub [[IN:v[0-9]+]], v{{[0-9:]+}}, v{{[0-9]+}}, s[{{[0-9:]+}}] glc
+; GCN-DAG: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
+; GCN-DAG: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
 ; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, [[IN]]
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
 ; GCN:     v_mfma_i32_4x4x4i8 a[[[N:[0-9]+]]:
 ; GCN:     v_accvgpr_read_b32 [[V:v[0-9]+]], a[[N]]{{$}}
 ; GCN:     global_atomic_add v{{[0-9]+}}, v{{[0-9:]+}}, [[V]], s[{{[0-9:]+}}] glc
@@ -217,7 +217,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_atomic_mfma_4xi32_atomic64_store:
 ; GCN:         global_atomic_sub_x2 v[{{[0-9:]+}}], v{{[0-9:]+}}, v[{{[0-9:]+}}], s[{{[0-9:]+}}] glc
-; GCN-COUNT-4: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_accvgpr_write_b32 [[A_ZERO:a[0-9]+]], 0
+; GCN: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
+; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
 ; GCN:         v_mfma_i32_4x4x4i8 a[[[N:[0-9]+]]:
 ; GCN:         v_accvgpr_read_b32 v{{[0-9]+}}, a{{[0-9]+}}
 ; GCN:         v_accvgpr_read_b32 v{{[0-9]+}}, a{{[0-9]+}}
diff --git a/llvm/test/CodeGen/AMDGPU/coalesces-better.mir b/llvm/test/CodeGen/AMDGPU/coalesces-better.mir
new file mode 100644
index 0000000000000..593220d879c2a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/coalesces-better.mir
@@ -0,0 +1,74 @@
+# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx950 -start-after=si-fold-operands -o - %s | FileCheck %s
+
+--- |
+  target triple = "amdgcn-mesa-mesa3d"
+
+  define <4 x float> @test_smfmac_f32_16x16x64_f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3) #0 {
+  ; CHECK-LABEL: test_smfmac_f32_16x16x64_f16:
+  ; CHECK:       ; %bb.0:
+  ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+  ; CHECK-NEXT:    v_smfmac_f32_16x16x64_f16 v[12:15], v[0:3], v[4:11], v16
+  ; CHECK-NEXT:    s_nop 7
+  ; CHECK-NEXT:    v_mov_b32_e32 v0, v12
+  ; CHECK-NEXT:    v_mov_b32_e32 v1, v13
+  ; CHECK-NEXT:    v_mov_b32_e32 v2, v14
+  ; CHECK-NEXT:    v_mov_b32_e32 v3, v15
+  ; CHECK-NEXT:    s_setpc_b64 s[30:31]
+    %result = call <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3, i32 immarg 0, i32 immarg 0)
+    ret <4 x float> %result
+  }
+
+  ; Function Attrs: convergent nocallback nofree nosync nounwind willreturn memory(none)
+  declare <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half>, <16 x half>, <4 x float>, i32, i32 immarg, i32 immarg) #1
+
+  attributes #0 = { "target-cpu"="gfx950" }
+  attributes #1 = { convergent nocallback nofree nosync nounwind willreturn memory(none) "target-cpu"="gfx950" }
+
+...
+---
+name:            test_smfmac_f32_16x16x64_f16
+tracksRegLiveness: true
+isSSA:           true
+machineFunctionInfo:
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  sgprForEXECCopy: '$sgpr100_sgpr101'
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16
+
+
+    %0:vgpr_32 = COPY $vgpr16
+    %1:vgpr_32 = COPY $vgpr15
+    %2:vgpr_32 = COPY $vgpr14
+    %3:vgpr_32 = COPY $vgpr13
+    %4:vgpr_32 = COPY $vgpr12
+    %5:vgpr_32 = COPY $vgpr11
+    %6:vgpr_32 = COPY $vgpr10
+    %7:vgpr_32 = COPY $vgpr9
+    %8:vgpr_32 = COPY $vgpr8
+    %9:vgpr_32 = COPY $vgpr7
+    %10:vgpr_32 = COPY $vgpr6
+    %11:vgpr_32 = COPY $vgpr5
+    %12:vgpr_32 = COPY $vgpr4
+    %13:vgpr_32 = COPY $vgpr3
+    %14:vgpr_32 = COPY $vgpr2
+    %15:vgpr_32 = COPY $vgpr1
+    %16:vgpr_32 = COPY $vgpr0
+    %17:vreg_256_align2 = REG_SEQUENCE %12, %subreg.sub0, %11, %subreg.sub1, %10, %subreg.sub2, %9, %subreg.sub3, %8, %subreg.sub4, %7, %subreg.sub5, %6, %subreg.sub6, %5, %subreg.sub7
+    %18:vreg_128_align2 = REG_SEQUENCE %4, %subreg.sub0, %3, %subreg.sub1, %2, %subreg.sub2, %1, %subreg.sub3
+    %19:vreg_128_align2 = REG_SEQUENCE %16, %subreg.sub0, %15, %subreg.sub1, %14, %subreg.sub2, %13, %subreg.sub3
+    %24:areg_128_align2 = COPY %18
+    %25:areg_128_align2 = V_SMFMAC_F32_16X16X64_F16_e64 %19, %17, %0, 0, 0, %24, implicit $mode, implicit $exec
+    %26:vgpr_32 = COPY %25.sub0
+    %27:vgpr_32 = COPY %25.sub1
+    %28:vgpr_32 = COPY %25.sub2
+    %29:vgpr_32 = COPY %25.sub3
+    $vgpr0 = COPY %26
+    $vgpr1 = COPY %27
+    $vgpr2 = COPY %28
+    $vgpr3 = COPY %29
+    SI_RETURN implicit $vgpr0, implicit $vgpr1, implicit $vgpr2, implicit $vgpr3
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir b/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir
new file mode 100644
index 0000000000000..0718f825fbacb
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir
@@ -0,0 +1,71 @@
+# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx950 -start-after=si-fold-operands -o - %s | FileCheck %s
+
+--- |
+  target triple = "amdgcn-mesa-mesa3d"
+
+  define <4 x float> @test_smfmac_f32_16x16x64_f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3) #0 {
+  ; CHECK-LABEL: test_smfmac_f32_16x16x64_f16:
+  ; CHECK:       ; %bb.0:
+  ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a0, v12
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a1, v13
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a2, v14
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a3, v15
+  ; CHECK-NEXT:    s_nop 1
+  ; CHECK-NEXT:    v_smfmac_f32_16x16x64_f16 a[0:3], v[0:3], v[4:11], v16
+  ; CHECK-NEXT:    s_nop 7
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v2, a2
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v3, a3
+  ; CHECK-NEXT:    s_setpc_b64 s[30:31]
+    %result = call <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3, i32 immarg 0, i32 immarg 0)
+    ret <4 x float> %result
+  }
+
+  declare <4 x float> @llvm.amdgcn.smfmac.f3...
[truncated]

JanekvO · 2025-03-03T18:23:09Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+  // Value, subregindex for new REG_SEQUENCE
+  SmallVector<std::pair<MachineOperand *, unsigned>, 32> NewDefs;
+
+  unsigned NumRegSeqOperands = RegSeq->getNumOperands();
+  unsigned NumFoldable = 0;
+
+  for (unsigned I = 1; I != NumRegSeqOperands; I += 2) {
+    MachineOperand &RegOp = RegSeq->getOperand(I);
+    unsigned SubRegIdx = RegSeq->getOperand(I + 1).getImm();
+
+    if (RegOp.getSubReg()) {
+      // TODO: Handle subregister compose
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    MachineOperand *Lookup = lookUpCopyChain(*TII, *MRI, RegOp.getReg());
+    if (!Lookup)
+      Lookup = &RegOp;
+
+    if (Lookup->isImm()) {
+      // Check if this is an agpr_32 subregister.
+      const TargetRegisterClass *DestSuperRC = TRI->getMatchingSuperRegClass(
+          DefRC, &AMDGPU::AGPR_32RegClass, SubRegIdx);
+      if (DestSuperRC &&
+          TII->isInlineConstant(*Lookup, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
+        ++NumFoldable;
+        NewDefs.emplace_back(Lookup, SubRegIdx);
+        continue;
+      }
+    }
+
+    const TargetRegisterClass *InputRC =
+        Lookup->isReg() ? MRI->getRegClass(Lookup->getReg())
+                        : MRI->getRegClass(RegOp.getReg());
+
+    // TODO: Account for Lookup->getSubReg()
+
+    // If we can't find a matching super class, this is an SGPR->AGPR or
+    // VGPR->AGPR subreg copy (or something constant-like we have to materialize
+    // in the AGPR). We can't directly copy from SGPR to AGPR on gfx908, so we
+    // want to rewrite to copy to an intermediate VGPR class.
+    const TargetRegisterClass *MatchRC =
+        TRI->getMatchingSuperRegClass(DefRC, InputRC, SubRegIdx);
+    if (!MatchRC) {
+      ++NumFoldable;
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    NewDefs.emplace_back(&RegOp, SubRegIdx);
+  }
+
+  // Do not clone a reg_sequence and merely change the result register class.
+  if (NumFoldable == 0)
+    return false;


Forgive my ignorance, but why isn't it sufficient to use (or modify) getRegSeqInit?

getRegSeqInit bakes in the assumption that every entry is 32-bits. This should handle arbitrary subregister indexes, and has the heuristic checks mixed in with the source look throughs

JanekvO · 2025-03-03T18:33:57Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+        // Direct copy from SGPR to AGPR is not possible on gfx908. To avoid
+        // creation of exploded copies SGPR->VGPR->AGPR in the copyPhysReg()
+        // later, create a copy here and track if we already have such a copy.
+        if (TRI->getSubRegisterClass(MRI->getRegClass(Src.Reg), Src.SubReg) !=
+            VGPRUseSubRC) {


How does this check for SGPR (as the comments explain)?

The register bank mismatch is covered by the generic subregister class logic

JanekvO · 2025-03-03T18:38:41Z

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll

+; GCN-NEXT:    s_mov_b32 s38, 2
+; GCN-NEXT:    s_mov_b32 s39, s37


Why can't it use s[36:37] as it did before?

This amounts to reordering, the v_pk_mov_b32 just moved below. I don't expect consistent regalloc decisions with movement

arsenm · 2025-03-04T07:37:15Z

Merge activity

Mar 4, 2:37 AM EST: A user started a stack merge that includes this pull request via Graphite.
Mar 4, 2:39 AM EST: Graphite rebased this pull request as part of a merge.
Mar 4, 2:41 AM EST: A user merged this pull request with Graphite.

SIFoldOperands is frustratingly written in a def-folds-into-use iteration pattern, with a few random cases starting at the uses. We were handling this case by looking at the reg_sequence, and finding the copy. This did not work for the most basic pattern of materializing a vector constant that started in SGPRs. It just happens there is an optimization bug in SelectionDAG that produced the expected pattern. Perform an additional attempt at the fold rooted at the copy. This mostly shows test improvements. There were some tricky updates to perform. remaining-virtual-register-operands.ll managed to stop failing the allocator, so needed to be tricked into failing again. I also do not understand what schedule-xdl-resource.ll is trying to do for the test so this changes it to some random output that exists in the debug output.

…vm#129463) SIFoldOperands is frustratingly written in a def-folds-into-use iteration pattern, with a few random cases starting at the uses. We were handling this case by looking at the reg_sequence, and finding the copy. This did not work for the most basic pattern of materializing a vector constant that started in SGPRs. It just happens there is an optimization bug in SelectionDAG that produced the expected pattern. Perform an additional attempt at the fold rooted at the copy. This mostly shows test improvements. There were some tricky updates to perform. remaining-virtual-register-operands.ll managed to stop failing the allocator, so needed to be tricked into failing again. I also do not understand what schedule-xdl-resource.ll is trying to do for the test so this changes it to some random output that exists in the debug output.

This was referenced Mar 3, 2025

AMDGPU: Stop introducing v_accvgpr_write_b32 for reg-to-reg copy #129059

Merged

AMDGPU: Use helper function for use/def chain walk #129052

Merged

arsenm mentioned this pull request Mar 3, 2025

AMDGPU: Add mir test for agpr constant reg_sequence handling #129058

Merged

arsenm added the backend:AMDGPU label Mar 3, 2025 — with Graphite App

arsenm mentioned this pull request Mar 3, 2025

DAG: Use phi to create vregs instead of the constant input #129464

Merged

arsenm requested review from JanekvO, jrbyrnes, kerbowa, Pierre-vh and rampitec March 3, 2025 03:02

arsenm marked this pull request as ready for review March 3, 2025 03:02

llvmbot added the llvm:globalisel label Mar 3, 2025

arsenm force-pushed the users/arsenm/amdgpu/stop-introducing-accvpr-write-for-reg-copy branch from 2340ede to f3d76df Compare March 3, 2025 09:19

Base automatically changed from users/arsenm/amdgpu/stop-introducing-accvpr-write-for-reg-copy to main March 3, 2025 09:22

arsenm force-pushed the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch 2 times, most recently from a5de66e to d6ca555 Compare March 3, 2025 13:17

arsenm mentioned this pull request Mar 3, 2025

[AMDGPU] SelectionDAG support for vector type set 0 to multiple sgpr64 #128017

Closed

rampitec approved these changes Mar 3, 2025

View reviewed changes

JanekvO reviewed Mar 3, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch from d6ca555 to 0ed5d46 Compare March 4, 2025 07:39

arsenm merged commit b3d5056 into main Mar 4, 2025
6 of 9 checks passed

arsenm deleted the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch March 4, 2025 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

Uh oh!

arsenm commented Mar 3, 2025

Uh oh!

arsenm commented Mar 3, 2025 •

edited

Loading

Uh oh!

llvmbot commented Mar 3, 2025 •

edited

Loading

Uh oh!

JanekvO Mar 3, 2025

Uh oh!

arsenm Mar 4, 2025

Uh oh!

JanekvO Mar 3, 2025

Uh oh!

arsenm Mar 4, 2025

Uh oh!

JanekvO Mar 3, 2025

Uh oh!

arsenm Mar 4, 2025

Uh oh!

arsenm commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

Uh oh!

Conversation

arsenm commented Mar 3, 2025

Uh oh!

arsenm commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JanekvO Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

JanekvO Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

JanekvO Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

arsenm commented Mar 3, 2025 •

edited

Loading

llvmbot commented Mar 3, 2025 •

edited

Loading

arsenm commented Mar 4, 2025 •

edited

Loading