Skip to content

AMDGPU: Try to perform copy to agpr from reg_sequence at the copy #129463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Mar 3, 2025

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.

@llvmbot
Copy link
Member

llvmbot commented Mar 3, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.


Patch is 272.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129463.diff

15 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+106-47)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll (+31-35)
  • (modified) llvm/test/CodeGen/AMDGPU/acc-ldst.ll (+7-4)
  • (added) llvm/test/CodeGen/AMDGPU/coalesces-better.mir (+74)
  • (added) llvm/test/CodeGen/AMDGPU/coalesces-worse.mir (+71)
  • (modified) llvm/test/CodeGen/AMDGPU/fold-agpr-phis.mir (+36)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.16x16x128.f8f6f4.ll (+11-15)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll (+172-258)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smfmac.gfx950.ll (+1382-1986)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-no-register-aliasing.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+23-26)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.mir (+19-16)
  • (modified) llvm/test/CodeGen/AMDGPU/remaining-virtual-register-operands.ll (+7-2)
  • (modified) llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll (+2-3)
  • (modified) llvm/test/CodeGen/AMDGPU/si-fold-operands-agpr-copy-reg-sequence.mir (+100-58)
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index eb9aabf8b6317..3a3f303293461 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1510,76 +1510,128 @@ bool SIFoldOperandsImpl::foldCopyToAGPRRegSequence(MachineInstr *CopyMI) const {
   // only accept VGPR or inline immediate. Recreate a reg_sequence with its
   // initializers right here, so we will rematerialize immediates and avoid
   // copies via different reg classes.
-  if (!TRI->isAGPR(*MRI, CopyMI->getOperand(0).getReg()))
+  const TargetRegisterClass *DefRC =
+      MRI->getRegClass(CopyMI->getOperand(0).getReg());
+  if (!TRI->isAGPRClass(DefRC))
     return false;
+
   Register UseReg = CopyMI->getOperand(1).getReg();
-  SmallVector<std::pair<MachineOperand *, unsigned>, 32> Defs;
-  if (!getRegSeqInit(Defs, UseReg, AMDGPU::OPERAND_REG_INLINE_C_INT32))
+  MachineInstr *RegSeq = MRI->getVRegDef(UseReg);
+  if (!RegSeq || !RegSeq->isRegSequence())
     return false;
 
   const DebugLoc &DL = CopyMI->getDebugLoc();
   MachineBasicBlock &MBB = *CopyMI->getParent();
 
+  MachineInstrBuilder B(*MBB.getParent(), CopyMI);
+  DenseMap<TargetInstrInfo::RegSubRegPair, Register> VGPRCopies;
+  SmallSetVector<TargetInstrInfo::RegSubRegPair, 32> SeenInputs;
+
+  const TargetRegisterClass *UseRC =
+      MRI->getRegClass(CopyMI->getOperand(1).getReg());
+
+  // Value, subregindex for new REG_SEQUENCE
+  SmallVector<std::pair<MachineOperand *, unsigned>, 32> NewDefs;
+
+  unsigned NumRegSeqOperands = RegSeq->getNumOperands();
+  unsigned NumFoldable = 0;
+
+  for (unsigned I = 1; I != NumRegSeqOperands; I += 2) {
+    MachineOperand &RegOp = RegSeq->getOperand(I);
+    unsigned SubRegIdx = RegSeq->getOperand(I + 1).getImm();
+
+    if (RegOp.getSubReg()) {
+      // TODO: Handle subregister compose
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    MachineOperand *Lookup = lookUpCopyChain(*TII, *MRI, RegOp.getReg());
+    if (!Lookup)
+      Lookup = &RegOp;
+
+    if (Lookup->isImm()) {
+      // Check if this is an agpr_32 subregister.
+      const TargetRegisterClass *DestSuperRC = TRI->getMatchingSuperRegClass(
+          DefRC, &AMDGPU::AGPR_32RegClass, SubRegIdx);
+      if (DestSuperRC &&
+          TII->isInlineConstant(*Lookup, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
+        ++NumFoldable;
+        NewDefs.emplace_back(Lookup, SubRegIdx);
+        continue;
+      }
+    }
+
+    const TargetRegisterClass *InputRC =
+        Lookup->isReg() ? MRI->getRegClass(Lookup->getReg())
+                        : MRI->getRegClass(RegOp.getReg());
+
+    // TODO: Account for Lookup->getSubReg()
+
+    // If we can't find a matching super class, this is an SGPR->AGPR or
+    // VGPR->AGPR subreg copy (or something constant-like we have to materialize
+    // in the AGPR). We can't directly copy from SGPR to AGPR on gfx908, so we
+    // want to rewrite to copy to an intermediate VGPR class.
+    const TargetRegisterClass *MatchRC =
+        TRI->getMatchingSuperRegClass(DefRC, InputRC, SubRegIdx);
+    if (!MatchRC) {
+      ++NumFoldable;
+      NewDefs.emplace_back(&RegOp, SubRegIdx);
+      continue;
+    }
+
+    NewDefs.emplace_back(&RegOp, SubRegIdx);
+  }
+
+  // Do not clone a reg_sequence and merely change the result register class.
+  if (NumFoldable == 0)
+    return false;
+
   CopyMI->setDesc(TII->get(AMDGPU::REG_SEQUENCE));
   for (unsigned I = CopyMI->getNumOperands() - 1; I > 0; --I)
     CopyMI->removeOperand(I);
 
-  MachineInstrBuilder B(*MBB.getParent(), CopyMI);
-  DenseMap<TargetInstrInfo::RegSubRegPair, Register> VGPRCopies;
-  SmallSetVector<TargetInstrInfo::RegSubRegPair, 32> SeenAGPRs;
-  for (unsigned I = 0, NumElts = Defs.size(); I != NumElts; ++I) {
-    MachineOperand *Def = Defs[I].first;
-    TargetInstrInfo::RegSubRegPair CopyToVGPR;
-    if (Def->isImm() &&
-        TII->isInlineConstant(*Def, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
-      int64_t Imm = Def->getImm();
-
-      auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
+  for (auto [Def, DestSubIdx] : NewDefs) {
+    if (!Def->isReg()) {
+      // TODO: Should we use single write for each repeated value like in
+      // register case?
+      Register Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
       BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::V_ACCVGPR_WRITE_B32_e64), Tmp)
-          .addImm(Imm);
+          .add(*Def);
       B.addReg(Tmp);
-    } else if (Def->isReg() && TRI->isAGPR(*MRI, Def->getReg())) {
-      auto Src = getRegSubRegPair(*Def);
+    } else {
+      TargetInstrInfo::RegSubRegPair Src = getRegSubRegPair(*Def);
       Def->setIsKill(false);
-      if (!SeenAGPRs.insert(Src)) {
+
+      Register &VGPRCopy = VGPRCopies[Src];
+      if (!VGPRCopy) {
+        const TargetRegisterClass *VGPRUseSubRC =
+            TRI->getSubRegisterClass(UseRC, DestSubIdx);
+
         // We cannot build a reg_sequence out of the same registers, they
         // must be copied. Better do it here before copyPhysReg() created
         // several reads to do the AGPR->VGPR->AGPR copy.
-        CopyToVGPR = Src;
-      } else {
-        B.addReg(Src.Reg, Def->isUndef() ? RegState::Undef : 0, Src.SubReg);
-      }
-    } else {
-      assert(Def->isReg());
-      Def->setIsKill(false);
-      auto Src = getRegSubRegPair(*Def);
 
-      // Direct copy from SGPR to AGPR is not possible. To avoid creation
-      // of exploded copies SGPR->VGPR->AGPR in the copyPhysReg() later,
-      // create a copy here and track if we already have such a copy.
-      if (TRI->isSGPRReg(*MRI, Src.Reg)) {
-        CopyToVGPR = Src;
+        // Direct copy from SGPR to AGPR is not possible on gfx908. To avoid
+        // creation of exploded copies SGPR->VGPR->AGPR in the copyPhysReg()
+        // later, create a copy here and track if we already have such a copy.
+        if (TRI->getSubRegisterClass(MRI->getRegClass(Src.Reg), Src.SubReg) !=
+            VGPRUseSubRC) {
+          VGPRCopy = MRI->createVirtualRegister(VGPRUseSubRC);
+          BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), VGPRCopy).add(*Def);
+          B.addReg(VGPRCopy);
+        } else {
+          // If it is already a VGPR, do not copy the register.
+          B.add(*Def);
+        }
       } else {
-        auto Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
-        BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Tmp).add(*Def);
-        B.addReg(Tmp);
+        B.addReg(VGPRCopy);
       }
     }
 
-    if (CopyToVGPR.Reg) {
-      auto [It, Inserted] = VGPRCopies.try_emplace(CopyToVGPR);
-      Register &Vgpr = It->second;
-      if (Inserted) {
-        Vgpr = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
-        BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Vgpr).add(*Def);
-      }
-      Register Tmp = MRI->createVirtualRegister(&AMDGPU::AGPR_32RegClass);
-      BuildMI(MBB, CopyMI, DL, TII->get(AMDGPU::COPY), Tmp).addReg(Vgpr);
-      B.addReg(Tmp);
-    }
-
-    B.addImm(Defs[I].second);
+    B.addImm(DestSubIdx);
   }
+
   LLVM_DEBUG(dbgs() << "Folded " << *CopyMI);
   return true;
 }
@@ -1634,6 +1686,13 @@ bool SIFoldOperandsImpl::tryFoldFoldableCopy(
       foldCopyToVGPROfScalarAddOfFrameIndex(DstReg, OpToFold.getReg(), MI))
     return true;
 
+  // Fold copy to AGPR through reg_sequence
+  // TODO: Handle with subregister extract
+  if (OpToFold.isReg() && MI.isCopy() && !MI.getOperand(1).getSubReg()) {
+    if (foldCopyToAGPRRegSequence(&MI))
+      return true;
+  }
+
   bool Changed = foldInstOperand(MI, OpToFold);
 
   // If we managed to fold all uses of this copy then we might as well
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
index a31064e293622..3f5a99cad9543 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
@@ -16,13 +16,15 @@ define amdgpu_kernel void @test_mfma_f32_32x32x4bf16_1k(ptr addrspace(1) %arg) #
 ; GCN-NEXT:    s_load_dwordx2 s[34:35], s[4:5], 0x24
 ; GCN-NEXT:    s_mov_b64 s[36:37], 1
 ; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[36:37], s[36:37] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b32 s36, 2
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[36:37], s[36:37] op_sel:[0,1]
+; GCN-NEXT:    s_mov_b32 s38, 2
+; GCN-NEXT:    s_mov_b32 s39, s37
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    s_load_dwordx16 s[0:15], s[34:35], 0x0
 ; GCN-NEXT:    s_load_dwordx16 s[16:31], s[34:35], 0x40
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[38:39], s[38:39] op_sel:[0,1]
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 ; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
+; GCN-NEXT:    v_accvgpr_write_b32 a16, s16
 ; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
 ; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
 ; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
@@ -38,7 +40,6 @@ define amdgpu_kernel void @test_mfma_f32_32x32x4bf16_1k(ptr addrspace(1) %arg) #
 ; GCN-NEXT:    v_accvgpr_write_b32 a13, s13
 ; GCN-NEXT:    v_accvgpr_write_b32 a14, s14
 ; GCN-NEXT:    v_accvgpr_write_b32 a15, s15
-; GCN-NEXT:    v_accvgpr_write_b32 a16, s16
 ; GCN-NEXT:    v_accvgpr_write_b32 a17, s17
 ; GCN-NEXT:    v_accvgpr_write_b32 a18, s18
 ; GCN-NEXT:    v_accvgpr_write_b32 a19, s19
@@ -317,31 +318,29 @@ bb:
 define amdgpu_kernel void @test_mfma_f64_16x16x4f64_imm(ptr addrspace(1) %arg, double %a, double %b) #0 {
 ; GCN-LABEL: test_mfma_f64_16x16x4f64_imm:
 ; GCN:       ; %bb.0: ; %bb
-; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
-; GCN-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x34
-; GCN-NEXT:    s_mov_b64 s[0:1], 0
+; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GCN-NEXT:    s_load_dwordx2 s[10:11], s[4:5], 0x34
 ; GCN-NEXT:    s_mov_b64 s[6:7], 1.0
-; GCN-NEXT:    s_mov_b64 s[2:3], s[0:1]
+; GCN-NEXT:    s_mov_b64 s[8:9], 0
+; GCN-NEXT:    v_accvgpr_write_b32 a0, s8
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[10:11], s[10:11] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b64 s[4:5], s[0:1]
-; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
-; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
-; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
-; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
-; GCN-NEXT:    v_accvgpr_write_b32 a4, s4
-; GCN-NEXT:    v_accvgpr_write_b32 a5, s5
+; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
+; GCN-NEXT:    v_accvgpr_write_b32 a2, s8
+; GCN-NEXT:    v_accvgpr_write_b32 a4, s8
 ; GCN-NEXT:    v_accvgpr_write_b32 a6, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a1, s9
+; GCN-NEXT:    v_accvgpr_write_b32 a3, s9
+; GCN-NEXT:    v_accvgpr_write_b32 a5, s9
 ; GCN-NEXT:    v_accvgpr_write_b32 a7, s7
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[12:13], s[12:13] op_sel:[0,1]
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[10:11], s[10:11] op_sel:[0,1]
 ; GCN-NEXT:    s_nop 1
 ; GCN-NEXT:    v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7]
 ; GCN-NEXT:    v_mov_b32_e32 v0, 0
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[8:9]
-; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[8:9] offset:16
+; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[0:1]
+; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[0:1] offset:16
 ; GCN-NEXT:    s_endpgm
 bb:
   %mai.1 = tail call <4 x double> @llvm.amdgcn.mfma.f64.16x16x4f64(double %a, double %b, <4 x double> <double 0.0, double 0.0, double 0.0, double 1.0>, i32 0, i32 0, i32 0)
@@ -352,32 +351,29 @@ bb:
 define amdgpu_kernel void @test_mfma_f64_16x16x4f64_splat_lit(ptr addrspace(1) %arg, double %a, double %b) #0 {
 ; GCN-LABEL: test_mfma_f64_16x16x4f64_splat_lit:
 ; GCN:       ; %bb.0: ; %bb
-; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
-; GCN-NEXT:    s_load_dwordx2 s[12:13], s[4:5], 0x34
-; GCN-NEXT:    s_mov_b32 s0, 0
-; GCN-NEXT:    s_mov_b32 s1, 0x405ec000
-; GCN-NEXT:    s_mov_b64 s[2:3], s[0:1]
+; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GCN-NEXT:    s_load_dwordx2 s[8:9], s[4:5], 0x34
+; GCN-NEXT:    s_mov_b32 s6, 0
+; GCN-NEXT:    s_mov_b32 s7, 0x405ec000
+; GCN-NEXT:    v_accvgpr_write_b32 a0, s6
 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[10:11], s[10:11] op_sel:[0,1]
-; GCN-NEXT:    s_mov_b64 s[4:5], s[0:1]
-; GCN-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GCN-NEXT:    v_accvgpr_write_b32 a0, s0
-; GCN-NEXT:    v_accvgpr_write_b32 a1, s1
-; GCN-NEXT:    v_accvgpr_write_b32 a2, s2
-; GCN-NEXT:    v_accvgpr_write_b32 a3, s3
-; GCN-NEXT:    v_accvgpr_write_b32 a4, s4
-; GCN-NEXT:    v_accvgpr_write_b32 a5, s5
+; GCN-NEXT:    v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1]
+; GCN-NEXT:    v_accvgpr_write_b32 a2, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a4, s6
 ; GCN-NEXT:    v_accvgpr_write_b32 a6, s6
+; GCN-NEXT:    v_accvgpr_write_b32 a1, s7
+; GCN-NEXT:    v_accvgpr_write_b32 a3, s7
+; GCN-NEXT:    v_accvgpr_write_b32 a5, s7
 ; GCN-NEXT:    v_accvgpr_write_b32 a7, s7
-; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[12:13], s[12:13] op_sel:[0,1]
+; GCN-NEXT:    v_pk_mov_b32 v[2:3], s[8:9], s[8:9] op_sel:[0,1]
 ; GCN-NEXT:    s_nop 1
 ; GCN-NEXT:    v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7]
 ; GCN-NEXT:    v_mov_b32_e32 v0, 0
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 7
 ; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[8:9]
-; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[8:9] offset:16
+; GCN-NEXT:    global_store_dwordx4 v0, a[0:3], s[0:1]
+; GCN-NEXT:    global_store_dwordx4 v0, a[4:7], s[0:1] offset:16
 ; GCN-NEXT:    s_endpgm
 bb:
   %mai.1 = tail call <4 x double> @llvm.amdgcn.mfma.f64.16x16x4f64(double %a, double %b, <4 x double> <double 123.0, double 123.0, double 123.0, double 123.0>, i32 0, i32 0, i32 0)
diff --git a/llvm/test/CodeGen/AMDGPU/acc-ldst.ll b/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
index 5c484e1e52da8..726bfbab7ad48 100644
--- a/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
+++ b/llvm/test/CodeGen/AMDGPU/acc-ldst.ll
@@ -190,11 +190,11 @@ bb:
 
 ; NB: for atomics both vdata and vdst shall be either VGPR or AGPR
 ; GCN-LABEL: {{^}}test_atomic_mfma_4xi32_atomic_store:
+; GCN: v_accvgpr_write_b32 [[A_ZERO:a[0-9]+]], 0
 ; GCN:     global_atomic_sub [[IN:v[0-9]+]], v{{[0-9:]+}}, v{{[0-9]+}}, s[{{[0-9:]+}}] glc
+; GCN-DAG: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
+; GCN-DAG: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
 ; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, [[IN]]
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
-; GCN-DAG: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
 ; GCN:     v_mfma_i32_4x4x4i8 a[[[N:[0-9]+]]:
 ; GCN:     v_accvgpr_read_b32 [[V:v[0-9]+]], a[[N]]{{$}}
 ; GCN:     global_atomic_add v{{[0-9]+}}, v{{[0-9:]+}}, [[V]], s[{{[0-9:]+}}] glc
@@ -217,7 +217,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_atomic_mfma_4xi32_atomic64_store:
 ; GCN:         global_atomic_sub_x2 v[{{[0-9:]+}}], v{{[0-9:]+}}, v[{{[0-9:]+}}], s[{{[0-9:]+}}] glc
-; GCN-COUNT-4: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_accvgpr_write_b32 [[A_ZERO:a[0-9]+]], 0
+; GCN: v_accvgpr_mov_b32 a{{[0-9]+}}, [[A_ZERO]]
+; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
+; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v{{[0-9]+}}
 ; GCN:         v_mfma_i32_4x4x4i8 a[[[N:[0-9]+]]:
 ; GCN:         v_accvgpr_read_b32 v{{[0-9]+}}, a{{[0-9]+}}
 ; GCN:         v_accvgpr_read_b32 v{{[0-9]+}}, a{{[0-9]+}}
diff --git a/llvm/test/CodeGen/AMDGPU/coalesces-better.mir b/llvm/test/CodeGen/AMDGPU/coalesces-better.mir
new file mode 100644
index 0000000000000..593220d879c2a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/coalesces-better.mir
@@ -0,0 +1,74 @@
+# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx950 -start-after=si-fold-operands -o - %s | FileCheck %s
+
+--- |
+  target triple = "amdgcn-mesa-mesa3d"
+
+  define <4 x float> @test_smfmac_f32_16x16x64_f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3) #0 {
+  ; CHECK-LABEL: test_smfmac_f32_16x16x64_f16:
+  ; CHECK:       ; %bb.0:
+  ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+  ; CHECK-NEXT:    v_smfmac_f32_16x16x64_f16 v[12:15], v[0:3], v[4:11], v16
+  ; CHECK-NEXT:    s_nop 7
+  ; CHECK-NEXT:    v_mov_b32_e32 v0, v12
+  ; CHECK-NEXT:    v_mov_b32_e32 v1, v13
+  ; CHECK-NEXT:    v_mov_b32_e32 v2, v14
+  ; CHECK-NEXT:    v_mov_b32_e32 v3, v15
+  ; CHECK-NEXT:    s_setpc_b64 s[30:31]
+    %result = call <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3, i32 immarg 0, i32 immarg 0)
+    ret <4 x float> %result
+  }
+
+  ; Function Attrs: convergent nocallback nofree nosync nounwind willreturn memory(none)
+  declare <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half>, <16 x half>, <4 x float>, i32, i32 immarg, i32 immarg) #1
+
+  attributes #0 = { "target-cpu"="gfx950" }
+  attributes #1 = { convergent nocallback nofree nosync nounwind willreturn memory(none) "target-cpu"="gfx950" }
+
+...
+---
+name:            test_smfmac_f32_16x16x64_f16
+tracksRegLiveness: true
+isSSA:           true
+machineFunctionInfo:
+  frameOffsetReg:  '$sgpr33'
+  stackPtrOffsetReg: '$sgpr32'
+  sgprForEXECCopy: '$sgpr100_sgpr101'
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16
+
+
+    %0:vgpr_32 = COPY $vgpr16
+    %1:vgpr_32 = COPY $vgpr15
+    %2:vgpr_32 = COPY $vgpr14
+    %3:vgpr_32 = COPY $vgpr13
+    %4:vgpr_32 = COPY $vgpr12
+    %5:vgpr_32 = COPY $vgpr11
+    %6:vgpr_32 = COPY $vgpr10
+    %7:vgpr_32 = COPY $vgpr9
+    %8:vgpr_32 = COPY $vgpr8
+    %9:vgpr_32 = COPY $vgpr7
+    %10:vgpr_32 = COPY $vgpr6
+    %11:vgpr_32 = COPY $vgpr5
+    %12:vgpr_32 = COPY $vgpr4
+    %13:vgpr_32 = COPY $vgpr3
+    %14:vgpr_32 = COPY $vgpr2
+    %15:vgpr_32 = COPY $vgpr1
+    %16:vgpr_32 = COPY $vgpr0
+    %17:vreg_256_align2 = REG_SEQUENCE %12, %subreg.sub0, %11, %subreg.sub1, %10, %subreg.sub2, %9, %subreg.sub3, %8, %subreg.sub4, %7, %subreg.sub5, %6, %subreg.sub6, %5, %subreg.sub7
+    %18:vreg_128_align2 = REG_SEQUENCE %4, %subreg.sub0, %3, %subreg.sub1, %2, %subreg.sub2, %1, %subreg.sub3
+    %19:vreg_128_align2 = REG_SEQUENCE %16, %subreg.sub0, %15, %subreg.sub1, %14, %subreg.sub2, %13, %subreg.sub3
+    %24:areg_128_align2 = COPY %18
+    %25:areg_128_align2 = V_SMFMAC_F32_16X16X64_F16_e64 %19, %17, %0, 0, 0, %24, implicit $mode, implicit $exec
+    %26:vgpr_32 = COPY %25.sub0
+    %27:vgpr_32 = COPY %25.sub1
+    %28:vgpr_32 = COPY %25.sub2
+    %29:vgpr_32 = COPY %25.sub3
+    $vgpr0 = COPY %26
+    $vgpr1 = COPY %27
+    $vgpr2 = COPY %28
+    $vgpr3 = COPY %29
+    SI_RETURN implicit $vgpr0, implicit $vgpr1, implicit $vgpr2, implicit $vgpr3
+
+...
diff --git a/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir b/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir
new file mode 100644
index 0000000000000..0718f825fbacb
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/coalesces-worse.mir
@@ -0,0 +1,71 @@
+# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx950 -start-after=si-fold-operands -o - %s | FileCheck %s
+
+--- |
+  target triple = "amdgcn-mesa-mesa3d"
+
+  define <4 x float> @test_smfmac_f32_16x16x64_f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3) #0 {
+  ; CHECK-LABEL: test_smfmac_f32_16x16x64_f16:
+  ; CHECK:       ; %bb.0:
+  ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a0, v12
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a1, v13
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a2, v14
+  ; CHECK-NEXT:    v_accvgpr_write_b32 a3, v15
+  ; CHECK-NEXT:    s_nop 1
+  ; CHECK-NEXT:    v_smfmac_f32_16x16x64_f16 a[0:3], v[0:3], v[4:11], v16
+  ; CHECK-NEXT:    s_nop 7
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v0, a0
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v1, a1
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v2, a2
+  ; CHECK-NEXT:    v_accvgpr_read_b32 v3, a3
+  ; CHECK-NEXT:    s_setpc_b64 s[30:31]
+    %result = call <4 x float> @llvm.amdgcn.smfmac.f32.16x16x64.f16(<8 x half> %arg0, <16 x half> %arg1, <4 x float> %arg2, i32 %arg3, i32 immarg 0, i32 immarg 0)
+    ret <4 x float> %result
+  }
+
+  declare <4 x float> @llvm.amdgcn.smfmac.f3...
[truncated]

@arsenm arsenm force-pushed the users/arsenm/amdgpu/stop-introducing-accvpr-write-for-reg-copy branch from 2340ede to f3d76df Compare March 3, 2025 09:19
Base automatically changed from users/arsenm/amdgpu/stop-introducing-accvpr-write-for-reg-copy to main March 3, 2025 09:22
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch 2 times, most recently from a5de66e to d6ca555 Compare March 3, 2025 13:17
Comment on lines +1533 to +1588
// Value, subregindex for new REG_SEQUENCE
SmallVector<std::pair<MachineOperand *, unsigned>, 32> NewDefs;

unsigned NumRegSeqOperands = RegSeq->getNumOperands();
unsigned NumFoldable = 0;

for (unsigned I = 1; I != NumRegSeqOperands; I += 2) {
MachineOperand &RegOp = RegSeq->getOperand(I);
unsigned SubRegIdx = RegSeq->getOperand(I + 1).getImm();

if (RegOp.getSubReg()) {
// TODO: Handle subregister compose
NewDefs.emplace_back(&RegOp, SubRegIdx);
continue;
}

MachineOperand *Lookup = lookUpCopyChain(*TII, *MRI, RegOp.getReg());
if (!Lookup)
Lookup = &RegOp;

if (Lookup->isImm()) {
// Check if this is an agpr_32 subregister.
const TargetRegisterClass *DestSuperRC = TRI->getMatchingSuperRegClass(
DefRC, &AMDGPU::AGPR_32RegClass, SubRegIdx);
if (DestSuperRC &&
TII->isInlineConstant(*Lookup, AMDGPU::OPERAND_REG_INLINE_C_INT32)) {
++NumFoldable;
NewDefs.emplace_back(Lookup, SubRegIdx);
continue;
}
}

const TargetRegisterClass *InputRC =
Lookup->isReg() ? MRI->getRegClass(Lookup->getReg())
: MRI->getRegClass(RegOp.getReg());

// TODO: Account for Lookup->getSubReg()

// If we can't find a matching super class, this is an SGPR->AGPR or
// VGPR->AGPR subreg copy (or something constant-like we have to materialize
// in the AGPR). We can't directly copy from SGPR to AGPR on gfx908, so we
// want to rewrite to copy to an intermediate VGPR class.
const TargetRegisterClass *MatchRC =
TRI->getMatchingSuperRegClass(DefRC, InputRC, SubRegIdx);
if (!MatchRC) {
++NumFoldable;
NewDefs.emplace_back(&RegOp, SubRegIdx);
continue;
}

NewDefs.emplace_back(&RegOp, SubRegIdx);
}

// Do not clone a reg_sequence and merely change the result register class.
if (NumFoldable == 0)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive my ignorance, but why isn't it sufficient to use (or modify) getRegSeqInit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRegSeqInit bakes in the assumption that every entry is 32-bits. This should handle arbitrary subregister indexes, and has the heuristic checks mixed in with the source look throughs

Comment on lines +1615 to +1619
// Direct copy from SGPR to AGPR is not possible on gfx908. To avoid
// creation of exploded copies SGPR->VGPR->AGPR in the copyPhysReg()
// later, create a copy here and track if we already have such a copy.
if (TRI->getSubRegisterClass(MRI->getRegClass(Src.Reg), Src.SubReg) !=
VGPRUseSubRC) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this check for SGPR (as the comments explain)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register bank mismatch is covered by the generic subregister class logic

Comment on lines +19 to +20
; GCN-NEXT: s_mov_b32 s38, 2
; GCN-NEXT: s_mov_b32 s39, s37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it use s[36:37] as it did before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This amounts to reordering, the v_pk_mov_b32 just moved below. I don't expect consistent regalloc decisions with movement

Copy link
Contributor Author

arsenm commented Mar 4, 2025

Merge activity

  • Mar 4, 2:37 AM EST: A user started a stack merge that includes this pull request via Graphite.
  • Mar 4, 2:39 AM EST: Graphite rebased this pull request as part of a merge.
  • Mar 4, 2:41 AM EST: A user merged this pull request with Graphite.

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.
@arsenm arsenm force-pushed the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch from d6ca555 to 0ed5d46 Compare March 4, 2025 07:39
@arsenm arsenm merged commit b3d5056 into main Mar 4, 2025
6 of 9 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu/fold-agpr-reg-sequence-at-copy-not-reg-sequence-user branch March 4, 2025 07:41
jph-13 pushed a commit to jph-13/llvm-project that referenced this pull request Mar 21, 2025
…vm#129463)

SIFoldOperands is frustratingly written in a def-folds-into-use
iteration pattern, with a few random cases starting at the uses.
We were handling this case by looking at the reg_sequence, and finding
the copy. This did not work for the most basic pattern of materializing
a vector constant that started in SGPRs. It just happens there is an
optimization bug in SelectionDAG that produced the expected pattern.

Perform an additional attempt at the fold rooted at the copy. This
mostly shows test improvements. There were some tricky updates to
perform. remaining-virtual-register-operands.ll managed to stop failing
the allocator, so needed to be tricked into failing again. I also do
not understand what schedule-xdl-resource.ll is trying to do for the test
so this changes it to some random output that exists in the debug output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants