[AMDGPU] Allocate AVRegClass last #146606

jrbyrnes · 2025-07-01T22:43:37Z

This changes the RC priorities such that AVRegClass is the least prioritized. These registers are less constrained than the VRegClass and ARegClass as they can be either agpr or vgpr. Thus, assigning them last removes unnecessary constraints from VRegClass and ARegClass assignments, and allows the RA to make smarter decisions about whether to use vgpr / agpr for AVRegClass.

We only have 5 bits for RC priorities, and we still want to prioritize larger RCs over smaller ones. Since this new prioritization uses the 5th bit for AVRegClass vs ARegClass / VRegClass, we only have 4 bits to encode the size priorities. Previously, each RC with a distinct size, had a distinct priority. However, this PR groups together multiple sizes to the same priority. Currently, this will have no effect on prioritization in practice because we only have one actually defined RC per group per vector register type.

For example, a register class with 15 or 16 32bit registers will have the same size priority (14). However, we only have VReg_512 (VReg_480 doesn't exist), so only one actual RC in VRegClass has this priority. Similarly, we give register class with 17-32+ 32 bit registers a size priority of 15, but we only have VReg_1024.

The effect of this PR is to prioritize first the vector register type (VReg & Areg have top priority, then AVReg), with the size of the register class having second priority.

Passes PSDB.

Change-Id: Iace3462f27ea276b22716793ebfa13b5026b0e58

Change-Id: Ia1e4719bd9edada963d9e5f07371f786eb490d15

llvmbot · 2025-07-01T22:44:07Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

This changes the RC priorities such that AVRegClass is the least prioritized. These registers are less constrained than the VRegClass and ARegClass as they can be either agpr or vgpr. Thus, assigning them last removes unnecessary constraints from VRegClass and ARegClass assignments, and allows the RA to make smarter decisions about whether to use vgpr / agpr for AVRegClass.

We only have 5 bits for RC priorities, and we still want to prioritize larger RCs over smaller ones. Since this new prioritization uses the 5th bit for AVRegClass vs ARegClass / VRegClass, we only have 4 bits to encode the size priorities. Previously, each RC with a distinct size, had a distinct priority. However, this PR groups together multiple sizes to the same priority. Currently, this will have no effect on prioritization in practice because we only have one actually defined RC per group per vector register type.

For example, a register class with 15 or 16 32bit registers will have the same size priority (14). However, we only have VReg_512 (VReg_480 doesn't exist), so only one actual RC in VRegClass has this priority. Similarly, we give register class with 17-32+ 32 bit registers a size priority of 15, but we only have VReg_1024.

The effect of this PR is to prioritize first the vector register type (VReg & Areg have top priority, then AVReg), with the size of the register class having second priority.

Passes PSDB.

Patch is 445.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146606.diff

40 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.td (+9-4)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+38-36)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+36-36)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+36-36)
(modified) llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+114-104)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+240-261)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+240-261)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+114-104)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics.ll (+62-71)
(modified) llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+6-5)
(modified) llvm/test/CodeGen/AMDGPU/frem.ll (+197-199)
(modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+114-104)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+240-261)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+240-261)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+114-104)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/inflated-reg-class-snippet-copy-use-after-free.mir (+7-25)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.dim.gfx90a.ll (+53-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.bf16.ll (+32-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+94-98)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll (+112-116)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.smfmac.gfx950.ll (+40-40)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll (+6-13)
(modified) llvm/test/CodeGen/AMDGPU/regalloc-failure-overlapping-insert-assert.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/regalloc-illegal-eviction-assert.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/register-killed-error-after-alloc-failure1.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.v8i16.ll (+18-18)
(modified) llvm/test/CodeGen/AMDGPU/sext-in-reg-vector-shuffle.ll (+18-16)
(modified) llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.i64.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/undef-handling-crash-in-ra.ll (+21-26)
(modified) llvm/test/CodeGen/AMDGPU/v_pack.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+14-16)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll (+11-9)

diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index d595163f820cb..b4e968f5f455a 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -109,6 +109,10 @@ class SIRegisterClass <string n, list<ValueType> rTypes, int Align, dag rList>
   let TSFlags{2} = HasVGPR;
   let TSFlags{3} = HasAGPR;
   let TSFlags{4} = HasSGPR;
+
+  // RegisterClass (e.g. AGPR / VGPR) priority for allocation
+  field int RegClassPriority = 1;
+
 }
 
 multiclass SIRegLoHi16 <string n, bits<8> regIdx, bit ArtificialHigh = 1,
@@ -936,14 +940,15 @@ class VRegClassBase<int numRegs, list<ValueType> regTypes, dag regList> :
 
   // Requires n v_mov_b32 to copy
   let CopyCost = numRegs;
-  let AllocationPriority = !sub(numRegs, 1);
+  defvar SizePrioriity = !if(!le(numRegs, 14), !sub(numRegs, 1), !if(!le(numRegs, 16), 14, 15));
+  let AllocationPriority = !add(SizePrioriity, !mul(RegClassPriority, 16));
   let Weight = numRegs;
 }
 
 // Define a register tuple class, along with one requiring an even
 // aligned base register.
 multiclass VRegClass<int numRegs, list<ValueType> regTypes, dag regList> {
-  let HasVGPR = 1 in {
+  let HasVGPR = 1, RegClassPriority = 1 in {
     // Define the regular class.
     def "" : VRegClassBase<numRegs, regTypes, regList> {
       let BaseClassOrder = !mul(numRegs, 32);
@@ -977,7 +982,7 @@ defm VReg_1024 : VRegClass<32, Reg1024Types.types, (add VGPR_1024)>;
 }
 
 multiclass ARegClass<int numRegs, list<ValueType> regTypes, dag regList> {
-  let CopyCost = !add(numRegs, numRegs, 1), HasAGPR = 1 in {
+  let CopyCost = !add(numRegs, numRegs, 1), HasAGPR = 1, RegClassPriority = 1 in {
     // Define the regular class.
     def "" : VRegClassBase<numRegs, regTypes, regList> {
       let BaseClassOrder = !mul(numRegs, 32);
@@ -1070,7 +1075,7 @@ def AV_32 : SIRegisterClass<"AMDGPU", VGPR_32.RegTypes, 32, (add VGPR_32, AGPR_3
 // aligned base register.
 multiclass AVRegClass<int numRegs, list<ValueType> regTypes,
                       dag vregList,  dag aregList> {
-  let HasVGPR = 1, HasAGPR = 1 in {
+  let HasVGPR = 1, HasAGPR = 1, RegClassPriority = 0 in {
     // Define the regular class.
     def "" : VRegClassBase<numRegs, regTypes, (add vregList, aregList)>;
 
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
index 36370361b677d..dc84a85d6c207 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll
@@ -4295,15 +4295,15 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-TRUE16-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-TRUE16-NEXT:    s_wait_kmcnt 0x0
-; GFX12-TRUE16-NEXT:    v_add_nc_u32_e32 v6, 0x200, v4
+; GFX12-TRUE16-NEXT:    v_add_nc_u32_e32 v4, 0x200, v4
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX12-TRUE16-NEXT:    v_and_b32_e32 v4, 3, v6
-; GFX12-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v6
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v4, 3, v4
+; GFX12-TRUE16-NEXT:    v_and_b32_e32 v6, 3, v4
+; GFX12-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v4
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v8, 3, v6
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e64 v7, v4, 0xffff
-; GFX12-TRUE16-NEXT:    v_not_b32_e32 v10, v7
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v8, 0xffff
+; GFX12-TRUE16-NEXT:    v_not_b32_e32 v10, v6
 ; GFX12-TRUE16-NEXT:  .LBB15_1: ; =>This Inner Loop Header: Depth=1
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s5, v1
@@ -4317,29 +4317,30 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
 ; GFX12-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    buffer_load_b32 v6, v9, s[4:7], null offen
+; GFX12-TRUE16-NEXT:    buffer_load_b32 v4, v9, s[4:7], null offen
 ; GFX12-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB15_1
 ; GFX12-TRUE16-NEXT:  ; %bb.2:
 ; GFX12-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
+; GFX12-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v5.l
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s1, 0
 ; GFX12-TRUE16-NEXT:  .LBB15_3: ; %atomicrmw.start
 ; GFX12-TRUE16-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX12-TRUE16-NEXT:    ; Child Loop BB15_4 Depth 2
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v8, v6
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v6, v4
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX12-TRUE16-NEXT:    s_wait_storecnt 0x0
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v6, v4, v8
-; GFX12-TRUE16-NEXT:    v_add_f16_e32 v6.l, v6.l, v5.l
+; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v4, v8, v6
+; GFX12-TRUE16-NEXT:    v_add_f16_e32 v4.l, v4.l, v7.l
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_and_b32_e32 v6, 0xffff, v6
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v6, v4, v6
+; GFX12-TRUE16-NEXT:    v_and_b32_e32 v4, 0xffff, v4
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v4, v8, v4
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_and_or_b32 v7, v8, v10, v6
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v6, v7
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v7, v8
+; GFX12-TRUE16-NEXT:    v_and_or_b32 v5, v6, v10, v4
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v4, v5
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v5, v6
 ; GFX12-TRUE16-NEXT:  .LBB15_4: ; Parent Loop BB15_3 Depth=1
 ; GFX12-TRUE16-NEXT:    ; => This Inner Loop Header: Depth=2
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
@@ -4354,13 +4355,13 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
 ; GFX12-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[6:7], v9, s[4:7], null offen th:TH_ATOMIC_RETURN
+; GFX12-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[4:5], v9, s[4:7], null offen th:TH_ATOMIC_RETURN
 ; GFX12-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB15_4
 ; GFX12-TRUE16-NEXT:  ; %bb.5: ; in Loop: Header=BB15_3 Depth=1
 ; GFX12-TRUE16-NEXT:    s_mov_b32 exec_lo, s2
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v6, v8
+; GFX12-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v4, v6
 ; GFX12-TRUE16-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-TRUE16-NEXT:    s_or_b32 s1, vcc_lo, s1
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
@@ -4368,7 +4369,7 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB15_3
 ; GFX12-TRUE16-NEXT:  ; %bb.6: ; %atomicrmw.end
 ; GFX12-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v4, v6
+; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v8, v4
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -4526,16 +4527,16 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu_no_fine_grained_memory:
 ; GFX11-TRUE16:       ; %bb.0:
 ; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT:    v_add_nc_u32_e32 v6, 0x200, v4
+; GFX11-TRUE16-NEXT:    v_add_nc_u32_e32 v4, 0x200, v4
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s1, 0
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v4, 3, v6
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v6
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v4, 3, v4
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v6, 3, v4
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v4
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v8, 3, v6
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e64 v7, v4, 0xffff
-; GFX11-TRUE16-NEXT:    v_not_b32_e32 v10, v7
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v8, 0xffff
+; GFX11-TRUE16-NEXT:    v_not_b32_e32 v10, v6
 ; GFX11-TRUE16-NEXT:  .LBB15_1: ; =>This Inner Loop Header: Depth=1
 ; GFX11-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
 ; GFX11-TRUE16-NEXT:    v_readfirstlane_b32 s5, v1
@@ -4547,29 +4548,30 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-NEXT:    s_and_b32 s0, vcc_lo, s0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
-; GFX11-TRUE16-NEXT:    buffer_load_b32 v6, v9, s[4:7], 0 offen
+; GFX11-TRUE16-NEXT:    buffer_load_b32 v4, v9, s[4:7], 0 offen
 ; GFX11-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX11-TRUE16-NEXT:    s_cbranch_execnz .LBB15_1
 ; GFX11-TRUE16-NEXT:  ; %bb.2:
 ; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s2
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v7.l, v5.l
 ; GFX11-TRUE16-NEXT:    .p2align 6
 ; GFX11-TRUE16-NEXT:  .LBB15_3: ; %atomicrmw.start
 ; GFX11-TRUE16-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX11-TRUE16-NEXT:    ; Child Loop BB15_4 Depth 2
 ; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v8, v6
+; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v6, v4
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX11-TRUE16-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v6, v4, v8
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v6.l, v6.l, v5.l
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v4, v8, v6
+; GFX11-TRUE16-NEXT:    v_add_f16_e32 v4.l, v4.l, v7.l
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v6, 0xffff, v6
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v6, v4, v6
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v4, 0xffff, v4
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v4, v8, v4
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_and_or_b32 v7, v8, v10, v6
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v6, v7
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v7, v8
+; GFX11-TRUE16-NEXT:    v_and_or_b32 v5, v6, v10, v4
+; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v4, v5
+; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v5, v6
 ; GFX11-TRUE16-NEXT:  .LBB15_4: ; Parent Loop BB15_3 Depth=1
 ; GFX11-TRUE16-NEXT:    ; => This Inner Loop Header: Depth=2
 ; GFX11-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
@@ -4583,13 +4585,13 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
 ; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[6:7], v9, s[4:7], 0 offen glc
+; GFX11-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[4:5], v9, s[4:7], 0 offen glc
 ; GFX11-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX11-TRUE16-NEXT:    s_cbranch_execnz .LBB15_4
 ; GFX11-TRUE16-NEXT:  ; %bb.5: ; in Loop: Header=BB15_3 Depth=1
 ; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s2
 ; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v6, v8
+; GFX11-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v4, v6
 ; GFX11-TRUE16-NEXT:    buffer_gl1_inv
 ; GFX11-TRUE16-NEXT:    buffer_gl0_inv
 ; GFX11-TRUE16-NEXT:    s_or_b32 s1, vcc_lo, s1
@@ -4598,7 +4600,7 @@ define half @buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-NEXT:    s_cbranch_execnz .LBB15_3
 ; GFX11-TRUE16-NEXT:  ; %bb.6: ; %atomicrmw.end
 ; GFX11-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v4, v6
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v8, v4
 ; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-FAKE16-LABEL: buffer_fat_ptr_agent_atomic_fadd_ret_f16__offset__waterfall__amdgpu_no_fine_grained_memory:
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
index 3ad1e5c0b81e0..e0a81187aeed1 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll
@@ -3417,11 +3417,11 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s1, exec_lo
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX12-TRUE16-NEXT:    v_and_b32_e32 v6, 3, v4
-; GFX12-TRUE16-NEXT:    v_and_b32_e32 v10, -4, v4
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v9, 3, v6
+; GFX12-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v4
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v8, 3, v6
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v9, 0xffff
-; GFX12-TRUE16-NEXT:    v_not_b32_e32 v11, v6
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v8, 0xffff
+; GFX12-TRUE16-NEXT:    v_not_b32_e32 v10, v6
 ; GFX12-TRUE16-NEXT:  .LBB12_1: ; =>This Inner Loop Header: Depth=1
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s5, v1
@@ -3435,32 +3435,32 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
 ; GFX12-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    buffer_load_b32 v6, v10, s[4:7], null offen
+; GFX12-TRUE16-NEXT:    buffer_load_b32 v4, v9, s[4:7], null offen
 ; GFX12-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB12_1
 ; GFX12-TRUE16-NEXT:  ; %bb.2:
 ; GFX12-TRUE16-NEXT:    s_mov_b32 exec_lo, s1
-; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v4.l, v5.l, v5.l
+; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v7.l, v5.l, v5.l
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s1, 0
 ; GFX12-TRUE16-NEXT:  .LBB12_3: ; %atomicrmw.start
 ; GFX12-TRUE16-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX12-TRUE16-NEXT:    ; Child Loop BB12_4 Depth 2
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v8, v6
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v6, v4
 ; GFX12-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX12-TRUE16-NEXT:    s_wait_storecnt 0x0
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v5, v9, v8
-; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v4.h, v5.l, v5.l
+; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v4, v8, v6
+; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v4.l, v4.l, v4.l
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v5.l, v4.h, v4.l
-; GFX12-TRUE16-NEXT:    v_and_b32_e32 v5, 0xffff, v5
+; GFX12-TRUE16-NEXT:    v_max_num_f16_e32 v4.l, v4.l, v7.l
+; GFX12-TRUE16-NEXT:    v_and_b32_e32 v4, 0xffff, v4
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v5, v9, v5
-; GFX12-TRUE16-NEXT:    v_and_or_b32 v7, v8, v11, v5
+; GFX12-TRUE16-NEXT:    v_lshlrev_b32_e32 v4, v8, v4
+; GFX12-TRUE16-NEXT:    v_and_or_b32 v5, v6, v10, v4
 ; GFX12-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v6, v7
-; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v7, v8
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v4, v5
+; GFX12-TRUE16-NEXT:    v_mov_b32_e32 v5, v6
 ; GFX12-TRUE16-NEXT:  .LBB12_4: ; Parent Loop BB12_3 Depth=1
 ; GFX12-TRUE16-NEXT:    ; => This Inner Loop Header: Depth=2
 ; GFX12-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
@@ -3475,13 +3475,13 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
 ; GFX12-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[6:7], v10, s[4:7], null offen th:TH_ATOMIC_RETURN
+; GFX12-TRUE16-NEXT:    buffer_atomic_cmpswap_b32 v[4:5], v9, s[4:7], null offen th:TH_ATOMIC_RETURN
 ; GFX12-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB12_4
 ; GFX12-TRUE16-NEXT:  ; %bb.5: ; in Loop: Header=BB12_3 Depth=1
 ; GFX12-TRUE16-NEXT:    s_mov_b32 exec_lo, s2
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
-; GFX12-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v6, v8
+; GFX12-TRUE16-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v4, v6
 ; GFX12-TRUE16-NEXT:    global_inv scope:SCOPE_DEV
 ; GFX12-TRUE16-NEXT:    s_or_b32 s1, vcc_lo, s1
 ; GFX12-TRUE16-NEXT:    s_wait_alu 0xfffe
@@ -3489,7 +3489,7 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX12-TRUE16-NEXT:    s_cbranch_execnz .LBB12_3
 ; GFX12-TRUE16-NEXT:  ; %bb.6: ; %atomicrmw.end
 ; GFX12-TRUE16-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v9, v6
+; GFX12-TRUE16-NEXT:    v_lshrrev_b32_e32 v0, v8, v4
 ; GFX12-TRUE16-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -3657,11 +3657,11 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
 ; GFX11-TRUE16-NEXT:    v_and_b32_e32 v6, 3, v4
-; GFX11-TRUE16-NEXT:    v_and_b32_e32 v10, -4, v4
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v9, 3, v6
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v9, -4, v4
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e32 v8, 3, v6
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v9, 0xffff
-; GFX11-TRUE16-NEXT:    v_not_b32_e32 v11, v6
+; GFX11-TRUE16-NEXT:    v_lshlrev_b32_e64 v6, v8, 0xffff
+; GFX11-TRUE16-NEXT:    v_not_b32_e32 v10, v6
 ; GFX11-TRUE16-NEXT:  .LBB12_1: ; =>This Inner Loop Header: Depth=1
 ; GFX11-TRUE16-NEXT:    v_readfirstlane_b32 s4, v0
 ; GFX11-TRUE16-NEXT:    v_readfirstlane_b32 s5, v1
@@ -3673,32 +3673,32 @@ define half @buffer_fat_ptr_agent_atomic_fmax_ret_f16__offset__waterfall__amdgpu
 ; GFX11-TRUE16-NEXT:    s_and_b32 s0, vcc_lo, s0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-TRUE16-NEXT:    s_and_saveexec_b32 s0, s0
-; GFX11-TRUE16-NEXT:    buffer_load_b32 v6, v10, s[4:7], 0 offen
+; GFX11-TRUE16-NEXT:    buffer_load_b32 v4, v9, s[4:7], 0 offen
 ; GFX11-TRUE16-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
 ; GFX11-TRUE16-NEXT:    s_cbranch_execnz .LBB12_1
 ; GFX11-TRUE16-NEXT:  ; %bb.2:
 ; GFX11-TRUE16-NEXT:    s_mov_b32 exec_lo, s2
-; GFX11-TRUE16-NEXT:    v_max_f16_e32 v4.l, v5.l, v5.l
+; GFX11-TRUE16-NEXT:    v_max_f16_e32 v7.l, v5.l, v5.l
 ; GFX11-TRUE16-NEXT:    .p2align 6
 ; GFX11-TRUE16-NEXT:  .LBB12_3: ; %atomicrmw.start
 ; GFX11-TRUE16-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX11-TRUE16-NEXT:    ; Child Loop BB12_4 Depth 2
 ; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v8, v6
+; GFX11-TRUE16-NEXT:    v_mov_b32_e32 v6, v4
 ; GFX11-TRUE16-NEXT:    s_mov_b32 s2, exec_lo
 ; GFX11-TRUE16-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v5, v9, v8
-; GFX11-TRUE16-NEXT:    v_max_f16_e32 v4.h, v5.l, v5.l
+; GFX11-TRUE16-NEXT:    v_lshrrev_b32_e32 v4, v8, v6
+; GFX11-TRUE16-NEXT:    v_max_f16_e32 v4.l, v4.l, v4.l
 ; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-TRUE16-NEXT:    v_max_f16_e32 v5.l, v4.h, v4.l
-; GFX11-TRUE...
[truncated]

jrbyrnes · 2025-07-01T22:47:37Z

llvm/test/CodeGen/AMDGPU/regalloc-failure-overlapping-insert-assert.mir

@@ -1,4 +1,4 @@
-# RUN: not llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs -start-before=greedy,1 -stop-after=virtregrewriter,2 %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+# RUN: not llc -mtriple=amdgcn -mcpu=gfx908  -start-before=greedy,1 -stop-after=virtregrewriter,2 %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s


I've disabled the verifier on this and a couple other tests. It seems that the changes to allocation have triggered a latent issue. If RA fails and we enter error state, then we call cleanupFailedVReg to do some sort of rewrite + liveness workaround. This adds some undefs to split the live range of the problematic vreg and assignments. The effect of this it to put the regunit LiveRanges in a weird state that is inconsistent with other RA structures. I'm seeing that we insert a spill of a PhysReg at a point where the RegUnit LiveRange thinks it is already dead -- the RegUnit LiveRange doesn't model the liveness of this and the verifier complains. I'll keep looking into it for a bit, but since 1. it is independent of this PR, 2. it is very contrived and not experienced in internal tests, and 3. it is not the original error of the test case, I think it may make sense to address separately.

See also:

regalloc-illegal-eviction-assert.ll
register-killed-error-after-alloc-failure1.ll

This doesn't disable the verifier, it just moves the failure to be EXPENSIVE_CHECKS builds only

Need to address this before proceeding in any case

I had various more complex versions of the error handling patch to handle split virtregs, but then it turned out to be unnecessary. It might help to look into those again (i.e. it was various places using splitSeparateComponents)

llvm/lib/Target/AMDGPU/SIRegisterInfo.td

jayfoad · 2025-07-02T09:06:42Z

Previously, each RC with a distinct size, had a distinct priority. However, this PR groups together multiple sizes to the same priority. Currently, this will have no effect on prioritization in practice because we only have one actually defined RC per group per vector register type.

So why does this patch affect codegen for GFX11 and GFX12, which don't have AGPRs?

Change-Id: I2c56c9148c68fe820895d6eabb0a9058d04d7b4d

Change-Id: I7df86a09024b78c48e728119d42c2d3d812bbebd

Change-Id: Iaaa9a2317a2dc67f35d8664735c0dd9b2056314f

jrbyrnes · 2025-07-02T22:12:05Z

So why does this patch affect codegen for GFX11 and GFX12, which don't have AGPRs?

Thanks for catching! I had missed the VGPR32 base, and VGPR16 classes.

jrbyrnes added 2 commits July 1, 2025 09:47

[AMDGPU] Allocate AVRegClass last

11bbd06

Change-Id: Iace3462f27ea276b22716793ebfa13b5026b0e58

Bypass verifier for problematic tests

a46bd5e

Change-Id: Ia1e4719bd9edada963d9e5f07371f786eb490d15

jrbyrnes requested review from arsenm and rampitec July 1, 2025 22:43

llvmbot added the backend:AMDGPU label Jul 1, 2025

jrbyrnes commented Jul 1, 2025

View reviewed changes

jrbyrnes requested a review from srpande July 1, 2025 22:53

arsenm reviewed Jul 2, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIRegisterInfo.td Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIRegisterInfo.td Outdated Show resolved Hide resolved

jrbyrnes added 2 commits July 2, 2025 14:17

Also update priorities for VGPR32/VGPR16

5bf1b0a

Change-Id: I2c56c9148c68fe820895d6eabb0a9058d04d7b4d

Address Review comments

dc9323b

Change-Id: I7df86a09024b78c48e728119d42c2d3d812bbebd

llvmbot added llvm:codegen llvm:globalisel labels Jul 2, 2025

Delete accidental change

419e53e

Change-Id: Iaaa9a2317a2dc67f35d8664735c0dd9b2056314f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Allocate AVRegClass last #146606

[AMDGPU] Allocate AVRegClass last #146606

jrbyrnes commented Jul 1, 2025

Uh oh!

llvmbot commented Jul 1, 2025 •

edited

Loading

Uh oh!

jrbyrnes Jul 1, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

jayfoad commented Jul 2, 2025

Uh oh!

jrbyrnes commented Jul 2, 2025

Uh oh!

Uh oh!

		@@ -1,4 +1,4 @@
		# RUN: not llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs -start-before=greedy,1 -stop-after=virtregrewriter,2 %s -o /dev/null 2>&1 \| FileCheck -check-prefix=ERR %s
		# RUN: not llc -mtriple=amdgcn -mcpu=gfx908 -start-before=greedy,1 -stop-after=virtregrewriter,2 %s -o /dev/null 2>&1 \| FileCheck -check-prefix=ERR %s

[AMDGPU] Allocate AVRegClass last #146606

Are you sure you want to change the base?

[AMDGPU] Allocate AVRegClass last #146606

Conversation

jrbyrnes commented Jul 1, 2025

Uh oh!

llvmbot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbyrnes Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jayfoad commented Jul 2, 2025

Uh oh!

jrbyrnes commented Jul 2, 2025

Uh oh!

Uh oh!

llvmbot commented Jul 1, 2025 •

edited

Loading