[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis #133242

rovka · 2025-03-27T12:27:59Z

Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.

In particular, the register counts are used:

When launching an entry function - in which case we're safe because
the register counts of the entry function will include the register
counts of all callees.
At function boundaries in dynamic VGPR mode. In this case it's safe
because whenever we set the new VGPR allocation we take into account
the outgoing_vgpr_count set by the middle-end.

The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.

When using the amdgcn.init.whole.wave intrinsic, we add dummy VGPR arguments with the purpose of preserving their inactive lanes. The pattern may look something like this: ``` entry: call amdgcn.init.whole.wave brand to shader or tail shader: $vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes actual code... tail: call amdgcn.cs.chain [...], implicit $vInactive ``` We should not report these VGPRs in the .vgpr_count metadata. This patch achieves that goal by ignoring IMPLICIT_DEFs and SI_TCRETURNs in functions that use the amdgcn.init.whole.wave intrinsic. All other VGPRs are counted as usual.

llvmbot · 2025-03-27T12:28:23Z

@llvm/pr-subscribers-backend-amdgpu

Author: Diana Picus (rovka)

Changes

When using the amdgcn.init.whole.wave intrinsic, we add dummy VGPR arguments with the purpose of preserving their inactive lanes. The pattern may look something like this:

entry:
  call amdgcn.init.whole.wave
  brand to shader or tail

shader:
  $vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes
  actual code...

tail:
  call amdgcn.cs.chain [...], implicit $vInactive

We should not report these VGPRs in the .vgpr_count metadata. This patch achieves that goal by ignoring IMPLICIT_DEFs and SI_TCRETURNs in functions that use the amdgcn.init.whole.wave intrinsic. All other VGPRs are counted as usual.

Patch is 24.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/133242.diff

6 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp (+4-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp (+16)
(added) llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll (+76)
(added) llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll (+50)
(added) llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll (+78)
(added) llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll (+75)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
index 800e2b9c0e657..7769bc5d74ebd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
@@ -990,7 +990,10 @@ void AMDGPUAsmPrinter::getSIProgramInfo(SIProgramInfo &ProgInfo,
   // dispatch registers are function args.
   unsigned WaveDispatchNumSGPR = 0, WaveDispatchNumVGPR = 0;
 
-  if (isShader(F.getCallingConv())) {
+  // Shaders that use the init.whole.wave intrinsic sometimes have VGPR
+  // arguments that are only added for the purpose of preserving their inactive
+  // lanes. Skip including them in the VGPR count.
+  if (isShader(F.getCallingConv()) && !MFI->hasInitWholeWave()) {
     bool IsPixelShader =
         F.getCallingConv() == CallingConv::AMDGPU_PS && !STM.isAmdHsaOS();
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp b/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
index 9a609a1752de0..05d1aa38d4a25 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
@@ -156,8 +156,14 @@ AMDGPUResourceUsageAnalysis::analyzeResourceUsage(
   int32_t MaxSGPR = -1;
   Info.CalleeSegmentSize = 0;
 
+  bool IsIWWFunction = MFI->hasInitWholeWave();
+
   for (const MachineBasicBlock &MBB : MF) {
     for (const MachineInstr &MI : MBB) {
+      // At this point, the chain call pseudos are already expanded.
+      bool IsChainCall = MI.getOpcode() == AMDGPU::SI_TCRETURN;
+      bool IsImplicitDef = MI.isImplicitDef();
+
       // TODO: Check regmasks? Do they occur anywhere except calls?
       for (const MachineOperand &MO : MI.operands()) {
         unsigned Width = 0;
@@ -239,6 +245,16 @@ AMDGPUResourceUsageAnalysis::analyzeResourceUsage(
           break;
         }
 
+        // For functions that use the llvm.amdgcn.init.whole.wave intrinsic, we
+        // often add artificial VGPR arguments for the purpose of preserving
+        // their inactive lanes. These should not be reported as part of our
+        // VGPR usage. We can identify them easily because they're only used in
+        // the chain call, and possibly in an IMPLICIT_DEF coming from an
+        // llvm.amdgcn.dead intrinsic.
+        if (IsIWWFunction && (IsChainCall || IsImplicitDef) &&
+            TRI.isVectorRegister(MRI, Reg))
+          continue;
+
         if (AMDGPU::SGPR_32RegClass.contains(Reg) ||
             AMDGPU::SGPR_LO16RegClass.contains(Reg) ||
             AMDGPU::SGPR_HI16RegClass.contains(Reg)) {
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll
new file mode 100644
index 0000000000000..e47f5e25ead3a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll
@@ -0,0 +1,76 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Use VGPRs above the input arguments.
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0x1d{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+                                    i32 %vcr, { i32 } %system.data,
+                                    i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+                                    i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+                                    i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+                                    local_unnamed_addr {
+entry:
+  %system.data.value = extractvalue { i32 } %system.data, 0
+  %dead.val = call i32 @llvm.amdgcn.dead.i32()
+  %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+  br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+  %system.data.extract = extractvalue { i32 } %system.data, 0
+  %data.mul = mul i32 %system.data.extract, 2
+  %data.add = add i32 %data.mul, 1
+  call void asm sideeffect "; clobber v28", "~{v28}"()
+  br label %tail
+
+tail:
+  %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+  %final.sys.data = phi i32 [ %system.data.value, %entry ], [ %data.add, %shader ]
+  %final.inactive0 = phi i32 [ %inactive.vgpr, %entry ], [ %dead.val, %shader ]
+  %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %shader ]
+  %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %shader ]
+  %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %shader ]
+  %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %shader ]
+  %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %shader ]
+  %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %shader ]
+  %final.inactive7 = phi i32 [ %inactive.vgpr7, %entry ], [ %dead.val, %shader ]
+  %final.inactive8 = phi i32 [ %inactive.vgpr8, %entry ], [ %dead.val, %shader ]
+  %final.inactive9 = phi i32 [ %inactive.vgpr9, %entry ], [ %dead.val, %shader ]
+
+  %struct.init = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } poison, i32 %final.vcr, 0
+  %struct.with.data = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.init, i32 %final.sys.data, 1
+  %struct.with.inactive0 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.data, i32 %final.inactive0, 2
+  %struct.with.inactive1 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive0, i32 %final.inactive1, 3
+  %struct.with.inactive2 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive1, i32 %final.inactive2, 4
+  %struct.with.inactive3 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive2, i32 %final.inactive3, 5
+  %struct.with.inactive4 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive3, i32 %final.inactive4, 6
+  %struct.with.inactive5 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive4, i32 %final.inactive5, 7
+  %struct.with.inactive6 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive5, i32 %final.inactive6, 8
+  %struct.with.inactive7 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive6, i32 %final.inactive7, 9
+  %struct.with.inactive8 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive7, i32 %final.inactive8, 10
+  %final.struct = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive8, i32 %final.inactive9, 11
+
+  %vec.global = insertelement <4 x i32> poison, i32 %global.table, i64 0
+  %vec.max.vgpr = insertelement <4 x i32> %vec.global, i32 %max.outgoing.vgpr.count, i64 1
+  %vec.sys.data = insertelement <4 x i32> %vec.max.vgpr, i32 %final.sys.data, i64 2
+  %final.vec = insertelement <4 x i32> %vec.sys.data, i32 0, i64 3
+
+  call void (ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32, ...)
+        @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(
+        ptr %next.callee, i32 0, <4 x i32> inreg %final.vec,
+        { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %final.struct,
+        i32 1, i32 %max.outgoing.vgpr.count, i32 -1, ptr @retry_vgpr_alloc.v4i32)
+  unreachable
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
\ No newline at end of file
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll
new file mode 100644
index 0000000000000..5d7472fd3c56e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll
@@ -0,0 +1,50 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; CHECK-LABEL: leaf_shader:
+; CHECK: .vgpr_count:{{.*}}0xc{{$}}
+
+; Function without calls.
+define amdgpu_cs_chain void @_leaf_shader(ptr %output.ptr, i32 inreg %input.value,
+                              i32 %active.vgpr1, i32 %active.vgpr2,
+                              i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+                              i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6)
+                              local_unnamed_addr {
+entry:
+  %dead.val = call i32 @llvm.amdgcn.dead.i32()
+  %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+  br i1 %is.whole.wave, label %compute, label %merge
+
+compute:
+  ; Perform a more complex computation using active VGPRs
+  %square = mul i32 %active.vgpr1, %active.vgpr1
+  %product = mul i32 %square, %active.vgpr2
+  %sum = add i32 %product, %input.value
+  %result = add i32 %sum, 42
+  br label %merge
+
+merge:
+  %final.result = phi i32 [ 0, %entry ], [ %result, %compute ]
+  %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %compute ]
+  %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %compute ]
+  %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %compute ]
+  %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %compute ]
+  %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %compute ]
+  %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %compute ]
+
+  store i32 %final.result, ptr %output.ptr, align 4
+
+  ret void
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll
new file mode 100644
index 0000000000000..0c699a07cb3fd
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll
@@ -0,0 +1,78 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; The shader is free to use any of the VGPRs mapped to a %inactive.vpgr as long as it only touches its active lanes.
+; In that case, the VGPR should be included in the .vgpr_count
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0xd{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+                                    i32 %vcr, { i32 } %system.data,
+                                    i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+                                    i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+                                    i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+                                    local_unnamed_addr {
+entry:
+  %system.data.value = extractvalue { i32 } %system.data, 0
+  %dead.val = call i32 @llvm.amdgcn.dead.i32()
+  %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+  br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+  %system.data.extract = extractvalue { i32 } %system.data, 0
+  %data.mul = mul i32 %system.data.extract, 2
+  %data.add = add i32 %data.mul, 1
+  call void asm sideeffect "; use VGPR for %inactive.vgpr2", "~{v12}"()
+  br label %tail
+
+tail:
+  %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+  %final.sys.data = phi i32 [ %system.data.value, %entry ], [ %data.add, %shader ]
+  %final.inactive0 = phi i32 [ %inactive.vgpr, %entry ], [ %dead.val, %shader ]
+  %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %shader ]
+  %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %shader ]
+  %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %shader ]
+  %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %shader ]
+  %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %shader ]
+  %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %shader ]
+  %final.inactive7 = phi i32 [ %inactive.vgpr7, %entry ], [ %dead.val, %shader ]
+  %final.inactive8 = phi i32 [ %inactive.vgpr8, %entry ], [ %dead.val, %shader ]
+  %final.inactive9 = phi i32 [ %inactive.vgpr9, %entry ], [ %dead.val, %shader ]
+
+  %struct.init = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } poison, i32 %final.vcr, 0
+  %struct.with.data = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.init, i32 %final.sys.data, 1
+  %struct.with.inactive0 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.data, i32 %final.inactive0, 2
+  %struct.with.inactive1 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive0, i32 %final.inactive1, 3
+  %struct.with.inactive2 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive1, i32 %final.inactive2, 4
+  %struct.with.inactive3 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive2, i32 %final.inactive3, 5
+  %struct.with.inactive4 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive3, i32 %final.inactive4, 6
+  %struct.with.inactive5 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive4, i32 %final.inactive5, 7
+  %struct.with.inactive6 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive5, i32 %final.inactive6, 8
+  %struct.with.inactive7 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive6, i32 %final.inactive7, 9
+  %struct.with.inactive8 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive7, i32 %final.inactive8, 10
+  %final.struct = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive8, i32 %final.inactive9, 11
+
+  %vec.global = insertelement <4 x i32> poison, i32 %global.table, i64 0
+  %vec.max.vgpr = insertelement <4 x i32> %vec.global, i32 %max.outgoing.vgpr.count, i64 1
+  %vec.sys.data = insertelement <4 x i32> %vec.max.vgpr, i32 %final.sys.data, i64 2
+  %final.vec = insertelement <4 x i32> %vec.sys.data, i32 0, i64 3
+
+  call void (ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32, ...)
+        @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(
+        ptr %next.callee, i32 0, <4 x i32> inreg %final.vec,
+        { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %final.struct,
+        i32 1, i32 %max.outgoing.vgpr.count, i32 -1, ptr @retry_vgpr_alloc.v4i32)
+  unreachable
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll
new file mode 100644
index 0000000000000..b9130dd1b7ed4
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll
@@ -0,0 +1,75 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0xa{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+                                    i32 %vcr, { i32 } %system.data,
+                                    i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+                                    i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+                                    i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+                                    local_unnamed_addr {
+entry:
+  %system.data.value = extractvalue { i32 } %system.data, 0
+  %dead.val = call i32 @llvm.amdgcn.dead.i32()
+  %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+  br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+  %system.data.extract = extractvalue { i32 } %system.data, 0
+  %data.mul = mul i32 %system.data.extract, 2
+  %data.add = add i32 %data.mul, 1
+  br label %tail
+
+tail:
+  %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+  %final.sys.data = phi i...
[truncated]

jayfoad · 2025-03-27T12:46:01Z

This could be much more generic. AMDGPUResourceUsageAnalysis could unconditionally ignore IMPLICIT_DEF and all use operands.

arsenm · 2025-03-27T14:03:26Z

llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp

+        if (IsIWWFunction && (IsChainCall || IsImplicitDef) &&
+            TRI.isVectorRegister(MRI, Reg))
+          continue;


Doesn't matter if it's a vector register. I thought we already had a helper somewhere to skip synthetic uses

jayfoad · 2025-03-27T18:37:23Z

Skip implicit defs unconditionally

I don't think it's safe to ignore all implicit def operands (but you can ignore the def operand of IMPLICIT_DEF). Do you need this to get the optimization you're hoping for? Is ignoring all use operands not enough?

arsenm · 2025-03-28T01:34:46Z

llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp

+        // For functions that use the llvm.amdgcn.init.whole.wave intrinsic, we
+        // often add artificial VGPR arguments for the purpose of preserving
+        // their inactive lanes. These should not be reported as part of our
+        // VGPR usage. We can identify them easily because they're only used in
+        // the chain call, and possibly in an IMPLICIT_DEF coming from an
+        // llvm.amdgcn.dead intrinsic.
+        if (IsIWWFunction && IsChainCall && TRI.isVectorRegister(MRI, Reg))
+          continue;


extend to any implicit argument that isn't in the MCInstrDesc, ignoring variadic instructions in the variadic section

Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode.

rovka · 2025-04-02T12:16:50Z

Skip implicit defs unconditionally

I don't think it's safe to ignore all implicit def operands (but you can ignore the def operand of IMPLICIT_DEF). Do you need this to get the optimization you're hoping for? Is ignoring all use operands not enough?

The code was only ignoring IMPLICIT_DEF instructions, not implicit def operands :) Sorry about the confusing commit message.

rovka · 2025-04-02T12:19:02Z

This could be much more generic. AMDGPUResourceUsageAnalysis could unconditionally ignore IMPLICIT_DEF and all use operands.

Ok, I switched to that.

@arsenm I think this makes your other comments obsolete. But anyway, let me know what you think of this approach :)

llvm/docs/AMDGPUUsage.rst

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll

Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>

Flakebi · 2025-04-07T11:59:02Z

I created some small tests, to make sure this works as intended in all cases. Probably makes sense to add them here.
One of them yields surprising results, probably something that should be fixed.

PAL tests:

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
; Check that reads of a VGPR in kernels counts towards VGPR count, but in functions, only writes of VGPRs count towards VGPR count.
target triple = "amdgcn--amdpal"

@global = addrspace(1) global i32 poison, align 4

; CHECK-LABEL: amdpal.pipelines:

; Neither uses not writes a VGPR, but the hardware initializes the VGPRs that the kernel receives, so they count as used.
; CHECK-LABEL: .entry_point_symbol: kernel_use
; CHECK: .vgpr_count:     0x20
define amdgpu_cs void @kernel_use([32 x i32] %args) {
entry:
  %a = extractvalue [32 x i32] %args, 14
  store i32 %a, ptr addrspace(1) @global
  ret void
}

; Neither uses not writes a VGPR
; CHECK-LABEL: gfx_func:
; CHECK: .vgpr_count:     0x20
define amdgpu_gfx [32 x i32] @gfx_func([32 x i32] %args) {
entry:
  ret [32 x i32] %args
}

; Neither uses not writes a VGPR
; CHECK-LABEL: chain_func:
; CHECK: .vgpr_count:     0x1
define amdgpu_cs_chain void @chain_func([32 x i32] %args) {
entry:
  call void (ptr, i32, {}, [32 x i32], i32, ...) @llvm.amdgcn.cs.chain.p0.i32.s.a(
        ptr @chain_func, i32 0, {} inreg {}, [32 x i32] %args, i32 0)
  unreachable
}

The (to me) surprising one is gfx_func, it only contains SALU instructions, so should have no defs of VGPRs and only uses for the return. I would expect it to have vgpr_count: 0x0 or maybe 0x1.

This one works as expected:

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn--amdpal"

declare amdgpu_gfx void @gfx_dummy([32 x i32] %args)

; CHECK-LABEL: .entry_point_symbol: kernel_call
; CHECK: .vgpr_count:     0x20
define amdgpu_cs void @kernel_call([32 x i32] %args) {
entry:
  call amdgpu_gfx void @gfx_dummy([32 x i32] %args)
  ret void
}

Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used).

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn-amd-amdhsa"

@global = addrspace(1) global i32 poison, align 4

; Carefully crafted kernel that uses v0 but never writes a VGPR or reads another VGPR.
; Only hardware-initialized VGPRs (v0) are read in this kernel.

; CHECK-LABEL: amdhsa.kernels:
; CHECK: .vgpr_count:     1
define amdgpu_kernel void @kernel(ptr addrspace(8) %rsrc) #0 {
entry:
  %id = call i32 @llvm.amdgcn.workitem.id.x()
  call void @llvm.amdgcn.raw.ptr.buffer.store.i32(i32 %id, ptr addrspace(8) %rsrc, i32 0, i32 0, i32 0)
  ret void
}

attributes #0 = { "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" }

rovka · 2025-04-09T12:36:26Z

I created some small tests, to make sure this works as intended in all cases. Probably makes sense to add them here. One of them yields surprising results, probably something that should be fixed.

PAL tests:

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
; Check that reads of a VGPR in kernels counts towards VGPR count, but in functions, only writes of VGPRs count towards VGPR count.
target triple = "amdgcn--amdpal"

@global = addrspace(1) global i32 poison, align 4

; CHECK-LABEL: amdpal.pipelines:

; Neither uses not writes a VGPR, but the hardware initializes the VGPRs that the kernel receives, so they count as used.
; CHECK-LABEL: .entry_point_symbol: kernel_use
; CHECK: .vgpr_count:     0x20
define amdgpu_cs void @kernel_use([32 x i32] %args) {
entry:
  %a = extractvalue [32 x i32] %args, 14
  store i32 %a, ptr addrspace(1) @global
  ret void
}

; Neither uses not writes a VGPR
; CHECK-LABEL: gfx_func:
; CHECK: .vgpr_count:     0x20
define amdgpu_gfx [32 x i32] @gfx_func([32 x i32] %args) {
entry:
  ret [32 x i32] %args
}

; Neither uses not writes a VGPR
; CHECK-LABEL: chain_func:
; CHECK: .vgpr_count:     0x1
define amdgpu_cs_chain void @chain_func([32 x i32] %args) {
entry:
  call void (ptr, i32, {}, [32 x i32], i32, ...) @llvm.amdgcn.cs.chain.p0.i32.s.a(
        ptr @chain_func, i32 0, {} inreg {}, [32 x i32] %args, i32 0)
  unreachable
}

The (to me) surprising one is gfx_func, it only contains SALU instructions, so should have no defs of VGPRs and only uses for the return. I would expect it to have vgpr_count: 0x0 or maybe 0x1.

This one works as expected:

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn--amdpal"

declare amdgpu_gfx void @gfx_dummy([32 x i32] %args)

; CHECK-LABEL: .entry_point_symbol: kernel_call
; CHECK: .vgpr_count:     0x20
define amdgpu_cs void @kernel_call([32 x i32] %args) {
entry:
  call amdgpu_gfx void @gfx_dummy([32 x i32] %args)
  ret void
}

Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used).

; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn-amd-amdhsa"

@global = addrspace(1) global i32 poison, align 4

; Carefully crafted kernel that uses v0 but never writes a VGPR or reads another VGPR.
; Only hardware-initialized VGPRs (v0) are read in this kernel.

; CHECK-LABEL: amdhsa.kernels:
; CHECK: .vgpr_count:     1
define amdgpu_kernel void @kernel(ptr addrspace(8) %rsrc) #0 {
entry:
  %id = call i32 @llvm.amdgcn.workitem.id.x()
  call void @llvm.amdgcn.raw.ptr.buffer.store.i32(i32 %id, ptr addrspace(8) %rsrc, i32 0, i32 0, i32 0)
  ret void
}

attributes #0 = { "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" }

Hi Sebastian, thanks for looking into this!

I don't think the existing test coverage is really that bad. The testcase that you pointed out (amgpu_gfx leaf function) is not conceptually that different from the many amdgpu_gfx testcases we have here, or even from the leaf function test I added. All of these have in common the fact that they follow the code path for leaf functions, which gets the VGPR/SGPR/AGPR usage by just checking TRI.

I tried updating this code path to look only at register defs, so we're consistent with non-leaf functions, but that produces different results in a lot of tests, including for things like .numbered_sgpr, granulated_wavefront_sgpr_count, wavefront_sgpr_count, .amdhsa_next_free_sgpr... you get the idea. I'm feeling a bit uncomfortable updating all of those, because I'm not sure what they're used for and if it's actually ok for them to ignore uses. In any case, over-reporting the register usage of leaf functions is benign. It's still going to be less than the usage of whatever function/kernel actually defines those registers, so we won't be allocating too much. Can I get away with just a comment or something? :D I'm still looking into these :)

rovka

I made some pretty large changes:

Unified handling of leaf and non-leaf functions as much as possible.
Added any used preloaded SGPRs to the SGPR count (since they'll be written by the hardware)

rovka · 2025-04-30T08:27:07Z

llvm/test/CodeGen/AMDGPU/coalescer_remat.ll

@@ -12,7 +12,7 @@ declare float @llvm.fma.f32(float, float, float)
 ; CHECK:  v_mov_b32_e32 v{{[0-9]+}}, 0
 ; CHECK:  v_mov_b32_e32 v{{[0-9]+}}, 0
 ; It's probably OK if this is slightly higher:
-; CHECK: ; NumVgprs: 8
+; CHECK: ; NumVgprs: 5


I could use another pair of eyes on this test. From what I can tell, it uses v[4:7] (buffer_store_dwordx4 v[4:7], off, s[0:3], 0) but only defines v4. Am I missing smth, or is this just broken?

The first store in the loop is doing a vector store with only the 0th element defined

Flakebi · 2025-04-30T12:00:29Z

Thanks for the refactoring!

The tests I quoted above test specific corner cases, I find it much easier to look at and verify a 2 line test than a 20 line test.
The last test of the three I wrote fails with the new version of this change. It emits .vgpr_count: 0 even though v0 is live and defined as it’s initialized by the hardware (v2 is never written in the shader).
I’d still like it if we can have the tests in llvm :)

arsenm · 2025-05-01T14:09:05Z

llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp

+    Info.NumAGPR = TRI.getNumDefinedPhysRegs(MRI, AMDGPU::AGPR_32RegClass);
+
+  // Count any user or system SGPRs that are actually used.
+  for (int I = MFI->getNumPreloadedSGPRs() - 1; I >= 0; I--)


Braces, but also should just take the raw number of preloaded SGPRs to start with. What happens if you don't allocate preloaded registers that were requested? If we wanted to trim out unused preloaded registers, it should have happened earlier (or, we wouldn't have preloaded in the first place). This could also break the debugger use case if it expects to find something there

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll

arsenm · 2025-05-01T14:15:37Z

llvm/test/CodeGen/AMDGPU/coalescer_remat.ll

@@ -12,7 +12,7 @@ declare float @llvm.fma.f32(float, float, float)
 ; CHECK:  v_mov_b32_e32 v{{[0-9]+}}, 0
 ; CHECK:  v_mov_b32_e32 v{{[0-9]+}}, 0
 ; It's probably OK if this is slightly higher:
-; CHECK: ; NumVgprs: 8
+; CHECK: ; NumVgprs: 5


The first store in the loop is doing a vector store with only the 0th element defined

github-actions · 2025-05-05T15:00:44Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jayfoad · 2025-05-06T10:42:45Z

Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used).

This is getting a bit philosophical! Do we actually need to report that one VGPR is used? What will the hardware do if you ask it to launch (say) a PS with 16 VGPR inputs, but with an allocation of only 8 VGPRs? Will it automatically increase the allocation to include all the inputs?

rovka · 2025-05-08T15:53:10Z

Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used).

This is getting a bit philosophical! Do we actually need to report that one VGPR is used? What will the hardware do if you ask it to launch (say) a PS with 16 VGPR inputs, but with an allocation of only 8 VGPRs? Will it automatically increase the allocation to include all the inputs?

Aren't these different use cases? Graphics shaders (especially PS!) have some special handling in AMDGPUAsmPrinter to make sure we include all the inputs (otherwise we might end up overwriting another wave's registers, from what I've heard). The case in the original quote is compute, and to be honest I don't know what the hardware would do. We weren't including all of these inputs before (hence the test churn in the latest version), so either we got lucky and never hit any edge cases, or this is handled somewhere else (firmware?).

jayfoad · 2025-05-08T16:09:11Z

Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used).

This is getting a bit philosophical! Do we actually need to report that one VGPR is used? What will the hardware do if you ask it to launch (say) a PS with 16 VGPR inputs, but with an allocation of only 8 VGPRs? Will it automatically increase the allocation to include all the inputs?

Aren't these different use cases? Graphics shaders (especially PS!) have some special handling in AMDGPUAsmPrinter to make sure we include all the inputs (otherwise we might end up overwriting another wave's registers, from what I've heard). The case in the original quote is compute, and to be honest I don't know what the hardware would do. We weren't including all of these inputs before (hence the test churn in the latest version), so either we got lucky and never hit any edge cases, or this is handled somewhere else (firmware?).

All shaders must allocate at least one block of VGPRs. For a compute shader the hardware only initializes VGPR0, so there can never be a problem with the hardware not allocating enough registers to include all the inputs. That's the only reason I started talking about pixel shaders.

So I would ask again, do we actually need to report that one VGPR is used in Sebastian's carefully crafted compute test?

Flakebi · 2025-05-14T10:19:11Z

So I would ask again, do we actually need to report that one VGPR is used in Sebastian's carefully crafted compute test?

Not a definitive answer, but I think it would be cleaner to report v0 as used.

It allows to display the correct count in tools
and in case the hw behavior to always allocate at least one VGPR block changes, it still works correctly.

jayfoad · 2025-05-14T10:27:28Z

Not a definitive answer, but I think it would be cleaner to report v0 as used.

It allows to display the correct count in tools

and in case the hw behavior to always allocate at least one VGPR block changes, it still works correctly.

So you would distinguish between:

Shader reads v0 which was initialized by hardware. This counts as "used".
Shader reads an undefined value from v0 which was not initialized by hardware. This does not count as "used".

I guess we can do that. It just seems like extra effort in the compiler for (I claim) no tangible benefit.

rovka · 2025-05-20T10:53:41Z

So you would distinguish between:

Shader reads v0 which was initialized by hardware. This counts as "used".

Shader reads an undefined value from v0 which was not initialized by hardware. This does not count as "used".

I guess we can do that. It just seems like extra effort in the compiler for (I claim) no tangible benefit.

So are we good with the patch as it is now? It currently reports v0 as used if the workitem-id-x isn't explicitly disabled.

nhaehnle

I agree with @Flakebi that for entry functions it's better to report what the HW initializes.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode. --------- Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>

llvm-ci · 2025-06-03T10:55:40Z

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux-fast running on sanitizer-buildbot3 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/169/builds/11943

Here is the relevant piece of the build log for the reference

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/main.py:73: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 90009 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: ExecutionEngine/JITLink/x86-64/MachO_weak_references.s (51280 of 90009)
******************** TEST 'LLVM :: ExecutionEngine/JITLink/x86-64/MachO_weak_references.s' FAILED ********************
Exit Code: 134

Command Output (stderr):
--
rm -rf /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp && mkdir -p /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp # RUN: at line 1
+ rm -rf /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp
+ mkdir -p /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-mc -triple=x86_64-apple-macosx10.9 -filetype=obj -o /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s # RUN: at line 2
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-mc -triple=x86_64-apple-macosx10.9 -filetype=obj -o /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-present -abs bar=0x1 -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o # RUN: at line 3
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-present -abs bar=0x1 -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o # RUN: at line 4
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o
libc++abi: Pure virtual function called!
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.script: line 4: 2164743 Aborted                 /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
Slowest Tests:
--------------------------------------------------------------------------
332.20s: LLVM :: CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir
261.91s: Clang :: Driver/fsanitize.c
206.55s: Clang :: Preprocessor/riscv-target-features.c
186.62s: LLVM :: CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
161.46s: Clang :: OpenMP/target_update_codegen.cpp
160.47s: Clang :: Driver/arm-cortex-cpus-2.c
158.13s: Clang :: OpenMP/target_defaultmap_codegen_01.cpp
155.13s: Clang :: Driver/arm-cortex-cpus-1.c
152.78s: Clang :: Preprocessor/aarch64-target-features.c
141.99s: Clang :: Preprocessor/arm-target-features.c
131.37s: LLVM :: CodeGen/AMDGPU/memintrinsic-unroll.ll
129.92s: LLVM :: CodeGen/RISCV/attributes.ll
125.24s: Clang :: Preprocessor/predefined-arch-macros.c
124.76s: Clang :: Analysis/a_flaky_crash.cpp
111.41s: Clang :: CodeGen/AArch64/sve-intrinsics/acle_sve_reinterpret.c
Step 10 (stage2/asan_ubsan check) failure: stage2/asan_ubsan check (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/main.py:73: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 90009 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: ExecutionEngine/JITLink/x86-64/MachO_weak_references.s (51280 of 90009)
******************** TEST 'LLVM :: ExecutionEngine/JITLink/x86-64/MachO_weak_references.s' FAILED ********************
Exit Code: 134

Command Output (stderr):
--
rm -rf /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp && mkdir -p /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp # RUN: at line 1
+ rm -rf /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp
+ mkdir -p /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-mc -triple=x86_64-apple-macosx10.9 -filetype=obj -o /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s # RUN: at line 2
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-mc -triple=x86_64-apple-macosx10.9 -filetype=obj -o /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-present -abs bar=0x1 -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o # RUN: at line 3
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-present -abs bar=0x1 -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o # RUN: at line 4
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o
libc++abi: Pure virtual function called!
/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.script: line 4: 2164743 Aborted                 /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llvm-jitlink -noexec -check-name=jitlink-check-bar-absent -check=/home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/ExecutionEngine/JITLink/x86-64/MachO_weak_references.s /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/test/ExecutionEngine/JITLink/x86-64/Output/MachO_weak_references.s.tmp/macho_weak_refs.o

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
Slowest Tests:
--------------------------------------------------------------------------
332.20s: LLVM :: CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir
261.91s: Clang :: Driver/fsanitize.c
206.55s: Clang :: Preprocessor/riscv-target-features.c
186.62s: LLVM :: CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll
161.46s: Clang :: OpenMP/target_update_codegen.cpp
160.47s: Clang :: Driver/arm-cortex-cpus-2.c
158.13s: Clang :: OpenMP/target_defaultmap_codegen_01.cpp
155.13s: Clang :: Driver/arm-cortex-cpus-1.c
152.78s: Clang :: Preprocessor/aarch64-target-features.c
141.99s: Clang :: Preprocessor/arm-target-features.c
131.37s: LLVM :: CodeGen/AMDGPU/memintrinsic-unroll.ll
129.92s: LLVM :: CodeGen/RISCV/attributes.ll
125.24s: Clang :: Preprocessor/predefined-arch-macros.c
124.76s: Clang :: Analysis/a_flaky_crash.cpp
111.41s: Clang :: CodeGen/AArch64/sve-intrinsics/acle_sve_reinterpret.c

rovka added the backend:AMDGPU label Mar 27, 2025

rovka requested review from nhaehnle, tsymalla and Flakebi March 27, 2025 12:27

arsenm reviewed Mar 27, 2025

View reviewed changes

Skip implicit defs unconditionally

5089dad

arsenm reviewed Mar 28, 2025

View reviewed changes

rovka changed the title ~~[AMDGPU] Ignore inactive VGPRs in .vgpr_count~~ [AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis Apr 2, 2025

Merge branch 'main' into vgpr-count

63bba7e

Flakebi reviewed Apr 2, 2025

View reviewed changes

llvm/docs/AMDGPUUsage.rst Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp Outdated Show resolved Hide resolved

rovka requested a review from trenouf April 2, 2025 12:59

s/no-init-whole-wave/isEntryFunction

f42baae

tsymalla reviewed Apr 3, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll Outdated Show resolved Hide resolved

Fix typo in comment. NFC

affc837

Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>

rovka added 2 commits April 30, 2025 09:52

Unify code paths; include user + sys SGPRS if used

733829b

Merge remote-tracking branch 'remotes/origin/main' into vgpr-count

0824cf2

rovka commented Apr 30, 2025

View reviewed changes

arsenm reviewed May 1, 2025

View reviewed changes

Add missing tests. Fix preloaded VGPR issue

563cbd6

Formatting

f27621d

rovka added 2 commits May 6, 2025 13:32

Cound all preloaded regs

ca7e723

Merge remote-tracking branch 'origin/main' into vgpr-count

e45bbf9

nhaehnle approved these changes May 20, 2025

View reviewed changes

jayfoad reviewed May 21, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h Show resolved Hide resolved

arsenm reviewed May 21, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp Outdated Show resolved Hide resolved

arsenm reviewed May 22, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp Outdated Show resolved Hide resolved

rovka and others added 4 commits May 26, 2025 10:43

De Morgan

49fa86f

Comments

3da979a

Merge remote-tracking branch 'remotes/origin/main' into vgpr-count-rovka

0dbc113

Merge branch 'main' into vgpr-count

35af826

rovka merged commit 130080f into llvm:main Jun 3, 2025
12 checks passed

rovka added a commit to rovka/llvm-project that referenced this pull request Jun 3, 2025

Update test after merging llvm#133242

fa8acf5

[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis #133242

[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis #133242

Uh oh!

Conversation

rovka commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 27, 2025

Uh oh!

jayfoad commented Mar 27, 2025

Uh oh!

arsenm Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Mar 27, 2025

Uh oh!

arsenm Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

rovka commented Apr 2, 2025

Uh oh!

rovka commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Flakebi commented Apr 7, 2025

Uh oh!

rovka commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rovka left a comment

Choose a reason for hiding this comment

Uh oh!

rovka Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Flakebi commented Apr 30, 2025

Uh oh!

arsenm May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm May 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayfoad commented May 6, 2025

Uh oh!

rovka commented May 8, 2025

Uh oh!

jayfoad commented May 8, 2025

Uh oh!

Flakebi commented May 14, 2025

Uh oh!

jayfoad commented May 14, 2025

Uh oh!

rovka commented May 20, 2025

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvm-ci commented Jun 3, 2025

Uh oh!

Uh oh!

rovka commented Mar 27, 2025 •

edited

Loading

rovka commented Apr 9, 2025 •

edited

Loading

github-actions bot commented May 5, 2025 •

edited

Loading