[IR][LangRef] Add partial reduction add intrinsic #94499

NickGuy-Arm · 2024-06-05T16:46:40Z

Adds the llvm.experimental.partial.reduce.add.* overloaded intrinsic, this intrinsic represents add reductions that result in a narrower vector. In the generic case, this intrinsic is lowered to a chain of add reductions, it is assumed that it will only be emitted for targets that provide their own lowering.

llvmbot · 2024-06-05T16:47:11Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-aarch64

Author: None (NickGuy-Arm)

Changes

Adds the llvm.experimental.partial.reduce.add.* overloaded intrinsic, this intrinsic represents add reductions that result in a narrower vector. In the generic case, this intrinsic is lowered to a chain of add reductions, it is assumed that it will only be emitted for targets that provide their own lowering.

Full diff: https://github.com/llvm/llvm-project/pull/94499.diff

4 Files Affected:

(modified) llvm/docs/LangRef.rst (+31-2)
(modified) llvm/include/llvm/IR/Intrinsics.td (+6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+21)
(added) llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll (+76)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9d7ade8eb523b..95f839e35b673 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -14250,7 +14250,7 @@ Arguments:
 """"""""""
 The first 4 arguments are similar to ``llvm.instrprof.increment``. The indexing
 is specific to callsites, meaning callsites are indexed from 0, independent from
-the indexes used by the other intrinsics (such as 
+the indexes used by the other intrinsics (such as
 ``llvm.instrprof.increment[.step]``).
 
 The last argument is the called value of the callsite this intrinsic precedes.
@@ -14264,7 +14264,7 @@ a buffer LLVM can use to perform counter increments (i.e. the lowering of
 ``llvm.instrprof.increment[.step]``. The address range following the counter
 buffer, ``<num-counters>`` x ``sizeof(ptr)`` - sized, is expected to contain
 pointers to contexts of functions called from this function ("subcontexts").
-LLVM does not dereference into that memory region, just calculates GEPs. 
+LLVM does not dereference into that memory region, just calculates GEPs.
 
 The lowering of ``llvm.instrprof.callsite`` consists of:
 
@@ -19209,6 +19209,35 @@ will be on any later loop iteration.
 This intrinsic will only return 0 if the input count is also 0. A non-zero input
 count will produce a non-zero result.
 
+'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in)
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv2i32.nxv8i32(<vscale x 8 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %in)
+
+Overview:
+"""""""""
+
+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics do an integer
+``ADD`` reduction of subvectors within a vector, returning each scalar result as
+a lane within a vector. The return type is a vector type with an
+element-type of the vector input and a width a factor of the vector input
+(typically either half or quarter).
+
+Arguments:
+""""""""""
+
+The argument to this intrinsic must be a vector of integer values.
+
+
 '``llvm.experimental.vector.histogram.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 107442623ab7b..08c516bd1cea1 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2635,6 +2635,12 @@ def int_vector_deinterleave2 : DefaultAttrsIntrinsic<[LLVMHalfElementsVectorType
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+//===-------------- Intrinsics to perform partial reduction ---------------===//
+
+def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                                       [llvm_anyvector_ty],
+                                                                       [IntrNoMem]>;
+
 //===----------------- Pointer Authentication Intrinsics ------------------===//
 //
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index ba76456b5836a..f24723a45237d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7914,6 +7914,27 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     setValue(&I, Trunc);
     return;
   }
+  case Intrinsic::experimental_vector_partial_reduce_add: {
+    auto DL = getCurSDLoc();
+    auto ReducedTy = EVT::getEVT(I.getType());
+    auto OpNode = getValue(I.getOperand(0));
+    auto Index = DAG.getVectorIdxConstant(0, DL);
+    auto FullTy = OpNode.getValueType();
+
+    auto ResultVector = DAG.getSplat(ReducedTy, DL, DAG.getConstant(0, DL, ReducedTy.getScalarType()));
+    unsigned ScaleFactor = FullTy.getVectorMinNumElements() / ReducedTy.getVectorMinNumElements();
+
+    for(unsigned i = 0; i < ScaleFactor; i++) {
+      auto SourceIndex = DAG.getVectorIdxConstant(i * ScaleFactor, DL);
+      auto TargetIndex = DAG.getVectorIdxConstant(i, DL);
+      auto N = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ReducedTy, {OpNode, SourceIndex});
+      N = DAG.getNode(ISD::VECREDUCE_ADD, DL, ReducedTy.getScalarType(), N);
+      ResultVector = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, ReducedTy, {ResultVector, N, TargetIndex});
+    }
+
+    setValue(&I, ResultVector);
+    return;
+  }
   case Intrinsic::experimental_cttz_elts: {
     auto DL = getCurSDLoc();
     SDValue Op = getValue(I.getOperand(0));
diff --git a/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll b/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll
new file mode 100644
index 0000000000000..6a5b3bd5ace2e
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll
@@ -0,0 +1,76 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -force-vector-interleave=1 %s | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-none-unknown-elf"
+
+define void @partial_reduce_add(<vscale x 16 x i8> %wide.load.pre, <vscale x 16 x i32> %0, <vscale x 16 x i32> %1, i64 %index) #0 {
+; CHECK-LABEL: partial_reduce_add:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    index z2.s, #0, #1
+; CHECK-NEXT:    mov z4.s, w8
+; CHECK-NEXT:    mov w8, #2 // =0x2
+; CHECK-NEXT:    ptrue p2.s, vl1
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1w { z5.s }, p0/z, [x0, #2, mul vl]
+; CHECK-NEXT:    mov z6.s, w8
+; CHECK-NEXT:    cmpeq p1.s, p0/z, z2.s, z4.s
+; CHECK-NEXT:    uaddv d3, p0, z0.s
+; CHECK-NEXT:    mov z0.s, #0 // =0x0
+; CHECK-NEXT:    uaddv d7, p0, z1.s
+; CHECK-NEXT:    uaddv d4, p0, z5.s
+; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    fmov x8, d3
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x0, #3, mul vl]
+; CHECK-NEXT:    mov z1.s, p2/m, w8
+; CHECK-NEXT:    mov w8, #3 // =0x3
+; CHECK-NEXT:    cmpeq p2.s, p0/z, z2.s, z6.s
+; CHECK-NEXT:    mov z5.s, w8
+; CHECK-NEXT:    fmov x8, d7
+; CHECK-NEXT:    uaddv d3, p0, z3.s
+; CHECK-NEXT:    mov z1.s, p1/m, w8
+; CHECK-NEXT:    fmov x8, d4
+; CHECK-NEXT:    cmpeq p0.s, p0/z, z2.s, z5.s
+; CHECK-NEXT:    mov z1.s, p2/m, w8
+; CHECK-NEXT:    fmov x8, d3
+; CHECK-NEXT:    mov z1.s, p0/m, w8
+; CHECK-NEXT:    addvl x8, x1, #1
+; CHECK-NEXT:  .LBB0_1: // %vector.body
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    orr z0.d, z1.d, z0.d
+; CHECK-NEXT:    cbnz x8, .LBB0_1
+; CHECK-NEXT:  // %bb.2: // %middle.block
+; CHECK-NEXT:    ret
+entry:
+  %2 = call i64 @llvm.vscale.i64()
+  %3 = mul i64 %2, 16
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %vec.phi = phi <vscale x 4 x i32> [ zeroinitializer, %entry ], [ %4, %vector.body ]
+  %partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %1)
+  %4 = or <vscale x 4 x i32> %partial.reduce, %vec.phi
+  %index.next = add i64 %index, %3
+  %5 = icmp eq i64 %index.next, 0
+  br i1 %5, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %6 = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %4)
+  ret void
+}
+
+; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+declare i64 @llvm.vscale.i64() #1
+
+; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32>) #1
+
+; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>) #2
+
+attributes #0 = { "target-features"="+fp-armv8,+fullfp16,+neon,+sve,+sve2,+v8a" }
+attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) }
+attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

llvmbot · 2024-06-05T16:47:11Z

@llvm/pr-subscribers-llvm-selectiondag

Author: None (NickGuy-Arm)

Changes

Adds the llvm.experimental.partial.reduce.add.* overloaded intrinsic, this intrinsic represents add reductions that result in a narrower vector. In the generic case, this intrinsic is lowered to a chain of add reductions, it is assumed that it will only be emitted for targets that provide their own lowering.

Full diff: https://github.com/llvm/llvm-project/pull/94499.diff

4 Files Affected:

(modified) llvm/docs/LangRef.rst (+31-2)
(modified) llvm/include/llvm/IR/Intrinsics.td (+6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+21)
(added) llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll (+76)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9d7ade8eb523b..95f839e35b673 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -14250,7 +14250,7 @@ Arguments:
 """"""""""
 The first 4 arguments are similar to ``llvm.instrprof.increment``. The indexing
 is specific to callsites, meaning callsites are indexed from 0, independent from
-the indexes used by the other intrinsics (such as 
+the indexes used by the other intrinsics (such as
 ``llvm.instrprof.increment[.step]``).
 
 The last argument is the called value of the callsite this intrinsic precedes.
@@ -14264,7 +14264,7 @@ a buffer LLVM can use to perform counter increments (i.e. the lowering of
 ``llvm.instrprof.increment[.step]``. The address range following the counter
 buffer, ``<num-counters>`` x ``sizeof(ptr)`` - sized, is expected to contain
 pointers to contexts of functions called from this function ("subcontexts").
-LLVM does not dereference into that memory region, just calculates GEPs. 
+LLVM does not dereference into that memory region, just calculates GEPs.
 
 The lowering of ``llvm.instrprof.callsite`` consists of:
 
@@ -19209,6 +19209,35 @@ will be on any later loop iteration.
 This intrinsic will only return 0 if the input count is also 0. A non-zero input
 count will produce a non-zero result.
 
+'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in)
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv2i32.nxv8i32(<vscale x 8 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %in)
+
+Overview:
+"""""""""
+
+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics do an integer
+``ADD`` reduction of subvectors within a vector, returning each scalar result as
+a lane within a vector. The return type is a vector type with an
+element-type of the vector input and a width a factor of the vector input
+(typically either half or quarter).
+
+Arguments:
+""""""""""
+
+The argument to this intrinsic must be a vector of integer values.
+
+
 '``llvm.experimental.vector.histogram.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 107442623ab7b..08c516bd1cea1 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2635,6 +2635,12 @@ def int_vector_deinterleave2 : DefaultAttrsIntrinsic<[LLVMHalfElementsVectorType
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+//===-------------- Intrinsics to perform partial reduction ---------------===//
+
+def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                                       [llvm_anyvector_ty],
+                                                                       [IntrNoMem]>;
+
 //===----------------- Pointer Authentication Intrinsics ------------------===//
 //
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index ba76456b5836a..f24723a45237d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7914,6 +7914,27 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     setValue(&I, Trunc);
     return;
   }
+  case Intrinsic::experimental_vector_partial_reduce_add: {
+    auto DL = getCurSDLoc();
+    auto ReducedTy = EVT::getEVT(I.getType());
+    auto OpNode = getValue(I.getOperand(0));
+    auto Index = DAG.getVectorIdxConstant(0, DL);
+    auto FullTy = OpNode.getValueType();
+
+    auto ResultVector = DAG.getSplat(ReducedTy, DL, DAG.getConstant(0, DL, ReducedTy.getScalarType()));
+    unsigned ScaleFactor = FullTy.getVectorMinNumElements() / ReducedTy.getVectorMinNumElements();
+
+    for(unsigned i = 0; i < ScaleFactor; i++) {
+      auto SourceIndex = DAG.getVectorIdxConstant(i * ScaleFactor, DL);
+      auto TargetIndex = DAG.getVectorIdxConstant(i, DL);
+      auto N = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ReducedTy, {OpNode, SourceIndex});
+      N = DAG.getNode(ISD::VECREDUCE_ADD, DL, ReducedTy.getScalarType(), N);
+      ResultVector = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, ReducedTy, {ResultVector, N, TargetIndex});
+    }
+
+    setValue(&I, ResultVector);
+    return;
+  }
   case Intrinsic::experimental_cttz_elts: {
     auto DL = getCurSDLoc();
     SDValue Op = getValue(I.getOperand(0));
diff --git a/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll b/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll
new file mode 100644
index 0000000000000..6a5b3bd5ace2e
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll
@@ -0,0 +1,76 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -force-vector-interleave=1 %s | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-none-unknown-elf"
+
+define void @partial_reduce_add(<vscale x 16 x i8> %wide.load.pre, <vscale x 16 x i32> %0, <vscale x 16 x i32> %1, i64 %index) #0 {
+; CHECK-LABEL: partial_reduce_add:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    index z2.s, #0, #1
+; CHECK-NEXT:    mov z4.s, w8
+; CHECK-NEXT:    mov w8, #2 // =0x2
+; CHECK-NEXT:    ptrue p2.s, vl1
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    ld1w { z1.s }, p0/z, [x0, #1, mul vl]
+; CHECK-NEXT:    ld1w { z5.s }, p0/z, [x0, #2, mul vl]
+; CHECK-NEXT:    mov z6.s, w8
+; CHECK-NEXT:    cmpeq p1.s, p0/z, z2.s, z4.s
+; CHECK-NEXT:    uaddv d3, p0, z0.s
+; CHECK-NEXT:    mov z0.s, #0 // =0x0
+; CHECK-NEXT:    uaddv d7, p0, z1.s
+; CHECK-NEXT:    uaddv d4, p0, z5.s
+; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    fmov x8, d3
+; CHECK-NEXT:    ld1w { z3.s }, p0/z, [x0, #3, mul vl]
+; CHECK-NEXT:    mov z1.s, p2/m, w8
+; CHECK-NEXT:    mov w8, #3 // =0x3
+; CHECK-NEXT:    cmpeq p2.s, p0/z, z2.s, z6.s
+; CHECK-NEXT:    mov z5.s, w8
+; CHECK-NEXT:    fmov x8, d7
+; CHECK-NEXT:    uaddv d3, p0, z3.s
+; CHECK-NEXT:    mov z1.s, p1/m, w8
+; CHECK-NEXT:    fmov x8, d4
+; CHECK-NEXT:    cmpeq p0.s, p0/z, z2.s, z5.s
+; CHECK-NEXT:    mov z1.s, p2/m, w8
+; CHECK-NEXT:    fmov x8, d3
+; CHECK-NEXT:    mov z1.s, p0/m, w8
+; CHECK-NEXT:    addvl x8, x1, #1
+; CHECK-NEXT:  .LBB0_1: // %vector.body
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    orr z0.d, z1.d, z0.d
+; CHECK-NEXT:    cbnz x8, .LBB0_1
+; CHECK-NEXT:  // %bb.2: // %middle.block
+; CHECK-NEXT:    ret
+entry:
+  %2 = call i64 @llvm.vscale.i64()
+  %3 = mul i64 %2, 16
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry
+  %vec.phi = phi <vscale x 4 x i32> [ zeroinitializer, %entry ], [ %4, %vector.body ]
+  %partial.reduce = call <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %1)
+  %4 = or <vscale x 4 x i32> %partial.reduce, %vec.phi
+  %index.next = add i64 %index, %3
+  %5 = icmp eq i64 %index.next, 0
+  br i1 %5, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %6 = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> %4)
+  ret void
+}
+
+; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+declare i64 @llvm.vscale.i64() #1
+
+; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
+declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32>) #1
+
+; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32>) #2
+
+attributes #0 = { "target-features"="+fp-armv8,+fullfp16,+neon,+sve,+sve2,+v8a" }
+attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) }
+attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

github-actions · 2024-06-05T16:49:32Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2024-06-06T08:39:44Z

What's the benefit to claiming this is a separate operation from the existing reduction intrinsics? Can't you just loosen the type declaration to allow vector types for the result?

davemgreen · 2024-06-06T07:16:32Z

llvm/test/CodeGen/AArch64/partial-reduce-sdot-ir.ll

+  %3 = mul i64 %2, 16
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %entry


It doesn't need a loop for the test.

davemgreen · 2024-06-06T07:18:03Z

llvm/docs/LangRef.rst

+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics do an integer
+``ADD`` reduction of subvectors within a vector, returning each scalar result as
+a lane within a vector. The return type is a vector type with an
+element-type of the vector input and a width a factor of the vector input
+(typically either half or quarter).


I haven't been involved in defining these intrinsic internally, but have thought about how they might work before. I'm not sure if it is better to have a generic partial reduction like this or something more specific to dotprod that includes the zext/sext and mul. They both have advantages and disadvantages. The more instructions there are the harder they are to costmodel well, but more can be done with them.

But it would seem that we should be defining how these are expected to reduce the inputs into the output lanes. Otherwise the definition is a bit wishy-washy in a way that can make them more difficult to use than is necessary. I would expect them to perform pair-wise reductions, and might be simpler if they are limited to power-2 so that they can deinterleave in steps.
https://godbolt.org/z/G737aj1n6

The codegen that currently exists doesn't seem to do that though.

The intent here is to keep the reduction intrinsics as loose as possible so we don't lock the code generator into a specific ordering.

If there's an option to simply extend the original intrinsics that would be super but I figured it would be easier to move current uses to a newer intrinsic (assuming it leaves the experimental space) than the other way round.

paulwalker-arm · 2024-06-06T10:42:59Z

llvm/docs/LangRef.rst

+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in)
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in)


Is it best for these intrinsics to take a single operand? I kind of see them more like binary operators whose input and output operands are less restrictive.

The intent being we're doing a partial reduction where currently LoopVectorize achieves this via an add (or other binop) instruction that is too restrictive (i.e. it forces a very specific ordering for how elements are combined) and so we're introducing a new intrinsic that better reflects what LoopVectorize wants to achieve but without the baggage. This would allow existing code to pick a format of the intrinsic that can be mapped directly to an add instruction if need be.

To be specific I'm proposing

declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<4 x i32>,<16 x i32>)

whereby the result and first operand types match, but the second operand differs (perhaps with a restriction that is must have the same or more elements).

llvm/docs/LangRef.rst

huntergr-arm · 2024-06-12T13:49:30Z

llvm/docs/LangRef.rst

@@ -14264,7 +14264,7 @@ a buffer LLVM can use to perform counter increments (i.e. the lowering of
 ``llvm.instrprof.increment[.step]``. The address range following the counter
 buffer, ``<num-counters>`` x ``sizeof(ptr)`` - sized, is expected to contain
 pointers to contexts of functions called from this function ("subcontexts").
-LLVM does not dereference into that memory region, just calculates GEPs. 
+LLVM does not dereference into that memory region, just calculates GEPs.


nit: unrelated whitespace change.

huntergr-arm · 2024-06-12T14:01:41Z

llvm/include/llvm/IR/Intrinsics.td

+//===-------------- Intrinsics to perform partial reduction ---------------===//
+
+def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
+                                                                       [llvm_anyvector_ty, llvm_anyvector_ty],


I think adding a new matcher class to constrain the second parameter to the restrictions you defined in the langref would be helpful (same element type, width an integer multiple).

Given this is an experimental intrinsic is it worth implementing that plumbing?

Also, the matcher classes typically exist to allow for fewer explicit types when creating a call, which in this instance is not possible because both vector lengths are unknown (or to put another way, there's no 1-1 link between them).

Personally I think there verifier route is better, plus it allow for a more user friendly error message.

huntergr-arm · 2024-06-12T14:48:13Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+      auto SourceIndex = DAG.getVectorIdxConstant(i * ScaleFactor, DL);
+      auto TargetIndex = DAG.getVectorIdxConstant(i, DL);
+      auto ExistingValue = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ReducedTy.getScalarType(), {Accumulator, TargetIndex});
+      auto N = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ReducedTy, {OpNode, SourceIndex});


This seems to assume that each subvector will be the same size as the smaller vector type? It works for the case we're interested in (e.g. <vscale x 16 x i32> to <vscale x 4 x i32>), but would fail if the larger type were <vscale x 8 x i32> -- you'd want to extract <vscale x 2 x i32> and reduce that. (We might never create such a partial reduction, but I think it should work correctly if we did).

huntergr-arm · 2024-06-12T14:49:29Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

@@ -7914,6 +7914,28 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
    setValue(&I, Trunc);
    return;
  }
+  case Intrinsic::experimental_vector_partial_reduce_add: {


I think we can pass this through as an INTRINSIC_WO_CHAIN node, at least for targets that support it.

We need to be careful because I don't think common code exists to type legalise arbitrary INTRINSIC_WO_CHAIN calls (given their nature). Presumably we'll just follow the precedent set for get.active.lane.mask and cttz.elts when we add AArch64 specific lowering.

I can't help but think as some point we'll just want to restrict the "same element type" restrict of VECREDUCE_ADD to have explicit signed and unsigned versions, like we have for ABDS/ABDU, but I guess we can see how things work out (again much as we are for the intrinsics mentioned before).

huntergr-arm · 2024-06-12T14:56:29Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

+    auto Accumulator = getValue(I.getOperand(0));
+    unsigned ScaleFactor = FullTy.getVectorMinNumElements() / ReducedTy.getVectorMinNumElements();
+
+    for(unsigned i = 0; i < ScaleFactor; i++) {


I'm now a bit concerned about the semantics of the intrinsic. In one of the test cases below (partial_reduce_add), you have the same size vector for both inputs. Applying this lowering results in the second vector being reduced and the result added to the first lane of the accumulator, with the other lanes being untouched.

I think the idea was to reduce the second input vector until it matched the size of the first, then perform a vector add of the two. If both are the same size to begin with, you just need to perform a single vector add. @paulwalker-arm can you please clarify?

The langref text will need to make the exact semantics clear.

I think the previous design may have been better, since it was clearly just performing the reduction of a single vector value into another (and possibly to a scalar, as @arsenm suggests). Making it a binop as well seems to make it less flexible vs. just having a separate binop afterwards. Maybe I'm missing something though...

The problem with the "having a separate binop" approach is that it constrains optimisation/code generation because that binop requires a very specific ordering for how elements are combined, which is the very problem the partial reduction is solving.

I think folk are stuck in a "how can we use dot instructions" mindset, whilst I'm trying to push for "what is the loosest way reductions can be represented in IR". To this point, the current suggested langref text for the intrinsic is still too strict because it gives the impression there's a defined order for how the second operand's elements are combined with the first, where there shouldn't be.

@huntergr-arm - Yes, the intent for "same size operands" is to emit a stock binop. This will effectively match what LoopVectorize does today and thus allow the intrinsic to be used regardless of the target rather than having to implement target specific/controlled paths within the vectorizer.

Ok, I was thrown off by the langref description. I guess then I'd like to see the default lowering changed to just extract the subvectors from the second operand and perform a vector add on to the first operand, instead of reducing the subvectors and adding the result to individual lanes. It technically meets the defined semantics (target-defined order of reduction operations), but the current codegen is pretty awful compared to a series of vector adds.

huntergr-arm · 2024-06-12T14:57:56Z

llvm/lib/IR/Verifier.cpp

@@ -6131,6 +6131,19 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
    }
    break;
  }
+  case Intrinsic::experimental_vector_partial_reduce_add: {


I guess my matcher class suggestion would remove the need for this code.

See above for my 2c.

huntergr-arm · 2024-06-12T15:07:23Z

llvm/test/CodeGen/AArch64/partial-reduction-add.ll

+  ret <vscale x 4 x i32> %partial.reduce
+}
+
+define <vscale x 4 x i32> @partial_reduce_add_quart(<vscale x 4 x i32> %accumulator, <vscale x 16 x i32> %0) #0 {


This is reducing into the first 4 elements of the accumulator; it doesn't work correctly with vscale.

huntergr-arm · 2024-06-12T15:08:59Z

llvm/test/CodeGen/AArch64/partial-reduction-add.ll

+declare i32 @llvm.vector.reduce.add.nxv8i32(<vscale x 8 x i32>) #2
+
+attributes #0 = { "target-features"="+fp-armv8,+fullfp16,+neon,+sve,+sve2,+v8a" }
+attributes #1 = { nocallback nofree nosync nounwind willreturn memory(none) }


I think attributes 1 and 2 can be removed entirely, and 0 only really needs +sve2.

arsenm · 2024-06-12T15:15:09Z

What's the benefit to claiming this is a separate operation from the existing reduction intrinsics? Can't you just loosen the type declaration to allow vector types for the result?

ping?

paulwalker-arm · 2024-06-13T13:29:12Z

llvm/docs/LangRef.rst

+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics perform an integer
+``ADD`` reduction of subvectors within a vector, before adding the resulting vector


This should be loosed to unrestricted the way the operands are combined. In its broadest sense the operands are concatenated into a single vector that's then reduced down to the number of elements dictated but the result type (and hence first operand type) but there's no specification for how the reduction is distributed throughout those elements.

paulwalker-arm · 2024-06-13T13:31:18Z

llvm/docs/LangRef.rst

+Arguments:
+""""""""""
+
+The first argument is the accumulator vector, or a `zeroinitializer`. The type of


I don't think zeroinitializer adds anything to the description, as in, there's no change of behaviour base on this specific value.

paulwalker-arm · 2024-06-13T13:43:58Z

llvm/docs/LangRef.rst

+into the accumulator, the width of this vector must be a positive integer multiple of
+the accumulator vector/return type.


Somewhat contentious so feel free to ignore but when talking about the number of elements I see vectors having length not width.

For now it's worth an extra restriction for the two vector types to have matching styles (i.e. both fixed or both scalable) whilst also making it clear both vectors must have the same element type. The "style" restriction is something I think we'll want to relax in the future (AArch64's SVE2p1 feature is a possible enabling use case) but there's no point worrying about that yet.

paulwalker-arm · 2024-06-13T13:57:26Z

llvm/include/llvm/IR/Intrinsics.td

+//===-------------- Intrinsics to perform partial reduction ---------------===//
+
+def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
+                                                                       [llvm_anyvector_ty, llvm_anyvector_ty],


Given this is an experimental intrinsic is it worth implementing that plumbing?

Also, the matcher classes typically exist to allow for fewer explicit types when creating a call, which in this instance is not possible because both vector lengths are unknown (or to put another way, there's no 1-1 link between them).

Personally I think there verifier route is better, plus it allow for a more user friendly error message.

paulwalker-arm · 2024-06-13T14:06:26Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

@@ -7914,6 +7914,28 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
    setValue(&I, Trunc);
    return;
  }
+  case Intrinsic::experimental_vector_partial_reduce_add: {


We need to be careful because I don't think common code exists to type legalise arbitrary INTRINSIC_WO_CHAIN calls (given their nature). Presumably we'll just follow the precedent set for get.active.lane.mask and cttz.elts when we add AArch64 specific lowering.

I can't help but think as some point we'll just want to restrict the "same element type" restrict of VECREDUCE_ADD to have explicit signed and unsigned versions, like we have for ABDS/ABDU, but I guess we can see how things work out (again much as we are for the intrinsics mentioned before).

paulwalker-arm · 2024-06-13T14:11:00Z

llvm/lib/IR/Verifier.cpp

@@ -6131,6 +6131,19 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
    }
    break;
  }
+  case Intrinsic::experimental_vector_partial_reduce_add: {


See above for my 2c.

NickGuy-Arm · 2024-06-17T12:39:53Z

I've updated the langref, addressing the comments left. However I have not yet modified the codegen.

What's the benefit to claiming this is a separate operation from the existing reduction intrinsics? Can't you just loosen the type declaration to allow vector types for the result?

ping?

Apologies @arsenm, I missed your message initially.
That would be the ideal result, yes; however at the moment it's useful, until feature parity is achieved, to keep this intrinsic in the experimental namespace. I believe Paul alluded to this further up, but the end goal is to migrate uses of reductions to use this over time, rather than attempting to retrofit vector->vector reductions into the existing reduction intrinsics.

paulwalker-arm · 2024-06-17T13:09:00Z

llvm/docs/LangRef.rst

+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics reduce the
+input vector down to the number of elements dictated by the result vector, and
+then adds the resulting vector to the accumulator vector. The return type is a
+vector type that matches the type of the accumulator vector.


Please re-read my previous review comments because this description is still too strict. There is no accumulator vector (that's just an artefact of how LoopVectorize will use the intrinsic). There are just two vectors whose combined elements are to be reduced.

paulwalker-arm · 2024-06-17T14:43:03Z

llvm/docs/LangRef.rst

+second operand vector down to the number of elements dictated by the result
+vector, and then adds the resulting vector to the first operand vector. The
+return type is a vector type that matches the type of the first operand vector.


May I suggest

concatenation of its two vector operands down to the number of elements dictated by the result. The result type....

davemgreen · 2024-06-18T10:02:50Z

Was there an RFC for this anywhere yet to point to the review? The reviewer list looks a bit narrow. Thanks

paulwalker-arm · 2024-06-18T10:11:16Z

Was there an RFC for this anywhere yet to point to the review? The reviewer list looks a bit narrow. Thanks

There is https://discourse.llvm.org/t/rfc-is-a-more-expressive-way-to-represent-reductions-useful/74929, which started the conversation but as you'll see nobody seemed interested and so I said Arm would take the lead. We should have referenced the PR when available though, which I've rectified.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

huntergr-arm · 2024-07-02T09:57:35Z

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

@@ -7914,6 +7915,35 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
    setValue(&I, Trunc);
    return;
  }
+  case Intrinsic::experimental_vector_partial_reduce_add: {
+    auto DL = getCurSDLoc();


nit: It would be good to remove the 'auto' declarations and use the appropriate named types (SDValue, EVT, int, etc). I think you should already have a variable in scope for getCurSDLoc() as well (sdl, from the start of the function).

huntergr-arm · 2024-07-02T09:57:58Z

llvm/lib/IR/Verifier.cpp

+    VectorType *AccTy = cast<VectorType>(Call.getArgOperand(0)->getType());
+    VectorType *VecTy = cast<VectorType>(Call.getArgOperand(1)->getType());
+
+    auto VecWidth = VecTy->getElementCount().getKnownMinValue();


nit: more autos.

llvm-ci · 2024-07-04T14:06:25Z

LLVM Buildbot has detected a new failure on builder mlir-nvidia-gcc7 running on mlir-nvidia while building llvm at step 5 "build-check-mlir-build-only".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/116/builds/875

Here is the relevant piece of the build log for the reference:

Step 5 (build-check-mlir-build-only) failure: build (failure)
...
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:72:65:   required from here
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:63:31: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
   return ((llvm::isa<Tys>(t1) && llvm::isa<Tys>(t2)) || ... || false);
           ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:63:31: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:63:31: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
654.834 [53/16/4409] Building CXX object tools/mlir/test/lib/Dialect/Test/CMakeFiles/MLIRTestToLLVMIRTranslation.dir/TestToLLVMIRTranslation.cpp.o
654.849 [52/16/4410] Linking CXX static library lib/libMLIRTestToLLVMIRTranslation.a
657.007 [51/16/4411] Linking CXX executable bin/mlir-translate
657.026 [50/16/4412] Linking CXX static library lib/libMyExtensionCh4.a
command timed out: 1200 seconds without output running [b'ninja', b'-j', b'16', b'check-mlir-build-only'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3830.141216

Adds the llvm.experimental.partial.reduce.add.* overloaded intrinsic, this intrinsic represents add reductions that result in a narrower vector.

Following on from #94499, this patch adds support to the Loop Vectorizer to emit the partial reduction intrinsics where they may be beneficial for the target. --------- Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>

Following on from llvm/llvm-project#94499, this patch adds support to the Loop Vectorizer to emit the partial reduction intrinsics where they may be beneficial for the target. --------- Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>

Add partial reduction add intrinsic

f365ac7

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well llvm:ir labels Jun 5, 2024

NickGuy-Arm mentioned this pull request Jun 5, 2024

[LoopVectorizer] Add support for partial reductions #92418

Merged

davemgreen reviewed Jun 6, 2024

View reviewed changes

paulwalker-arm reviewed Jun 6, 2024

View reviewed changes

NickGuy-Arm added 2 commits June 10, 2024 15:44

Change partial reduction intrinsic to take the accumulator as an operand

102f9e4

Rename test file

b0c126e

NickGuy-Arm requested review from SamTebbs33, huntergr-arm and sdesmalen-arm June 10, 2024 15:10

Fix docs build error

b811558

huntergr-arm requested changes Jun 12, 2024

View reviewed changes

paulwalker-arm reviewed Jun 13, 2024

View reviewed changes

Update LangRef.rst for partial reduction intrinsic

0786587

paulwalker-arm reviewed Jun 17, 2024

View reviewed changes

Update LangRef.rst for partial reduction intrinsic

a9a1028

paulwalker-arm reviewed Jun 17, 2024

View reviewed changes

nikic requested review from fhahn and preames June 18, 2024 10:24

Update LangRef.rst for partial reduction intrinsic

c01d6c6

Implement generic lowering for the partial reduction intrinsic

fadffcc

huntergr-arm reviewed Jun 25, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp Outdated Show resolved Hide resolved

Use deque instead of SmallVector for generic lowering

7c428bd

huntergr-arm approved these changes Jul 2, 2024

View reviewed changes

NickGuy-Arm added 3 commits July 3, 2024 14:18

Address final comments

913ac87

Fix formatting

631208f

Fix formatting using git-clang-format

f8eec05

NickGuy-Arm merged commit 6222c8f into llvm:main Jul 4, 2024
8 checks passed

NickGuy-Arm deleted the partial-reduction-add-intrinsic branch July 4, 2024 12:36

sjoerdmeijer mentioned this pull request Oct 2, 2024

[SIMD] auto-vectorization using instruction usdot #63971

Closed

This was referenced Jun 2, 2025

[MTE] [NFC] use vector to collect globals to tag (#120283) #142329

Closed

[MTE] [NFC] use vector to collect globals to tag (#120283) #142330

Draft

		declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in)
		declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in)

		The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics perform an integer
		``ADD`` reduction of subvectors within a vector, before adding the resulting vector

		into the accumulator, the width of this vector must be a positive integer multiple of
		the accumulator vector/return type.

[IR][LangRef] Add partial reduction add intrinsic #94499

[IR][LangRef] Add partial reduction add intrinsic #94499

Uh oh!

Conversation

NickGuy-Arm commented Jun 5, 2024

Uh oh!

llvmbot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 5, 2024

Uh oh!

github-actions bot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Jun 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huntergr-arm Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm commented Jun 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickGuy-Arm commented Jun 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

llvmbot commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading

paulwalker-arm Jun 6, 2024 •

edited

Loading

huntergr-arm Jun 12, 2024 •

edited

Loading

huntergr-arm Jul 2, 2024 •

edited

Loading