AMDGPU: Custom lower 32-bit element shuffles #123711

arsenm · 2025-01-21T08:36:13Z

This is so we can try to make use of v_pk_mov_b32 when available.
Note this currently has little observable effect. The combiner
will undo the common extract of shuffle pattern. The lack
of test changes should demonstrate this change is minimally
correct.

We should probably try to make better use of wider extracts in
even aligned cases, but I'm trying to avoid some really ugly
regalloc regressions in some MFMA tests. The DAG scheduler ends
up doing a worse job if we use vector extracts, resulting
in failure to do 3 address conversion of MFMAs.

arsenm · 2025-01-21T08:36:30Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-01-21T08:36:52Z

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

This is so we can try to make use of v_pk_mov_b32 when available.
Note this currently has little observable effect. The combiner
will undo the common extract of shuffle pattern. The lack
of test changes should demonstrate this change is minimally
correct.

We should probably try to make better use of wider extracts in
even aligned cases, but I'm trying to avoid some really ugly
regalloc regressions in some MFMA tests. The DAG scheduler ends
up doing a worse job if we use vector extracts, resulting
in failure to do 3 address conversion of MFMAs.

Full diff: https://github.com/llvm/llvm-project/pull/123711.diff

1 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+80-5)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 1aeca7f370aa1b..b632c50dae0e35 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -419,8 +419,9 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   }
 
   setOperationAction(ISD::VECTOR_SHUFFLE,
-                     {MVT::v8i32, MVT::v8f32, MVT::v16i32, MVT::v16f32},
-                     Expand);
+                     {MVT::v4i32, MVT::v4f32, MVT::v8i32, MVT::v8f32,
+                      MVT::v16i32, MVT::v16f32, MVT::v32i32, MVT::v32f32},
+                     Custom);
 
   if (Subtarget->hasPkMovB32()) {
     // TODO: 16-bit element vectors should be legal with even aligned elements.
@@ -7589,15 +7590,38 @@ static bool elementPairIsContiguous(ArrayRef<int> Mask, int Elt) {
   return Mask[Elt + 1] == Mask[Elt] + 1 && (Mask[Elt] % 2 == 0);
 }
 
+static bool elementPairIsOddToEven(ArrayRef<int> Mask, int Elt) {
+  assert(Elt % 2 == 0);
+  return Mask[Elt] >= 0 && Mask[Elt + 1] >= 0 && (Mask[Elt] & 1) &&
+         !(Mask[Elt + 1] & 1);
+}
+
 SDValue SITargetLowering::lowerVECTOR_SHUFFLE(SDValue Op,
                                               SelectionDAG &DAG) const {
   SDLoc SL(Op);
   EVT ResultVT = Op.getValueType();
   ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op);
   MVT EltVT = ResultVT.getVectorElementType().getSimpleVT();
-  MVT PackVT = MVT::getVectorVT(EltVT, 2);
+  const int NewSrcNumElts = 2;
+  MVT PackVT = MVT::getVectorVT(EltVT, NewSrcNumElts);
   int SrcNumElts = Op.getOperand(0).getValueType().getVectorNumElements();
 
+  // Break up the shuffle into registers sized pieces.
+  //
+  // We're trying to form sub-shuffles that the register allocation pipeline
+  // won't be able to figure out, like how to use v_pk_mov_b32 to do a register
+  // blend or 16-bit op_sel. It should be able to figure out how to reassemble a
+  // pair of copies into a consecutive register copy, so use the ordinary
+  // extract_vector_elt lowering unless we can use the shuffle.
+  //
+  // TODO: This is a bit of hack, and we should probably always use
+  // extract_subvector for the largest possible subvector we can (or at least
+  // use it for PackVT aligned pieces). However we have worse support for
+  // combines on them don't directly treat extract_subvector / insert_subvector
+  // as legal. The DAG scheduler also ends up doing a worse job with the
+  // extract_subvectors.
+  const bool ShouldUseConsecutiveExtract = EltVT.getSizeInBits() == 16;
+
   // vector_shuffle <0,1,6,7> lhs, rhs
   // -> concat_vectors (extract_subvector lhs, 0), (extract_subvector rhs, 2)
   //
@@ -7608,9 +7632,18 @@ SDValue SITargetLowering::lowerVECTOR_SHUFFLE(SDValue Op,
   // -> concat_vectors (extract_subvector rhs, 2), (extract_subvector lhs, 0)
 
   // Avoid scalarizing when both halves are reading from consecutive elements.
-  SmallVector<SDValue, 4> Pieces;
+
+  // If we're treating 2 element shuffles as legal, also create odd-to-even
+  // shuffles of neighboring pairs.
+  //
+  // vector_shuffle <3,2,7,6> lhs, rhs
+  //  -> concat_vectors vector_shuffle <1, 0> (extract_subvector lhs, 0)
+  //                    vector_shuffle <1, 0> (extract_subvector rhs, 2)
+
+  SmallVector<SDValue, 16> Pieces;
   for (int I = 0, N = ResultVT.getVectorNumElements(); I != N; I += 2) {
-    if (elementPairIsContiguous(SVN->getMask(), I)) {
+    if (ShouldUseConsecutiveExtract &&
+        elementPairIsContiguous(SVN->getMask(), I)) {
       const int Idx = SVN->getMaskElt(I);
       int VecIdx = Idx < SrcNumElts ? 0 : 1;
       int EltIdx = Idx < SrcNumElts ? Idx : Idx - SrcNumElts;
@@ -7618,6 +7651,48 @@ SDValue SITargetLowering::lowerVECTOR_SHUFFLE(SDValue Op,
                                    SVN->getOperand(VecIdx),
                                    DAG.getConstant(EltIdx, SL, MVT::i32));
       Pieces.push_back(SubVec);
+    } else if (elementPairIsOddToEven(SVN->getMask(), I) &&
+               isOperationLegal(ISD::VECTOR_SHUFFLE, PackVT)) {
+      int Idx0 = SVN->getMaskElt(I);
+      int Idx1 = SVN->getMaskElt(I + 1);
+
+      SDValue SrcOp0 = SVN->getOperand(0);
+      SDValue SrcOp1 = SrcOp0;
+      if (Idx0 >= SrcNumElts) {
+        SrcOp0 = SVN->getOperand(1);
+        Idx0 -= SrcNumElts;
+      }
+
+      if (Idx1 >= SrcNumElts) {
+        SrcOp1 = SVN->getOperand(1);
+        Idx1 -= SrcNumElts;
+      }
+
+      int AlignedIdx0 = Idx0 & ~(NewSrcNumElts - 1);
+      int AlignedIdx1 = Idx1 & ~(NewSrcNumElts - 1);
+
+      // Extract nearest even aligned piece.
+      SDValue SubVec0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, SL, PackVT, SrcOp0,
+                                    DAG.getConstant(AlignedIdx0, SL, MVT::i32));
+      SDValue SubVec1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, SL, PackVT, SrcOp1,
+                                    DAG.getConstant(AlignedIdx1, SL, MVT::i32));
+
+      int NewMaskIdx0 = Idx0 - AlignedIdx0;
+      int NewMaskIdx1 = Idx1 - AlignedIdx1;
+
+      SDValue Result0 = SubVec0;
+      SDValue Result1 = SubVec0;
+
+      if (SubVec0 != SubVec1) {
+        NewMaskIdx1 += NewSrcNumElts;
+        Result1 = SubVec1;
+      } else {
+        Result1 = DAG.getUNDEF(PackVT);
+      }
+
+      SDValue Shuf = DAG.getVectorShuffle(PackVT, SL, Result0, Result1,
+                                          {NewMaskIdx0, NewMaskIdx1});
+      Pieces.push_back(Shuf);
     } else {
       const int Idx0 = SVN->getMaskElt(I);
       const int Idx1 = SVN->getMaskElt(I + 1);

rampitec · 2025-01-21T19:16:52Z

Is there any way at all to test it?

arsenm · 2025-01-22T01:03:57Z

Is there any way at all to test it?

Many shuffle tests were added in 7786266, this shows they are a no-op. The expected test changes from this are in #123711

rampitec · 2025-01-22T01:07:08Z

Is there any way at all to test it?

Many shuffle tests were added in 7786266, this shows they are a no-op. The expected test changes from this are in #123711

OK, I see. LGTM.

This is so we can try to make use of v_pk_mov_b32 when available. Note this currently has little observable effect. The combiner will undo the common extract of shuffle pattern. The lack of test changes should demonstrate this change is minimally correct. We should probably try to make better use of wider extracts in even aligned cases, but I'm trying to avoid some really ugly regalloc regressions in some MFMA tests. The DAG scheduler ends up doing a worse job if we use vector extracts, resulting in failure to do 3 address conversion of MFMAs.

arsenm · 2025-01-28T04:15:18Z

Merge activity

Jan 27, 11:15 PM EST: A user started a stack merge that includes this pull request via Graphite.
Jan 27, 11:17 PM EST: A user merged this pull request with Graphite.

arsenm mentioned this pull request Jan 21, 2025

DAG: Fix vector_shuffle -> splat fold defining undef lanes #123596

Merged

This was referenced Jan 21, 2025

AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 #123684

Merged

DAG: Avoid breaking legal vector_shuffle with multiple uses #123712

Merged

arsenm added the backend:AMDGPU label Jan 21, 2025 — with Graphite App

arsenm requested review from jayfoad, jrbyrnes, kerbowa, rampitec and shiltian January 21, 2025 08:36

arsenm marked this pull request as ready for review January 21, 2025 08:36

arsenm force-pushed the users/arsenm/amdgpu/make-shufflevector-v2i32-legal-pk-mov-b32 branch from c5caf56 to d75be50 Compare January 21, 2025 14:58

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 1c0f60a to 6434af5 Compare January 21, 2025 14:59

arsenm force-pushed the users/arsenm/amdgpu/make-shufflevector-v2i32-legal-pk-mov-b32 branch from d75be50 to 6c86ad2 Compare January 21, 2025 16:58

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 6434af5 to 4e1bdb4 Compare January 21, 2025 17:00

rampitec approved these changes Jan 22, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/amdgpu/make-shufflevector-v2i32-legal-pk-mov-b32 branch from 6c86ad2 to 84c8a20 Compare January 22, 2025 03:11

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 4e1bdb4 to 0a3791a Compare January 22, 2025 03:11

Base automatically changed from users/arsenm/amdgpu/make-shufflevector-v2i32-legal-pk-mov-b32 to main January 23, 2025 13:58

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 0a3791a to 1c68965 Compare January 23, 2025 14:00

arsenm force-pushed the users/arsenm/custom-lower-vector-shuffle-i32-elements branch from 1c68965 to 6851cfb Compare January 28, 2025 01:34

arsenm merged commit cc97653 into main Jan 28, 2025
6 of 7 checks passed

arsenm deleted the users/arsenm/custom-lower-vector-shuffle-i32-elements branch January 28, 2025 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU: Custom lower 32-bit element shuffles #123711

AMDGPU: Custom lower 32-bit element shuffles #123711

arsenm commented Jan 21, 2025

arsenm commented Jan 21, 2025 •

edited

Loading

llvmbot commented Jan 21, 2025

rampitec commented Jan 21, 2025

arsenm commented Jan 22, 2025

rampitec commented Jan 22, 2025

arsenm commented Jan 28, 2025 •

edited

Loading

AMDGPU: Custom lower 32-bit element shuffles #123711

AMDGPU: Custom lower 32-bit element shuffles #123711

Conversation

arsenm commented Jan 21, 2025

arsenm commented Jan 21, 2025 • edited Loading

llvmbot commented Jan 21, 2025

rampitec commented Jan 21, 2025

arsenm commented Jan 22, 2025

rampitec commented Jan 22, 2025

arsenm commented Jan 28, 2025 • edited Loading

Merge activity

arsenm commented Jan 21, 2025 •

edited

Loading

arsenm commented Jan 28, 2025 •

edited

Loading