Skip to content

[AMDGPU] add s_bitset[10]_b32 optimization for shl+[or,andn2] pattern #134155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BaoshanPang
Copy link
Contributor

This PR is for #130245

@llvmbot
Copy link
Member

llvmbot commented Apr 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Baoshan (BaoshanPang)

Changes

This PR is for #130245


Patch is 32.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134155.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp (+53-2)
  • (added) llvm/test/CodeGen/AMDGPU/130245.ll (+23)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+9-18)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+24-48)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+6-12)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+6-12)
diff --git a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
index 73343e1c80f33..b0b0e46066bc7 100644
--- a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
+++ b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
@@ -15,6 +15,7 @@
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineOperand.h"
 
 #define DEBUG_TYPE "si-shrink-instructions"
 
@@ -534,6 +535,56 @@ bool SIShrinkInstructions::shrinkScalarLogicOp(MachineInstr &MI) const {
   MachineOperand *SrcReg = Src0;
   MachineOperand *SrcImm = Src1;
 
+  //  case 1:
+  //  From:
+  //  s_lshl_b32 s1, 1, s1
+  //  s_or_b32 s0, s0, s1
+  //  To:
+  //  s_bitset1_b32 s0, s1
+  //
+  //  case 2:
+  //  s_lshl_b32 s1, 1, s1
+  //  s_andn2_b32 s0, s0, s1
+  //  To:
+  //  s_bitset0_b32 s0, s1
+  if ((MI.getOpcode() == AMDGPU::S_OR_B32 ||
+       MI.getOpcode() == AMDGPU::S_ANDN2_B32) &&
+      Src0->isReg() && Src1->isReg() && Dest->getReg() == Src0->getReg()) {
+    MachineInstr *Shl = MRI->getUniqueVRegDef(Src1->getReg());
+    if (Shl && Shl->getOpcode() == AMDGPU::S_LSHL_B32) {
+      MachineOperand &ShlDest = Shl->getOperand(0);
+      MachineOperand &ShlSrc0 = Shl->getOperand(1);
+      MachineOperand &ShlSrc1 = Shl->getOperand(2);
+      const bool IsUndef = SrcReg->isUndef();
+      const bool IsKill = SrcReg->isKill();
+      if (MI.getParent() == Shl->getParent() &&
+          MRI->hasAtMostUserInstrs(ShlDest.getReg(), 2) && ShlSrc0.isImm() &&
+          ShlSrc0.getImm() == 1 && ShlSrc1.isReg()) {
+        unsigned int NewOpc = (MI.getOpcode() == AMDGPU::S_OR_B32)
+                                  ? AMDGPU::S_BITSET1_B32
+                                  : AMDGPU::S_BITSET0_B32;
+        MI.setDesc(TII->get(NewOpc));
+        Src0->setReg(ShlSrc1.getReg());
+        MI.getOperand(2).ChangeToRegister(Dest->getReg(), /*IsDef*/ false,
+                                          /*isImp*/ false, IsKill,
+                                          /*isDead*/ false, IsUndef);
+        MI.tieOperands(0, 2);
+        for (auto IE = MI.getIterator(), I = std::next(Shl->getIterator());
+             I != IE; ++I) {
+          for (MachineOperand &MO : I->operands()) {
+            if (MO.isReg() && MO.getReg() == ShlSrc1.getReg()) {
+              if (MO.isKill())
+                MO.setIsKill(false);
+            }
+          }
+        }
+        Shl->eraseFromParent();
+        return true;
+      }
+    }
+    return false;
+  }
+
   if (!SrcImm->isImm() ||
       AMDGPU::isInlinableLiteral32(SrcImm->getImm(), ST->hasInv2PiInlineImm()))
     return false;
@@ -577,8 +628,7 @@ bool SIShrinkInstructions::shrinkScalarLogicOp(MachineInstr &MI) const {
       const bool IsUndef = SrcReg->isUndef();
       const bool IsKill = SrcReg->isKill();
       MI.setDesc(TII->get(Opc));
-      if (Opc == AMDGPU::S_BITSET0_B32 ||
-          Opc == AMDGPU::S_BITSET1_B32) {
+      if (Opc == AMDGPU::S_BITSET0_B32 || Opc == AMDGPU::S_BITSET1_B32) {
         Src0->ChangeToImmediate(NewImm);
         // Remove the immediate and add the tied input.
         MI.getOperand(2).ChangeToRegister(Dest->getReg(), /*IsDef*/ false,
@@ -945,6 +995,7 @@ bool SIShrinkInstructions::run(MachineFunction &MF) {
 
       // Shrink scalar logic operations.
       if (MI.getOpcode() == AMDGPU::S_AND_B32 ||
+          MI.getOpcode() == AMDGPU::S_ANDN2_B32 ||
           MI.getOpcode() == AMDGPU::S_OR_B32 ||
           MI.getOpcode() == AMDGPU::S_XOR_B32) {
         if (shrinkScalarLogicOp(MI))
diff --git a/llvm/test/CodeGen/AMDGPU/130245.ll b/llvm/test/CodeGen/AMDGPU/130245.ll
new file mode 100644
index 0000000000000..54e7e0f022f59
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/130245.ll
@@ -0,0 +1,23 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdpal -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefix=SI %s
+
+define amdgpu_ps i32 @s_bitset1_b32(i32 inreg %src0, i32 inreg %bit.index) {
+; SI-LABEL: s_bitset1_b32:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_bitset1_b32 s0, s1
+; SI-NEXT:    ; return to shader part epilog
+  %set.bit.at.index = shl i32 1, %bit.index
+  %or = or i32 %src0, %set.bit.at.index
+  ret i32 %or
+}
+
+define amdgpu_ps i32 @s_bitset0_b32(i32 inreg %src0, i32 inreg %bit.index) {
+; SI-LABEL: s_bitset0_b32:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_bitset0_b32 s0, s1
+; SI-NEXT:    ; return to shader part epilog
+  %set.bit.at.index = shl i32 1, %bit.index
+  %other.bits = xor i32 %set.bit.at.index, -1
+  %and = and i32 %src0, %other.bits
+  ret i32 %and
+}
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
index 8319e112f526e..bada7eb3eba93 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
@@ -728,9 +728,8 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX10W32-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10W32-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX10W32-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX10W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX10W32-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX10W32-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX10W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX10W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX10W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX10W32-NEXT:    s_cbranch_scc1 .LBB2_1
@@ -813,9 +812,8 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX11W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX11W32-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX11W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX11W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX11W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX11W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX11W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX11W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX11W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX11W32-NEXT:    s_cbranch_scc1 .LBB2_1
@@ -898,9 +896,8 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX12W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX12W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX12W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX12W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX12W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX12W32-NEXT:    s_add_co_i32 s0, s0, s3
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    s_cmp_lg_u32 s1, 0
@@ -1120,9 +1117,8 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
 ; GFX10W32-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10W32-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX10W32-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX10W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX10W32-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX10W32-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX10W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX10W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX10W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX10W32-NEXT:    s_cbranch_scc1 .LBB3_1
@@ -1211,9 +1207,8 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
 ; GFX11W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX11W32-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX11W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX11W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX11W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX11W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX11W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX11W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX11W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX11W32-NEXT:    s_cbranch_scc1 .LBB3_1
@@ -1301,9 +1296,8 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
 ; GFX12W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX12W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX12W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX12W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX12W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX12W32-NEXT:    s_add_co_i32 s0, s0, s3
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    s_cmp_lg_u32 s1, 0
@@ -2183,9 +2177,8 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX10W32-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX10W32-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX10W32-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX10W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX10W32-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX10W32-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX10W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX10W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX10W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX10W32-NEXT:    s_cbranch_scc1 .LBB7_1
@@ -2268,9 +2261,8 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX11W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX11W32-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX11W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX11W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX11W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX11W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX11W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX11W32-NEXT:    s_add_i32 s0, s0, s3
 ; GFX11W32-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX11W32-NEXT:    s_cbranch_scc1 .LBB7_1
@@ -2354,9 +2346,8 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
 ; GFX12W32-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX12W32-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX12W32-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX12W32-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX12W32-NEXT:    s_bitset0_b32 s1, s2
 ; GFX12W32-NEXT:    s_add_co_i32 s0, s0, s3
 ; GFX12W32-NEXT:    s_wait_alu 0xfffe
 ; GFX12W32-NEXT:    s_cmp_lg_u32 s1, 0
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
index 9775a37276dfd..15ad3c03a2b9c 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
@@ -620,9 +620,8 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1032_ITERATIVE-NEXT:    s_add_i32 s0, s0, s3
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB2_1
@@ -706,9 +705,8 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1132_ITERATIVE-NEXT:    s_add_i32 s0, s0, s3
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB2_1
@@ -4027,9 +4025,8 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1032_ITERATIVE-NEXT:    s_add_i32 s0, s0, s3
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB10_1
@@ -4113,9 +4110,8 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1132_ITERATIVE-NEXT:    s_add_i32 s0, s0, s3
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB10_1
@@ -6436,9 +6432,8 @@ define amdgpu_kernel void @and_i32_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1032_ITERATIVE-NEXT:    s_and_b32 s0, s0, s3
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB15_1
@@ -6522,9 +6517,8 @@ define amdgpu_kernel void @and_i32_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1132_ITERATIVE-NEXT:    s_and_b32 s0, s0, s3
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB15_1
@@ -7068,10 +7062,9 @@ define amdgpu_kernel void @and_i64_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s3, s2
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s6, v0, s3
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s8, 1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v2, s1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s3
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s2, s2, s8
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s2, s3
 ; GFX1032_ITERATIVE-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s2, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB16_1
@@ -7166,10 +7159,9 @@ define amdgpu_kernel void @and_i64_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s6, v2, s3
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s8, 1, s3
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v1, s1, s3
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s3
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s2, s2, s8
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s2, s3
 ; GFX1132_ITERATIVE-NEXT:    s_and_b64 s[0:1], s[0:1], s[6:7]
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s2, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB16_1
@@ -7797,9 +7789,8 @@ define amdgpu_kernel void @or_i32_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1032_ITERATIVE-NEXT:    s_or_b32 s0, s0, s3
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB17_1
@@ -7883,9 +7874,8 @@ define amdgpu_kernel void @or_i32_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1132_ITERATIVE-NEXT:    s_or_b32 s0, s0, s3
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB17_1
@@ -8428,10 +8418,9 @@ define amdgpu_kernel void @or_i64_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s3, s2
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s6, v0, s3
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s8, 1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v2, s1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s3
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s2, s2, s8
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s2, s3
 ; GFX1032_ITERATIVE-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s2, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB18_1
@@ -8526,10 +8515,9 @@ define amdgpu_kernel void @or_i64_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s6, v2, s3
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s8, 1, s3
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v1, s1, s3
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s3
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s2, s2, s8
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s2, s3
 ; GFX1132_ITERATIVE-NEXT:    s_or_b64 s[0:1], s[0:1], s[6:7]
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s2, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB18_1
@@ -9157,9 +9145,8 @@ define amdgpu_kernel void @xor_i32_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s2, s1
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s3, v0, s2
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s2
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s1, s1, s6
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1032_ITERATIVE-NEXT:    s_xor_b32 s0, s0, s3
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB19_1
@@ -9243,9 +9230,8 @@ define amdgpu_kernel void @xor_i32_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_ctz_i32_b32 s2, s1
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s3, v1, s2
-; GFX1132_ITERATIVE-NEXT:    s_lshl_b32 s6, 1, s2
 ; GFX1132_ITERATIVE-NEXT:    v_writelane_b32 v0, s0, s2
-; GFX1132_ITERATIVE-NEXT:    s_and_not1_b32 s1, s1, s6
+; GFX1132_ITERATIVE-NEXT:    s_bitset0_b32 s1, s2
 ; GFX1132_ITERATIVE-NEXT:    s_xor_b32 s0, s0, s3
 ; GFX1132_ITERATIVE-NEXT:    s_cmp_lg_u32 s1, 0
 ; GFX1132_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB19_1
@@ -9788,10 +9774,9 @@ define amdgpu_kernel void @xor_i64_varying(ptr addrspace(1) %out) {
 ; GFX1032_ITERATIVE-NEXT:    s_ff1_i32_b32 s3, s2
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1032_ITERATIVE-NEXT:    v_readlane_b32 s6, v0, s3
-; GFX1032_ITERATIVE-NEXT:    s_lshl_b32 s8, 1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v2, s1, s3
 ; GFX1032_ITERATIVE-NEXT:    v_writelane_b32 v1, s0, s3
-; GFX1032_ITERATIVE-NEXT:    s_andn2_b32 s2, s2, s8
+; GFX1032_ITERATIVE-NEXT:    s_bitset0_b32 s2, s3
 ; GFX1032_ITERATIVE-NEXT:    s_xor_b64 s[0:1], s[0:1], s[6:7]
 ; GFX1032_ITERATIVE-NEXT:    s_cmp_lg_u32 s2, 0
 ; GFX1032_ITERATIVE-NEXT:    s_cbranch_scc1 .LBB20_1
@@ -9886,10 +9871,9 @@ define amdgpu_kernel void @xor_i64_varying(ptr addrspace(1) %out) {
 ; GFX1132_ITERATIVE-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s7, v3, s3
 ; GFX1132_ITERATIVE-NEXT:    v_readlane_b32 s...
[truncated]

@BaoshanPang
Copy link
Contributor Author

@arsenm Please help to review this PR.

The PR is for 32-bit. I want to use a different PR for the 64-bit version if it is OK.

@BaoshanPang
Copy link
Contributor Author

The failed test case seems not related to my change:

Failed Tests (1):
--
  | LLVM-Unit :: Transforms/Vectorize/./VectorizeTests.exe/VPRecipeTest/CastVPInterleaveRecipeToVPUser

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more naturally belongs in a selection pattern

@BaoshanPang
Copy link
Contributor Author

This more naturally belongs in a selection pattern

Do you mean should I do it in td file?

@arsenm
Copy link
Contributor

arsenm commented Apr 3, 2025

Do you mean should I do it in td file?

Yes. This shouldn't require anything special that would require manual selection, or post-select folding (as you have here)

@BaoshanPang
Copy link
Contributor Author

Need help.

This pat is not working:

+def : GCNPat <
+  (i32 (and (i32 SSrc_b32:$src),(not (shl 1, (i32 SSrc_b32:$bit_idx))))),
+  (i32 (S_BITSET0_B32 SSrc_b32:$bit_idx, SSrc_b32:$src))
+>;

I want it to be used only when src and bit_idx both are SGPR, but llvm would insert a copy from VGPR to SGPR, and at later stage llvm also convert one SGPR back to VGPR which result an error. What should I do for such situation?

@BaoshanPang
Copy link
Contributor Author

Need help.

This pat is not working:

+def : GCNPat <
+  (i32 (and (i32 SSrc_b32:$src),(not (shl 1, (i32 SSrc_b32:$bit_idx))))),
+  (i32 (S_BITSET0_B32 SSrc_b32:$bit_idx, SSrc_b32:$src))
+>;

I want it to be used only when src and bit_idx both are SGPR, but llvm would insert a copy from VGPR to SGPR, and at later stage llvm also convert one SGPR back to VGPR which result an error. What should I do for such situation?

I am able to find a way to do it.

Copy link

github-actions bot commented Apr 5, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@BaoshanPang
Copy link
Contributor Author

BaoshanPang commented Apr 6, 2025

Do you mean should I do it in td file?

Yes. This shouldn't require anything special that would require manual selection, or post-select folding (as you have here)

@arsenm
With some debugging, I feel maybe it is better to do it in post-select. because:

  1. to do it in fd file, we need to use different method for global-isel and non-global-isel.
  2. in selection stage, the instructions are more generic and complex than what can be seen at post-select phase:
# *** IR Dump After AMDGPUPreLegalizerCombiner (amdgpu-prelegalizer-combiner) ***:
# Machine code for function s_bitset1_b32: IsSSA, TracksLiveness
Function Live Ins: $sgpr2 in %2

bb.1 (%ir-block.0):
  liveins: $sgpr0, $sgpr1
  %0:_(s32) = COPY $sgpr0
  %1:_(s32) = COPY $sgpr1
  %3:_(s32) = G_CONSTANT i32 1
  %4:_(s32) = G_SHL %3:_, %1:_(s32)
  %5:_(s32) = G_OR %0:_, %4:_
  %6:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), %5:_(s32)
  $sgpr0 = COPY %6:_(s32)
  SI_RETURN_TO_EPILOG implicit $sgpr0

# End machine code for function s_bitset1_b32.

also in later phase like RegBankSelect, some sgpr would be converted to vgpr which inrease the complexity:

# *** IR Dump After RegBankSelect (regbankselect) ***:
# Machine code for function s_bitset1_b32: IsSSA, TracksLiveness, Legalized, RegBankSelected
Function Live Ins: $sgpr2 in %2


bb.1 (%ir-block.0):
  liveins: $sgpr0, $sgpr1
  %0:sgpr(s32) = COPY $sgpr0
  %1:sgpr(s32) = COPY $sgpr1
  %3:sgpr(s32) = G_CONSTANT i32 1
  %4:sgpr(s32) = G_SHL %3:sgpr, %1:sgpr(s32)
  %5:sgpr(s32) = G_OR %0:sgpr, %4:sgpr
  %7:vgpr(s32) = COPY %5:sgpr(s32)
  %6:sgpr(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), %7:vgpr(s32)
  $sgpr0 = COPY %6:sgpr(s32)
  SI_RETURN_TO_EPILOG implicit $sgpr0

# End machine code for function s_bitset1_b32.

While if do it in post-select phase, these would be what we need to handle wich is much simple and straghtforward:

# *** IR Dump After SI Shrink Instructions (si-shrink-instructions) ***:
# Machine code for function s_bitset1_b32: NoPHIs, TracksLiveness, NoVRegs, Legalized, RegBankSelected, Selected, TiedOpsRewritten, TracksDebugUserValues
Function Live Ins: $sgpr2

bb.0 (%ir-block.0):
  liveins: $sgpr0, $sgpr1
  renamable $sgpr1 = S_LSHL_B32 1, killed renamable $sgpr1, implicit-def dead $scc
  renamable $sgpr0 = S_OR_B32 killed renamable $sgpr0, killed renamable $sgpr1, implicit-def dead $scc
  SI_RETURN_TO_EPILOG implicit $sgpr0

# End machine code for function s_bitset1_b32.


@BaoshanPang
Copy link
Contributor Author

This more naturally belongs in a selection pattern

With the knowledage what I have so far, doing it in SiShrink pass is the easiest way.

@arsenm
Copy link
Contributor

arsenm commented Apr 23, 2025

With the knowledage what I have so far, doing it in SiShrink pass is the easiest way.

This is more difficult than selecting this as a tablegen pattern

to do it in fd file, we need to use different method for global-isel and non-global-isel.

This is not true

in selection stage, the instructions are more generic and complex than what can be seen at post-select phase

The opposite is true. One generic operation becomes many more complicated selected operations

@BaoshanPang
Copy link
Contributor Author

With the knowledage what I have so far, doing it in SiShrink pass is the easiest way.

This is more difficult than selecting this as a tablegen pattern

to do it in fd file, we need to use different method for global-isel and non-global-isel.

This is not true

in selection stage, the instructions are more generic and complex than what can be seen at post-select phase

The opposite is true. One generic operation becomes many more complicated selected operations

Ok, I guess I need to get better understanding of the system. For now, I will try to find some workable issues to fix. Please feel free to assign #130245 to others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants