[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic #105822

rovka · 2024-08-23T12:29:21Z

This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).

llvmbot · 2024-08-23T12:29:50Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-ir

Author: Diana Picus (rovka)

Changes

This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions):

entry:
  %func_exec = call i1 @<!-- -->llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block

It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch).

Since the usage pattern is very strict, the backend can optimize away the intrinsic and the branch following it (in practice EXEC will already contain the correct value when entering the function, because it will be set by the llvm.amdgcn.cs.chain intrinsic before jumping in). The removal is done early on, in finalizeLowering, however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).

Some of the generated code could use some more optimization (#98864 might help with some of that). One important thing for front-ends to note is that for now it's recommended to avoid phi's in tail with large structs/vectors where not all elements are modified by shader - prefer to have small phi's and build the aggregate in tail (see the basic vs phi-whole-struct test cases).

Patch is 107.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/105822.diff

21 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+14)
(modified) llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp (+5)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+10)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h (+5)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+8-4)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+133)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+8)
(modified) llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h (+3)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll (+1545)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w64.ll (+254)
(modified) llvm/test/CodeGen/AMDGPU/pei-amdgpu-cs-chain.mir (+29)
(modified) llvm/test/CodeGen/MIR/AMDGPU/long-branch-reg-all-sgpr-used.ll (+2)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-after-pei.ll (+1)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg-debug.ll (+1)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-long-branch-reg.ll (+1)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info-no-ir.mir (+4)
(modified) llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll (+4)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..a552b758a1ace7 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -208,6 +208,20 @@ def int_amdgcn_init_exec_from_input : Intrinsic<[],
   [IntrConvergent, IntrHasSideEffects, IntrNoMem, IntrNoCallback,
    IntrNoFree, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
 
+// Sets the function into whole-wave-mode and returns whether the lane was
+// active when entering the function. A branch depending on this return will
+// revert the EXEC mask to what it was when entering the function, thus
+// resulting in a no-op. This pattern is used to optimize branches when function
+// tails need to be run in whole-wave-mode. It may also have other consequences
+// (mostly related to WWM CSR handling) that differentiate it from using
+// a plain `amdgcn.init.exec -1`.
+//
+// Can only be used in functions with the `amdgpu_cs_chain` calling convention.
+// Using this intrinsic without immediately branching on its return value is an
+// error.
+def int_amdgcn_init_whole_wave : Intrinsic<[llvm_i1_ty], [], [
+    IntrHasSideEffects, IntrNoMem, IntrNoDuplicate, IntrConvergent]>;
+
 def int_amdgcn_wavefrontsize :
   ClangBuiltin<"__builtin_amdgcn_wavefrontsize">,
   DefaultAttrsIntrinsic<[llvm_i32_ty], [], [NoUndef<RetIndex>, IntrNoMem, IntrSpeculatable]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 0daaf6b6576030..380dc7d3312f32 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -2738,6 +2738,11 @@ void AMDGPUDAGToDAGISel::SelectINTRINSIC_W_CHAIN(SDNode *N) {
   case Intrinsic::amdgcn_ds_bvh_stack_rtn:
     SelectDSBvhStackIntrinsic(N);
     return;
+  case Intrinsic::amdgcn_init_whole_wave:
+    CurDAG->getMachineFunction()
+        .getInfo<SIMachineFunctionInfo>()
+        ->setInitWholeWave();
+    break;
   }
 
   SelectCode(N);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 17071970ca4bfe..06192f74e1e0ed 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1772,6 +1772,14 @@ bool AMDGPUInstructionSelector::selectDSAppendConsume(MachineInstr &MI,
   return constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI);
 }
 
+bool AMDGPUInstructionSelector::selectInitWholeWave(MachineInstr &MI) const {
+  MachineFunction *MF = MI.getParent()->getParent();
+  SIMachineFunctionInfo *MFInfo = MF->getInfo<SIMachineFunctionInfo>();
+
+  MFInfo->setInitWholeWave();
+  return selectImpl(MI, *CoverageInfo);
+}
+
 bool AMDGPUInstructionSelector::selectSBarrier(MachineInstr &MI) const {
   if (TM.getOptLevel() > CodeGenOptLevel::None) {
     unsigned WGSize = STI.getFlatWorkGroupSizes(MF->getFunction()).second;
@@ -2099,6 +2107,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     return selectDSAppendConsume(I, true);
   case Intrinsic::amdgcn_ds_consume:
     return selectDSAppendConsume(I, false);
+  case Intrinsic::amdgcn_init_whole_wave:
+    return selectInitWholeWave(I);
   case Intrinsic::amdgcn_s_barrier:
     return selectSBarrier(I);
   case Intrinsic::amdgcn_raw_buffer_load_lds:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 207cd67f0eda0e..2ddee05d096b23 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -120,6 +120,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
+  bool selectInitWholeWave(MachineInstr &MI) const;
   bool selectSBarrier(MachineInstr &MI) const;
   bool selectDSBvhStackIntrinsic(MachineInstr &MI) const;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
index 7efb7f825348e3..b1022e48b8d34f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.h
@@ -67,6 +67,8 @@ class AMDGPUMachineFunction : public MachineFunctionInfo {
   // Kernel may need limited waves per EU for better performance.
   bool WaveLimiter = false;
 
+  bool HasInitWholeWave = false;
+
 public:
   AMDGPUMachineFunction(const Function &F, const AMDGPUSubtarget &ST);
 
@@ -109,6 +111,9 @@ class AMDGPUMachineFunction : public MachineFunctionInfo {
     return WaveLimiter;
   }
 
+  bool hasInitWholeWave() const { return HasInitWholeWave; }
+  void setInitWholeWave() { HasInitWholeWave = true; }
+
   unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalVariable &GV) {
     return allocateLDSGlobal(DL, GV, DynLDSAlign);
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 69a1936a11fe05..5e7986734871cf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4979,6 +4979,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[3] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, WaveSize);
       break;
     }
+    case Intrinsic::amdgcn_init_whole_wave:
     case Intrinsic::amdgcn_live_mask: {
       OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::VCCRegBankID, 1);
       break;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 95c4859674ecc4..2cd5fb2b94285c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -329,6 +329,7 @@ def : SourceOfDivergence<int_amdgcn_mov_dpp>;
 def : SourceOfDivergence<int_amdgcn_mov_dpp8>;
 def : SourceOfDivergence<int_amdgcn_update_dpp>;
 def : SourceOfDivergence<int_amdgcn_writelane>;
+def : SourceOfDivergence<int_amdgcn_init_whole_wave>;
 
 foreach intr = AMDGPUMFMAIntrinsics908 in
 def : SourceOfDivergence<intr>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 7a9735790371a1..3fcc750646d6ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1731,6 +1731,9 @@ bool GCNTargetMachine::parseMachineFunctionInfo(
                                            ? DenormalMode::IEEE
                                            : DenormalMode::PreserveSign;
 
+  if (YamlMFI.HasInitWholeWave)
+    MFI->setInitWholeWave();
+
   return false;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 8c951105101d96..dfdc7ad32b00c7 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1343,10 +1343,14 @@ void SIFrameLowering::processFunctionBeforeFrameFinalized(
 
   // Allocate spill slots for WWM reserved VGPRs.
   // For chain functions, we only need to do this if we have calls to
-  // llvm.amdgcn.cs.chain.
-  bool IsChainWithoutCalls =
-      FuncInfo->isChainFunction() && !MF.getFrameInfo().hasTailCall();
-  if (!FuncInfo->isEntryFunction() && !IsChainWithoutCalls) {
+  // llvm.amdgcn.cs.chain (otherwise there's no one to save them for, since
+  // chain functions do not return) and the function did not contain a call to
+  // llvm.amdgcn.init.whole.wave (since in that case there are no inactive lanes
+  // when entering the function).
+  bool IsChainWithoutRestores =
+      FuncInfo->isChainFunction() &&
+      (!MF.getFrameInfo().hasTailCall() || FuncInfo->hasInitWholeWave());
+  if (!FuncInfo->isEntryFunction() && !IsChainWithoutRestores) {
     for (Register Reg : FuncInfo->getWWMReservedRegs()) {
       const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Reg);
       FuncInfo->allocateWWMSpill(MF, Reg, TRI->getSpillSize(*RC),
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ecd4451c504727..b50c92df5479fc 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15671,6 +15671,133 @@ static int getAlignedAGPRClassID(unsigned UnalignedClassID) {
   }
 }
 
+static void removeInitWholeWaveBranch(MachineFunction &MF,
+                                      MachineRegisterInfo &MRI,
+                                      const SIInstrInfo *TII) {
+  // Remove SI_INIT_WHOLE_WAVE and the following SI_IF/END_CF and instead set
+  // EXEC to -1 at SI_END_CF.
+  auto IWWIt = find_if(MF.begin()->instrs(), [](const MachineInstr &MI) {
+    return MI.getOpcode() == AMDGPU::SI_INIT_WHOLE_WAVE;
+  });
+  if (IWWIt == MF.begin()->instr_end())
+    return; // We've been here before (GISel runs finalizeLowering twice).
+
+  MachineInstr &If = *MRI.use_begin(IWWIt->getOperand(0).getReg())->getParent();
+  assert(If.getOpcode() == AMDGPU::SI_IF &&
+         "Unexpected user for init.whole.wave result");
+  assert(MRI.hasOneUse(IWWIt->getOperand(0).getReg()) &&
+         "Expected simple control flow");
+
+  MachineInstr &EndCf = *MRI.use_begin(If.getOperand(0).getReg())->getParent();
+  MachineBasicBlock *EndBB = EndCf.getParent();
+
+  // Update all the Phis: since we're removing a predecessor, we need to remove
+  // the corresponding pair of operands. However, we can't just drop the value
+  // coming from the 'if' block - that's going to be the value of the inactive
+  // lanes.
+  // %v = phi (%inactive, %if), (%active1, %shader1), ... (%activeN, %shaderN)
+  // should become
+  // %t = phi (%active1, %shader1), ... (%activeN, %shaderN)
+  // %v = v_set_inactive %t, %inactive
+  // Note that usually EndCf will be the first instruction after the phis and as
+  // such will serve as the end of the range when iterating over phis.
+  // Therefore, we shouldn't introduce any new instructions before it.
+  const SIRegisterInfo &TRI = TII->getRegisterInfo();
+  auto AfterEndCf = std::next(EndCf.getIterator());
+  for (auto &Phi : EndBB->phis()) {
+    Register PhiDest = Phi.getOperand(0).getReg();
+    const TargetRegisterClass *PhiRC = MRI.getRegClass(PhiDest);
+
+    Register NewPhiDest = MRI.createVirtualRegister(PhiRC);
+    Phi.getOperand(0).setReg(NewPhiDest);
+
+    unsigned InactiveOpIdx = 0;
+    for (unsigned I = 1; I < Phi.getNumOperands(); I += 2) {
+      if (Phi.getOperand(I + 1).getMBB() == If.getParent()) {
+        InactiveOpIdx = I;
+        break;
+      }
+    }
+    assert(InactiveOpIdx != 0 && "Broken phi?");
+
+    // At this point, the register class could be larger than 32 or 64, so we
+    // might have to use more than one V_SET_INACTIVE instruction.
+    unsigned Size = TRI.getRegSizeInBits(*PhiRC);
+    switch (Size) {
+    case 32:
+      BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+              TII->get(AMDGPU::V_SET_INACTIVE_B32), PhiDest)
+          .addReg(NewPhiDest)
+          .add(Phi.getOperand(InactiveOpIdx));
+      break;
+    case 64:
+      BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+              TII->get(AMDGPU::V_SET_INACTIVE_B64), PhiDest)
+          .addReg(NewPhiDest)
+          .add(Phi.getOperand(InactiveOpIdx));
+      break;
+    default: {
+      // For each 32-bit subregister of the register at InactiveOpIdx, insert
+      // a COPY to a new register, and a V_SET_INACTIVE_B32 using the
+      // corresponding subregisters of PhiDest and NewPhiDest.
+      // FIXME: There has to be a better way to iterate over this...
+      llvm::SmallVector<Register, 16> PhiSubRegs;
+      const unsigned SubRegIndices[] = {
+          AMDGPU::sub0,  AMDGPU::sub1,  AMDGPU::sub2,  AMDGPU::sub3,
+          AMDGPU::sub4,  AMDGPU::sub5,  AMDGPU::sub6,  AMDGPU::sub7,
+          AMDGPU::sub8,  AMDGPU::sub9,  AMDGPU::sub10, AMDGPU::sub11,
+          AMDGPU::sub12, AMDGPU::sub13, AMDGPU::sub14, AMDGPU::sub15,
+          AMDGPU::sub16, AMDGPU::sub17, AMDGPU::sub18, AMDGPU::sub19,
+          AMDGPU::sub20, AMDGPU::sub21, AMDGPU::sub22, AMDGPU::sub23,
+          AMDGPU::sub24, AMDGPU::sub25, AMDGPU::sub26, AMDGPU::sub27,
+          AMDGPU::sub28, AMDGPU::sub29, AMDGPU::sub30, AMDGPU::sub31};
+      const unsigned NumSubRegs = Size / 32;
+      assert(sizeof(SubRegIndices) / sizeof(SubRegIndices[0]) >= NumSubRegs &&
+             "Not enough subregister indices");
+      for (unsigned I = 0; I != NumSubRegs; ++I) {
+        unsigned SubRegIdx = SubRegIndices[I];
+        Register InactiveSubReg =
+            MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+        BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(), TII->get(AMDGPU::COPY),
+                InactiveSubReg)
+            .addReg(Phi.getOperand(InactiveOpIdx).getReg(), 0, SubRegIdx);
+
+        Register AllLanesSubReg =
+            MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
+        BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+                TII->get(AMDGPU::V_SET_INACTIVE_B32), AllLanesSubReg)
+            .addReg(NewPhiDest, 0, SubRegIdx)
+            .addReg(InactiveSubReg);
+        PhiSubRegs.push_back(AllLanesSubReg);
+      }
+      // Now we need to combine the subregisters into the original register.
+      auto RegSequence = BuildMI(*EndBB, AfterEndCf, Phi.getDebugLoc(),
+                                 TII->get(AMDGPU::REG_SEQUENCE), PhiDest);
+      for (unsigned I = 0; I < NumSubRegs; ++I) {
+        RegSequence.addReg(PhiSubRegs[I]);
+        RegSequence.addImm(SubRegIndices[I]);
+      }
+      break;
+    }
+    }
+
+    Phi.removeOperand(InactiveOpIdx + 1);
+    Phi.removeOperand(InactiveOpIdx);
+  }
+  If.getParent()->removeSuccessor(EndBB);
+
+  BuildMI(*EndBB, AfterEndCf, IWWIt->getDebugLoc(),
+          TII->get(MF.getSubtarget<GCNSubtarget>().isWave32()
+                       ? AMDGPU::S_MOV_B32
+                       : AMDGPU::S_MOV_B64),
+          TII->getRegisterInfo().getExec())
+      .addImm(-1);
+
+  EndCf.eraseFromParent();
+  If.eraseFromParent();
+  IWWIt->eraseFromParent();
+}
+
 // Figure out which registers should be reserved for stack access. Only after
 // the function is legalized do we know all of the non-spill stack objects or if
 // calls are present.
@@ -15681,6 +15808,12 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
   const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
   const SIInstrInfo *TII = ST.getInstrInfo();
 
+  if (Info->hasInitWholeWave()) {
+    assert(Info->isChainFunction() &&
+           "init.whole.wave may only be used in chain functions");
+    removeInitWholeWaveBranch(MF, MRI, TII);
+  }
+
   if (Info->isEntryFunction()) {
     // Callable functions have fixed registers used for stack access.
     reservePrivateMemoryRegs(getTargetMachine(), MF, *TRI, *Info);
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 814d3182fb5df8..ffffa18e13c30d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -575,6 +575,14 @@ def SI_INIT_EXEC_FROM_INPUT : SPseudoInstSI <
   let Defs = [EXEC];
 }
 
+// Sets EXEC to all lanes and returns the previous EXEC.
+def SI_INIT_WHOLE_WAVE : SPseudoInstSI <
+  (outs SReg_1:$dst), (ins),
+  [(set i1:$dst, (int_amdgcn_init_whole_wave))]> {
+  let Defs = [EXEC];
+  let Uses = [EXEC];
+}
+
 // Return for returning shaders to a shader variant epilog.
 def SI_RETURN_TO_EPILOG : SPseudoInstSI <
   (outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
index 7af5e7388f841e..7cebfa29fe7b8d 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
@@ -289,6 +289,8 @@ struct SIMachineFunctionInfo final : public yaml::MachineFunctionInfo {
   StringValue SGPRForEXECCopy;
   StringValue LongBranchReservedReg;
 
+  bool HasInitWholeWave = false;
+
   SIMachineFunctionInfo() = default;
   SIMachineFunctionInfo(const llvm::SIMachineFunctionInfo &,
                         const TargetRegisterInfo &TRI,
@@ -336,6 +338,7 @@ template <> struct MappingTraits<SIMachineFunctionInfo> {
                        StringValue()); // Don't print out when it's empty.
     YamlIO.mapOptional("longBranchReservedReg", MFI.LongBranchReservedReg,
                        StringValue());
+    YamlIO.mapOptional("hasInitWholeWave", MFI.HasInitWholeWave, false);
   }
 };
 
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
new file mode 100644
index 00000000000000..6ad9e684273b28
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
@@ -0,0 +1,1545 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL12 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL12 %s
+; RUN: llc -global-isel=1 -O2 -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck --check-prefix=GISEL10 %s
+; RUN: llc -global-isel=0 -O2 -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck --check-prefix=DAGISEL10 %s
+
+define amdgpu_cs_chain void @basic(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
+; GISEL12-LABEL: basic:
+; GISEL12:       ; %bb.0: ; %entry
+; GISEL12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL12-NEXT:    s_wait_expcnt 0x0
+; GISEL12-NEXT:    s_wait_samplecnt 0x0
+; GISEL12-NEXT:    s_wait_bvhcnt 0x0
+; GISEL12-NEXT:    s_wait_kmcnt 0x0
+; GISEL12-NEXT:    s_mov_b32 s6, s3
+; GISEL12-NEXT:    s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT:    v_dual_mov_b32 v0, v12 :: v_dual_mov_b32 v1, v13
+; GISEL12-NEXT:    s_mov_b32 exec_lo, s3
+; GISEL12-NEXT:    s_mov_b32 s7, s4
+; GISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; GISEL12-NEXT:    v_mov_b32_e32 v2, v0
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    v_mov_b32_e32 v2, 0x47
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    s_or_saveexec_b32 s3, -1
+; GISEL12-NEXT:    v_cmp_ne_u32_e64 s4, 0, v2
+; GISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GISEL12-NEXT:    v_mov_b32_e32 v2, s4
+; GISEL12-NEXT:    s_mov_b32 exec_lo, s3
+; GISEL12-NEXT:    v_dual_mov_b32 v4, v2 :: v_dual_add_nc_u32 v3, 42, v0
+; GISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GISEL12-NEXT:    v_mov_b32_e32 v0, v3
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    v_mov_b32_e32 v0, v0
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    v_mov_b32_e32 v1, v4
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GISEL12-NEXT:    v_mov_b32_e32 v1, v1
+; GISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; GISEL12-NEXT:    s_mov_b32 exec_lo, -1
+; GISEL12-NEXT:    v_dual_mov_b32 v10, v0 :: v_dual_mov_b32 v11, v1
+; GISEL12-NEXT:    s_mov_b32 exec_lo, s5
+; GISEL12-NEXT:    s_setpc_b64 s[6:7]
+;
+; DAGISEL12-LABEL: basic:
+; DAGISEL12:       ; %bb.0: ; %entry
+; DAGISEL12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL12-NEXT:    s_wait_expcnt 0x0
+; DAGISEL12-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL12-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL12-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL12-NEXT:    s_or_saveexec_b32 s6, -1
+; DAGISEL12-NEXT:    v_dual_mov_b32 v0, v13 :: v_dual_mov_b32 v1, v12
+; DAGISEL12-NEXT:    s_mov_b32 exec_lo, s6
+; DAGISEL12-NEXT:    s_mov_b32 s7, s4
+; DAGISEL12-NEXT:    s_mov_b32 s6, s3
+; DAGISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_4) | instid1(VALU_DEP_1)
+; DAGISEL12-NEXT:    v_mov_b32_e32 v2, v1
+; DAGISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT:    v_mov_b32_e32 v2, 0x47
+; DAGISEL12-NEXT:    s_not_b32 exec_lo, exec_lo
+; DAGISEL12-NEXT:    s_or_saveexec_b32 s3, -1
+; DAGISEL12-NEXT:    v_...
[truncated]

jayfoad · 2024-08-23T12:50:30Z

Can this replacea llvm.amdgcn.init.exec? (Sorry if that's a dumb question, I have not grokked your new intrinsic yet.)

rovka · 2024-08-26T08:18:25Z

Can this replacea llvm.amdgcn.init.exec? (Sorry if that's a dumb question, I have not grokked your new intrinsic yet.)

In theory, no, but I guess it depends on what use cases it actually has in practice. init.exec can take any exec mask, and I think it can exist anywhere in the function (the comment only says it gets moved to the beginning of the basic block, but it doesn't say it has to be the entry block). OTOH init.whole.wave should only appear in the entry block (maybe I could add a Verifier check for that if you think it's worth having) and enables all the lanes.

Flakebi · 2024-08-26T11:42:37Z

Can this replacea llvm.amdgcn.init.exec? (Sorry if that's a dumb question, I have not grokked your new intrinsic yet.)

When used at the start of a function, their use-cases are rather different. llvm.amdgcn.init.exec adds an instruction that sets exec, llvm.amdgcn.init.whole.wave removes an instruction that sets exec (with the goal that the frontend can move the exec-setting to the end of the caller and the latency overlaps with the latency of the call).
Maybe there are cases where llvm.amdgcn.init.exec is currently used inside a function? (That could then be a candidate to use the new intrinsic, but it seems adventurous to use llvm.amdgcn.init.exec that way.)

Is there a test-case that does not use whole-wave-mode?
Due to whole-wave-mode, there’s a lot of exec-setting and prolog/epilog stuff going on.
I imagine the code that uses this new intrinsic does not use wwm as much, as the tail is effectively running in whole-wave-mode :)

define amdgpu_cs_chain void @basic(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %x, i32 %y) {
entry:
  %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %entry_exec, label %shader, label %tail

shader:
  %newx = add i32 %x, 42
  %oldval = extractvalue { i32, ptr addrspace(5), i32, i32 } %vgpr, 0
  %newval = add i32 %oldval, 5
  %newvgpr = insertvalue { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 %newval, 0

  br label %tail

tail:
  %full.x = phi i32 [%x, %entry], [%newx, %shader]
  %full.vgpr = phi i32 [%vgpr, %entry], [%newvgpr, %shader]
  call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %full.vgpr, i32 %full.x)
  unreachable
}

arsenm · 2024-08-27T18:53:14Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+// Using this intrinsic without immediately branching on its return value is an
+// error.
+def int_amdgcn_init_whole_wave : Intrinsic<[llvm_i1_ty], [], [
+    IntrHasSideEffects, IntrNoMem, IntrNoDuplicate, IntrConvergent]>;


Does this really need IntrNoDuplicate, I would like to eliminate it

arsenm · 2024-08-27T18:53:38Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+// (mostly related to WWM CSR handling) that differentiate it from using
+// a plain `amdgcn.init.exec -1`.
+//
+// Can only be used in functions with the `amdgpu_cs_chain` calling convention.


The verifier needs to enforce this if it's a requirement

arsenm · 2024-08-27T18:54:25Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+// Sets the function into whole-wave-mode and returns whether the lane was
+// active when entering the function. A branch depending on this return will


If we add the exec on entry to the function ABI (as we probably need to), can we remove the special calling convention requirement?

Sure, I don't think that's a hard requirement as much as an "I'm currently too lazy to think about/implement this for other calling conventions" requirement. I'm going to reword this.

It would be pretty cool to support something like this for arbitrary functions. However, the semantics of it would have to be different.

For chain functions, we explicitly plan to rely on the fact that function arguments in the inactive lanes have meaningful data in them, and the function that uses this intrinsic is expected to be aware of this and in fact to make use of this data. In other words, there's no need for the backend to save/restore the contents of inactive lanes. That would not be true for default calling convention functions.

I have long thought that we should rip out the current whole wave mode implementation in favor of something that works more like an MLIR region (represented as a function to be called). However, something needs to replace the functionality of set.inactive. So whatever we end up with in the end, it could well be somewhat similar, but it's different enough that we should keep it separate from this PR.

arsenm · 2024-08-27T18:55:04Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    // might have to use more than one V_SET_INACTIVE instruction.
+    unsigned Size = TRI.getRegSizeInBits(*PhiRC);
+    switch (Size) {
+    case 32:


arsenm · 2024-08-27T18:55:56Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+      // For each 32-bit subregister of the register at InactiveOpIdx, insert
+      // a COPY to a new register, and a V_SET_INACTIVE_B32 using the
+      // corresponding subregisters of PhiDest and NewPhiDest.
+      // FIXME: There has to be a better way to iterate over this...


Is this just getSubRegFromChannel?

llvm/lib/Target/AMDGPU/SIInstructions.td

nhaehnle · 2024-09-02T08:38:25Z

Can this replacea llvm.amdgcn.init.exec? (Sorry if that's a dumb question, I have not grokked your new intrinsic yet.)

In theory, no, but I guess it depends on what use cases it actually has in practice. init.exec can take any exec mask, and I think it can exist anywhere in the function (the comment only says it gets moved to the beginning of the basic block, but it doesn't say it has to be the entry block). OTOH init.whole.wave should only appear in the entry block (maybe I could add a Verifier check for that if you think it's worth having) and enables all the lanes.

I think it could replace init.exec. init.exec is also supposed to appear in the entry block,and the statement about it being moved up is really a requirement for this new intrinsic as well -- we want to be robust against intermediate transforms adding something above the intrinsic, perhaps because they added an alloca or something.

nhaehnle

Thanks! We've just had a quick offline discussion. To summarize, I think the overall flow of where stuff happens is good, but the optimization of the initial branch is subtle. I suggest splitting this up into multiple changes, where the first one implements init.whole.wave in a simple but hopefully obviously correct manner. The optimization of the initial branch can then be done as a separate change.

I also think we should try to find an alternative to using V_SET_INACTIVE, since the whole WWM infrastructure is awfully fragile. The closer we can get to "normal looking" IR between isel and register allocation, the better.

nhaehnle · 2024-09-02T08:35:01Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+// Sets the function into whole-wave-mode and returns whether the lane was
+// active when entering the function. A branch depending on this return will


It would be pretty cool to support something like this for arbitrary functions. However, the semantics of it would have to be different.

For chain functions, we explicitly plan to rely on the fact that function arguments in the inactive lanes have meaningful data in them, and the function that uses this intrinsic is expected to be aware of this and in fact to make use of this data. In other words, there's no need for the backend to save/restore the contents of inactive lanes. That would not be true for default calling convention functions.

I have long thought that we should rip out the current whole wave mode implementation in favor of something that works more like an MLIR region (represented as a function to be called). However, something needs to replace the functionality of set.inactive. So whatever we end up with in the end, it could well be somewhat similar, but it's different enough that we should keep it separate from this PR.

ruiling · 2024-09-04T02:44:05Z

I don't think we can simply remove the init.whole.wave in the entry block. Think about the case like below where we have a use of the function argument in tail block:

define amdgpu_cs_chain void @basic(<3 x i32> inreg %sgpr, ptr inreg %callee, i32 inreg %exec, { i32, ptr addrspace(5), i32, i32 } %vgpr) {
entry:
  %entry_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %entry_exec, label %shader, label %tail

shader:
   ...
  br label %tail

tail:
  %x = extractelement <16 x i32> %vgpr, i32 0
  %y = add i32 %x, 32    ;<-- the instruction should operate on all the lanes of %x
  %vgpr.new = insertelement <16 x i32> %vgpr, i32 %y, i32 0
  call void(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr %callee, i32 %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr.new)
  unreachable
}

This would be very likely lowered to something like: (Sorry it is in very rough shape.)

 entry:
   ; setup exec for whole wave mode
   %arg = COPY $vgpr8
   br %old_exec, %shader, %tail

shader:
 ....

tail:
 %new_arg = V_ADD_I32 %arg, 32
 SI_CS_CHAIN .... %new_arg

The point is when have an instruction in tail block which want to operate on a function argument, I think the expectation is the V_ADD_I32 here should operate(read/write) all the lanes. If we remove the exec setup in the entry block, the instruction would not be able to see the values in the lanes that were inactive at function start. Does this make sense?

- Add findImplicitExecSrc helper. - Use helper to ignore V_SET_INACTIVE instructions during WQM/WWM processing. This allows other passes to emit V_SET_INACTIVE for already known WWM sections. This supports llvm#105822. - Add test for above.

This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail. As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions): ``` entry: %func_exec = call i1 @llvm.amdgcn.init.whole.wave() br i1 %func_exec, label %func, label %tail func: ; ... stuff that should run with the actual EXEC mask br label %tail tail: ; ... stuff that runs with all the lanes enabled; ; can contain more than one basic block ``` It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch). The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now is expanded in si-wqm (which is where SI_INIT_EXEC is handled too); however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).

rovka · 2024-09-09T13:00:52Z

I don't think we can simply remove the init.whole.wave in the entry block.
[...]
The point is when have an instruction in tail block which want to operate on a function argument, I think the expectation is the V_ADD_I32 here should operate(read/write) all the lanes. If we remove the exec setup in the entry block, the instruction would not be able to see the values in the lanes that were inactive at function start. Does this make sense?

Yes, that makes sense, and that's why I was introducing the V_SET_INACTIVE in the tail block.

Anyway, the new version is following Nicolai's suggestion to leave the branch optimization for a future patch. So for now the EXEC setup is still there and we can discuss what to do with the inactive lanes in a future patch :)

nhaehnle

Thanks! Two nits inline, but apart from that LGTM.

nhaehnle · 2024-09-09T15:51:12Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

+// Using this intrinsic without immediately branching on its return value is an
+// error.


Remove this part of the comment.

nhaehnle · 2024-09-09T15:52:46Z

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp

@@ -1582,6 +1583,28 @@ void SIWholeQuadMode::lowerInitExec(MachineInstr &MI) {
  MachineBasicBlock *MBB = MI.getParent();
  bool IsWave32 = ST->isWave32();

+  if (MI.getOpcode() == AMDGPU::SI_INIT_WHOLE_WAVE) {
+    // TODO: Assert that it's in the entry block


Do the TODO? :)

Ooops, I thought for sure I would notice that before pushing...

nhaehnle · 2024-09-09T15:57:20Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll

+; GISEL12-NEXT:    v_add_nc_u32_e32 v8, 5, v8
+; GISEL12-NEXT:  ; %bb.2: ; %tail
+; GISEL12-NEXT:    s_wait_alu 0xfffe
+; GISEL12-NEXT:    s_or_b32 exec_lo, exec_lo, s3


We could just s_mov_b32 exec_lo, -1 here. Not particularly high priority, and if anything should happen as part of a more general micro-optimization that detects the equivalent condition in compute shaders with known workgroup size.

nhaehnle

Thanks, LGTM!

llvm-ci · 2024-09-10T12:26:06Z

LLVM Buildbot has detected a new failure on builder clang-arm64-windows-msvc running on linaro-armv8-windows-msvc-04 while building llvm at step 5 "ninja check 1".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/161/builds/1943

Here is the relevant piece of the build log for the reference

Step 5 (ninja check 1) failure: stage 1 checked (failure)
******************** TEST 'Clang :: Modules/pr60890.cppm' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 3
rm -rf C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp
# executed command: rm -rf 'C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp'
# RUN: at line 4
mkdir -p C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp
# executed command: mkdir -p 'C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp'
# RUN: at line 5
split-file C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\llvm\clang\test\Modules\pr60890.cppm C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp
# executed command: split-file 'C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\llvm\clang\test\Modules\pr60890.cppm' 'C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\stage1\tools\clang\test\Modules\Output\pr60890.cppm.tmp'
# .---command stderr------------
# | split-file: error: C:\Users\tcwg\llvm-worker\clang-arm64-windows-msvc\llvm\clang\test\Modules\pr60890.cppm: permission denied
# `-----------------------------
# error: command failed with exit status: 1

--

********************

llvm-ci · 2024-09-10T12:46:15Z

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux-fast running on sanitizer-buildbot3 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/169/builds/3049

Here is the relevant piece of the build log for the reference

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 85652 of 85653 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
FAIL: LLVM :: CodeGen/AMDGPU/si-init-whole-wave.mir (85634 of 85652)
******************** TEST 'LLVM :: CodeGen/AMDGPU/si-init-whole-wave.mir' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
RUN: at line 2: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -run-pass si-wqm -o -  /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir | /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/FileCheck /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/FileCheck /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -run-pass si-wqm -o - /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
=================================================================
==3115236==ERROR: AddressSanitizer: use-after-poison on address 0x52100001ce0c at pc 0x5c1f749ffffc bp 0x7ffef05f3560 sp 0x7ffef05f3558
READ of size 4 at 0x52100001ce0c thread T0
    #0 0x5c1f749ffffb in getFlag /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:399:12
    #1 0x5c1f749ffffb in isBundledWithPred /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:477:43
    #2 0x5c1f749ffffb in llvm::SlotIndexes::removeMachineInstrFromMaps(llvm::MachineInstr&, bool) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/CodeGen/SlotIndexes.cpp:129:3
    #3 0x5c1f70cecc37 in RemoveMachineInstrFromMaps /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/LiveIntervals.h:286:14
    #4 0x5c1f70cecc37 in lowerInitExec /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1602:12
    #5 0x5c1f70cecc37 in lowerInitExecInstrs /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1704:5
    #6 0x5c1f70cecc37 in (anonymous namespace)::SIWholeQuadMode::runOnMachineFunction(llvm::MachineFunction&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1764:41
    #7 0x5c1f7425791f in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:94:13
    #8 0x5c1f750c67f7 in llvm::FPPassManager::runOnFunction(llvm::Function&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1408:27
    #9 0x5c1f750de45e in llvm::FPPassManager::runOnModule(llvm::Module&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1454:16
    #10 0x5c1f750c8420 in runOnModule /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1523:27
    #11 0x5c1f750c8420 in llvm::legacy::PassManagerImpl::run(llvm::Module&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:541:44
    #12 0x5c1f6f4aa16d in compileModule(char**, llvm::LLVMContext&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/llc/llc.cpp:749:8
    #13 0x5c1f6f4a473e in main /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/llc/llc.cpp:412:22
    #14 0x79ed2c02a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #15 0x79ed2c02a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #16 0x5c1f6f3bffa4 in _start (/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc+0xaf7cfa4)

0x52100001ce0c is located 268 bytes inside of 4096-byte region [0x52100001cd00,0x52100001dd00)
allocated by thread T0 here:
    #0 0x5c1f6f497a52 in operator new(unsigned long, std::align_val_t) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:98:3
    #1 0x5c1f76aa5a0d in llvm::allocate_buffer(unsigned long, unsigned long) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Support/MemAlloc.cpp:16:10
    #2 0x5c1f6f922829 in Allocate /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/AllocatorBase.h:92:12
    #3 0x5c1f6f922829 in StartNewSlab /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:346:42
    #4 0x5c1f6f922829 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::AllocateSlow(unsigned long, unsigned long, llvm::Align) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:202:5
    #5 0x5c1f742734f2 in Allocate /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:216:12
    #6 0x5c1f742734f2 in allocate<llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096UL, 4096UL, 128UL> > /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/ArrayRecycler.h:130:38
Step 9 (stage2/asan_ubsan check) failure: stage2/asan_ubsan check (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using lld-link: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:506: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/utils/lit/lit/main.py:72: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 85652 of 85653 tests, 88 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
FAIL: LLVM :: CodeGen/AMDGPU/si-init-whole-wave.mir (85634 of 85652)
******************** TEST 'LLVM :: CodeGen/AMDGPU/si-init-whole-wave.mir' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
RUN: at line 2: /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -run-pass si-wqm -o -  /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir | /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/FileCheck /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/FileCheck /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
+ /home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -run-pass si-wqm -o - /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/test/CodeGen/AMDGPU/si-init-whole-wave.mir
=================================================================
==3115236==ERROR: AddressSanitizer: use-after-poison on address 0x52100001ce0c at pc 0x5c1f749ffffc bp 0x7ffef05f3560 sp 0x7ffef05f3558
READ of size 4 at 0x52100001ce0c thread T0
    #0 0x5c1f749ffffb in getFlag /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:399:12
    #1 0x5c1f749ffffb in isBundledWithPred /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:477:43
    #2 0x5c1f749ffffb in llvm::SlotIndexes::removeMachineInstrFromMaps(llvm::MachineInstr&, bool) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/CodeGen/SlotIndexes.cpp:129:3
    #3 0x5c1f70cecc37 in RemoveMachineInstrFromMaps /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/CodeGen/LiveIntervals.h:286:14
    #4 0x5c1f70cecc37 in lowerInitExec /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1602:12
    #5 0x5c1f70cecc37 in lowerInitExecInstrs /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1704:5
    #6 0x5c1f70cecc37 in (anonymous namespace)::SIWholeQuadMode::runOnMachineFunction(llvm::MachineFunction&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1764:41
    #7 0x5c1f7425791f in llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:94:13
    #8 0x5c1f750c67f7 in llvm::FPPassManager::runOnFunction(llvm::Function&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1408:27
    #9 0x5c1f750de45e in llvm::FPPassManager::runOnModule(llvm::Module&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1454:16
    #10 0x5c1f750c8420 in runOnModule /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1523:27
    #11 0x5c1f750c8420 in llvm::legacy::PassManagerImpl::run(llvm::Module&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:541:44
    #12 0x5c1f6f4aa16d in compileModule(char**, llvm::LLVMContext&) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/llc/llc.cpp:749:8
    #13 0x5c1f6f4a473e in main /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/llc/llc.cpp:412:22
    #14 0x79ed2c02a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #15 0x79ed2c02a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #16 0x5c1f6f3bffa4 in _start (/home/b/sanitizer-x86_64-linux-fast/build/llvm_build_asan_ubsan/bin/llc+0xaf7cfa4)

0x52100001ce0c is located 268 bytes inside of 4096-byte region [0x52100001cd00,0x52100001dd00)
allocated by thread T0 here:
    #0 0x5c1f6f497a52 in operator new(unsigned long, std::align_val_t) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:98:3
    #1 0x5c1f76aa5a0d in llvm::allocate_buffer(unsigned long, unsigned long) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Support/MemAlloc.cpp:16:10
    #2 0x5c1f6f922829 in Allocate /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/AllocatorBase.h:92:12
    #3 0x5c1f6f922829 in StartNewSlab /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:346:42
    #4 0x5c1f6f922829 in llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul, 128ul>::AllocateSlow(unsigned long, unsigned long, llvm::Align) /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:202:5
    #5 0x5c1f742734f2 in Allocate /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/Allocator.h:216:12
    #6 0x5c1f742734f2 in allocate<llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096UL, 4096UL, 128UL> > /home/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Support/ArrayRecycler.h:130:38

This reverts commit 44556e6.

Breaks bots, see #105822. Reverts #105822

llvmbot added backend:AMDGPU llvm:ir labels Aug 23, 2024

rovka requested review from nhaehnle, Flakebi and ruiling August 23, 2024 12:30

arsenm reviewed Aug 27, 2024

View reviewed changes

nhaehnle reviewed Sep 2, 2024

View reviewed changes

perlfu mentioned this pull request Sep 8, 2024

[AMDGPU] V_SET_INACTIVE optimizations #98864

Merged

rovka added 2 commits September 9, 2024 14:32

Address review comments

f79595f

rovka force-pushed the init-whole-wave branch from 8fda219 to f79595f Compare September 9, 2024 12:57

nhaehnle reviewed Sep 9, 2024

View reviewed changes

Fix nits

9c99324

nhaehnle approved these changes Sep 10, 2024

View reviewed changes

rovka merged commit 44556e6 into llvm:main Sep 10, 2024
8 checks passed

vitalybuka added a commit that referenced this pull request Sep 10, 2024

Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)"

55c2b8c

This reverts commit 44556e6.

vitalybuka mentioned this pull request Sep 10, 2024

Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" #108054

Merged

vitalybuka added a commit that referenced this pull request Sep 10, 2024

Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)

c7a7767

Breaks bots, see #105822. Reverts #105822

vitalybuka mentioned this pull request Sep 10, 2024

DO NOT MERGE: Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" #108055

Closed

jayfoad mentioned this pull request Sep 11, 2024

[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive #107889

Merged

		// Sets the function into whole-wave-mode and returns whether the lane was
		// active when entering the function. A branch depending on this return will

		// Using this intrinsic without immediately branching on its return value is an
		// error.

[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic #105822

[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic #105822

Uh oh!

Conversation

rovka commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayfoad commented Aug 23, 2024

Uh oh!

rovka commented Aug 26, 2024

Uh oh!

Flakebi commented Aug 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nhaehnle commented Sep 2, 2024

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruiling commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rovka commented Sep 9, 2024

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Sep 10, 2024

Uh oh!

llvm-ci commented Sep 10, 2024

Uh oh!

Uh oh!

rovka commented Aug 23, 2024 •

edited

Loading

llvmbot commented Aug 23, 2024 •

edited

Loading

ruiling commented Sep 4, 2024 •

edited

Loading