[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic #68186

rovka · 2023-10-04T07:42:23Z

The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call parameters are bundled up into 2 intrinsic arguments, one for those that should go in the SGPRs (the 3rd intrinsic argument), and one for those that should go in the VGPRs (the 4th intrinsic argument). Both will often be some kind of aggregate.

Both instruction selection frameworks have some internal representation for intrinsics (G_INTRINSIC[WITH_SIDE_EFFECTS] for GlobalISel, ISD::INTRINSIC[VOID|WITH_CHAIN] for DAGISel), but we can't use those because aggregates are dissolved very early on during ISel and we'd lose the inreg information. Therefore, this patch shortcircuits both the IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call from the very start. It tries to use the existing infrastructure as much as possible, by calling into the code for lowering tail calls.

This has already gone through a few rounds of review in Phab:

Differential Revision: https://reviews.llvm.org/D153761

llvmbot · 2023-10-04T07:43:33Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-llvm-globalisel

Changes

The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call parameters are bundled up into 2 intrinsic arguments, one for those that should go in the SGPRs (the 3rd intrinsic argument), and one for those that should go in the VGPRs (the 4th intrinsic argument). Both will often be some kind of aggregate.

Both instruction selection frameworks have some internal representation for intrinsics (G_INTRINSIC[WITH_SIDE_EFFECTS] for GlobalISel, ISD::INTRINSIC[VOID|WITH_CHAIN] for DAGISel), but we can't use those because aggregates are dissolved very early on during ISel and we'd lose the inreg information. Therefore, this patch shortcircuits both the IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call from the very start. It tries to use the existing infrastructure as much as possible, by calling into the code for lowering tail calls.

This has already gone through a few rounds of review in Phab:

Differential Revision: https://reviews.llvm.org/D153761

Patch is 165.73 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68186.diff

11 Files Affected:

(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+3)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+54)
(modified) llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp (+87-7)
(modified) llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h (+3)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+51-7)
(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+5)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll (+129)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+424)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+499-2)
(added) llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w32.ll (+637)
(added) llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w64.ll (+645)

diff --git a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
index 764567ac7baada6..cdbfba63932231c 100644
--- a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
@@ -62,6 +62,7 @@
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Metadata.h"
 #include "llvm/IR/PatternMatch.h"
@@ -2388,6 +2389,8 @@ bool IRTranslator::translateKnownIntrinsic(const CallInst &CI, Intrinsic::ID ID,
     Info.OrigRet = {Register(), Type::getVoidTy(CI.getContext()), 0};
     return CLI->lowerCall(MIRBuilder, Info);
   }
+  case Intrinsic::amdgcn_cs_chain:
+    return translateCallBase(CI, MIRBuilder);
   case Intrinsic::fptrunc_round: {
     uint32_t Flags = MachineInstr::copyFlagsFromInstruction(CI);
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index c5fd56795a5201a..cf026fa293f6496 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -76,6 +76,7 @@
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsWebAssembly.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Metadata.h"
@@ -7418,6 +7419,59 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     setValue(&I, Val);
     return;
   }
+  case Intrinsic::amdgcn_cs_chain: {
+    assert(I.arg_size() == 5 && "Additional args not supported yet");
+    assert(cast<ConstantInt>(I.getOperand(4))->isZero() &&
+           "Non-zero flags not supported yet");
+
+    // At this point we don't care if it's amdgpu_cs_chain or
+    // amdgpu_cs_chain_preserve.
+    CallingConv::ID CC = CallingConv::AMDGPU_CS_Chain;
+
+    Type *RetTy = I.getType();
+    assert(RetTy->isVoidTy() && "Should not return");
+
+    SDValue Callee = getValue(I.getOperand(0));
+
+    // We only have 2 actual args: one for the SGPRs and one for the VGPRs.
+    TargetLowering::ArgListTy Args;
+    Args.reserve(2);
+
+    for (unsigned Idx : {2, 3}) {
+      TargetLowering::ArgListEntry Arg;
+      Arg.Node = getValue(I.getOperand(Idx));
+      Arg.Ty = I.getOperand(Idx)->getType();
+      Arg.setAttributes(&I, Idx);
+      Args.push_back(Arg);
+    }
+
+    assert(Args[0].IsInReg && "SGPR args should be marked inreg");
+    assert(!Args[1].IsInReg && "VGPR args should not be marked inreg");
+
+    // We're also going to pass the EXEC mask as the last argument.
+    TargetLowering::ArgListEntry Arg;
+    Arg.Node = getValue(I.getOperand(1));
+    Arg.Ty = I.getOperand(1)->getType();
+    Arg.IsInReg = true;
+    Args.push_back(Arg);
+
+    TargetLowering::CallLoweringInfo CLI(DAG);
+    CLI.setDebugLoc(getCurSDLoc())
+        .setChain(getRoot())
+        .setCallee(CC, RetTy, Callee, std::move(Args))
+        .setNoReturn(true)
+        .setTailCall(true)
+        .setConvergent(I.isConvergent());
+    CLI.CB = &I;
+    std::pair<SDValue, SDValue> Result =
+        lowerInvokable(CLI, /*EHPadBB*/ nullptr);
+    (void)Result;
+    assert(!Result.first.getNode() && !Result.second.getNode() &&
+           "Should've lowered as tail call");
+
+    HasTailCall = true;
+    return;
+  }
   case Intrinsic::ptrmask: {
     SDValue Ptr = getValue(I.getOperand(0));
     SDValue Const = getValue(I.getOperand(1));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index db0f56416051f01..02e1a5aa818bc8f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -958,8 +958,10 @@ getAssignFnsForCC(CallingConv::ID CC, const SITargetLowering &TLI) {
 
 static unsigned getCallOpcode(const MachineFunction &CallerF, bool IsIndirect,
                               bool IsTailCall, CallingConv::ID CC) {
-  assert(!(IsIndirect && IsTailCall) && "Indirect calls can't be tail calls, "
-                                        "because the address can be divergent");
+  // For calls to amdgpu_cs_chain functions, the address is known to be uniform.
+  assert((AMDGPU::isChainCC(CC) || !IsIndirect || !IsTailCall) &&
+         "Indirect calls can't be tail calls, "
+         "because the address can be divergent");
   if (!IsTailCall)
     return AMDGPU::G_SI_CALL;
 
@@ -1150,14 +1152,20 @@ bool AMDGPUCallLowering::isEligibleForTailCallOptimization(
 void AMDGPUCallLowering::handleImplicitCallArguments(
     MachineIRBuilder &MIRBuilder, MachineInstrBuilder &CallInst,
     const GCNSubtarget &ST, const SIMachineFunctionInfo &FuncInfo,
+    CallingConv::ID CalleeCC,
     ArrayRef<std::pair<MCRegister, Register>> ImplicitArgRegs) const {
   if (!ST.enableFlatScratch()) {
     // Insert copies for the SRD. In the HSA case, this should be an identity
     // copy.
     auto ScratchRSrcReg = MIRBuilder.buildCopy(LLT::fixed_vector(4, 32),
                                                FuncInfo.getScratchRSrcReg());
-    MIRBuilder.buildCopy(AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3, ScratchRSrcReg);
-    CallInst.addReg(AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3, RegState::Implicit);
+
+    auto CalleeRSrcReg = AMDGPU::isChainCC(CalleeCC)
+                             ? AMDGPU::SGPR48_SGPR49_SGPR50_SGPR51
+                             : AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3;
+
+    MIRBuilder.buildCopy(CalleeRSrcReg, ScratchRSrcReg);
+    CallInst.addReg(CalleeRSrcReg, RegState::Implicit);
   }
 
   for (std::pair<MCRegister, Register> ArgReg : ImplicitArgRegs) {
@@ -1253,7 +1261,8 @@ bool AMDGPUCallLowering::lowerTailCall(
   // after the ordinary user argument registers.
   SmallVector<std::pair<MCRegister, Register>, 12> ImplicitArgRegs;
 
-  if (Info.CallConv != CallingConv::AMDGPU_Gfx) {
+  if (Info.CallConv != CallingConv::AMDGPU_Gfx &&
+      !AMDGPU::isChainCC(Info.CallConv)) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     if (!passSpecialInputs(MIRBuilder, CCInfo, ImplicitArgRegs, Info))
       return false;
@@ -1269,7 +1278,8 @@ bool AMDGPUCallLowering::lowerTailCall(
   if (!handleAssignments(Handler, OutArgs, CCInfo, ArgLocs, MIRBuilder))
     return false;
 
-  handleImplicitCallArguments(MIRBuilder, MIB, ST, *FuncInfo, ImplicitArgRegs);
+  handleImplicitCallArguments(MIRBuilder, MIB, ST, *FuncInfo, CalleeCC,
+                              ImplicitArgRegs);
 
   // If we have -tailcallopt, we need to adjust the stack. We'll do the call
   // sequence start and end here.
@@ -1283,6 +1293,23 @@ bool AMDGPUCallLowering::lowerTailCall(
     MIRBuilder.buildInstr(AMDGPU::ADJCALLSTACKDOWN).addImm(NumBytes).addImm(0);
   }
 
+  // If this is a chain call, we need to set EXEC right before the call.
+  if (AMDGPU::isChainCC(Info.CallConv)) {
+    ArgInfo ExecArg = Info.OrigArgs[1];
+    assert(ExecArg.Regs.size() == 1 && "Too many regs for EXEC");
+
+    if (!ExecArg.Ty->isIntegerTy(ST.getWavefrontSize()))
+      return false;
+
+    unsigned MovOpc = ST.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
+    MCRegister Exec = TRI->getExec();
+    auto SetExec =
+        MIRBuilder.buildInstr(MovOpc).addDef(Exec).addReg(ExecArg.Regs[0]);
+    SetExec->getOperand(1).setReg(constrainOperandRegClass(
+        MF, *TRI, MRI, *ST.getInstrInfo(), *ST.getRegBankInfo(), *SetExec,
+        SetExec->getDesc(), SetExec->getOperand(1), 1));
+  }
+
   // Now we can add the actual call instruction to the correct basic block.
   MIRBuilder.insertInstr(MIB);
 
@@ -1303,8 +1330,60 @@ bool AMDGPUCallLowering::lowerTailCall(
   return true;
 }
 
+/// Lower a call to the @llvm.amdgcn.cs.chain intrinsic.
+bool AMDGPUCallLowering::lowerChainCall(MachineIRBuilder &MIRBuilder,
+                                        CallLoweringInfo &Info) const {
+  ArgInfo Callee = Info.OrigArgs[0];
+  ArgInfo SGPRArgs = Info.OrigArgs[2];
+  ArgInfo VGPRArgs = Info.OrigArgs[3];
+  ArgInfo Flags = Info.OrigArgs[4];
+
+  assert(cast<ConstantInt>(Flags.OrigValue)->isZero() &&
+         "Non-zero flags aren't supported yet.");
+  assert(Info.OrigArgs.size() == 5 && "Additional args aren't supported yet.");
+
+  MachineFunction &MF = MIRBuilder.getMF();
+  const Function &F = MF.getFunction();
+  const DataLayout &DL = F.getParent()->getDataLayout();
+
+  // The function to jump to is actually the first argument, so we'll change the
+  // Callee and other info to match that before using our existing helper.
+  const Value *CalleeV = Callee.OrigValue->stripPointerCasts();
+  if (const Function *F = dyn_cast<Function>(CalleeV)) {
+    Info.Callee = MachineOperand::CreateGA(F, 0);
+    Info.CallConv = F->getCallingConv();
+  } else {
+    assert(Callee.Regs.size() == 1 && "Too many regs for the callee");
+    Info.Callee = MachineOperand::CreateReg(Callee.Regs[0], false);
+    Info.CallConv = CallingConv::AMDGPU_CS_Chain; // amdgpu_cs_chain_preserve
+                                                  // behaves the same here.
+  }
+
+  // The function that we're calling cannot be vararg (only the intrinsic is).
+  Info.IsVarArg = false;
+
+  assert(std::all_of(SGPRArgs.Flags.begin(), SGPRArgs.Flags.end(),
+                     [](ISD::ArgFlagsTy F) { return F.isInReg(); }) &&
+         "SGPR arguments should be marked inreg");
+  assert(std::none_of(VGPRArgs.Flags.begin(), VGPRArgs.Flags.end(),
+                      [](ISD::ArgFlagsTy F) { return F.isInReg(); }) &&
+         "VGPR arguments should not be marked inreg");
+
+  SmallVector<ArgInfo, 8> OutArgs;
+  splitToValueTypes(SGPRArgs, OutArgs, DL, Info.CallConv);
+  splitToValueTypes(VGPRArgs, OutArgs, DL, Info.CallConv);
+
+  Info.IsMustTailCall = true;
+  return lowerTailCall(MIRBuilder, Info, OutArgs);
+}
+
 bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
                                    CallLoweringInfo &Info) const {
+  if (Function *F = Info.CB->getCalledFunction())
+    if (F->isIntrinsic())
+      return F->getIntrinsicID() == Intrinsic::amdgcn_cs_chain &&
+             lowerChainCall(MIRBuilder, Info);
+
   if (Info.IsVarArg) {
     LLVM_DEBUG(dbgs() << "Variadic functions not implemented\n");
     return false;
@@ -1395,7 +1474,8 @@ bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
 
   const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
 
-  handleImplicitCallArguments(MIRBuilder, MIB, ST, *MFI, ImplicitArgRegs);
+  handleImplicitCallArguments(MIRBuilder, MIB, ST, *MFI, Info.CallConv,
+                              ImplicitArgRegs);
 
   // Get a count of how many bytes are to be pushed on the stack.
   unsigned NumBytes = CCInfo.getStackSize();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
index 569c6d75204d417..a6e801f2a547b53 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
@@ -75,10 +75,13 @@ class AMDGPUCallLowering final : public CallLowering {
   void handleImplicitCallArguments(
       MachineIRBuilder &MIRBuilder, MachineInstrBuilder &CallInst,
       const GCNSubtarget &ST, const SIMachineFunctionInfo &MFI,
+      CallingConv::ID CalleeCC,
       ArrayRef<std::pair<MCRegister, Register>> ImplicitArgRegs) const;
 
   bool lowerTailCall(MachineIRBuilder &MIRBuilder, CallLoweringInfo &Info,
                      SmallVectorImpl<ArgInfo> &OutArgs) const;
+  bool lowerChainCall(MachineIRBuilder &MIRBuilder,
+                      CallLoweringInfo &Info) const;
   bool lowerCall(MachineIRBuilder &MIRBuilder,
                  CallLoweringInfo &Info) const override;
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f170428b38c49a5..8eadec5e0797831 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -3259,6 +3259,9 @@ bool SITargetLowering::isEligibleForTailCallOptimization(
     const SmallVectorImpl<ISD::OutputArg> &Outs,
     const SmallVectorImpl<SDValue> &OutVals,
     const SmallVectorImpl<ISD::InputArg> &Ins, SelectionDAG &DAG) const {
+  if (AMDGPU::isChainCC(CalleeCC))
+    return true;
+
   if (!mayTailCallThisCC(CalleeCC))
     return false;
 
@@ -3343,7 +3346,36 @@ bool SITargetLowering::mayBeEmittedAsTailCall(const CallInst *CI) const {
 // The wave scratch offset register is used as the global base pointer.
 SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
                                     SmallVectorImpl<SDValue> &InVals) const {
+  CallingConv::ID CallConv = CLI.CallConv;
+  bool IsChainCallConv = AMDGPU::isChainCC(CallConv);
+
   SelectionDAG &DAG = CLI.DAG;
+
+  TargetLowering::ArgListEntry RequestedExec;
+  if (IsChainCallConv) {
+    // The last argument should be the value that we need to put in EXEC.
+    // Pop it out of CLI.Outs and CLI.OutVals before we do any processing so we
+    // don't treat it like the rest of the arguments.
+    RequestedExec = CLI.Args.back();
+    assert(RequestedExec.Node && "No node for EXEC");
+
+    if (!RequestedExec.Ty->isIntegerTy(Subtarget->getWavefrontSize()))
+      return lowerUnhandledCall(CLI, InVals, "Invalid value for EXEC");
+
+    assert(CLI.Outs.back().OrigArgIndex == 2 && "Unexpected last arg");
+    CLI.Outs.pop_back();
+    CLI.OutVals.pop_back();
+
+    if (RequestedExec.Ty->isIntegerTy(64)) {
+      assert(CLI.Outs.back().OrigArgIndex == 2 && "Exec wasn't split up");
+      CLI.Outs.pop_back();
+      CLI.OutVals.pop_back();
+    }
+
+    assert(CLI.Outs.back().OrigArgIndex != 2 &&
+           "Haven't popped all the pieces of the EXEC mask");
+  }
+
   const SDLoc &DL = CLI.DL;
   SmallVector<ISD::OutputArg, 32> &Outs = CLI.Outs;
   SmallVector<SDValue, 32> &OutVals = CLI.OutVals;
@@ -3351,7 +3383,6 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
   SDValue Chain = CLI.Chain;
   SDValue Callee = CLI.Callee;
   bool &IsTailCall = CLI.IsTailCall;
-  CallingConv::ID CallConv = CLI.CallConv;
   bool IsVarArg = CLI.IsVarArg;
   bool IsSibCall = false;
   bool IsThisReturn = false;
@@ -3382,9 +3413,10 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
   if (IsTailCall) {
     IsTailCall = isEligibleForTailCallOptimization(
       Callee, CallConv, IsVarArg, Outs, OutVals, Ins, DAG);
-    if (!IsTailCall && CLI.CB && CLI.CB->isMustTailCall()) {
+    if (!IsTailCall &&
+        ((CLI.CB && CLI.CB->isMustTailCall()) || IsChainCallConv)) {
       report_fatal_error("failed to perform tail call elimination on a call "
-                         "site marked musttail");
+                         "site marked musttail or to llvm.amdgcn.cs.chain");
     }
 
     bool TailCallOpt = MF.getTarget().Options.GuaranteedTailCallOpt;
@@ -3407,7 +3439,7 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
   CCState CCInfo(CallConv, IsVarArg, MF, ArgLocs, *DAG.getContext());
   CCAssignFn *AssignFn = CCAssignFnForCall(CallConv, IsVarArg);
 
-  if (CallConv != CallingConv::AMDGPU_Gfx) {
+  if (CallConv != CallingConv::AMDGPU_Gfx && !AMDGPU::isChainCC(CallConv)) {
     // With a fixed ABI, allocate fixed registers before user arguments.
     passSpecialInputs(CLI, CCInfo, *Info, RegsToPass, MemOpChains, Chain);
   }
@@ -3433,16 +3465,20 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   // Adjust the stack pointer for the new arguments...
   // These operations are automatically eliminated by the prolog/epilog pass
-  if (!IsSibCall) {
+  if (!IsSibCall)
     Chain = DAG.getCALLSEQ_START(Chain, 0, 0, DL);
 
+  if (!IsSibCall || IsChainCallConv) {
     if (!Subtarget->enableFlatScratch()) {
       SmallVector<SDValue, 4> CopyFromChains;
 
       // In the HSA case, this should be an identity copy.
       SDValue ScratchRSrcReg
         = DAG.getCopyFromReg(Chain, DL, Info->getScratchRSrcReg(), MVT::v4i32);
-      RegsToPass.emplace_back(AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3, ScratchRSrcReg);
+      RegsToPass.emplace_back(IsChainCallConv
+                                  ? AMDGPU::SGPR48_SGPR49_SGPR50_SGPR51
+                                  : AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3,
+                              ScratchRSrcReg);
       CopyFromChains.push_back(ScratchRSrcReg.getValue(1));
       Chain = DAG.getTokenFactor(DL, CopyFromChains);
     }
@@ -3568,6 +3604,15 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
     InGlue = Chain.getValue(1);
   }
 
+  auto *TRI = static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());
+
+  if (IsChainCallConv) {
+    // Set EXEC right before the call.
+    MCRegister ExecReg = TRI->getExec();
+    Chain = DAG.getCopyToReg(Chain, DL, ExecReg, RequestedExec.Node, InGlue);
+    InGlue = Chain.getValue(1);
+  }
+
   std::vector<SDValue> Ops;
   Ops.push_back(Chain);
   Ops.push_back(Callee);
@@ -3596,7 +3641,6 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   // Add a register mask operand representing the call-preserved registers.
 
-  auto *TRI = static_cast<const SIRegisterInfo*>(Subtarget->getRegisterInfo());
   const uint32_t *Mask = TRI->getCallPreservedMask(MF, CallConv);
   assert(Mask && "Missing call preserved mask for calling convention");
   Ops.push_back(DAG.getRegisterMask(Mask));
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index e065b104db4eb33..f4dde175063c089 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -421,6 +421,11 @@ const uint32_t *SIRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
   case CallingConv::AMDGPU_Gfx:
     return ST.hasGFX90AInsts() ? CSR_AMDGPU_SI_Gfx_GFX90AInsts_RegMask
                                : CSR_AMDGPU_SI_Gfx_RegMask;
+  case CallingConv::AMDGPU_CS_Chain:
+  case CallingConv::AMDGPU_CS_ChainPreserve:
+    // Calls to these functions never return, so we can pretend everything is
+    // preserved.
+    return AMDGPU_AllVGPRs_RegMask;
   default:
     return nullptr;
   }
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll
new file mode 100644
index 000000000000000..15ebda007520e72
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll
@@ -0,0 +1,129 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 2
+; RUN: llc --global-isel=1 -march=amdgcn -mcpu=gfx1100 -stop-after=irtranslator -verify-machineinstrs %s -o - | FileCheck %s --check-prefix=GFX11
+; RUN: llc --global-isel=1 -march=amdgcn -mcpu=gfx1030 -stop-after=irtranslator -verify-machineinstrs %s -o - | FileCheck %s --check-prefix=GFX10
+
+declare amdgpu_cs_chain void @callee(<3 x i32> inreg, { i32, ptr addrspace(5), i32, i32 })
+declare amdgpu_cs_chain_preserve void @callee_preserve(<3 x i32> inreg, { i32, ptr addrspace(5), i32, i32 })
+declare void @llvm.amdgcn.cs.chain(ptr, i32, <3 x i32> inreg, { i32, ptr addrspace(5), i32, i32 }, i32, ...) noreturn
+
+define amdgpu_cs_chain void @chain_call(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr) {
+  ; GFX11-LABEL: name: chain_call
+  ; GFX11: bb.1 (%ir-block.0):
+  ; GFX11-NEXT:   liveins: $sgpr0, $sgpr1, $sgpr2, $vgpr8, $vgpr9, $vgpr10, $vgpr11
+  ; GFX11-NEXT: {{  $}}
+  ; GFX11-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $sgpr0
+  ; GFX11-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY $sgpr1
+  ; GFX11-NEXT:   [[COPY2:%[0-9]+]]:_(s32) = COPY $sgpr2
+  ; GFX11-NEXT:   [[BUILD_VECTOR:%[0-9]+]]:_(<3 x s32>) = G_BUILD_VECTOR [[COPY]](s32), [[COPY1]](s32), [[COPY2]](s32)
+  ; GFX11-NEXT:   [[COPY3:%[0-9]+]]:_(s32) = COPY $vgpr8
+  ; GFX11-NEXT:   [[COPY4:%[0-9]+]]:_(p5) = COPY $vgpr9
+  ; GFX11-NEXT:   [[COPY5:%[0-9]+]]:_(s32) = COPY $vgpr10
+  ; GFX11-NEXT:   [[COPY6:%[0-9]+]]:_(s32) = COPY $vgpr11
+  ; GFX11-NEXT:   [[GV:%[0-9]+]]:_(p0) = G_GLOBAL_VALUE @callee
+  ; GFX11-NEXT:   [[C:%[0-9]+]]:sreg_32(s32) = G_CONSTANT i32 -1
+  ; GFX11-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
+  ; GFX11-NEXT:   [[GV1:%[0-9]+]]:ccr_sgpr_64(p0) = G_GLOBAL_VALUE @callee
+  ; GFX11-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<3 x s32>)
+  ; GFX11-NEXT:   $sgpr0 = COPY [[UV]](s32)
+  ; GFX11-N...
[truncated]

rovka · 2023-10-10T10:27:27Z

Ping?

rovka · 2023-10-23T11:48:23Z

Ping.

nhaehnle

The high-level flow looks very reasonable to me, but I did notice a test problem.

nhaehnle · 2023-10-23T17:37:26Z

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll

+; GISEL-GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GISEL-GFX11-NEXT:    s_add_u32 s4, s4, chain_callee@gotpcrel32@lo+4
+; GISEL-GFX11-NEXT:    s_addc_u32 s5, s5, chain_callee@gotpcrel32@hi+12
+; GISEL-GFX11-NEXT:    s_load_b64 s[4:5], s[4:5], 0x0


To make these code sequences a bit more representative, can you change the RUN lines to use a triple with amdpal as the OS? I believe the change to use absolute relocations instead of gotpcrel here has already landed...

Ok, I did that but I think the change for absolute relocations wasn't in yet when I made this PR, so I'd have to either merge or rebase for us to see that. IIUC both have disadvantages in GitHub? I'll upload just the change to the triple for now, since that leads to some other changes too (only in the whole-backend tests, not in the ISel-only tests).

nhaehnle · 2023-10-23T17:40:01Z

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll

+; DAGISEL-GFX10-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; DAGISEL-GFX10-NEXT:    s_setpc_b64 s[4:5]
+  call void asm "s_nop", "~{v0},~{v8},~{v16},~{s0}"()


Looks like this case isn't actually handled correctly? v16 is clobbered here, but I don't see it being preserved.

There's a FIXME for that at line 204 above. This will be handled by this patch, which has already been approved and which I'll have to look into rebasing after this one lands.

nhaehnle · 2023-10-23T17:40:10Z

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll

+; DAGISEL-GFX10-NEXT:    s_mov_b32 exec_lo, -1
+; DAGISEL-GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; DAGISEL-GFX10-NEXT:    s_setpc_b64 s[4:5]
+  call void asm "s_nop", "~{v0},~{v8},~{v16},~{s0}"()


The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call parameters are bundled up into 2 intrinsic arguments, one for those that should go in the SGPRs (the 3rd intrinsic argument), and one for those that should go in the VGPRs (the 4th intrinsic argument). Both will often be some kind of aggregate. Both instruction selection frameworks have some internal representation for intrinsics (G_INTRINSIC[_WITH_SIDE_EFFECTS] for GlobalISel, ISD::INTRINSIC_[VOID|WITH_CHAIN] for DAGISel), but we can't use those because aggregates are dissolved very early on during ISel and we'd lose the inreg information. Therefore, this patch shortcircuits both the IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call from the very start. It tries to use the existing infrastructure as much as possible, by calling into the code for lowering tail calls. This has already gone through a few rounds of review in Phab: Differential Revision: https://reviews.llvm.org/D153761

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp

Switch to pseudo

nhaehnle · 2023-11-03T02:07:50Z

@ruiling will you get a chance to review this?

ruiling

I think the intrinsic here is quite special which makes us to lower it in a different way from ordinary intrinsic. I think it works quite like invoke, which acts as kind of indirect call with additional custom operations being capture in extra operands and with custom calling convention. So I think it sounds reasonable to lower it in a similar way as invoke. The change LGTM overall. Please wait some time to see if there are further concerns.

nhaehnle

Thanks for taking a look. I haven't fully reviewed it, but it's probably good to go.

jayfoad

I don't have the expertise to review this in detail but I've skimmed the code and it looks pretty good to me.

jayfoad · 2023-11-03T17:37:39Z

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp

 bool AMDGPUCallLowering::lowerCall(MachineIRBuilder &MIRBuilder,
                                   CallLoweringInfo &Info) const {
+  if (Function *F = Info.CB->getCalledFunction())
+    if (F->isIntrinsic())
+      return F->getIntrinsicID() == Intrinsic::amdgcn_cs_chain &&


It feels like you ought to be able to assert F->getIntrinsicID() == Intrinsic::amdgcn_cs_chain here.

jayfoad · 2023-11-03T17:39:57Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

@@ -3570,6 +3606,8 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
    InGlue = Chain.getValue(1);
  }

+  auto *TRI = static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());


What is this for?

jayfoad · 2023-11-03T17:40:05Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

@@ -3589,6 +3627,10 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
    Ops.push_back(DAG.getTargetConstant(FPDiff, DL, MVT::i32));
  }

+  if (IsChainCallConv) {


Don't need braces.

jayfoad · 2023-11-03T17:42:28Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+// right before jumping to the callee.
+class SI_CS_CHAIN_TC<
+    ValueType execvt,
+    RegisterOperand execrc = !if(!eq(execvt, i32), SSrc_b32, SSrc_b64)>


Use getSOPSrcForVT<...>.ret?

jayfoad · 2023-11-03T17:42:59Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+
+multiclass si_cs_chain_tc_patterns<
+  ValueType execvt,
+  RegisterOperand execrc = !if(!eq(execvt, i32), SSrc_b32, SSrc_b64),


jayfoad · 2023-11-03T17:43:53Z

llvm/lib/Target/AMDGPU/SIInstructions.td

+  let SchedRW = [WriteBranch];
+  let isConvergent = 1;
+
+  let WaveSizePredicate = !if(!eq(execvt, i32), isWave32, isWave64);


It's a matter of taste I guess, but you could just pass the predicate in as a second template argument.

jayfoad · 2023-11-03T17:45:54Z

llvm/lib/Target/AMDGPU/SILateBranchLowering.cpp

+      BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), TII->get(MovOpc), ExecReg);
+  SetExec->addOperand(MI.getOperand(ExecIdx));


Could write this as BuildMI(...).add(...); Then there is no need for SetExec.

rovka requested a review from arsenm October 4, 2023 07:42

llvmbot added backend:AMDGPU llvm:globalisel llvm:SelectionDAG SelectionDAGISel as well labels Oct 4, 2023

rovka requested review from jayfoad, nhaehnle and ruiling October 23, 2023 11:48

nhaehnle reviewed Oct 23, 2023

View reviewed changes

rovka added 2 commits October 24, 2023 14:56

fixup! [AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic

4485b78

rovka force-pushed the amdgcn-cs-chain-isel branch from 423f552 to 4485b78 Compare October 24, 2023 12:59

fixup! [AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic

e332ec8

ruiling reviewed Oct 24, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp Outdated Show resolved Hide resolved

fixup! [AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic

6119b44

Switch to pseudo

ruiling approved these changes Nov 3, 2023

View reviewed changes

nhaehnle reviewed Nov 3, 2023

View reviewed changes

jayfoad reviewed Nov 3, 2023

View reviewed changes

fixup! [AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic

ae8ab8d

rovka merged commit 7f5d59b into llvm:main Nov 6, 2023

rovka deleted the amdgcn-cs-chain-isel branch November 6, 2023 11:30

		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), TII->get(MovOpc), ExecReg);
		SetExec->addOperand(MI.getOperand(ExecIdx));

[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic #68186

[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic #68186

Uh oh!

Conversation

rovka commented Oct 4, 2023

Uh oh!

llvmbot commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rovka commented Oct 10, 2023

Uh oh!

rovka commented Oct 23, 2023

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nhaehnle commented Nov 3, 2023

Uh oh!

ruiling left a comment

Choose a reason for hiding this comment

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Oct 4, 2023 •

edited

Loading