Skip to content

Commit

Permalink
[NVPTX] Add TMA Bulk Copy intrinsics (llvm#122344)
Browse files Browse the repository at this point in the history
PR llvm#96083 added intrinsics for async copy of 'tensor' data
using TMA. Following a similar design, this PR adds intrinsics
for async copy of bulk data (non-tensor variants) through TMA.

* These intrinsics optionally support multicast and cache_hints,
   as indicated by the boolean arguments at the end of the intrinsics.
* The backend looks through these flag arguments and lowers to the
   appropriate PTX instructions.
* Lit tests are added for all combinations of these intrinsics in
   cp-async-bulk.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
  • Loading branch information
durga4github authored and Mel-Chen committed Jan 13, 2025
1 parent 861e09c commit f9e9301
Show file tree
Hide file tree
Showing 6 changed files with 391 additions and 3 deletions.
88 changes: 88 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,94 @@ least-significant bit position. 0xffffffff is returned if no 1 bit is found.
TMA family of Intrinsics
------------------------

'``llvm.nvvm.cp.async.bulk.global.to.shared.cluster``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm
declare void @llvm.nvvm.cp.async.bulk.global.to.shared.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(1) %src, i32 %size, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.global.to.shared.cluster``' intrinsic
corresponds to the ``cp.async.bulk.shared::cluster.global.*`` family
of PTX instructions. These instructions initiate an asynchronous
copy of bulk data from global memory to shared::cluster memory.
The 32-bit operand ``%size`` specifies the amount of memory to be
copied and it must be a multiple of 16.

* The last two arguments to these intrinsics are boolean flags
indicating support for cache_hint and/or multicast modifiers.
These flag arguments must be compile-time constants. The backend
looks through these flags and lowers the intrinsics appropriately.

* The Nth argument (denoted by ``i1 %flag_ch``) when set, indicates
a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
variant of the PTX instruction.

* The [N-1]th argument (denoted by ``i1 %flag_mc``) when set, indicates
the presence of a multicast mask (``i16 %mc``) and generates the PTX
instruction with the ``.multicast::cluster`` modifier.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.

'``llvm.nvvm.cp.async.bulk.shared.cta.to.global``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch)
Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic
corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX
instructions. These instructions initiate an asynchronous copy from
shared::cta to global memory. The 32-bit operand ``%size`` specifies
the amount of memory to be copied and it must be a multiple of 16.

* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
be a compile-time constant. When set, it indicates a valid
cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
variant of the PTX instruction.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.

'``llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(3) %src, i32 %size)
Overview:
"""""""""

The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``' intrinsic
corresponds to the ``cp.async.bulk.shared::cluster.shared::cta.*``
PTX instruction. This instruction initiates an asynchronous copy from
shared::cta to shared::cluster memory. The destination has to be in
the shared memory of a different CTA within the cluster. The 32-bit
operand ``%size`` specifies the amount of memory to be copied and
it must be a multiple of 16.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.

'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
43 changes: 43 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -4980,4 +4980,47 @@ foreach dim = [1, 2, 3, 4, 5] in {
}
}

// Intrinsics for Bulk Copy using TMA (non-tensor)
// From Global to Shared Cluster
def int_nvvm_cp_async_bulk_global_to_shared_cluster
: DefaultAttrsIntrinsic<[],
[llvm_shared_ptr_ty, // dst_smem_ptr
llvm_shared_ptr_ty, // mbarrier_ptr
llvm_global_ptr_ty, // src_gmem_ptr
llvm_i32_ty, // copy_size
llvm_i16_ty, // cta_mask
llvm_i64_ty, // cache_hint
llvm_i1_ty, // Flag for cta_mask
llvm_i1_ty], // Flag for cache_hint
[IntrConvergent, IntrArgMemOnly,
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
NoCapture<ArgIndex<2>>, ImmArg<ArgIndex<6>>,
ImmArg<ArgIndex<7>>]>;

// From Shared CTA to Shared Cluster
def int_nvvm_cp_async_bulk_shared_cta_to_cluster
: DefaultAttrsIntrinsic<[],
[llvm_shared_ptr_ty, // dst_smem_ptr
llvm_shared_ptr_ty, // mbarrier_ptr
llvm_shared_ptr_ty, // src_smem_ptr
llvm_i32_ty], // copy_size
[IntrConvergent, IntrArgMemOnly,
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
NoCapture<ArgIndex<2>>]>;

// From Shared CTA to Global memory
def int_nvvm_cp_async_bulk_shared_cta_to_global
: DefaultAttrsIntrinsic<[],
[llvm_global_ptr_ty, // dst_gmem_ptr
llvm_shared_ptr_ty, // src_smem_ptr
llvm_i32_ty, // copy_size
llvm_i64_ty, // cache_hint
llvm_i1_ty], // Flag for cache_hint
[IntrConvergent, IntrArgMemOnly,
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
ImmArg<ArgIndex<4>>]>;

} // let TargetPrefix = "nvvm"
75 changes: 75 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3024,13 +3024,88 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}

void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2G(SDNode *N) {
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
// dst, src, size, cache_hint, cache_hint_flag
// NumOperands = {Chain, IID} + {Actual intrinsic args}
// = {2} + {5}
size_t NumOps = N->getNumOperands();
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
size_t NumArgs = IsCacheHint ? 4 : 3; // src, dst, size, cache_hint

SDLoc DL(N);
SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumArgs));
Ops.push_back(N->getOperand(0)); // Chain operand

bool IsShared32 =
CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
unsigned Opcode;
if (IsCacheHint)
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32_CH
: NVPTX::CP_ASYNC_BULK_S2G_CH;
else
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32
: NVPTX::CP_ASYNC_BULK_S2G;
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}

void NVPTXDAGToDAGISel::SelectCpAsyncBulkG2S(SDNode *N) {
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
// {dst, mbar, src, size, multicast, cache_hint,
// multicast_flag, cache_hint_flag}
// NumOperands = {Chain, IID} + {Actual intrinsic args}
// = {2} + {8}
size_t NumOps = N->getNumOperands();
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
bool IsMultiCast = N->getConstantOperandVal(NumOps - 2) == 1;
size_t NumBaseArgs = 4; // dst, mbar, src, size
size_t MultiCastIdx = NumBaseArgs + 2; // for Chain and IID

SDLoc DL(N);
SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumBaseArgs));

// Push MultiCast operand, if available
if (IsMultiCast)
Ops.push_back(N->getOperand(MultiCastIdx));

// Push CacheHint operand, if available
if (IsCacheHint)
Ops.push_back(N->getOperand(MultiCastIdx + 1));

// Finally, the chain operand
Ops.push_back(N->getOperand(0));

bool IsShared32 =
CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
unsigned Opcode = [&]() {
if (IsMultiCast && IsCacheHint)
return IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_MC_CH
: NVPTX::CP_ASYNC_BULK_G2S_MC_CH;
if (IsMultiCast)
return IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_MC
: NVPTX::CP_ASYNC_BULK_G2S_MC;
if (IsCacheHint)
return IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_CH
: NVPTX::CP_ASYNC_BULK_G2S_CH;
return IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32
: NVPTX::CP_ASYNC_BULK_G2S;
}();
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}

bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
unsigned IID = N->getConstantOperandVal(1);
using TMARedTy = llvm::nvvm::TMAReductionOp;
auto CastTy = [](TMARedTy Op) { return static_cast<unsigned>(Op); };
switch (IID) {
default:
return false;
case Intrinsic::nvvm_cp_async_bulk_global_to_shared_cluster:
SelectCpAsyncBulkG2S(N);
return true;
case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global:
SelectCpAsyncBulkS2G(N);
return true;
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_1d:
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_2d:
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_3d:
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
bool tryEXTRACT_VECTOR_ELEMENT(SDNode *N);
void SelectV2I64toI128(SDNode *N);
void SelectI128toV2I64(SDNode *N);
void SelectCpAsyncBulkG2S(SDNode *N);
void SelectCpAsyncBulkS2G(SDNode *N);
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
Expand Down
68 changes: 65 additions & 3 deletions llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -498,9 +498,71 @@ def CP_ASYNC_BULK_WAIT_GROUP_READ :
[(int_nvvm_cp_async_bulk_wait_group_read timm:$n)]>,
Requires<[hasPTX<80>, hasSM<90>]>;

//-----------------------------------
// TMA Async Tensor Copy Functions
//-----------------------------------
//------------------------------
// TMA Async Bulk Copy Functions
//------------------------------

class CpAsyncBulkStr<bit mc, bit ch> {
// Shared to Global memory
string S2G = "cp.async.bulk.global.shared::cta.bulk_group"
# !if(ch, ".L2::cache_hint", "");

// Global to Shared cluster memory
string G2S = "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes"
# !if(mc, ".multicast::cluster", "")
# !if(ch, ".L2::cache_hint", "");

// Shared CTA to Cluster memory
string C2C = "cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes";
}

multiclass CP_ASYNC_BULK_S2G<NVPTXRegClass rc> {
def NAME: NVPTXInst<(outs),
(ins Int64Regs:$dst, rc:$src, Int32Regs:$size),
!strconcat(CpAsyncBulkStr<0, 0>.S2G, " [$dst], [$src], $size;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
def NAME # _CH: NVPTXInst<(outs),
(ins Int64Regs:$dst, rc:$src, Int32Regs:$size, Int64Regs:$ch),
!strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
}
defm CP_ASYNC_BULK_S2G : CP_ASYNC_BULK_S2G<Int64Regs>;
defm CP_ASYNC_BULK_S2G_SHARED32 : CP_ASYNC_BULK_S2G<Int32Regs>;

multiclass CP_ASYNC_BULK_G2S<NVPTXRegClass rc> {
def NAME: NVPTXInst<(outs),
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size),
!strconcat(CpAsyncBulkStr<0, 0>.G2S, " [$dst], [$src], $size, [$mbar];"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
def NAME # _MC: NVPTXInst<(outs),
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int16Regs:$mc),
!strconcat(CpAsyncBulkStr<1, 0>.G2S, " [$dst], [$src], $size, [$mbar], $mc;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
def NAME # _CH: NVPTXInst<(outs),
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int64Regs:$ch),
!strconcat(CpAsyncBulkStr<0, 1>.G2S, " [$dst], [$src], $size, [$mbar], $ch;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
def NAME # _MC_CH: NVPTXInst<(outs),
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int16Regs:$mc, Int64Regs:$ch),
!strconcat(CpAsyncBulkStr<1, 1>.G2S, " [$dst], [$src], $size, [$mbar], $mc, $ch;"), []>,
Requires<[hasPTX<80>, hasSM<90>]>;
}
defm CP_ASYNC_BULK_G2S : CP_ASYNC_BULK_G2S<Int64Regs>;
defm CP_ASYNC_BULK_G2S_SHARED32 : CP_ASYNC_BULK_G2S<Int32Regs>;

multiclass CP_ASYNC_BULK_CTA_TO_CLUSTER<NVPTXRegClass rc> {
def NAME: NVPTXInst<(outs),
(ins rc:$dst, rc:$mbar, rc:$src, Int32Regs:$size),
!strconcat(CpAsyncBulkStr<0, 0>.C2C, " [$dst], [$src], $size, [$mbar];"),
[(int_nvvm_cp_async_bulk_shared_cta_to_cluster rc:$dst, rc:$mbar, rc:$src, Int32Regs:$size)]>,
Requires<[hasPTX<80>, hasSM<90>]>;
}
defm CP_ASYNC_BULK_CTA_TO_CLUSTER : CP_ASYNC_BULK_CTA_TO_CLUSTER<Int64Regs>;
defm CP_ASYNC_BULK_CTA_TO_CLUSTER_SHARED32 : CP_ASYNC_BULK_CTA_TO_CLUSTER<Int32Regs>;

//-------------------------------------
// TMA Async Bulk Tensor Copy Functions
//-------------------------------------

// From Global to Shared memory (G2S)
class G2S_STRINGS<int dim, string mode, bit mc, bit ch, bit is_shared32 = 0> {
Expand Down
Loading

0 comments on commit f9e9301

Please sign in to comment.