-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[NVPTX] Add TMA bulk tensor reduction intrinsics #116854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Add TMA bulk tensor reduction intrinsics #116854
Conversation
@llvm/pr-subscribers-backend-nvptx @llvm/pr-subscribers-llvm-ir Author: Durgadoss R (durga4github) ChangesThis patch adds NVVM intrinsics and NVPTX codegen for:
PTX Spec reference: Patch is 65.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/116854.diff 9 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 2152de9709dc6e..ed29d87edbad8e 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -663,6 +663,84 @@ the same functionality as described in the ``tile`` mode intrinsics above.
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(..., i32 %d0, i32 %d1, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``tile`` mode.
+The dimension of the tensor data ranges from 1d to 5d with the coordinates
+specified by the ``i32 %d0 ... i32 %d4`` arguments.
+
+* The last two arguments to these intrinsics are flags.
+ These flag arguments must be compile-time constants. The backend
+ looks through these flags and lowers the intrinsics appropriately.
+
+* The Nth argument (denoted by ``i8 flag_red_op``) indicates the
+ kind of reduction operation performed. The argument must be in
+ the range [0, 7], representing the following reduction operations:
+
+ ========== =============
+ Enum Value Reduction Op
+ ========== =============
+ ``0`` ADD
+ ``1`` MIN
+ ``2`` MAX
+ ``3`` INC
+ ``4`` DEC
+ ``5`` AND
+ ``6`` OR
+ ``7`` XOR
+ ========== =============
+
+* The [N-1]th argument (denoted by ``i1 flag_ch``) when set, indicates
+ the presence of a valid cache_hint (``i64 %ch``) and generates the
+ ``.L2::cache_hint`` variant of the PTX instruction.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``im2col`` mode.
+In this mode, the tensor has to be at least three-dimensional.
+The last two arguments of these intrinsics are compile-time flags,
+with the same functionality as described in the ``tile`` mode intrinsics above.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
Other Intrinsics
----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 115fcee0b04f22..b55b71a9418baa 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -635,6 +635,26 @@ class CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<int dim, string mode> {
ImmArg<ArgIndex<FlagsStartIdx>>];
}
+class CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, string mode> {
+ string Name = "int_nvvm_cp_async_bulk_tensor_reduce_" # mode # "_" # dim # "d";
+
+ list<LLVMType> TensorDimsTy = !listsplat(llvm_i32_ty, dim);
+ list<LLVMType> ArgsTy = !listconcat(
+ [llvm_shared_ptr_ty, // src_smem_ptr
+ llvm_ptr_ty], // tensormap_ptr
+ TensorDimsTy, // actual tensor dims
+ [llvm_i64_ty, // cache_hint
+ llvm_i1_ty, // Flag for cache_hint
+ llvm_i8_ty] // Flag for Reduction Op
+ );
+ int FlagsStartIdx = !add(dim, 3);
+ list<IntrinsicProperty> IntrProp = [IntrConvergent,
+ ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
+ NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
+ ImmArg<ArgIndex<FlagsStartIdx>>,
+ ImmArg<ArgIndex<!add(FlagsStartIdx, 1)>>];
+}
+
let TargetPrefix = "nvvm" in {
def int_nvvm_prmt : ClangBuiltin<"__nvvm_prmt">,
DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
@@ -4926,6 +4946,8 @@ foreach dim = [1, 2, 3, 4, 5] in {
def s2g.Name : DefaultAttrsIntrinsic<[], s2g.ArgsTy, s2g.IntrProp>;
foreach prefetch = [CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<dim, mode>] in
def prefetch.Name : DefaultAttrsIntrinsic<[], prefetch.ArgsTy, prefetch.IntrProp>;
+ foreach reduce = [CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, mode>] in
+ def reduce.Name : DefaultAttrsIntrinsic<[], reduce.ArgsTy, reduce.IntrProp>;
}
}
diff --git a/llvm/include/llvm/IR/NVVMIntrinsicFlags.h b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
new file mode 100644
index 00000000000000..c82c4044e03cfa
--- /dev/null
+++ b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
@@ -0,0 +1,37 @@
+//===--- NVVMIntrinsicFlags.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This file contains the definitions of the enumerations and flags
+/// associated with NVVM Intrinsics.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+#define LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+
+namespace llvm {
+namespace nvvm {
+
+// Reduction Ops supported with TMA Copy from Shared
+// to Global Memory for the "cp.reduce.async.bulk.tensor.*"
+// family of PTX instructions.
+enum class TMAReductionOp : uint8_t {
+ ADD = 0,
+ MIN = 1,
+ MAX = 2,
+ INC = 3,
+ DEC = 4,
+ AND = 5,
+ OR = 6,
+ XOR = 7,
+};
+
+} // namespace nvvm
+} // namespace llvm
+#endif // LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
index 7af3f76249d61d..7c80736adf3e8e 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
@@ -14,6 +14,7 @@
#include "NVPTX.h"
#include "NVPTXUtilities.h"
#include "llvm/ADT/StringRef.h"
+#include "llvm/IR/NVVMIntrinsicFlags.h"
#include "llvm/MC/MCExpr.h"
#include "llvm/MC/MCInst.h"
#include "llvm/MC/MCInstrInfo.h"
@@ -416,3 +417,39 @@ void NVPTXInstPrinter::printPrmtMode(const MCInst *MI, int OpNum,
return;
}
}
+
+void NVPTXInstPrinter::printTmaReductionMode(const MCInst *MI, int OpNum,
+ raw_ostream &O,
+ const char *Modifier) {
+ const MCOperand &MO = MI->getOperand(OpNum);
+
+ switch (static_cast<nvvm::TMAReductionOp>(MO.getImm())) {
+ case nvvm::TMAReductionOp::ADD:
+ O << ".add";
+ return;
+ case nvvm::TMAReductionOp::MIN:
+ O << ".min";
+ return;
+ case nvvm::TMAReductionOp::MAX:
+ O << ".max";
+ return;
+ case nvvm::TMAReductionOp::INC:
+ O << ".inc";
+ return;
+ case nvvm::TMAReductionOp::DEC:
+ O << ".dec";
+ return;
+ case nvvm::TMAReductionOp::AND:
+ O << ".and";
+ return;
+ case nvvm::TMAReductionOp::OR:
+ O << ".or";
+ return;
+ case nvvm::TMAReductionOp::XOR:
+ O << ".xor";
+ return;
+ default:
+ llvm_unreachable(
+ "Invalid Reduction Op in printCpAsyncBulkTensorReductionMode");
+ }
+}
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
index 2ce40bd6e8b973..2b19386ef17fe5 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
@@ -54,6 +54,8 @@ class NVPTXInstPrinter : public MCInstPrinter {
raw_ostream &O, const char *Modifier = nullptr);
void printPrmtMode(const MCInst *MI, int OpNum, raw_ostream &O,
const char *Modifier = nullptr);
+ void printTmaReductionMode(const MCInst *MI, int OpNum, raw_ostream &O,
+ const char *Modifier = nullptr);
};
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 173c37cfd8c8f7..15172a8d06fddb 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -4157,9 +4157,9 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
? NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_SHARED32_##mode##suffix \
: NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_##mode##suffix)
-#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(dim, mode) \
- (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, _CH)) \
- : (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, )))
+#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(op, dim, mode) \
+ (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, _CH)) \
+ : (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, )))
#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_G2S(dim, mode) \
[&]() -> auto { \
@@ -4177,15 +4177,19 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
: NVPTX::CP_ASYNC_BULK_TENSOR_PREFETCH_##dim##_##mode)
static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
- bool IsCacheHint, bool IsIm2Col) {
+ bool IsCacheHint, bool IsIm2Col,
+ bool IsReduce = false) {
if (IsIm2Col) {
switch (Dim) {
case 3:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, IM2COL);
case 4:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, IM2COL);
case 5:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, IM2COL);
default:
llvm_unreachable("Invalid Dimension in im2col mode for "
"GetCpAsyncBulkTensorS2GOpcode.");
@@ -4193,15 +4197,20 @@ static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
} else {
switch (Dim) {
case 1:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(1D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 1D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 1D, TILE);
case 2:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(2D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 2D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 2D, TILE);
case 3:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, TILE);
case 4:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, TILE);
case 5:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, TILE);
default:
llvm_unreachable(
"Invalid Dimension in tile mode for GetCpAsyncBulkTensorS2GOpcode.");
@@ -4377,6 +4386,30 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N,
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
+void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
+ bool IsIm2Col) {
+ // We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
+ // src, dst, dims{d0...dN}, cache_hint, cache_hint_flag, reduction_kind_flag
+ // NumOperands = {Chain, IID} + {Actual intrinsic args}
+ // = {2} + {5 + dims}
+ size_t NumOps = N->getNumOperands();
+ size_t NumDims = NumOps - 7;
+ unsigned ReductionKind = N->getConstantOperandVal(NumOps - 1);
+ bool IsCacheHint = N->getConstantOperandVal(NumOps - 2) == 1;
+ size_t NumArgs = NumDims + (IsCacheHint ? 3 : 2); // src, dst, cache_hint
+
+ SDLoc DL(N);
+ SmallVector<SDValue, 12> Ops(N->ops().slice(2, NumArgs));
+ Ops.push_back(getI32Imm(ReductionKind, DL)); // Reduction Op
+ Ops.push_back(N->getOperand(0)); // Chain operand
+
+ bool IsShared32 =
+ CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
+ unsigned Opcode = GetCpAsyncBulkTensorS2GOpcode(
+ NumDims, IsShared32, IsCacheHint, IsIm2Col, /*IsReduce=*/true);
+ ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
+}
+
bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
unsigned IID = N->getConstantOperandVal(1);
switch (IID) {
@@ -4418,5 +4451,17 @@ bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
case Intrinsic::nvvm_cp_async_bulk_tensor_prefetch_im2col_5d:
SelectCpAsyncBulkTensorPrefetchCommon(N, /*IsIm2Col=*/true);
return true;
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_1d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_2d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_3d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_4d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_5d:
+ SelectCpAsyncBulkTensorReduceCommon(N);
+ return true;
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_3d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_4d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_5d:
+ SelectCpAsyncBulkTensorReduceCommon(N, /*IsIm2Col=*/true);
+ return true;
}
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index d6c80a31b7463d..0a79428fcec2d8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -95,6 +95,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
+ void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, bool IsIm2Col = false);
+
inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 5878940812f62b..103c92b608dfba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -564,17 +564,19 @@ foreach dim = [1, 2, 3, 4, 5] in {
}
// From Shared to Global memory (S2G)
-class S2G_STRINGS<int dim, string mode, bit ch, bit is_shared32 = 0> {
- string prefix = "cp.async.bulk.tensor";
+class S2G_STRINGS<int dim, string mode, bit ch,
+ bit is_shared32 = 0, bit is_reduce = 0> {
string dir = "global.shared::cta";
string completion = "bulk_group";
- string inst_name = prefix
+ string inst_name = !if(is_reduce, "cp.reduce", "cp")
+ # ".async.bulk.tensor"
# "." # dim # "d"
# "." # dir
# "." # mode
# "." # completion
# !if(ch, ".L2::cache_hint", "");
- string intr_name = "CP_ASYNC_BULK_TENSOR_S2G_"
+ string intr_name = "CP_ASYNC_BULK_TENSOR_"
+ # !if(is_reduce, "RED_", "S2G_")
# dim # "D"
# !if(is_shared32, "_SHARED32", "")
# !if(!eq(mode, "tile"), "_TILE", "_IM2COL");
@@ -596,11 +598,37 @@ multiclass CP_ASYNC_BULK_TENSOR_S2G_INTR<int dim, bit shared32, string mode> {
Requires<[hasPTX<80>, hasSM<90>]>;
}
+def TMAReductionFlags : Operand<i32> {
+ let PrintMethod = "printTmaReductionMode";
+}
+
+// TMA Copy from Shared to Global memory with Reduction
+multiclass CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, bit shared32, string mode> {
+ defvar dims_dag = !dag(ins, !listsplat(Int32Regs, dim), !foreach(i, !range(dim), "d" # i));
+ defvar dims_str = !interleave(!foreach(i, !range(dim), "$d" # i), ", ");
+ defvar asm_str = " [$tmap, {{" # dims_str # "}}], [$src]";
+ defvar rc = !if(shared32, Int32Regs, Int64Regs);
+
+ defvar prefix = "cp.reduce.async.bulk.tensor" # "." # dim # "d" # ".global.shared::cta";
+ defvar suffix = "." # mode # ".bulk_group";
+
+ def "": NVPTXInst<(outs),
+ !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins TMAReductionFlags:$red_op)),
+ !strconcat(prefix, "${red_op}", suffix, asm_str, ";"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+ def _CH: NVPTXInst<(outs),
+ !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins Int64Regs:$ch, TMAReductionFlags:$red_op)),
+ !strconcat(prefix, "${red_op}", suffix, ".L2::cache_hint", asm_str, ", $ch;"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+}
+
foreach dim = [1, 2, 3, 4, 5] in {
foreach shared32 = [true, false] in {
foreach mode = !if(!ge(dim, 3), ["tile", "im2col_no_offs"], ["tile"]) in {
defm S2G_STRINGS<dim, mode, 0, shared32>.intr_name :
CP_ASYNC_BULK_TENSOR_S2G_INTR<dim, shared32, mode>;
+ defm S2G_STRINGS<dim, mode, 0, shared32, 1>.intr_name :
+ CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, shared32, mode>;
}
}
}
diff --git a/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
new file mode 100644
index 00000000000000..28713109e742fa
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
@@ -0,0 +1,426 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX %s
+; RUN: %if ptxas-12.3 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(p...
[truncated]
|
51207bc
to
1631ebe
Compare
@Artem-B , Kindly help with the review. |
1631ebe
to
ce817e2
Compare
Ping, |
ce817e2
to
67da5e5
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
67da5e5
to
db41224
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor nit.
This patch adds NVVM intrinsics and NVPTX codegen for: * cp.async.bulk.tensor.reduce.1D -> 5D variants, supporting both Tile and Im2Col modes. * These intrinsics optionally support cache_hints as indicated by the boolean flag argument. * Lit tests are added for all combinations of these intrinsics in cp-async-bulk-tensor-reduce.ll. * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst file. PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
db41224
to
091821f
Compare
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/5960 Here is the relevant piece of the build log for the reference
|
@kazutakahirata Thanks for merging the fix. |
If you could send me a separate PR, I'll LGTM it right away. Thanks! |
|
PR llvm#116854 adds intrinsics for TMA Store with reduction. This patch adds an NVVM Dialect Op for the same. Lit tests to verify the lowering to LLVM intrinsics as well as verifier tests (for invalid cases) are added. * The Common verifier method for TMA Ops is updated to handle im2col modes without offsets. This helps Ops like TMA Store, TMA StoreReduce etc. * The nvvmir.mlir test file is already large. So, this patch adds the tests for this Op into a separate file under a separate "nvvm/" directory. [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir] PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
PR llvm#116854 adds intrinsics for TMA Store with reduction. This patch adds an NVVM Dialect Op for the same. Lit tests to verify the lowering to LLVM intrinsics as well as verifier tests (for invalid cases) are added. * The Common verifier method for TMA Ops is updated to handle im2col modes without offsets. This helps Ops like TMA Store, TMA StoreReduce etc. * The nvvmir.mlir test file is already large. So, this patch adds the tests for this Op into a separate file under a separate "nvvm/" directory. [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir] PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
PR #116854 adds intrinsics for TMA Store with reduction. This patch adds an NVVM Dialect Op for the same. * Lit tests are added to verify the lowering to LLVM intrinsics and invalid cases. * The common verifier method is updated to handle im2col modes without offsets. This helps Ops like TMA Store, TMA StoreReduce etc. * The nvvmir.mlir test file is already large. So, this patch adds the tests for this Op in a new file under a separate "nvvm/" directory. [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir] PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This patch adds NVVM intrinsics and NVPTX codegen for:
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor