Skip to content

[NVPTX] Add TMA bulk tensor reduction intrinsics #116854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

durga4github
Copy link
Contributor

This patch adds NVVM intrinsics and NVPTX codegen for:

  • cp.async.bulk.tensor.reduce.1D -> 5D variants, supporting both Tile and Im2Col modes.
  • These intrinsics optionally support cache_hints as indicated by the boolean flag argument.
  • Lit tests are added for all combinations of these intrinsics in cp-async-bulk-tensor-reduce.ll.
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

@llvmbot
Copy link
Member

llvmbot commented Nov 19, 2024

@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-llvm-ir

Author: Durgadoss R (durga4github)

Changes

This patch adds NVVM intrinsics and NVPTX codegen for:

  • cp.async.bulk.tensor.reduce.1D -> 5D variants, supporting both Tile and Im2Col modes.
  • These intrinsics optionally support cache_hints as indicated by the boolean flag argument.
  • Lit tests are added for all combinations of these intrinsics in cp-async-bulk-tensor-reduce.ll.
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor


Patch is 65.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/116854.diff

9 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+78)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+22)
  • (added) llvm/include/llvm/IR/NVVMIntrinsicFlags.h (+37)
  • (modified) llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp (+37)
  • (modified) llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h (+2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+57-12)
  • (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+32-4)
  • (added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll (+426)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 2152de9709dc6e..ed29d87edbad8e 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -663,6 +663,84 @@ the same functionality as described in the ``tile`` mode intrinsics above.
 For more information, refer PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
 
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(..., i32 %d0, i32 %d1, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``tile`` mode.
+The dimension of the tensor data ranges from 1d to 5d with the coordinates
+specified by the ``i32 %d0 ... i32 %d4`` arguments.
+
+* The last two arguments to these intrinsics are flags.
+  These flag arguments must be compile-time constants. The backend
+  looks through these flags and lowers the intrinsics appropriately.
+
+* The Nth argument (denoted by ``i8 flag_red_op``) indicates the
+  kind of reduction operation performed. The argument must be in
+  the range [0, 7], representing the following reduction operations:
+
+  ========== =============
+  Enum Value  Reduction Op
+  ========== =============
+  ``0``       ADD
+  ``1``       MIN
+  ``2``       MAX
+  ``3``       INC
+  ``4``       DEC
+  ``5``       AND
+  ``6``       OR
+  ``7``       XOR
+  ========== =============
+
+* The [N-1]th argument (denoted by ``i1 flag_ch``) when set, indicates
+  the presence of a valid cache_hint (``i64 %ch``) and generates the
+  ``.L2::cache_hint`` variant of the PTX instruction.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``im2col`` mode.
+In this mode, the tensor has to be at least three-dimensional.
+The last two arguments of these intrinsics are compile-time flags,
+with the same functionality as described in the ``tile`` mode intrinsics above.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
 Other Intrinsics
 ----------------
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 115fcee0b04f22..b55b71a9418baa 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -635,6 +635,26 @@ class CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<int dim, string mode> {
         ImmArg<ArgIndex<FlagsStartIdx>>];
 }
 
+class CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, string mode> {
+  string Name = "int_nvvm_cp_async_bulk_tensor_reduce_" # mode # "_" # dim # "d";
+
+  list<LLVMType> TensorDimsTy = !listsplat(llvm_i32_ty, dim);
+  list<LLVMType> ArgsTy = !listconcat(
+                          [llvm_shared_ptr_ty,  // src_smem_ptr
+                           llvm_ptr_ty],        // tensormap_ptr
+                           TensorDimsTy,        // actual tensor dims
+                          [llvm_i64_ty,         // cache_hint
+                           llvm_i1_ty,          // Flag for cache_hint
+                           llvm_i8_ty]          // Flag for Reduction Op
+                          );
+  int FlagsStartIdx = !add(dim, 3);
+  list<IntrinsicProperty> IntrProp = [IntrConvergent,
+        ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
+        NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
+        ImmArg<ArgIndex<FlagsStartIdx>>,
+        ImmArg<ArgIndex<!add(FlagsStartIdx, 1)>>];
+}
+
 let TargetPrefix = "nvvm" in {
   def int_nvvm_prmt : ClangBuiltin<"__nvvm_prmt">,
       DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
@@ -4926,6 +4946,8 @@ foreach dim = [1, 2, 3, 4, 5] in {
       def s2g.Name : DefaultAttrsIntrinsic<[], s2g.ArgsTy, s2g.IntrProp>;
     foreach prefetch = [CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<dim, mode>] in
       def prefetch.Name : DefaultAttrsIntrinsic<[], prefetch.ArgsTy, prefetch.IntrProp>;
+    foreach reduce = [CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, mode>] in
+      def reduce.Name : DefaultAttrsIntrinsic<[], reduce.ArgsTy, reduce.IntrProp>;
   }
 }
 
diff --git a/llvm/include/llvm/IR/NVVMIntrinsicFlags.h b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
new file mode 100644
index 00000000000000..c82c4044e03cfa
--- /dev/null
+++ b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
@@ -0,0 +1,37 @@
+//===--- NVVMIntrinsicFlags.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This file contains the definitions of the enumerations and flags
+/// associated with NVVM Intrinsics.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+#define LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+
+namespace llvm {
+namespace nvvm {
+
+// Reduction Ops supported with TMA Copy from Shared
+// to Global Memory for the "cp.reduce.async.bulk.tensor.*"
+// family of PTX instructions.
+enum class TMAReductionOp : uint8_t {
+  ADD = 0,
+  MIN = 1,
+  MAX = 2,
+  INC = 3,
+  DEC = 4,
+  AND = 5,
+  OR = 6,
+  XOR = 7,
+};
+
+} // namespace nvvm
+} // namespace llvm
+#endif // LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
index 7af3f76249d61d..7c80736adf3e8e 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
@@ -14,6 +14,7 @@
 #include "NVPTX.h"
 #include "NVPTXUtilities.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/IR/NVVMIntrinsicFlags.h"
 #include "llvm/MC/MCExpr.h"
 #include "llvm/MC/MCInst.h"
 #include "llvm/MC/MCInstrInfo.h"
@@ -416,3 +417,39 @@ void NVPTXInstPrinter::printPrmtMode(const MCInst *MI, int OpNum,
     return;
   }
 }
+
+void NVPTXInstPrinter::printTmaReductionMode(const MCInst *MI, int OpNum,
+                                             raw_ostream &O,
+                                             const char *Modifier) {
+  const MCOperand &MO = MI->getOperand(OpNum);
+
+  switch (static_cast<nvvm::TMAReductionOp>(MO.getImm())) {
+  case nvvm::TMAReductionOp::ADD:
+    O << ".add";
+    return;
+  case nvvm::TMAReductionOp::MIN:
+    O << ".min";
+    return;
+  case nvvm::TMAReductionOp::MAX:
+    O << ".max";
+    return;
+  case nvvm::TMAReductionOp::INC:
+    O << ".inc";
+    return;
+  case nvvm::TMAReductionOp::DEC:
+    O << ".dec";
+    return;
+  case nvvm::TMAReductionOp::AND:
+    O << ".and";
+    return;
+  case nvvm::TMAReductionOp::OR:
+    O << ".or";
+    return;
+  case nvvm::TMAReductionOp::XOR:
+    O << ".xor";
+    return;
+  default:
+    llvm_unreachable(
+        "Invalid Reduction Op in printCpAsyncBulkTensorReductionMode");
+  }
+}
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
index 2ce40bd6e8b973..2b19386ef17fe5 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
@@ -54,6 +54,8 @@ class NVPTXInstPrinter : public MCInstPrinter {
                        raw_ostream &O, const char *Modifier = nullptr);
   void printPrmtMode(const MCInst *MI, int OpNum, raw_ostream &O,
                      const char *Modifier = nullptr);
+  void printTmaReductionMode(const MCInst *MI, int OpNum, raw_ostream &O,
+                             const char *Modifier = nullptr);
 };
 
 }
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 173c37cfd8c8f7..15172a8d06fddb 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -4157,9 +4157,9 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
        ? NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_SHARED32_##mode##suffix   \
        : NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_##mode##suffix)
 
-#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(dim, mode)                         \
-  (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, _CH))            \
-               : (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, )))
+#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(op, dim, mode)                      \
+  (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, _CH))             \
+               : (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, )))
 
 #define GET_CP_ASYNC_BULK_TENSOR_OPCODE_G2S(dim, mode)                         \
   [&]() -> auto {                                                              \
@@ -4177,15 +4177,19 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
                : NVPTX::CP_ASYNC_BULK_TENSOR_PREFETCH_##dim##_##mode)
 
 static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
-                                              bool IsCacheHint, bool IsIm2Col) {
+                                              bool IsCacheHint, bool IsIm2Col,
+                                              bool IsReduce = false) {
   if (IsIm2Col) {
     switch (Dim) {
     case 3:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, IM2COL);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, IM2COL)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, IM2COL);
     case 4:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, IM2COL);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, IM2COL)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, IM2COL);
     case 5:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, IM2COL);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, IM2COL)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, IM2COL);
     default:
       llvm_unreachable("Invalid Dimension in im2col mode for "
                        "GetCpAsyncBulkTensorS2GOpcode.");
@@ -4193,15 +4197,20 @@ static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
   } else {
     switch (Dim) {
     case 1:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(1D, TILE);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 1D, TILE)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 1D, TILE);
     case 2:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(2D, TILE);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 2D, TILE)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 2D, TILE);
     case 3:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, TILE);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, TILE)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, TILE);
     case 4:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, TILE);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, TILE)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, TILE);
     case 5:
-      return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, TILE);
+      return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, TILE)
+                      : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, TILE);
     default:
       llvm_unreachable(
           "Invalid Dimension in tile mode for GetCpAsyncBulkTensorS2GOpcode.");
@@ -4377,6 +4386,30 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N,
   ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
 }
 
+void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
+                                                            bool IsIm2Col) {
+  // We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
+  // src, dst, dims{d0...dN}, cache_hint, cache_hint_flag, reduction_kind_flag
+  // NumOperands = {Chain, IID} + {Actual intrinsic args}
+  //             = {2}          + {5 + dims}
+  size_t NumOps = N->getNumOperands();
+  size_t NumDims = NumOps - 7;
+  unsigned ReductionKind = N->getConstantOperandVal(NumOps - 1);
+  bool IsCacheHint = N->getConstantOperandVal(NumOps - 2) == 1;
+  size_t NumArgs = NumDims + (IsCacheHint ? 3 : 2); // src, dst, cache_hint
+
+  SDLoc DL(N);
+  SmallVector<SDValue, 12> Ops(N->ops().slice(2, NumArgs));
+  Ops.push_back(getI32Imm(ReductionKind, DL)); // Reduction Op
+  Ops.push_back(N->getOperand(0));             // Chain operand
+
+  bool IsShared32 =
+      CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
+  unsigned Opcode = GetCpAsyncBulkTensorS2GOpcode(
+      NumDims, IsShared32, IsCacheHint, IsIm2Col, /*IsReduce=*/true);
+  ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
+}
+
 bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
   unsigned IID = N->getConstantOperandVal(1);
   switch (IID) {
@@ -4418,5 +4451,17 @@ bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
   case Intrinsic::nvvm_cp_async_bulk_tensor_prefetch_im2col_5d:
     SelectCpAsyncBulkTensorPrefetchCommon(N, /*IsIm2Col=*/true);
     return true;
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_1d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_2d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_3d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_4d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_5d:
+    SelectCpAsyncBulkTensorReduceCommon(N);
+    return true;
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_3d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_4d:
+  case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_5d:
+    SelectCpAsyncBulkTensorReduceCommon(N, /*IsIm2Col=*/true);
+    return true;
   }
 }
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index d6c80a31b7463d..0a79428fcec2d8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -95,6 +95,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
   void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
   void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
   void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
+  void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, bool IsIm2Col = false);
+
   inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
     return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
   }
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 5878940812f62b..103c92b608dfba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -564,17 +564,19 @@ foreach dim = [1, 2, 3, 4, 5] in {
 }
 
 // From Shared to Global memory (S2G)
-class S2G_STRINGS<int dim, string mode, bit ch, bit is_shared32 = 0> {
-  string prefix = "cp.async.bulk.tensor";
+class S2G_STRINGS<int dim, string mode, bit ch,
+                  bit is_shared32 = 0, bit is_reduce = 0> {
   string dir = "global.shared::cta";
   string completion = "bulk_group";
-  string inst_name = prefix
+  string inst_name = !if(is_reduce, "cp.reduce", "cp")
+                     # ".async.bulk.tensor"
                      # "." # dim # "d"
                      # "." # dir
                      # "." # mode
                      # "." # completion
                      # !if(ch, ".L2::cache_hint", "");
-  string intr_name = "CP_ASYNC_BULK_TENSOR_S2G_"
+  string intr_name = "CP_ASYNC_BULK_TENSOR_"
+                     # !if(is_reduce, "RED_", "S2G_")
                      # dim # "D"
                      # !if(is_shared32, "_SHARED32", "")
                      # !if(!eq(mode, "tile"), "_TILE", "_IM2COL");
@@ -596,11 +598,37 @@ multiclass CP_ASYNC_BULK_TENSOR_S2G_INTR<int dim, bit shared32, string mode> {
            Requires<[hasPTX<80>, hasSM<90>]>;
 }
 
+def TMAReductionFlags : Operand<i32> {
+  let PrintMethod = "printTmaReductionMode";
+}
+
+// TMA Copy from Shared to Global memory with Reduction
+multiclass CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, bit shared32, string mode> {
+  defvar dims_dag = !dag(ins, !listsplat(Int32Regs, dim), !foreach(i, !range(dim), "d" # i));
+  defvar dims_str = !interleave(!foreach(i, !range(dim), "$d" # i), ", ");
+  defvar asm_str = " [$tmap, {{" # dims_str # "}}], [$src]";
+  defvar rc = !if(shared32, Int32Regs, Int64Regs);
+
+  defvar prefix = "cp.reduce.async.bulk.tensor" # "." # dim # "d" # ".global.shared::cta";
+  defvar suffix = "." # mode # ".bulk_group";
+
+  def "": NVPTXInst<(outs),
+          !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins TMAReductionFlags:$red_op)),
+          !strconcat(prefix, "${red_op}", suffix, asm_str, ";"), []>,
+          Requires<[hasPTX<80>, hasSM<90>]>;
+  def _CH: NVPTXInst<(outs),
+           !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins Int64Regs:$ch, TMAReductionFlags:$red_op)),
+           !strconcat(prefix, "${red_op}", suffix, ".L2::cache_hint", asm_str, ", $ch;"), []>,
+           Requires<[hasPTX<80>, hasSM<90>]>;
+}
+
 foreach dim = [1, 2, 3, 4, 5] in {
   foreach shared32 = [true, false] in {
     foreach mode = !if(!ge(dim, 3), ["tile", "im2col_no_offs"], ["tile"]) in {
       defm S2G_STRINGS<dim, mode, 0, shared32>.intr_name :
         CP_ASYNC_BULK_TENSOR_S2G_INTR<dim, shared32, mode>;
+      defm S2G_STRINGS<dim, mode, 0, shared32, 1>.intr_name :
+        CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, shared32, mode>;
     }
   }
 }
diff --git a/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
new file mode 100644
index 00000000000000..28713109e742fa
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
@@ -0,0 +1,426 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX %s
+; RUN: %if ptxas-12.3 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(p...
[truncated]

@durga4github durga4github requested a review from Artem-B November 19, 2024 18:28
@durga4github durga4github force-pushed the durgadossr/nvptx_tma_reduce branch from 51207bc to 1631ebe Compare November 19, 2024 18:34
@durga4github
Copy link
Contributor Author

@Artem-B , Kindly help with the review.

@durga4github durga4github force-pushed the durgadossr/nvptx_tma_reduce branch from 1631ebe to ce817e2 Compare November 21, 2024 09:55
@durga4github
Copy link
Contributor Author

Ping,
@Artem-B , Could you please help with this review?

@durga4github durga4github force-pushed the durgadossr/nvptx_tma_reduce branch from ce817e2 to 67da5e5 Compare November 26, 2024 13:05
Copy link

github-actions bot commented Nov 26, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@durga4github durga4github force-pushed the durgadossr/nvptx_tma_reduce branch from 67da5e5 to db41224 Compare November 26, 2024 13:11
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit.

This patch adds NVVM intrinsics and NVPTX codegen for:
* cp.async.bulk.tensor.reduce.1D -> 5D variants,
  supporting both Tile and Im2Col modes.
* These intrinsics optionally support cache_hints as
  indicated by the boolean flag argument.
* Lit tests are added for all combinations of these
  intrinsics in cp-async-bulk-tensor-reduce.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
@durga4github durga4github force-pushed the durgadossr/nvptx_tma_reduce branch from db41224 to 091821f Compare November 26, 2024 19:41
@durga4github durga4github merged commit 40d0058 into llvm:main Nov 27, 2024
9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 27, 2024

LLVM Buildbot has detected a new failure on builder ppc64le-lld-multistage-test running on ppc64le-lld-multistage-test while building llvm at step 12 "build-stage2-unified-tree".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/5960

Here is the relevant piece of the build log for the reference
Step 12 (build-stage2-unified-tree) failure: build (failure)
...
347.737 [1207/1154/3983] Building CXX object lib/Target/Hexagon/CMakeFiles/LLVMHexagonCodeGen.dir/HexagonGenMux.cpp.o
347.766 [1206/1154/3984] Building CXX object lib/Target/Hexagon/MCTargetDesc/CMakeFiles/LLVMHexagonDesc.dir/HexagonMCTargetDesc.cpp.o
347.786 [1205/1154/3985] Building CXX object tools/clang/lib/AST/CMakeFiles/obj.clangAST.dir/StmtIterator.cpp.o
347.808 [1204/1154/3986] Building CXX object tools/llvm-exegesis/lib/CMakeFiles/LLVMExegesis.dir/Target.cpp.o
347.827 [1203/1154/3987] Building CXX object lib/Target/BPF/CMakeFiles/LLVMBPFCodeGen.dir/BPFISelDAGToDAG.cpp.o
347.868 [1202/1154/3988] Building CXX object tools/llvm-exegesis/lib/CMakeFiles/LLVMExegesis.dir/SnippetRepetitor.cpp.o
347.887 [1201/1154/3989] Building CXX object lib/Target/NVPTX/CMakeFiles/LLVMNVPTXCodeGen.dir/NVPTXFrameLowering.cpp.o
347.968 [1200/1154/3990] Building CXX object lib/Target/BPF/CMakeFiles/LLVMBPFCodeGen.dir/BPFISelLowering.cpp.o
348.018 [1199/1154/3991] Building CXX object tools/clang/lib/AST/CMakeFiles/obj.clangAST.dir/ASTConcept.cpp.o
348.038 [1198/1154/3992] Building CXX object lib/Target/NVPTX/MCTargetDesc/CMakeFiles/LLVMNVPTXDesc.dir/NVPTXInstPrinter.cpp.o
FAILED: lib/Target/NVPTX/MCTargetDesc/CMakeFiles/LLVMNVPTXDesc.dir/NVPTXInstPrinter.cpp.o 
ccache /home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/install/stage1/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/build/stage2/lib/Target/NVPTX/MCTargetDesc -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/lib/Target/NVPTX/MCTargetDesc -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/lib/Target/NVPTX -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/build/stage2/lib/Target/NVPTX -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/build/stage2/include -I/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17 -fvisibility=hidden  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT lib/Target/NVPTX/MCTargetDesc/CMakeFiles/LLVMNVPTXDesc.dir/NVPTXInstPrinter.cpp.o -MF lib/Target/NVPTX/MCTargetDesc/CMakeFiles/LLVMNVPTXDesc.dir/NVPTXInstPrinter.cpp.o.d -o lib/Target/NVPTX/MCTargetDesc/CMakeFiles/LLVMNVPTXDesc.dir/NVPTXInstPrinter.cpp.o -c /home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp:452:3: error: default label in switch which covers all enumeration values [-Werror,-Wcovered-switch-default]
  452 |   default:
      |   ^
1 error generated.
348.109 [1198/1153/3993] Building CXX object lib/Target/LoongArch/CMakeFiles/LLVMLoongArchCodeGen.dir/LoongArchDeadRegisterDefinitions.cpp.o
348.112 [1198/1152/3994] Building CXX object lib/ExecutionEngine/Orc/CMakeFiles/LLVMOrcJIT.dir/ObjectLinkingLayer.cpp.o
348.158 [1198/1151/3995] Building CXX object lib/Target/WebAssembly/CMakeFiles/LLVMWebAssemblyCodeGen.dir/WebAssemblyRefTypeMem2Local.cpp.o
348.207 [1198/1150/3996] Linking CXX static library lib/libLLVMHexagonDesc.a
348.327 [1198/1149/3997] Linking CXX static library lib/libLLVMSystemZDesc.a
348.369 [1198/1148/3998] Building CXX object lib/Target/Hexagon/CMakeFiles/LLVMHexagonCodeGen.dir/HexagonRegisterInfo.cpp.o
348.398 [1198/1147/3999] Building CXX object tools/clang/lib/Lex/CMakeFiles/obj.clangLex.dir/PPDirectives.cpp.o
348.865 [1198/1146/4000] Building CXX object tools/clang/lib/AST/CMakeFiles/obj.clangAST.dir/ExternalASTSource.cpp.o
348.869 [1198/1145/4001] Building CXX object tools/llvm-exegesis/lib/CMakeFiles/LLVMExegesis.dir/UopsBenchmarkRunner.cpp.o
348.870 [1198/1144/4002] Building CXX object lib/Target/AVR/CMakeFiles/LLVMAVRCodeGen.dir/AVRISelLowering.cpp.o
348.872 [1198/1143/4003] Building CXX object lib/Target/Hexagon/CMakeFiles/LLVMHexagonCodeGen.dir/HexagonGenMemAbsolute.cpp.o
348.873 [1198/1142/4004] Building CXX object lib/Target/SystemZ/CMakeFiles/LLVMSystemZCodeGen.dir/SystemZPostRewrite.cpp.o
348.874 [1198/1141/4005] Building CXX object lib/Target/VE/MCTargetDesc/CMakeFiles/LLVMVEDesc.dir/VEMCTargetDesc.cpp.o
348.875 [1198/1140/4006] Building CXX object lib/Target/WebAssembly/CMakeFiles/LLVMWebAssemblyCodeGen.dir/WebAssemblyMCLowerPrePass.cpp.o
348.876 [1198/1139/4007] Building CXX object tools/clang/tools/diagtool/CMakeFiles/diagtool.dir/TreeView.cpp.o
348.888 [1198/1138/4008] Building CXX object tools/clang/lib/Format/CMakeFiles/obj.clangFormat.dir/TokenAnnotator.cpp.o
348.988 [1198/1137/4009] Building CXX object tools/clang/tools/apinotes-test/CMakeFiles/apinotes-test.dir/APINotesTest.cpp.o
349.018 [1198/1136/4010] Building CXX object tools/clang/lib/AST/CMakeFiles/obj.clangAST.dir/CommentLexer.cpp.o
349.068 [1198/1135/4011] Building CXX object lib/Target/ARM/MCTargetDesc/CMakeFiles/LLVMARMDesc.dir/ARMMCCodeEmitter.cpp.o
349.088 [1198/1134/4012] Building CXX object lib/Target/VE/Disassembler/CMakeFiles/LLVMVEDisassembler.dir/VEDisassembler.cpp.o
349.129 [1198/1133/4013] Building CXX object lib/Target/PowerPC/CMakeFiles/LLVMPowerPCCodeGen.dir/GISel/PPCCallLowering.cpp.o
349.137 [1198/1132/4014] Building CXX object lib/Target/LoongArch/CMakeFiles/LLVMLoongArchCodeGen.dir/LoongArchFrameLowering.cpp.o
349.179 [1198/1131/4015] Building CXX object lib/Target/LoongArch/CMakeFiles/LLVMLoongArchCodeGen.dir/LoongArchExpandPseudoInsts.cpp.o
349.300 [1198/1130/4016] Building CXX object tools/clang/lib/Format/CMakeFiles/obj.clangFormat.dir/UnwrappedLineParser.cpp.o
349.578 [1198/1129/4017] Building CXX object tools/clang/lib/Driver/CMakeFiles/obj.clangDriver.dir/ToolChain.cpp.o
349.638 [1198/1128/4018] Building CXX object lib/Target/WebAssembly/AsmParser/CMakeFiles/LLVMWebAssemblyAsmParser.dir/WebAssemblyAsmTypeCheck.cpp.o
349.658 [1198/1127/4019] Building CXX object tools/clang/lib/CodeGen/CMakeFiles/obj.clangCodeGen.dir/SanitizerMetadata.cpp.o
349.687 [1198/1126/4020] Building CXX object lib/Target/Mips/CMakeFiles/LLVMMipsCodeGen.dir/MicroMipsSizeReduction.cpp.o
349.878 [1198/1125/4021] Building CXX object lib/Target/Hexagon/CMakeFiles/LLVMHexagonCodeGen.dir/HexagonRDFOpt.cpp.o
350.098 [1198/1124/4022] Building CXX object lib/Target/Mips/CMakeFiles/LLVMMipsCodeGen.dir/MipsCCState.cpp.o
350.100 [1198/1123/4023] Building CXX object lib/Target/Mips/CMakeFiles/LLVMMipsCodeGen.dir/MipsRegisterInfo.cpp.o
350.168 [1198/1122/4024] Building CXX object lib/Target/LoongArch/CMakeFiles/LLVMLoongArchCodeGen.dir/LoongArchAsmPrinter.cpp.o
350.170 [1198/1121/4025] Building CXX object lib/Target/NVPTX/CMakeFiles/LLVMNVPTXCodeGen.dir/NVPTXRegisterInfo.cpp.o

@durga4github
Copy link
Contributor Author

@kazutakahirata
I was writing a fix ;-)

Thanks for merging the fix.
However, instead of removing the unreachable altogether, we can have it outside the switch-case.
Would you amend ? or shall I submit a separate PR for this?

@kazutakahirata
Copy link
Contributor

Thanks for merging the fix. However, instead of removing the unreachable altogether, we can have it outside the switch-case. Would you amend ? or shall I submit a separate PR for this?

If you could send me a separate PR, I'll LGTM it right away. Thanks!

@durga4github
Copy link
Contributor Author

Thanks for merging the fix. However, instead of removing the unreachable altogether, we can have it outside the switch-case. Would you amend ? or shall I submit a separate PR for this?

If you could send me a separate PR, I'll LGTM it right away. Thanks!

#117850

@durga4github durga4github deleted the durgadossr/nvptx_tma_reduce branch November 27, 2024 06:46
durga4github added a commit to durga4github/llvm-project that referenced this pull request Dec 5, 2024
PR llvm#116854 adds intrinsics for TMA Store with reduction.
This patch adds an NVVM Dialect Op for the same.

Lit tests to verify the lowering to LLVM intrinsics as well as
verifier tests (for invalid cases) are added.

* The Common verifier method for TMA Ops is updated
  to handle im2col modes without offsets. This helps
  Ops like TMA Store, TMA StoreReduce etc.
* The nvvmir.mlir test file is already large. So, this
  patch adds the tests for this Op into a separate file
  under a separate "nvvm/" directory.
  [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir]

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
durga4github added a commit to durga4github/llvm-project that referenced this pull request Dec 6, 2024
PR llvm#116854 adds intrinsics for TMA Store with reduction.
This patch adds an NVVM Dialect Op for the same.

Lit tests to verify the lowering to LLVM intrinsics as well as
verifier tests (for invalid cases) are added.

* The Common verifier method for TMA Ops is updated
  to handle im2col modes without offsets. This helps
  Ops like TMA Store, TMA StoreReduce etc.
* The nvvmir.mlir test file is already large. So, this
  patch adds the tests for this Op into a separate file
  under a separate "nvvm/" directory.
  [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir]

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
joker-eph pushed a commit that referenced this pull request Dec 11, 2024
PR #116854 adds intrinsics for TMA Store with reduction.
This patch adds an NVVM Dialect Op for the same.

* Lit tests are added to verify the lowering to LLVM intrinsics and
invalid cases.
* The common verifier method is updated to handle im2col modes without
offsets.
   This helps Ops like TMA Store, TMA StoreReduce etc.
* The nvvmir.mlir test file is already large. So, this patch adds the
tests for this Op
   in a new file under a separate "nvvm/" directory.
   [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir]

PTX Spec reference:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants