Merge pull request #386 from cjnolet/branch-21.12-merge-22.02

Branch 21.12 merge 22.02
rapidsai · Nov 17, 2021 · 891ff74 · 891ff74
2 parents 3a7e3ed + 5dd75eb
commit 891ff74
Show file tree

Hide file tree

Showing 87 changed files with 1,939 additions and 986 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,94 @@
-# <div align="left"><img src="https://rapids.ai/assets/images/rapids_logo.png" width="90px"/>&nbsp;RAFT: RAPIDS Analytics Frameworks Toolset</div>
+# <div align="left"><img src="https://rapids.ai/assets/images/rapids_logo.png" width="90px"/>&nbsp;RAFT: RAPIDS Analytics Framework Toolkit</div>
 
-RAFT is a repository containining shared utilities, mathematical operations and common functions for the analytics components of RAPIDS. Both the C++ and Python components can be included in consuming libraries.
+RAFT is a library containing building-blocks for rapid composition of RAPIDS Analytics. These building-blocks include shared representations, mathematical computational primitives, and utilities that accelerate building analytics and data science algorithms in the RAPIDS ecosystem. Both the C++ and Python components can be included in consuming libraries, providing building-blocks for both dense and sparse matrix formats in the following general categories:
+#####
+| Category | Description / Examples |
+| --- | --- |
+| **Data Formats** | tensor representations and conversions for both sparse and dense formats |
+| **Data Generation** | graph, spatial, and machine learning dataset generation |
+| **Dense Operations** | linear algebra, statistics |
+| **Spatial** | pairwise distances, nearest neighbors, neighborhood / proximity graph construction |
+| **Sparse/Graph Operations** | linear algebra, statistics, slicing, msf, spectral embedding/clustering, slhc, vertex degree |
+| **Solvers** | eigenvalue decomposition, least squares, lanczos |
+| **Tools** | multi-node multi-gpu communicator, utilities |
+
+By taking a primitives-based approach to algorithm development, RAFT accelerates algorithm construction time and reduces
+the maintenance burden by maximizing reuse across projects. RAFT relies on the [RAPIDS memory manager (RMM)](https://github.com/rapidsai/rmm) which, 
+like other projects in the RAPIDS ecosystem, eases the burden of configuring different allocation strategies globally 
+across the libraries that use it. RMM also provides RAII wrappers around device arrays that handle the allocation and cleanup.
+
+## Getting started
 
 Refer to the [Build and Development Guide](BUILD.md) for details on RAFT's design, building, testing and development guidelines.
 
+Most of the primitives in RAFT accept a `raft::handle_t` object for the management of resources which are expensive to create, such CUDA streams, stream pools, and handles to other CUDA libraries like `cublas` and `cusolver`. 
+
+
+### C++ Example
+
+The example below demonstrates creating a RAFT handle and using it with RMM's `device_uvector` to allocate memory on device and compute
+pairwise Euclidean distances:
+```c++
+#include <raft/handle.hpp>
+#include <raft/distance/distance.hpp>
+
+#include <rmm/device_uvector.hpp>
+raft::handle_t handle;
+
+int n_samples = ...;
+int n_features = ...;
+
+rmm::device_uvector<float> input(n_samples * n_features, handle.get_stream());
+rmm::device_uvector<float> output(n_samples * n_samples, handle.get_stream());
+
+// ... Populate feature matrix ...
+
+auto metric = raft::distance::DistanceType::L2SqrtExpanded;
+rmm::device_uvector<char> workspace(0, handle.get_stream());
+raft::distance::pairwise_distance(handle, input.data(), input.data(),
+                                  output.data(),
+                                  n_samples, n_samples, n_features,
+                                  workspace.data(), metric);
+```
+
+
+
+
 ## Folder Structure and Contents
 
-The folder structure mirrors the main RAPIDS repos (cuDF, cuML, cuGraph...), with the following folders:
+The folder structure mirrors other RAPIDS repos (cuDF, cuML, cuGraph...), with the following folders:
 
-- `cpp`: Source code for all C++ code. The code is header only, therefore it is in the `include` folder (with no `src`).
+- `cpp`: Source code for all C++ code. The code is currently header-only, therefore it is in the `include` folder (with no `src`).
 - `python`: Source code for all Python source code.
 - `ci`: Scripts for running CI in PRs
 
+[comment]: <> (TODO: This needs to be updated after the public API is established)
+[comment]: <> (The library layout contains the following structure:)
+
+[comment]: <> (```bash)
+
+[comment]: <> (cpp/include/raft)
+
+[comment]: <> (     |------------ comms      [communication abstraction layer])
+
+[comment]: <> (     |------------ distance   [dense pairwise distances])
+
+[comment]: <> (     |------------ linalg     [dense linear algebra])
+
+[comment]: <> (     |------------ matrix     [dense matrix format])
+
+[comment]: <> (     |------------ random     [random matrix generation])
+
+[comment]: <> (     |------------ sparse     [sparse matrix and graph algorithms])
+
+[comment]: <> (     |------------ spatial    [spatial algorithms])
+
+[comment]: <> (     |------------ spectral   [spectral clustering])
+
+[comment]: <> (     |------------ stats      [statistics primitives])
+
+[comment]: <> (     |------------ handle.hpp [raft handle])
+
+[comment]: <> (```)
+
 
diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
@@ -15,7 +15,7 @@ function hasArg {
 
 # Set path and build parallel level
 export PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
-export PARALLEL_LEVEL=${PARALLEL_LEVEL:-4}
+export PARALLEL_LEVEL=${PARALLEL_LEVEL:-8}
 export CUDA_REL=${CUDA_VERSION%.*}
 
 # Set home to the job's workspace

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
@@ -100,8 +100,10 @@ endif()
 # add third party dependencies using CPM
 rapids_cpm_init()
 
+# thrust and libcudacxx need to be before cuco!
 include(cmake/thirdparty/get_thrust.cmake)
 include(cmake/thirdparty/get_rmm.cmake)
+include(cmake/thirdparty/get_libcudacxx.cmake)
 include(cmake/thirdparty/get_cuco.cmake)
 
 if(BUILD_TESTS)

diff --git a/cpp/Doxyfile.in b/cpp/Doxyfile.in
@@ -771,10 +771,7 @@ WARN_LOGFILE           =
 # spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
 # Note: If this tag is empty the current directory is searched.
 
-INPUT                  = @CMAKE_CURRENT_SOURCE_DIR@/comms \
-                         @CMAKE_CURRENT_SOURCE_DIR@/include \
-                         @CMAKE_CURRENT_SOURCE_DIR@/src \
-                         @CMAKE_CURRENT_SOURCE_DIR@/src_prims
+INPUT                  = @CMAKE_CURRENT_SOURCE_DIR@/include \
 
 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
@@ -799,12 +796,7 @@ INPUT_ENCODING         = UTF-8
 # *.m, *.markdown, *.md, *.mm, *.dox, *.py, *.pyw, *.f90, *.f, *.for, *.tcl,
 # *.vhd, *.vhdl, *.ucf, *.qsf, *.as and *.js.
 
-FILE_PATTERNS          = *.cpp \
-                         *.h \
-                         *.hpp \
-                         *.hxx \
-                         *.cu \
-                         *.cuh
+FILE_PATTERNS          = *.hpp
 
 # The RECURSIVE tag can be used to specify whether or not subdirectories should
 # be searched for input files as well.
@@ -835,8 +827,8 @@ EXCLUDE_SYMLINKS       = NO
 # Note that the wildcards are matched against the file with absolute path, so to
 # exclude all test directories for example use the pattern */test/*
 
-EXCLUDE_PATTERNS       = columnWiseSort.h \
-                         smoblocksolve.h
+EXCLUDE_PATTERNS       = **/detail/** \
+                         **/spectral/**
 
 # The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names
 # (namespaces, classes, functions, etc.) that should be excluded from the
@@ -873,7 +865,7 @@ EXAMPLE_RECURSIVE      = NO
 # that contain images that are to be included in the documentation (see the
 # \image command).
 
-IMAGE_PATH             = @CMAKE_CURRENT_SOURCE_DIR@/doxygen/images
+IMAGE_PATH             =
 
 # The INPUT_FILTER tag can be used to specify a program that doxygen should
 # invoke to filter for each input file. Doxygen will invoke the filter program

diff --git a/cpp/cmake/doxygen.cmake b/cpp/cmake/doxygen.cmake
@@ -22,7 +22,7 @@ function(add_doxygen_target)
     set(multiValueArgs "")
     cmake_parse_arguments(dox "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
     configure_file(${dox_IN_DOXYFILE} ${dox_OUT_DOXYFILE} @ONLY)
-    add_custom_target(doc
+    add_custom_target(docs_raft
       ${DOXYGEN_EXECUTABLE} ${dox_OUT_DOXYFILE}
       WORKING_DIRECTORY ${dox_CWD}
       VERBATIM

diff --git a/cpp/cmake/libcudacxx.patch b/cpp/cmake/libcudacxx.patch
@@ -0,0 +1,21 @@
+diff --git a/include/cuda/std/detail/__config b/include/cuda/std/detail/__config
+index d55a43688..654142d7e 100644
+--- a/include/cuda/std/detail/__config
++++ b/include/cuda/std/detail/__config
+@@ -23,7 +23,7 @@
+     #define _LIBCUDACXX_CUDACC_VER_MINOR __CUDACC_VER_MINOR__
+     #define _LIBCUDACXX_CUDACC_VER_BUILD __CUDACC_VER_BUILD__
+     #define _LIBCUDACXX_CUDACC_VER                                                  \
+-        _LIBCUDACXX_CUDACC_VER_MAJOR * 10000 + _LIBCUDACXX_CUDACC_VER_MINOR * 100 + \
++        _LIBCUDACXX_CUDACC_VER_MAJOR * 100000 + _LIBCUDACXX_CUDACC_VER_MINOR * 1000 + \
+         _LIBCUDACXX_CUDACC_VER_BUILD
+
+     #define _LIBCUDACXX_HAS_NO_LONG_DOUBLE
+@@ -64,7 +64,7 @@
+ #  endif
+ #endif
+
+-#if defined(_LIBCUDACXX_COMPILER_MSVC) || (defined(_LIBCUDACXX_CUDACC_VER) && (_LIBCUDACXX_CUDACC_VER < 110500))
++#if defined(_LIBCUDACXX_COMPILER_MSVC) || (defined(_LIBCUDACXX_CUDACC_VER) && (_LIBCUDACXX_CUDACC_VER < 1105000))
+ #  define _LIBCUDACXX_HAS_NO_INT128
+ #endif
diff --git a/cpp/cmake/thirdparty/get_cuco.cmake b/cpp/cmake/thirdparty/get_cuco.cmake
@@ -22,7 +22,7 @@ function(find_and_configure_cuco VERSION)
       INSTALL_EXPORT_SET  raft-exports
       CPM_ARGS
         GIT_REPOSITORY https://github.com/NVIDIA/cuCollections.git
-        GIT_TAG        729857a5698a0e8d8f812e0464f65f37854ae17b
+        GIT_TAG        f0eecb203590f1f4ac4a9f1700229f4434ac64dc
         OPTIONS        "BUILD_TESTS OFF"
                        "BUILD_BENCHMARKS OFF"
                        "BUILD_EXAMPLES OFF"

diff --git a/cpp/cmake/thirdparty/get_libcudacxx.cmake b/cpp/cmake/thirdparty/get_libcudacxx.cmake
@@ -0,0 +1,26 @@
+# =============================================================================
+# Copyright (c) 2020-2021, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
+# in compliance with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software distributed under the License
+# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+# or implied. See the License for the specific language governing permissions and limitations under
+# the License.
+# =============================================================================
+
+# This function finds libcudacxx and sets any additional necessary environment variables.
+function(find_and_configure_libcudacxx)
+  include(${rapids-cmake-dir}/cpm/libcudacxx.cmake)
+
+  rapids_cpm_libcudacxx(
+    BUILD_EXPORT_SET raft-exports INSTALL_EXPORT_SET raft-exports PATCH_COMMAND patch
+    --reject-file=- -p1 -N < ${RAFT_SOURCE_DIR}/cmake/libcudacxx.patch || true
+  )
+
+endfunction()
+
+find_and_configure_libcudacxx()
diff --git a/cpp/include/raft/comms/comms.hpp b/cpp/include/raft/comms/comms.hpp
@@ -339,9 +339,9 @@ class comms_t {
 
   /**
    * Gathers data from all ranks and delivers to combined data to all ranks
-   * @param value_t datatype of underlying buffers
-   * @param sendbuff buffer containing data to send
-   * @param recvbuff buffer containing data to receive
+   * @tparam value_t datatype of underlying buffers
+   * @param sendbuf buffer containing data to send
+   * @param recvbuf buffer containing data to receive
    * @param recvcounts pointer to an array (of length num_ranks size) containing the number of
    *                   elements that are to be received from each rank
    * @param displs pointer to an array (of length num_ranks size) to specify the displacement
@@ -376,9 +376,9 @@ class comms_t {
 
   /**
    * Gathers data from all ranks and delivers to combined data to all ranks
-   * @param value_t datatype of underlying buffers
-   * @param sendbuff buffer containing data to send
-   * @param recvbuff buffer containing data to receive
+   * @tparam value_t datatype of underlying buffers
+   * @param sendbuf buffer containing data to send
+   * @param recvbuf buffer containing data to receive
    * @param sendcount number of elements in send buffer
    * @param recvcounts pointer to an array (of length num_ranks size) containing the number of
    *                   elements that are to be received from each rank
@@ -401,6 +401,7 @@ class comms_t {
    * @tparam value_t datatype of underlying buffers
    * @param sendbuff buffer containing data to send (size recvcount * num_ranks)
    * @param recvbuff buffer containing received data
+   * @param recvcount number of items to receive
    * @param op reduction operation to perform
    * @param stream CUDA stream to synchronize operation
    */
@@ -476,7 +477,7 @@ class comms_t {
    * @param sendbuf pointer to array of data to send
    * @param sendsizes numbers of elements to send
    * @param sendoffsets offsets in a number of elements from sendbuf
-   * @param dest destination ranks
+   * @param dests destination ranks
    * @param recvbuf pointer to (initialized) array that will hold received data
    * @param recvsizes numbers of elements to recv
    * @param recvoffsets offsets in a number of elements from recvbuf

diff --git a/cpp/include/raft/comms/std_comms.hpp b/cpp/include/raft/comms/std_comms.hpp
@@ -56,11 +56,13 @@ class std_comms : public comms_iface {
 
   /**
    * @brief Constructor for collective + point-to-point operation.
-   * @param comm initialized nccl comm
+   * @param nccl_comm initialized nccl comm
    * @param ucp_worker initialized ucp_worker instance
    * @param eps shared pointer to array of ucp endpoints
-   * @param size size of the cluster
+   * @param num_ranks number of ranks in the cluster
    * @param rank rank of the current worker
+   * @param stream cuda stream for synchronizing and ordering collective operations
+   * @param subcomms_ucp use ucp for subcommunicators
    */
   std_comms(ncclComm_t nccl_comm, ucp_worker_h ucp_worker,
             std::shared_ptr<ucp_ep_h *> eps, int num_ranks, int rank,
@@ -79,9 +81,10 @@ class std_comms : public comms_iface {
 
   /**
    * @brief constructor for collective-only operation
-   * @param comm initilized nccl communicator
-   * @param size size of the cluster
+   * @param nccl_comm initilized nccl communicator
+   * @param num_ranks size of the cluster
    * @param rank rank of the current worker
+   * @param stream stream for ordering collective operations
    */
   std_comms(const ncclComm_t nccl_comm, int num_ranks, int rank,
             cudaStream_t stream)