INSTALL

UPC++ Installation

This file documents software installation of UPC++.

For information on using UPC++, see: README

Public Installs

The team which develops and maintains UPC++ also provides public installs of current UPC++ releases at several HPC centers. Before you invest time in installing UPC++ for yourself, please consider checking the online documentation which describes these installs, including site-specific usage instructions regarding compiling and running on each such system.

System Requirements

Supported Platforms

UPC++ makes aggressive use of template meta-programming techniques, and requires a modern C++ compiler and corresponding standard library implementation.

The current release is known to work on the following configurations:

Apple macOS/x86_64:
- The most recent Xcode release for each macOS release is generally well-tested
  - It is suspected that any Xcode (ie Apple clang) release 8.0 or newer will work
- Free Software Foundation g++ (e.g., as installed by Homebrew or Fink) version 6.4.0 or newer should also work
Apple macOS/aarch64 (aka "Apple Silicon"):
- The most recent Xcode release for each macOS release is generally well-tested
  - It is suspected that any compatible Xcode (ie Apple clang) will work
- Free Software Foundation g++ (e.g., as installed by Homebrew or Fink) of a recent version should also work
Caveat: Nothing platform-specific has been implemented for the mix of "performance" and "efficiency" cores, meaning performance could be highly variable.
Linux/x86_64 with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 4.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
- Intel C++ 17.0.2 or newer (with libstdc++ from g++ 6.4.0 or newer)
- Intel oneAPI compilers 2021.1.2 or newer (with libstdc++ from g++ 6.4.0 or newer)
- PGI C++ 19.3 through 20.4 (with libstdc++ from g++ 6.4.0 or newer)
- NVIDIA HPC SDK (aka nvhpc) 20.9 and newer (with libstdc++ from g++ 6.4.0 or newer)
- AMD AOCC compilers 2.3.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
If /usr/bin/g++ is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below.
Linux/ppc64le (aka IBM POWER little-endian) with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 5.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
- PGI C++ 19.3 through 20.4 (with libstdc++ from g++ 6.4.0 or newer)
- NVIDIA HPC SDK (aka nvhpc) 20.9 and newer (with libstdc++ from g++ 6.4.0 or newer)
If /usr/bin/g++ is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below.
Linux/aarch64 (aka "arm64" or "armv8") with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 4.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
If /usr/bin/g++ is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below.

Note the GPUDirect drivers necessary for GDR-accelerated memory kinds on InfiniBand are not supported on the Linux/aarch64 platform.
HPE Cray EX with x86_64 CPUs and one of the following PrgEnv environment modules, plus its dependencies (smp and ofi conduits):
- PrgEnv-gnu with gcc/10.3.0 (or later) loaded.
- PrgEnv-cray with cce/12.0.0 (or later) loaded.
- PrgEnv-amd with amd/4.2.0 (or later) loaded.
- PrgEnv-aocc with aocc/3.1.0 (or later) loaded.
- PrgEnv-nvidia with nvidia/21.9 (or later) loaded.
- PrgEnv-nvhpc with nvhpc/21.9 (or later) loaded.
- PrgEnv-intel with intel/2023.1.0 (or later) loaded.
NOT officially supported:
- Vendor-specific clang++ or g++ variants.
  At least Arm Ltd. provides compilers based on their own modifications to Clang/LLVM. Similarly, at least Arm Ltd. and IBM provide forks of g++.
  To the best of our limited current knowledge, these all behave as their respective "upstream" compilers, with no additional compiler-specific issues.
  At this time we do not consider these compilers to be officially supported due to insufficient periodic automated testing.
  The presence or absence of a warning from configure varies.

Miscellaneous software requirements:

Python3 or Python2 version 2.7.5 or newer
Perl version 5.005 or newer
GNU Bash 3.2.57 or newer (must be installed, user's shell doesn't matter)
GNU Make 3.80 or newer
The following standard Unix tools: 'awk', 'sed', 'env', 'basename', 'dirname'

Linux Compiler Notes:

If /usr/bin/g++ is older than 6.4.0 (even if using a different C++ compiler for UPC++) please read docs/local-gcc.
If using a non-GNU compiler with /usr/bin/g++ older than 6.4.0, please also read docs/alt-compilers.

Installation Instructions

The recipe for building and installing UPC++ is the same as many packages using the GNU Autoconf and Automake infrastructure (though UPC++ does not use either). The high-level steps are as follows:

configure
Configures UPC++ with key settings such as the installation location
make all
Compiles the UPC++ package
make check (optional, but recommended)
Verifies the correctness of the UPC++ build prior to its installation
make install
Installs the UPC++ package to the user-specified location
make test_install (optional, but highly recommended)
Verifies the installed package
Post-install recommendations

The following numbered sections provide detailed descriptions of each step above. Following those are sections with platform-specific instructions.

1. Configuring UPC++

cd <upcxx-source-dir>
./configure  --prefix=<upcxx-install-path>

Or, to have distinct source and build trees (for instance to compile multiple configurations from a common source directory):

mkdir <upcxx-build-path>
cd <upcxx-build-path>
<upcxx-source-path>/configure  --prefix=<upcxx-install-path>

This will configure the UPC++ library to be installed to the given <upcxx-install-path> directory. Users are recommended to use paths to non-existent or empty directories as the installation path so that uninstallation is as trivial as rm -rf <upcxx-install-path>.

Depending on the platform, additional command-line arguments may be necessary when invoking configure. For guidance, see the platform-specific instructions in the following sections, below:

Configuration: HPE Cray EX
Configuration: Linux
Configuration: Apple macOS
Configuration: CUDA GPU support
Configuration: AMD ROCm/HIP GPU support
Configuration: HIP-over-CUDA GPU support
Configuration: Intel oneAPI GPU support

Running <upcxx-source-path>/configure --help will provide general information on the available configuration options, and similar information is provided in the Advanced Configuration section below.

If you are using a source tarball release downloaded from the website, it should include an embedded copy of GASNet-EX and configure will default to using that. However if you are using a git clone or other repo snapshot of UPC++, then configure may default to downloading the GASNet-EX communication library, in which case an Internet connection is needed at configuration time.

GNU Make 3.80 or newer is required to build UPC++. If neither make nor gmake in your $PATH meets this requirement, you may use --with-gmake=... to specify the full path to an appropriate version. You may need to substitute gmake, or your configured value, for make where it appears in the following steps. The final output from configure will provide the appropriate commands.

Python3 or Python2 (version 2.7.5 or later) is required by UPC++. By default, configure searches $PATH for several common Python interpreter names. If that does not produce a suitable interpreter, you may override this using --with-python=... to specify a python interpreter. If you provide a full path, the value is used as given. Otherwise, the $PATH at configure-time is searched to produce a full path. Either way, the resulting full path to the python interpreter will be used in the installed upcxx-run script, rather than a runtime search of $PATH. Therefore, the interpreter specified must be available in a batch-job environment where applicable.

Bash 3.2.57 or newer is required by UPC++ scripts, including configure. By default, configure will try /bin/sh and then the first instance of bash found in $PATH. If neither of these is bash 3.2.57 (or newer), or if the one found is not appropriate to use (for instance not accessible on compute nodes), one can override the automated selection by invoking configure via the desired instance of bash:

/usr/gnu/bin/bash <upcxx-source-path>/configure ...

By default, the configure script will attempt to enforce use of C++ and C compilers which report the same family and version. If necessary, this can be disabled using --enable-allow-compiler-mismatch. However, installation of UPC++ configured in this manner is not supported.

2. Compiling UPC++

make all

This will compile the UPC++ runtime libraries, including the GASNet-EX communications runtime. One may run, for instance, make -j8 all to build with eight concurrent processes. This may significantly reduce the time required. However parallel make can also obscure error messages, so if you encounter a failure you should retry without a -j option.

Some combinations of network and configure options require that CXX be capable of linking MPI applications. If that requirement exists but is unmet, then this step will fail with output giving instructions to read the section Configuration: Linux in this document, where this issue is described in more detail.

The output generated at the successful conclusion of this step gives the default network and a list of available networks. This is an appropriate time to verify that the default network is the one you expect to use. If it is not, but it is listed as available, you can specify your preferred network to the later make install step without starting over. However, if your preferred network is not listed as available, then you will need to return to the previous (configure) step, where additional arguments or environment modules may be required to enable detection of the appropriate headers and/or libraries.

3. Testing the UPC++ build (optional)

Though it is not required, we recommend testing the completeness and correctness of the UPC++ build before proceeding to the installation step. In general the environment used to compile UPC++ tests and run them may not be the same (most notably, on batch-scheduled and/or cross-compiled platforms). The following command assumes it is invoked in an environment suitable for both, if such is available:

make check

This compiles all available tests for the default network and then runs them. One can override the default network by appending NETWORKS=net1,net2 to this command, with network names (such as smp, udp, ibv or ofi) substituted for the netN placeholders.

Setting of NETWORKS to restrict what is tested may be necessary, for instance, if GASNet-EX detected libraries for a network not physically present in your system. This will often occur for InfiniBand (which GASNet-EX identifies as ibv) due to presence of the associated libraries on many Linux distributions. One may, if desired, return to the configure step and pass --disable-ibv (or other undesired network) to remove support for a given network from the build of UPC++.

By default the test-running step runs each test with a 5 minute time limit (assuming the timeout command from GNU coreutils appears in $PATH). If any tests terminate with FAILED (exitcode=124): probable timeout, this indicates a timeout (which might happen in environments with very slow hardware or slow job launch). The simplest workaround in such cases is to set TIMEOUT=false to disable the timeout. Alternatively, one can set envvar UPCXX_RUN_TIME_LIMIT to a value in seconds to enforce a longer timeout.

Variables TESTS and NO_TESTS can optionally be set to a space-delimited list of test name substrings used as a filter to select or discard a subset of tests to be compiled/run. Variable EXTRAFLAGS can optionally inject upcxx compile options, eg EXTRAFLAGS=-Werror.

If it is not possible to both compile and run parallel applications in the same environment, then one may apply the following two steps in place of make check:

In an environment suited to compilation, run make tests-clean tests. This will remove any test executables left over from previous attempts, and then compiles all tests for all available networks. One may restrict this to a subset of the available networks by appending a setting for NETWORKS, as described above for make check.
In an environment suited to execution of parallel applications, run make run-tests. As in the first step, one may set NETWORKS on the make command line to limit the tests run to some subset of the tests built above.

4. Installing the compiled UPC++ package

make install [NETWORK=net]

This will install the UPC++ runtime libraries and accompanying utilities to the location specified via --prefix=... at configuration time. If that value is not the desired installation location, then make install prefix=<desired-install-directory> may be used to override the value given at configure time.

One may optionally pass NETWORK=net (replacing net by a supported network name) to specify the default network (overriding --with-default-network=... specified at configure time, if any). Output at the end of the all and check steps report the default to be used in the absence of an explicit setting, and the available networks.

5. Testing the install UPC++ package (optional)

make tests-clean test_install

This optional command removes any test executables left over from previous attempts, and then builds a simple "Hello, World" test for each supported network using the installed UPC++ libraries and compiler wrapper.

At the end of the output will be instructions for running these tests if desired.

6. Post-install recommendations

After step 5 (or step 4, if skipping step 5) one may safely remove the directory <upcxx-source-path> (and <upcxx-build-path>, if used) since they are not needed by the installed package.

One may use the utilities upcxx (compiler wrapper), upcxx-run (launch wrapper) and upcxx-meta (UPC++ metadata utility) by their full path in <upcxx-install-path>/bin. However, it is common to append that directory to one's $PATH environment variable (the best means to do so are beyond this scope of this document).

Additionally, one may wish to set the environment variable $UPCXX_INSTALL to <upcxx-install-path>, as this is assumed by several UPC++ examples.

For systems using "environment modules" an example module file is provided as <upcxx-install-path>/share/modulefiles/upcxx/<upcxx-version>. This sets both $PATH and $UPCXX_INSTALL as recommended above. Consult the documentation for the environment modules package on how to use this file.

For users of CMake 3.6 or newer, <upcxx-install-path>/share/cmake/UPCXX contains a UPCXXConfig.cmake. Consult CMake documentation for instructions on use of this file.

Finally, <upcxx-install-path>/bin/test-upcxx-install.sh is a script which can be run to replicate the verification performed by make test_install without <upcxx-source-path> and/or <upcxx-build-path>. This could be useful, for instance, to verify permissions for a user other than the one performing the installation.

Configuration: HPE Cray EX

This release of UPC++ includes initial support for the HPE Cray EX platform, including both the "Slingshot-10" and "Slingshot-11" network interface cards (NICs) and GPUs from both Nvidia and AMD. When built in a supported configuration, this release passes all of the UPC++ test suite. However, the performance has not yet been tuned on this platform.

Unlike the Cray XC, the HPE Cray EX is not treated as a cross-compilation target when building UPC++. However, we strongly advise use of the vendor's wrapper compilers, cc and CC. Additionally, the two NICs require distinct non-default settings. The following shows our recommended configure command with some "" which are explained below. Note that these assume use of the default version of GASNet-EX. If using an earlier release of GASNet-EX, please consult documentation in a UPC++ release of similar age.

module load libfabric cray-pmi <GPU_MODULES>
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> \
    --with-cc=cc --with-cxx=CC --with-mpi-cc=cc \
    --with-ofi-provider=<PROVIDER> \
    --with-pmi-runcmd='<RUNCMD>' \
    <GPU_OPTIONS>

The libfabric and cray-pmi environment modules may or may not be loaded by default at any given site. Please ensure they are loaded (as shown above) or the configure or build steps may fail.

As denoted by the <GPU_MODULES> placeholder, one or more environment modules may be needed for GPU support. Example module names (to help locate the appropriate information in site-specific documentation) include cudatoolkit, rocm and intel_compute_runtime, though variations on these names exist. In some cases one may also need a device-specific module, often with a name starting with craype-accel-, to avoid link errors or warnings on every compile. Be advised that some sites may bundle the programming model and device modules into a single module.

There are two NICs options in an HPE Cray EX system, known as "Slingshot-10" and "Slingshot-11". They require different libfabric "providers", as indicated by the <PROVIDER> placeholder above:

--with-ofi-provider=verbs for Slingshot-10.
This is a Mellanox ConnectX-5 (or -6) 100Gbps NIC.
--with-ofi-provider=cxi for Slingshot-11.
This is an HPE 200Gbps NIC

If you are uncertain of which NIC is used on a given system, please consult the site-specific documentation or ask the support staff for assistance. Another alternative is to pass --with-ofi-provider=generic, which requests provider adaptation be performed during runtime startup, at some cost in additional communication overhead. This option may be useful for systems with a mix of Slingshot-10 and Slingshot-11 nodes, although all processes in a job still need to be using matching hardware and software (including provider) at runtime.

On some systems with multiple Slingshot NICs, one will need to add --with-host-detect=hostname. This option is recommended only when actually required. If your system does require this setting, then you will see a message at application run time directing you to use this option, or an environment-based alternative.

You will also need to select the proper argument to --with-pmi-runcmd=... (the <RUNCMD> placeholder, above).

If using the Slurm Workload Manager: --with-pmi-runcmd='srun -n %N -- %C'
For most other cases: --with-pmi-runcmd='aprun --cc none -n %N %C'

At the time of this writing we've only tested UPC++ on HPE Cray EX systems with AMD CPUs.

As mentioned earlier and indicated by the <GPU_OPTIONS> placeholder, this UPC++ release supports GPUs using Nvidia CUDA, AMD ROCm/HIP, and Intel oneAPI in HPE Cray EX systems. Please also see the respective sections of this document for UPC++ configure options needed to enable this support:

Configuration: CUDA GPU support
Configuration: AMD ROCm/HIP GPU support
Configuration: Intel oneAPI GPU support

With the Slingshot-11 network, some users have seen application hangs due to what appears to be "lost" RPCs. At the time this is written, there are two possible workarounds for this issue. Descriptions of both workarounds, along with the most up-to-date information on this issue in general, can be found in the corresponding GASNet-EX report: bug 4461.

With the Slingshot-10 network, there are conditions (not yet characterized) under which RPCs may be corrupted in such a way that their reception results in a fatal error. This can manifest with a fatal error containing the text "no associated AM handler function" or (in a debug build) an "Assertion failure" message with the expression isreq == header->isreq. At the time this is written, there is a known workaround for this issue. A description of the workaround, along with the most up-to-date information on this issue in general, can be found in the corresponding GASNet-EX report: bug 4517.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: Linux

The configure command above will work as-is. The default compilers used will be gcc/g++. The --with-cc=... and --with-cxx=... options may specify alternatives to override this behavior. Additional options providing finer control over how UPC++ is configured can be found in the Advanced Configuration section below.

By default ibv-conduit (InfiniBand support) will use MPI for job spawning if a working mpicc is found in your $PATH when UPC++ is built. The same is true for SMP, MPI, OFI and UCX conduits, if these have been enabled. To ensure that UPC++ applications will link when one of these conduits are used, one of three options must be chosen. Failure to do so will typically result in an error message at UPC++ build time, directing you to this documentation. Note that smp-conduit is a recent addition to the list above with GASNet versions 2024.5.3 or later.

Option 1. The most direct solution is to configure using --with-cxx=mpicxx (or similar) to ensure correct linking of UPC++ applications which use MPI for job spawning. When one is using MPI for job spawning, it is important that GASNet's MPI support use a corresponding/compatible mpicc and mpirun. In the common case, the un-prefixed mpicc and mpirun in $PATH are compatible (ie. same vendor/version/ABI) with the provided --with-cxx=mpicxx, in which case nothing more should be required. Otherwise, one may need to additionally pass options like --with-mpi-cc='/path/to/compatible/mpicc -options' and/or --with-mpirun-cmd='/path/to/compatible/mpirun -np %N %C'.
Please see GASNet's mpi-conduit documentation for details.

Option 2. If any of these networks are enabled but are not necessary, one can configure using --disable-[network] to disable it. One may wish to select this option if there is no corresponding network hardware or no interest in using the given network API. The case of missing hardware can often occur for IBV when Linux distros install the corresponding development packages as dependencies of other packages.

Option 3. If one does not require MPI for job spawning (because SSH-based or PMI-based or SMP fork-based spawning in GASNet are sufficient), then one may configure using --disable-mpi-compat to eliminate the link-time dependence on MPI. Note that this particular option does NOT work for mpi-conduit.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: Apple macOS

On macOS, the default network is "smp": multiple processes running on a single host, communicating over shared memory. One may specify a different default using --with-default-network=... at configure time. However, you will also have the opportunity to make such a selection at the make install step.

On macOS, UPC++ defaults to using the Apple LLVM clang compiler that is part of the Xcode Command Line Tools.

The Xcode Command Line Tools need to be installed before invoking configure, i.e.:

xcode-select --install

Alternatively, the --with-cc=... and --with-cxx=... options to configure may be used to specify different compilers.

Note that with GASNet versions 2024.5.3 or later, if you have MPI installed then you may also need to specify configure option --with-cxx=mpicxx or --disable-mpi-compat accordingly. See Configuration: Linux above for more details.

In order to use a debugger on macOS, we advise you to enable "Developer Mode". This is a system setting, not directly related to UPC++. Developer Mode may already be enabled, for instance if one granted Xcode permission when it asked to enable it. If not, then an Administrator must run DevToolsSecurity -enable in Terminal. This mode allows all users to use development tools, including the lldb debugger. If that is not desirable, then use of debuggers will be limited to members of the _developer group. An internet search for macos _developer group will provide additional information.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: CUDA GPU support

System Requirements:

UPC++ includes support for RMA communication operations on memory buffers resident in a CUDA-compatible NVIDIA GPU. General requirements:

Modern NVIDIA-branded CUDA-compatible GPU hardware
NVIDIA CUDA toolkit v9.0 or later. Available for download here.

Additional System Requirements for GDR-accelerated memory kinds:

This version of UPC++ supports GPUDirect RDMA (GDR) acceleration of memory kinds data transfers on selected platforms using modern NVIDIA-branded GPUs with NVIDIA- or Mellanox-branded InfiniBand or HPE Slingshot network hardware.
This support requires one of the following high-performance network conduit configurations, and the current/default version of GASNet-EX:

ibv-conduit with recent NVIDIA/Mellanox-branded InfiniBand network hardware
ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)
ofi-conduit on HPE Cray EX with HPE Slingshot-10 (verbs provider)

Additional requirements:

Linux OS with x86_64 or ppc64le CPU (not ARM)
GPUDirect RDMA drivers installed

When using GDR-accelerated memory kinds, calls to upcxx::copy will offload the data transfer to the network adapter, streaming data directly between the source and destination memory locations (in host or device memory on any node), without staging through additional memory buffers.

For all other platforms, the CUDA support in this UPC++ release utilizes a reference implementation which has not been tuned for performance. In particular, upcxx::copy will stage data transfers involving device memory through intermediate buffers in host memory, and is expected to underperform relative to solutions using RDMA, GPUDirect and similar zero-copy technologies. Future versions of UPC++ will introduce native memory kinds acceleration for additional GPU and network variants.

`configure` Command for Enabling CUDA GPU Support

To activate the UPC++ support for CUDA, pass --enable-cuda to the configure script:

cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-cuda

This will detect whether the requirements for GDR acceleration are met and automatically activate that feature. For troubleshooting installation of GASNet's GDR support, please see docs/memory_kinds in the GASNet distribution.

configure --enable-cuda expects to find the NVIDIA nvcc compiler wrapper in your $PATH and will attempt to extract the correct build settings for your system. If this automatic extraction fails (resulting in preprocessor or linker errors mentioning CUDA), then you may need to manually override the following options to configure:

--with-nvcc=...: the full path to the nvcc compiler wrapper from the CUDA toolkit. Eg --with-nvcc=/Developer/NVIDIA/CUDA-10.0/bin/nvcc
--with-cuda-cppflags=...: preprocessor flags to add for locating the CUDA toolkit headers. Eg --with-cuda-cppflags='-I/Developer/NVIDIA/CUDA-10.0/include'
--with-cuda-libflags=...: linker flags to use for linking CUDA executables. Eg --with-cuda-libflags='-Xlinker -force_load -Xlinker /Developer/NVIDIA/CUDA-10.0/lib/libcudart_static.a -L/Developer/NVIDIA/CUDA-10.0/lib -lcudadevrt -Xlinker -rpath -Xlinker /usr/local/cuda/lib -Xlinker -framework -Xlinker CoreFoundation -framework CUDA'

Note that you must build UPC++ with the same host compiler toolchain as is used by nvcc when compiling any UPC++ CUDA programs. That is, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain. You can ensure this is the case by either (1) configuring UPC++ with the same compiler as your system nvcc uses, or (2) using the -ccbin command line argument to nvcc during application compilation to ensure it uses the same host compiler as was passed to the UPC++ configure script.

Validation of CUDA memory kinds support

One can validate CUDA support in a given UPC++ install using a command like the following:

$ upcxx-info | grep CUDA

UPCXX_CUDA:                         1
UPCXX_CUDA_NVCC:                    /path/to/cuda/bin/nvcc
UPCXX_CUDA_CPPFLAGS:                ...CUDA include options...
UPCXX_CUDA_LIBFLAGS:                ...CUDA library options...
  GPUs with NVIDIA CUDA API (cuda-uva)               ON     (enabled)

Where the UPCXX_CUDA: 1 indicates the UPC++ install is CUDA-aware, and in the last line ON indicates that GASNet-EX may include GDR acceleration support (actual availability also depends on network backend selection at application compile time).

UPC++ CUDA operation can be validated using the following programs in the source tree:

test/copy.cpp and test/copy-cover.cpp: correctness testers for the UPC++ cuda_device
bench/gpu_microbenchmark.cpp: performance microbenchmark for upcxx::copy using GPU memory
make cuda_vecadd in example/gpu_vecadd: demonstration of using UPC++ cuda_device to orchestrate communication for a program invoking CUDA computational kernels on the GPU.

One can validate use of GDR acceleration in a given UPC++ executable with a command like the following:

$ upcxx-run -i a.out | grep CUDA
UPCXXKindCUDA: 202103L
UPCXXCUDAGASNet: 1
UPCXXCUDAEnabled: 1
GASNetMKClassCUDAUVA: 1

Where the UPCXXCUDAGASNet: 1 and GASNetMKClassCUDAUVA: 1 lines together confirm the use of GDR acceleration. If either value is 0 or absent then GDR acceleration is not in use.

Known problems with GDR-accelerated memory kinds

There is a known bug in the vendor-provided IB Verbs firmware affecting GDR Gets that causes crashes inside the IB Verbs network stack during copy() operations targeting small objects in a cuda_device segment with affinity to the calling process on some platforms. This problem can be worked-around by setting MLX5_SCATTER_TO_CQE=0, but this setting has a global negative impact on RMA Get operations (even those not involving device memory) so should only be used on affected platforms. Details are here:

bug 4151: IBVerbs SEGV on small Gets to device memory

There is also a known bug in the libfabric "verbs provider" (recommended for use on Slingshot-10 networks) that causes crashes inside libfabric during copy() operations where the source objects reside in a cuda_device segment with affinity to the calling process. This problem can be worked-around by setting FI_VERBS_INLINE_SIZE=0. Because this setting disables a valuable performance optimization, it may increase the latency of all small RMA Puts, including those from host memory, as well as some RPCs. Therefore, it is strongly recommended that you set this variable only if your system exhibits this issue. Details are here:

bug 4494: ofi/verbs SEGV for small Puts from CUDA memory

In addition to the two issues described above, the current implementation of GDR-accelerated memory kinds enforces a per-process limit of 32 active cuda_device opens over the lifetime of the process. This static limit can be raised at configure time via configure --with-maxeps=N, and is expected to become a more dynamic limit in a future release.

Use of UPC++ memory kinds

See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the CUDA support.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: AMD ROCm/HIP GPU support

System Requirements:

UPC++ includes support for RMA communication operations on memory buffers resident in a ROCm/HIP-compatible AMD GPU. General requirements:

Modern AMD-branded HIP-compatible GPU hardware
AMD ROCm drivers version 4.5.0 or later (earlier versions of ROCm MIGHT also work, but are not recommended)

Additional System Requirements for ROCmRDMA-accelerated memory kinds:

This version of UPC++ supports ROCmRDMA acceleration of memory kinds data transfers on selected platforms using modern AMD-branded GPUs. This support requires one of the following high-performance network conduit configurations, and the current/default version of GASNet-EX:

ibv-conduit with recent NVIDIA/Mellanox-branded InfiniBand network hardware
ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)
ofi-conduit on HPE Cray EX with HPE Slingshot-10 (verbs provider)

Additional Requirements:

Linux OS with x86_64 or ppc64le CPU (not ARM)
AMD GPU kernel driver installed

When using ROCmRDMA-accelerated memory kinds, calls to upcxx::copy will offload the data transfer to the network adapter, streaming data directly between the source and destination memory locations (in host or device memory on any node), without staging through additional memory buffers.

For all other platforms, the ROCm/HIP support in this UPC++ release utilizes a reference implementation which has not been tuned for performance. In particular, upcxx::copy will stage data transfers involving device memory through intermediate buffers in host memory, and is expected to underperform relative to solutions using RDMA, ROCmRDMA and similar zero-copy technologies. Future versions of UPC++ will introduce native memory kinds acceleration for additional GPU and network variants.

`configure` Command for Enabling AMD ROCm/HIP GPU Support

To activate the UPC++ support for AMD ROCm/HIP, pass --enable-hip to the configure script:

cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-hip

This will detect whether the requirements for ROCmRDMA acceleration are met and automatically activate that feature. For troubleshooting installation of GASNet's ROCmRDMA support, please see docs/memory_kinds in the GASNet distribution.

configure --enable-hip expects to find the AMD ROCm hipcc compiler wrapper in your $PATH and will attempt to infer the correct ROCm/HIP install location for your system. If this automatic detection fails, then you may need to manually override the following options to configure:

--with-hip-home=...: the install prefix for the ROCm/HIP developer tools Eg --with-hip-home=/opt/rocm-4.5.0/hip
--with-hip-cppflags=...: the pre-processor flags needed to find HIP runtime headers Eg --with-hip-cppflags='-I/opt/rocm-4.5.0/hip/include'
--with-hip-libflags=...: the linker flags needed to link HIP runtime libraries Eg --with-hip-libflags='-L/opt/rocm-4.5.0/hip/lib -lamdhip64'

Note that you must build UPC++ with the same host compiler toolchain as is used by hipcc when compiling any UPC++ ROCm programs. That is, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain. You can ensure this is the case by either (1) configuring UPC++ with the same compiler as your system hipcc uses, or (2) using the --gcc-toolchain= command line argument to hipcc during application compilation to ensure it uses the same host compiler as was passed to the UPC++ configure script.

Validation of ROCm/HIP memory kinds support

One can validate HIP/ROCm support in a given UPC++ install using a command like the following:

$ upcxx-info | grep HIP

UPCXX_HIP:                          1
UPCXX_HIP_CPPFLAGS:                 ...HIP include options...
UPCXX_HIP_LIBFLAGS:                 ...HIP library options...
  GPUs with AMD HIP API (hip)                        ON     (enabled)

Where the UPCXX_HIP: 1 indicates the UPC++ install is HIP-aware, and in the last line ON indicates that GASNet-EX may include ROCmRDMA acceleration support (actual availability also depends on network backend selection at application compile time).

UPC++ ROCm/HIP operation can be validated using the following programs in the source tree:

test/copy.cpp and test/copy-cover.cpp: correctness testers for the UPC++ hip_device
bench/gpu_microbenchmark.cpp: performance microbenchmark for upcxx::copy using GPU memory
make hip_vecadd in example/gpu_vecadd: demonstration of using UPC++ hip_device to orchestrate communication for a program invoking HIP computational kernels on the GPU.

One can validate use of ROCmRDMA acceleration in a given UPC++ executable with a command like the following:

$ upcxx-run -i a.out | grep HIP
UPCXXKindHIP: 202203L
UPCXXHIPEnabled: 1
UPCXXHIPGASNet: 1
GASNetMKClassHIP: 1

Where the UPCXXHIPGASNet: 1 and GASNetMKClassHIP: 1 lines together confirm the use of ROCmRDMA acceleration. If either value is 0 or absent then ROCmRDMA acceleration is not in use.

Known problems with ROCmRDMA-accelerated memory kinds

The current implementation of ROCmRDMA-accelerated memory kinds enforces a per-process limit of 32 active hip_device opens over the lifetime of the process. This static limit can be raised at configure time via configure --with-maxeps=N, and is expected to become a more dynamic limit in a future release.

Use of UPC++ memory kinds

See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the UPC++ GPU support.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: HIP-over-CUDA GPU support

System Requirements:

AMD ROCm provides an implementation of HIP-over-CUDA allowing HIP code to target NVIDIA-branded GPUs. UPC++ can interoperate with this translation layer, allowing the use of upcxx::hip_device on NVIDIA GPU hardware. This enables RMA communication on memory buffers resident in these GPUs just as if they were AMD GPUs (or if the code being compiled was written in CUDA). This is an experimental capability, but has been shown to work with the following configurations:

AMD ROCm version 5.1.0 and CUDA toolkit version 11.4.0
AMD ROCm version 5.3.2 and CUDA toolkit version 11.7.0

as well as modern NVIDIA-branded CUDA-compatible GPU hardware.

Additional requirements for GPUDirect RDMA can be found in the section Configuration: CUDA GPU support.

`configure` Command for Enabling HIP-over-CUDA GPU Support

To activate the UPC++ support for HIP-over-CUDA, pass --enable-hip and --with-hip-platform=nvidia to the configure script:

cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-hip --with-hip-platform=nvidia

For issues with automatic detection of compiler location or build flags, consult the relevant sections of
Configuration: CUDA GPU support and Configuration: AMD ROCm/HIP GPU support.

As mentioned in prior sections, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain.

After running configure, return to Step 2: Compiling UPC++, above.

Configuration: Intel oneAPI GPU support

UPC++ includes initial EXPERIMENTAL support for RMA communication operations on memory buffers resident in a oneAPI-compatible Intel GPU, using the oneAPI Level-Zero (ZE) interface.

Intel GPU memory kind support in this release is believed to be functionally correct, but has not been tuned for performance. upcxx::copy() operations on ze_device memory are currently staged through host memory by default and do not yet leverage network-direct RDMA.

System Requirements:

Modern Intel-branded oneAPI-compatible GPU hardware with appropriate kernel drivers
Intel Level-Zero development headers (level-zero-dev package)

The full Intel oneAPI Toolkits are NOT required to build UPC++ and use ze_device, but are likely required by applications that want to use the GPU for computation.

`configure` Command for Enabling Intel oneAPI GPU Support

To activate the UPC++ support for Intel oneAPI Level-Zero, pass --enable-ze to the configure script:

cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-ze

configure --enable-ze attempts to automatically detect the install prefix of the Level-Zero developer tools and related compilation options for your system. If this automatic detection fails, then you may need to manually override one or more of the following options to configure:

--with-ze-home=...: the install prefix for the Level Zero developer tools Eg --with-ze-home=/usr/local/pkg/intel/level-zero/1.9.4
--with-ze-cppflags=...: the pre-processor flags needed to find Level Zero headers Eg --with-ze-cppflags='-I/usr/local/pkg/intel/level-zero/1.9.4/include'
--with-ze-libflags=...: the linker flags needed to link Level Zero runtime libraries Eg --with-ze-libflags='-L/usr/local/pkg/intel/level-zero/1.9.4/lib64 -lze_loader'

Note that you must build UPC++ with the same host compiler toolchain used for compiling objects linked to any UPC++ oneAPI programs. That is, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain.

Validation of oneAPI memory kinds support

One can validate ze_device support in a given UPC++ install using a command like the following:

$ upcxx-info | grep ZE

UPCXX_ZE:                          1
UPCXX_ZE_CPPFLAGS:                 ...ZE include options...
UPCXX_ZE_LIBFLAGS:                 ...ZE library options...

Where the UPCXX_ZE: 1 indicates the UPC++ install is ZE-aware.

UPC++ ze_device operation can be validated using the following programs in the source tree:

test/copy.cpp and test/copy-cover.cpp: correctness testers for the UPC++ ze_device
bench/gpu_microbenchmark.cpp: performance microbenchmark for upcxx::copy using GPU memory

One can validate a given UPC++ executable includes ze_device support with a command like the following:

$ upcxx-run -i a.out | grep ZE
UPCXXKindZE: 202303L
UPCXXZEEnabled: 1
UPCXXZEGASNet: 0

Where the UPCXXZEEnabled: 1 line indicates the presence of ze_device support in UPC++, and UPCXXZEGASNet: 0 indicates the default lack of hardware acceleration for ze_device transfers in the current GASNet release.

EXPERIMENTAL accelerated memory kinds for Intel GPUs:

This version of UPC++ includes an EXPERIMENTAL prototype-quality implementation of accelerated memory kinds data transfers on selected platforms using modern Intel-branded GPUs with HPE Slingshot-11 network hardware. This support is preliminary and has known correctness and functionality limitations, and is thus disabled by default; configure option --enable-kind-ze must be provided to activate this support.

This support requires the following high-performance network conduit configurations, and the current/default version of GASNet-EX:

ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)

Additional requirements:

Recent Linux OS with x86_64 CPU
Appropriate Intel GPU drivers installed

When using accelerated memory kinds, calls to upcxx::copy will offload the data transfer to the network adapter, streaming data directly between the source and destination memory locations (in host or device memory on any node), without staging through additional memory buffers. Presence of this support can be validated using the same commands in the previous section, where the output includes a UPCXXZEGASNet: 1 line to indicate presence of the support.

In the absence of this experimental support, the Level Zero memory kinds support in this UPC++ release utilizes a reference implementation which has not been tuned for performance. In particular, upcxx::copy will stage data transfers involving device memory through intermediate buffers in host memory, and is expected to underperform relative to solutions using zero-copy technologies. Future versions of UPC++ and GASNet-EX will expand and enhance the support for memory kinds acceleration on Intel GPUs.

Use of UPC++ memory kinds

See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the UPC++ GPU support.

After running configure, return to Step 2: Compiling UPC++, above.

Advanced Configuration

The configure script tries to pick sensible defaults for the platform it is running on, but its behavior can be controlled using the following command line options:

--prefix=...: The location at which UPC++ is to be installed. The default is /usr/local/upcxx.
--with-cc=... and --with-cxx=...: The C and C++ compilers to use.
--with-cross=...: The cross-configure settings script to pull from the GASNet-EX source tree (<gasnet>/other/contrib/cross-configure-${VALUE}).
--without-cross: Disable automatic cross-compilation, for instance to compile for the front-end of a Cray XC system.
--with-default-network=...: Sets the default network to be used by the upcxx compiler wrapper. Valid values are listed under "UPC++ Backends" in README. The default is (currently) smp. Users with high-speed networks, such as InfiniBand (ibv), are encouraged to set this parameter to a value appropriate for their system.
--with-gasnet=...: Provides the GASNet-EX source tree from which UPC++ will configure and build its own copies of GASNet-EX. This can be a path to a tarball, URL to a tarball, or path to a full source tree. If provided, this must correspond to a recent and compatible version of GASNet-EX (NOT GASNet-1). Defaults to an embedded copy of GASNet-EX, or the GASNet-EX download URL.
--with-gmake=...: GNU Make command to use; must be 3.80 or newer. The default behavior is to search $PATH for a make or gmake which meets this minimum version requirement.
--with-python=...: Python interpreter to use; must be Python3 or Python2 version 2.7.5 or newer. The default behavior is to search $PATH for a suitable interpreter when upcxx-run is executed. This option results in the use of a full path to the Python interpreter in upcxx-run.
Options for control of (optional) CUDA support are documented in the section Configuration: CUDA GPU support
Options for control of (optional) AMD ROCm/HIP GPU support are documented in the section Configuration: AMD ROCm/HIP GPU support
Options for control of (optional) Intel oneAPI GPU support are documented in the section Configuration: Intel oneAPI GPU support
Options not recognized by the UPC++ configure script will be passed to the GASNet-EX configure. For instance, --with-mpirun-cmd=... might be required to setup MPI-based launch of ibv-conduit applications. Please read the GASNet-EX documentation for more information on this and many other options available to configure GASNet-EX. Additionally, passing the option --help=recursive to the UPC++ configure script will produce GASNet-EX's configure help message.

In addition to these explicit configure options, there are several environment variables which can implicitly affect the configuration of GASNet-EX. The most common of these are listed at the end of the output of configure --help. Since these influence the GASNet-EX configure script, they are used in the make or make all stages of the UPC++ build, not its configure stage.

INSTALL

UPC++ Installation

Public Installs

System Requirements

Supported Platforms

Miscellaneous software requirements:

Linux Compiler Notes:

Installation Instructions

1. Configuring UPC++

2. Compiling UPC++

3. Testing the UPC++ build (optional)

4. Installing the compiled UPC++ package

5. Testing the install UPC++ package (optional)

6. Post-install recommendations

Configuration: HPE Cray EX

Configuration: Linux

Configuration: Apple macOS

Configuration: CUDA GPU support

System Requirements:

Additional System Requirements for GDR-accelerated memory kinds:

configure Command for Enabling CUDA GPU Support

Validation of CUDA memory kinds support

Known problems with GDR-accelerated memory kinds

Use of UPC++ memory kinds

Configuration: AMD ROCm/HIP GPU support

System Requirements:

Additional System Requirements for ROCmRDMA-accelerated memory kinds:

configure Command for Enabling AMD ROCm/HIP GPU Support

Validation of ROCm/HIP memory kinds support

Known problems with ROCmRDMA-accelerated memory kinds

Use of UPC++ memory kinds

Configuration: HIP-over-CUDA GPU support

System Requirements:

configure Command for Enabling HIP-over-CUDA GPU Support

Configuration: Intel oneAPI GPU support

System Requirements:

configure Command for Enabling Intel oneAPI GPU Support

Validation of oneAPI memory kinds support

EXPERIMENTAL accelerated memory kinds for Intel GPUs:

Use of UPC++ memory kinds

Advanced Configuration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`configure` Command for Enabling CUDA GPU Support

`configure` Command for Enabling AMD ROCm/HIP GPU Support

`configure` Command for Enabling HIP-over-CUDA GPU Support

`configure` Command for Enabling Intel oneAPI GPU Support