-
Notifications
You must be signed in to change notification settings - Fork 0
INSTALL
This file documents software installation of UPC++.
For information on using UPC++, see: README
The team which develops and maintains UPC++ also provides public installs of current UPC++ releases at several HPC centers. Before you invest time in installing UPC++ for yourself, please consider checking the online documentation which describes these installs, including site-specific usage instructions regarding compiling and running on each such system.
UPC++ makes aggressive use of template meta-programming techniques, and requires a modern C++ compiler and corresponding standard library implementation.
The current release is known to work on the following configurations:
-
Apple macOS/x86_64:
- The most recent Xcode release for each macOS release is generally well-tested
- It is suspected that any Xcode (ie Apple clang) release 8.0 or newer will work
- Free Software Foundation g++ (e.g., as installed by Homebrew or Fink) version 6.4.0 or newer should also work
- The most recent Xcode release for each macOS release is generally well-tested
-
Apple macOS/aarch64 (aka "Apple Silicon"):
- The most recent Xcode release for each macOS release is generally well-tested
- It is suspected that any compatible Xcode (ie Apple clang) will work
- Free Software Foundation g++ (e.g., as installed by Homebrew or Fink) of a recent version should also work
Caveat: Nothing platform-specific has been implemented for the mix of "performance" and "efficiency" cores, meaning performance could be highly variable.
- The most recent Xcode release for each macOS release is generally well-tested
-
Linux/x86_64 with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 4.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
- Intel C++ 17.0.2 or newer (with libstdc++ from g++ 6.4.0 or newer)
- Intel oneAPI compilers 2021.1.2 or newer (with libstdc++ from g++ 6.4.0 or newer)
- PGI C++ 19.3 through 20.4 (with libstdc++ from g++ 6.4.0 or newer)
- NVIDIA HPC SDK (aka nvhpc) 20.9 and newer (with libstdc++ from g++ 6.4.0 or newer)
- AMD AOCC compilers 2.3.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
If
/usr/bin/g++is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below. -
Linux/ppc64le (aka IBM POWER little-endian) with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 5.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
- PGI C++ 19.3 through 20.4 (with libstdc++ from g++ 6.4.0 or newer)
- NVIDIA HPC SDK (aka nvhpc) 20.9 and newer (with libstdc++ from g++ 6.4.0 or newer)
If
/usr/bin/g++is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below. -
Linux/aarch64 (aka "arm64" or "armv8") with one of the following compilers:
- g++ 6.4.0 or newer
- clang++ 4.0.0 or newer (with libstdc++ from g++ 6.4.0 or newer)
If
/usr/bin/g++is older than 6.4.0 (even if using another compiler), see Linux Compiler Notes, below.Note the GPUDirect drivers necessary for GDR-accelerated memory kinds on InfiniBand are not supported on the Linux/aarch64 platform.
-
HPE Cray EX with x86_64 CPUs and one of the following PrgEnv environment modules, plus its dependencies (smp and ofi conduits):
- PrgEnv-gnu with gcc/10.3.0 (or later) loaded.
- PrgEnv-cray with cce/12.0.0 (or later) loaded.
- PrgEnv-amd with amd/4.2.0 (or later) loaded.
- PrgEnv-aocc with aocc/3.1.0 (or later) loaded.
- PrgEnv-nvidia with nvidia/21.9 (or later) loaded.
- PrgEnv-nvhpc with nvhpc/21.9 (or later) loaded.
- PrgEnv-intel with intel/2023.1.0 (or later) loaded.
-
NOT officially supported:
- Vendor-specific
clang++org++variants.
At least Arm Ltd. provides compilers based on their own modifications to Clang/LLVM. Similarly, at least Arm Ltd. and IBM provide forks ofg++.
To the best of our limited current knowledge, these all behave as their respective "upstream" compilers, with no additional compiler-specific issues.
At this time we do not consider these compilers to be officially supported due to insufficient periodic automated testing.
The presence or absence of a warning fromconfigurevaries.
- Vendor-specific
-
Python3 or Python2 version 2.7.5 or newer
-
Perl version 5.005 or newer
-
GNU Bash 3.2.57 or newer (must be installed, user's shell doesn't matter)
-
GNU Make 3.80 or newer
-
The following standard Unix tools: 'awk', 'sed', 'env', 'basename', 'dirname'
-
If /usr/bin/g++ is older than 6.4.0 (even if using a different C++ compiler for UPC++) please read docs/local-gcc.
-
If using a non-GNU compiler with /usr/bin/g++ older than 6.4.0, please also read docs/alt-compilers.
The recipe for building and installing UPC++ is the same as many packages using the GNU Autoconf and Automake infrastructure (though UPC++ does not use either). The high-level steps are as follows:
-
configure
Configures UPC++ with key settings such as the installation location -
make all
Compiles the UPC++ package -
make check(optional, but recommended)
Verifies the correctness of the UPC++ build prior to its installation -
make install
Installs the UPC++ package to the user-specified location -
make test_install(optional, but highly recommended)
Verifies the installed package - Post-install recommendations
The following numbered sections provide detailed descriptions of each step above. Following those are sections with platform-specific instructions.
cd <upcxx-source-dir>
./configure --prefix=<upcxx-install-path>Or, to have distinct source and build trees (for instance to compile multiple configurations from a common source directory):
mkdir <upcxx-build-path>
cd <upcxx-build-path>
<upcxx-source-path>/configure --prefix=<upcxx-install-path>This will configure the UPC++ library to be installed to the given
<upcxx-install-path> directory. Users are recommended to use paths to
non-existent or empty directories as the installation path so that
uninstallation is as trivial as rm -rf <upcxx-install-path>.
Depending on the platform, additional command-line arguments may be necessary
when invoking configure. For guidance, see the platform-specific instructions
in the following sections, below:
- Configuration: HPE Cray EX
- Configuration: Linux
- Configuration: Apple macOS
- Configuration: CUDA GPU support
- Configuration: AMD ROCm/HIP GPU support
- Configuration: HIP-over-CUDA GPU support
- Configuration: Intel oneAPI GPU support
Running <upcxx-source-path>/configure --help will provide general
information on the available configuration options, and similar information is
provided in the Advanced Configuration
section below.
If you are using a source tarball release downloaded from the website, it
should include an embedded copy of GASNet-EX and configure will default to
using that. However if you are using a git clone or other repo snapshot of
UPC++, then configure may default to downloading the GASNet-EX communication
library, in which case an Internet connection is needed at configuration time.
GNU Make 3.80 or newer is required to build UPC++. If neither make nor
gmake in your $PATH meets this requirement, you may use --with-gmake=...
to specify the full path to an appropriate version. You may need to
substitute gmake, or your configured value, for make where it appears in
the following steps. The final output from configure will provide the
appropriate commands.
Python3 or Python2 (version 2.7.5 or later) is required by UPC++. By
default, configure searches $PATH for several common Python interpreter
names. If that does not produce a suitable interpreter, you may override
this using --with-python=... to specify a python interpreter. If you
provide a full path, the value is used as given. Otherwise, the $PATH at
configure-time is searched to produce a full path. Either way, the resulting
full path to the python interpreter will be used in the installed upcxx-run
script, rather than a runtime search of $PATH. Therefore, the interpreter
specified must be available in a batch-job environment where applicable.
Bash 3.2.57 or newer is required by UPC++ scripts, including configure. By
default, configure will try /bin/sh and then the first instance of bash
found in $PATH. If neither of these is bash 3.2.57 (or newer), or if the one
found is not appropriate to use (for instance not accessible on compute
nodes), one can override the automated selection by invoking configure via
the desired instance of bash:
/usr/gnu/bin/bash <upcxx-source-path>/configure ...By default, the configure script will attempt to enforce use of C++ and C
compilers which report the same family and version. If necessary, this
can be disabled using --enable-allow-compiler-mismatch. However,
installation of UPC++ configured in this manner is not supported.
make allThis will compile the UPC++ runtime libraries, including the GASNet-EX
communications runtime. One may run, for instance, make -j8 all to build
with eight concurrent processes. This may significantly reduce the time
required. However parallel make can also obscure error messages, so if you
encounter a failure you should retry without a -j option.
Some combinations of network and configure options require that CXX be
capable of linking MPI applications. If that requirement exists but is unmet,
then this step will fail with output giving instructions to read the section
Configuration: Linux in this document,
where this issue is described in more detail.
The output generated at the successful conclusion of this step gives the
default network and a list of available networks. This is an appropriate time
to verify that the default network is the one you expect to use. If it is
not, but it is listed as available, you can specify your preferred network
to the later make install step without starting over. However, if your
preferred network is not listed as available, then you will need to return
to the previous (configure) step, where additional arguments or environment
modules may be required to enable detection of the appropriate headers and/or
libraries.
Though it is not required, we recommend testing the completeness and correctness of the UPC++ build before proceeding to the installation step. In general the environment used to compile UPC++ tests and run them may not be the same (most notably, on batch-scheduled and/or cross-compiled platforms). The following command assumes it is invoked in an environment suitable for both, if such is available:
make checkThis compiles all available tests for the default network and then runs them.
One can override the default network by appending NETWORKS=net1,net2
to this command, with network names (such as smp, udp, ibv or ofi)
substituted for the netN placeholders.
Setting of NETWORKS to restrict what is tested may be necessary, for
instance, if GASNet-EX detected libraries for a network not physically present
in your system. This will often occur for InfiniBand (which GASNet-EX
identifies as ibv) due to presence of the associated libraries on many Linux
distributions. One may, if desired, return to the configure step and pass
--disable-ibv (or other undesired network) to remove support for a given
network from the build of UPC++.
By default the test-running step runs each test with a 5 minute time limit
(assuming the timeout command from GNU coreutils appears in $PATH).
If any tests terminate with FAILED (exitcode=124): probable timeout, this
indicates a timeout (which might happen in environments with very slow hardware
or slow job launch). The simplest workaround in such cases is to set
TIMEOUT=false to disable the timeout. Alternatively, one can set envvar
UPCXX_RUN_TIME_LIMIT to a value in seconds to enforce a longer timeout.
Variables TESTS and NO_TESTS can optionally be set to a space-delimited
list of test name substrings used as a filter to select or discard a subset
of tests to be compiled/run. Variable EXTRAFLAGS can optionally inject
upcxx compile options, eg EXTRAFLAGS=-Werror.
If it is not possible to both compile and run parallel applications in the
same environment, then one may apply the following two steps in place of
make check:
-
In an environment suited to compilation, run
make tests-clean tests. This will remove any test executables left over from previous attempts, and then compiles all tests for all available networks. One may restrict this to a subset of the available networks by appending a setting forNETWORKS, as described above formake check. -
In an environment suited to execution of parallel applications, run
make run-tests. As in the first step, one may setNETWORKSon themakecommand line to limit the tests run to some subset of the tests built above.
make install [NETWORK=net]This will install the UPC++ runtime libraries and accompanying utilities to
the location specified via --prefix=... at configuration time. If that
value is not the desired installation location, then make install prefix=<desired-install-directory> may be used to override the value given at
configure time.
One may optionally pass NETWORK=net (replacing net by a supported network
name) to specify the default network (overriding --with-default-network=...
specified at configure time, if any). Output at the end of the all and
check steps report the default to be used in the absence of an explicit
setting, and the available networks.
make tests-clean test_installThis optional command removes any test executables left over from previous attempts, and then builds a simple "Hello, World" test for each supported network using the installed UPC++ libraries and compiler wrapper.
At the end of the output will be instructions for running these tests if desired.
After step 5 (or step 4, if skipping step 5) one may safely remove the
directory <upcxx-source-path> (and <upcxx-build-path>, if used) since they
are not needed by the installed package.
One may use the utilities upcxx (compiler wrapper), upcxx-run (launch
wrapper) and upcxx-meta (UPC++ metadata utility) by their full path in
<upcxx-install-path>/bin. However, it is common to append that directory to
one's $PATH environment variable (the best means to do so are beyond this
scope of this document).
Additionally, one may wish to set the environment variable $UPCXX_INSTALL
to <upcxx-install-path>, as this is assumed by several UPC++ examples.
For systems using "environment modules" an example module file is provided
as <upcxx-install-path>/share/modulefiles/upcxx/<upcxx-version>. This
sets both $PATH and $UPCXX_INSTALL as recommended above. Consult
the documentation for the environment modules package on how to use this file.
For users of CMake 3.6 or newer, <upcxx-install-path>/share/cmake/UPCXX
contains a UPCXXConfig.cmake. Consult CMake documentation for instructions
on use of this file.
Finally, <upcxx-install-path>/bin/test-upcxx-install.sh is a script which can
be run to replicate the verification performed by make test_install without
<upcxx-source-path> and/or <upcxx-build-path>. This could be useful, for
instance, to verify permissions for a user other than the one performing the
installation.
This release of UPC++ includes initial support for the HPE Cray EX platform, including both the "Slingshot-10" and "Slingshot-11" network interface cards (NICs) and GPUs from both Nvidia and AMD. When built in a supported configuration, this release passes all of the UPC++ test suite. However, the performance has not yet been tuned on this platform.
Unlike the Cray XC, the HPE Cray EX is not treated as a cross-compilation
target when building UPC++. However, we strongly advise use of the vendor's
wrapper compilers, cc and CC. Additionally, the two NICs require distinct
non-default settings. The following shows our recommended configure command
with some "" which are explained below. Note that these assume
use of the default version of GASNet-EX. If using an earlier release of
GASNet-EX, please consult documentation in a UPC++ release of similar age.
module load libfabric cray-pmi <GPU_MODULES>
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> \
--with-cc=cc --with-cxx=CC --with-mpi-cc=cc \
--with-ofi-provider=<PROVIDER> \
--with-pmi-runcmd='<RUNCMD>' \
<GPU_OPTIONS> The libfabric and cray-pmi environment modules may or may not be loaded by
default at any given site. Please ensure they are loaded (as shown above) or
the configure or build steps may fail.
As denoted by the <GPU_MODULES> placeholder, one or more environment modules
may be needed for GPU support. Example module names (to help locate the
appropriate information in site-specific documentation) include cudatoolkit,
rocm and intel_compute_runtime, though variations on these names exist.
In some cases one may also need a device-specific module, often with a name
starting with craype-accel-, to avoid link errors or warnings on every
compile. Be advised that some sites may bundle the programming model and
device modules into a single module.
There are two NICs options in an HPE Cray EX system, known as "Slingshot-10" and
"Slingshot-11". They require different libfabric "providers", as indicated by
the <PROVIDER> placeholder above:
-
--with-ofi-provider=verbsfor Slingshot-10.
This is a Mellanox ConnectX-5 (or -6) 100Gbps NIC. -
--with-ofi-provider=cxifor Slingshot-11.
This is an HPE 200Gbps NIC
If you are uncertain of which NIC is used on a given system, please consult the
site-specific documentation or ask the support staff for assistance.
Another alternative is to pass --with-ofi-provider=generic, which requests
provider adaptation be performed during runtime startup, at some cost in
additional communication overhead. This option may be useful for systems
with a mix of Slingshot-10 and Slingshot-11 nodes, although all processes in a
job still need to be using matching hardware and software (including provider)
at runtime.
On some systems with multiple Slingshot NICs, one will need to add
--with-host-detect=hostname. This option is recommended only when actually
required. If your system does require this setting, then you will see a
message at application run time directing you to use this option, or an
environment-based alternative.
You will also need to select the proper argument to --with-pmi-runcmd=...
(the <RUNCMD> placeholder, above).
- If using the Slurm Workload Manager:
--with-pmi-runcmd='srun -n %N -- %C' - For most other cases:
--with-pmi-runcmd='aprun --cc none -n %N %C'
At the time of this writing we've only tested UPC++ on HPE Cray EX systems with AMD CPUs.
As mentioned earlier and indicated by the <GPU_OPTIONS> placeholder, this
UPC++ release supports GPUs using Nvidia CUDA, AMD ROCm/HIP, and Intel oneAPI
in HPE Cray EX systems. Please also see the respective sections of this
document for UPC++ configure options needed to enable this support:
- Configuration: CUDA GPU support
- Configuration: AMD ROCm/HIP GPU support
- Configuration: Intel oneAPI GPU support
With the Slingshot-11 network, some users have seen application hangs due to what appears to be "lost" RPCs. At the time this is written, there are two possible workarounds for this issue. Descriptions of both workarounds, along with the most up-to-date information on this issue in general, can be found in the corresponding GASNet-EX report: bug 4461.
With the Slingshot-10 network, there are conditions (not yet characterized)
under which RPCs may be corrupted in such a way that their reception results in
a fatal error. This can manifest with a fatal error containing the text "no
associated AM handler function" or (in a debug build) an "Assertion failure"
message with the expression isreq == header->isreq. At the time this is
written, there is a known workaround for this issue. A description of the
workaround, along with the most up-to-date information on this issue in
general, can be found in the corresponding GASNet-EX report:
bug 4517.
After running configure, return to
Step 2: Compiling UPC++, above.
The configure command above will work as-is. The default compilers used will
be gcc/g++. The --with-cc=... and --with-cxx=... options may specify
alternatives to override this behavior. Additional options providing finer
control over how UPC++ is configured can be found in the
Advanced Configuration section below.
By default ibv-conduit (InfiniBand support) will use MPI for job spawning if a
working mpicc is found in your $PATH when UPC++ is built. The same is
true for SMP, MPI, OFI and UCX conduits, if these have been enabled. To ensure that
UPC++ applications will link when one of these conduits are used, one of three
options must be chosen. Failure to do so will typically result in an error
message at UPC++ build time, directing you to this documentation.
Note that smp-conduit is a recent addition to the list above with GASNet
versions 2024.5.3 or later.
Option 1. The most direct solution is to configure using --with-cxx=mpicxx (or
similar) to ensure correct linking of UPC++ applications which use MPI for job
spawning. When one is using MPI for job spawning, it is important that
GASNet's MPI support use a corresponding/compatible mpicc and mpirun. In
the common case, the un-prefixed mpicc and mpirun in $PATH are compatible
(ie. same vendor/version/ABI) with the provided --with-cxx=mpicxx, in which
case nothing more should be required. Otherwise, one may need to additionally
pass options like --with-mpi-cc='/path/to/compatible/mpicc -options' and/or
--with-mpirun-cmd='/path/to/compatible/mpirun -np %N %C'.
Please see GASNet's mpi-conduit documentation for details.
Option 2. If any of these networks are enabled but are not necessary, one can
configure using --disable-[network] to disable it. One may wish to select
this option if there is no corresponding network hardware or no interest in
using the given network API. The case of missing hardware can often occur for
IBV when Linux distros install the corresponding development packages as
dependencies of other packages.
Option 3. If one does not require MPI for job spawning (because SSH-based or
PMI-based or SMP fork-based spawning in GASNet are sufficient), then one may configure using
--disable-mpi-compat to eliminate the link-time dependence on MPI.
Note that this particular option does NOT work for mpi-conduit.
After running configure, return to
Step 2: Compiling UPC++, above.
On macOS, the default network is "smp": multiple processes running on a single
host, communicating over shared memory. One may specify a different default
using --with-default-network=... at configure time. However, you will also
have the opportunity to make such a selection at the make install step.
On macOS, UPC++ defaults to using the Apple LLVM clang compiler that is part of the Xcode Command Line Tools.
The Xcode Command Line Tools need to be installed before invoking configure,
i.e.:
xcode-select --installAlternatively, the --with-cc=... and --with-cxx=... options to configure
may be used to specify different compilers.
Note that with GASNet versions 2024.5.3 or later, if you have MPI installed
then you may also need to specify configure option --with-cxx=mpicxx or
--disable-mpi-compat accordingly. See Configuration: Linux
above for more details.
In order to use a debugger on macOS, we advise you to enable "Developer
Mode". This is a system setting, not directly related to UPC++.
Developer Mode may already be enabled, for instance if one granted Xcode
permission when it asked to enable it. If not, then an Administrator must
run DevToolsSecurity -enable in Terminal. This mode allows all users to
use development tools, including the lldb debugger. If that is not
desirable, then use of debuggers will be limited to members of the
_developer group. An internet search for macos _developer group will
provide additional information.
After running configure, return to
Step 2: Compiling UPC++, above.
UPC++ includes support for RMA communication operations on memory buffers resident in a CUDA-compatible NVIDIA GPU. General requirements:
- Modern NVIDIA-branded CUDA-compatible GPU hardware
- NVIDIA CUDA toolkit v9.0 or later. Available for download here.
This version of UPC++ supports GPUDirect RDMA (GDR) acceleration of memory
kinds data transfers on selected platforms using modern NVIDIA-branded GPUs
with NVIDIA- or Mellanox-branded InfiniBand or HPE Slingshot network hardware.
This support requires one of the following high-performance network conduit
configurations, and the current/default version of GASNet-EX:
- ibv-conduit with recent NVIDIA/Mellanox-branded InfiniBand network hardware
- ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)
- ofi-conduit on HPE Cray EX with HPE Slingshot-10 (verbs provider)
Additional requirements:
- Linux OS with x86_64 or ppc64le CPU (not ARM)
- GPUDirect RDMA drivers installed
When using GDR-accelerated memory kinds, calls to upcxx::copy will offload
the data transfer to the network adapter, streaming data directly between the
source and destination memory locations (in host or device memory on any node),
without staging through additional memory buffers.
For all other platforms, the CUDA support in this UPC++ release utilizes a
reference implementation which has not been tuned for performance. In
particular, upcxx::copy will stage data transfers involving device
memory through intermediate buffers in host memory, and is expected to
underperform relative to solutions using RDMA, GPUDirect and similar
zero-copy technologies. Future versions of UPC++ will introduce
native memory kinds acceleration for additional GPU and network variants.
To activate the UPC++ support for CUDA, pass --enable-cuda to the configure
script:
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-cudaThis will detect whether the requirements for GDR acceleration are met and automatically activate that feature. For troubleshooting installation of GASNet's GDR support, please see docs/memory_kinds in the GASNet distribution.
configure --enable-cuda expects to find the NVIDIA nvcc compiler wrapper in your $PATH and
will attempt to extract the correct build settings for your system. If this
automatic extraction fails (resulting in preprocessor or linker errors
mentioning CUDA), then you may need to manually override the following
options to configure:
-
--with-nvcc=...: the full path to thenvcccompiler wrapper from the CUDA toolkit. Eg--with-nvcc=/Developer/NVIDIA/CUDA-10.0/bin/nvcc -
--with-cuda-cppflags=...: preprocessor flags to add for locating the CUDA toolkit headers. Eg--with-cuda-cppflags='-I/Developer/NVIDIA/CUDA-10.0/include' -
--with-cuda-libflags=...: linker flags to use for linking CUDA executables. Eg--with-cuda-libflags='-Xlinker -force_load -Xlinker /Developer/NVIDIA/CUDA-10.0/lib/libcudart_static.a -L/Developer/NVIDIA/CUDA-10.0/lib -lcudadevrt -Xlinker -rpath -Xlinker /usr/local/cuda/lib -Xlinker -framework -Xlinker CoreFoundation -framework CUDA'
Note that you must build UPC++ with the same host compiler toolchain as is used
by nvcc when compiling any UPC++ CUDA programs. That is, both UPC++ and your
UPC++ application must be compiled using the same host compiler toolchain.
You can ensure this is the case by either (1) configuring UPC++ with the same
compiler as your system nvcc uses, or (2) using the -ccbin command line
argument to nvcc during application compilation to ensure it uses the same host
compiler as was passed to the UPC++ configure script.
One can validate CUDA support in a given UPC++ install using a command like the following:
$ upcxx-info | grep CUDA
UPCXX_CUDA: 1
UPCXX_CUDA_NVCC: /path/to/cuda/bin/nvcc
UPCXX_CUDA_CPPFLAGS: ...CUDA include options...
UPCXX_CUDA_LIBFLAGS: ...CUDA library options...
GPUs with NVIDIA CUDA API (cuda-uva) ON (enabled)Where the UPCXX_CUDA: 1 indicates the UPC++ install is CUDA-aware, and in the last line
ON indicates that GASNet-EX may include GDR acceleration support (actual availability
also depends on network backend selection at application compile time).
UPC++ CUDA operation can be validated using the following programs in the source tree:
-
test/copy.cppandtest/copy-cover.cpp: correctness testers for the UPC++cuda_device -
bench/gpu_microbenchmark.cpp: performance microbenchmark forupcxx::copyusing GPU memory -
make cuda_vecaddinexample/gpu_vecadd: demonstration of using UPC++cuda_deviceto orchestrate communication for a program invoking CUDA computational kernels on the GPU.
One can validate use of GDR acceleration in a given UPC++ executable with a command like the following:
$ upcxx-run -i a.out | grep CUDA
UPCXXKindCUDA: 202103L
UPCXXCUDAGASNet: 1
UPCXXCUDAEnabled: 1
GASNetMKClassCUDAUVA: 1Where the UPCXXCUDAGASNet: 1 and GASNetMKClassCUDAUVA: 1 lines together confirm the
use of GDR acceleration. If either value is 0 or absent then GDR acceleration is not in use.
There is a known bug in the vendor-provided IB Verbs firmware affecting GDR Gets that
causes crashes inside the IB Verbs network stack during copy() operations
targeting small objects in a cuda_device segment with affinity to the calling
process on some platforms. This problem can be worked-around by setting
MLX5_SCATTER_TO_CQE=0, but this setting has a global negative impact on RMA
Get operations (even those not involving device memory) so should only be used
on affected platforms. Details are here:
There is also a known bug in the libfabric "verbs provider" (recommended for
use on Slingshot-10 networks) that causes crashes inside libfabric during
copy() operations where the source objects reside in a cuda_device segment
with affinity to the calling process. This problem can be worked-around by
setting FI_VERBS_INLINE_SIZE=0. Because this setting disables a valuable
performance optimization, it may increase the latency of all small RMA Puts,
including those from host memory, as well as some RPCs. Therefore, it is
strongly recommended that you set this variable only if your system exhibits
this issue. Details are here:
In addition to the two issues described above, the current implementation of
GDR-accelerated memory kinds enforces a per-process limit of 32 active cuda_device
opens over the lifetime of the process. This static limit can be raised at configure time
via configure --with-maxeps=N, and is expected to become a more dynamic limit
in a future release.
See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the CUDA support.
After running configure, return to
Step 2: Compiling UPC++, above.
UPC++ includes support for RMA communication operations on memory buffers resident in a ROCm/HIP-compatible AMD GPU. General requirements:
- Modern AMD-branded HIP-compatible GPU hardware
- AMD ROCm drivers version 4.5.0 or later (earlier versions of ROCm MIGHT also work, but are not recommended)
This version of UPC++ supports ROCmRDMA acceleration of memory kinds data transfers on selected platforms using modern AMD-branded GPUs. This support requires one of the following high-performance network conduit configurations, and the current/default version of GASNet-EX:
- ibv-conduit with recent NVIDIA/Mellanox-branded InfiniBand network hardware
- ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)
- ofi-conduit on HPE Cray EX with HPE Slingshot-10 (verbs provider)
Additional Requirements:
- Linux OS with x86_64 or ppc64le CPU (not ARM)
- AMD GPU kernel driver installed
When using ROCmRDMA-accelerated memory kinds, calls to upcxx::copy will offload
the data transfer to the network adapter, streaming data directly between the
source and destination memory locations (in host or device memory on any node),
without staging through additional memory buffers.
For all other platforms, the ROCm/HIP support in this UPC++ release utilizes a
reference implementation which has not been tuned for performance. In
particular, upcxx::copy will stage data transfers involving device
memory through intermediate buffers in host memory, and is expected to
underperform relative to solutions using RDMA, ROCmRDMA and similar
zero-copy technologies. Future versions of UPC++ will introduce
native memory kinds acceleration for additional GPU and network variants.
To activate the UPC++ support for AMD ROCm/HIP, pass --enable-hip to the configure
script:
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-hipThis will detect whether the requirements for ROCmRDMA acceleration are met and automatically activate that feature. For troubleshooting installation of GASNet's ROCmRDMA support, please see docs/memory_kinds in the GASNet distribution.
configure --enable-hip expects to find the AMD ROCm hipcc compiler wrapper
in your $PATH and will attempt to infer the correct ROCm/HIP install location for
your system. If this automatic detection fails, then you may need to manually
override the following options to configure:
-
--with-hip-home=...: the install prefix for the ROCm/HIP developer tools Eg--with-hip-home=/opt/rocm-4.5.0/hip -
--with-hip-cppflags=...: the pre-processor flags needed to find HIP runtime headers Eg--with-hip-cppflags='-I/opt/rocm-4.5.0/hip/include' -
--with-hip-libflags=...: the linker flags needed to link HIP runtime libraries Eg--with-hip-libflags='-L/opt/rocm-4.5.0/hip/lib -lamdhip64'
Note that you must build UPC++ with the same host compiler toolchain as is used
by hipcc when compiling any UPC++ ROCm programs. That is, both UPC++ and your
UPC++ application must be compiled using the same host compiler toolchain.
You can ensure this is the case by either (1) configuring UPC++ with the same
compiler as your system hipcc uses, or (2) using the --gcc-toolchain= command line
argument to hipcc during application compilation to ensure it uses the same host
compiler as was passed to the UPC++ configure script.
One can validate HIP/ROCm support in a given UPC++ install using a command like the following:
$ upcxx-info | grep HIP
UPCXX_HIP: 1
UPCXX_HIP_CPPFLAGS: ...HIP include options...
UPCXX_HIP_LIBFLAGS: ...HIP library options...
GPUs with AMD HIP API (hip) ON (enabled)Where the UPCXX_HIP: 1 indicates the UPC++ install is HIP-aware, and in the last line
ON indicates that GASNet-EX may include ROCmRDMA acceleration support (actual availability
also depends on network backend selection at application compile time).
UPC++ ROCm/HIP operation can be validated using the following programs in the source tree:
-
test/copy.cppandtest/copy-cover.cpp: correctness testers for the UPC++hip_device -
bench/gpu_microbenchmark.cpp: performance microbenchmark forupcxx::copyusing GPU memory -
make hip_vecaddinexample/gpu_vecadd: demonstration of using UPC++hip_deviceto orchestrate communication for a program invoking HIP computational kernels on the GPU.
One can validate use of ROCmRDMA acceleration in a given UPC++ executable with a command like the following:
$ upcxx-run -i a.out | grep HIP
UPCXXKindHIP: 202203L
UPCXXHIPEnabled: 1
UPCXXHIPGASNet: 1
GASNetMKClassHIP: 1Where the UPCXXHIPGASNet: 1 and GASNetMKClassHIP: 1 lines together confirm the
use of ROCmRDMA acceleration. If either value is 0 or absent then ROCmRDMA acceleration is not in use.
The current implementation of
ROCmRDMA-accelerated memory kinds enforces a per-process limit of 32 active hip_device
opens over the lifetime of the process. This static limit can be raised at configure time
via configure --with-maxeps=N, and is expected to become a more dynamic limit
in a future release.
See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the UPC++ GPU support.
After running configure, return to
Step 2: Compiling UPC++, above.
AMD ROCm provides an implementation of HIP-over-CUDA allowing HIP code to target
NVIDIA-branded GPUs. UPC++ can interoperate with this translation layer,
allowing the use of upcxx::hip_device on NVIDIA GPU hardware. This enables RMA
communication on memory buffers resident in these GPUs just as if
they were AMD GPUs (or if the code being compiled was written in CUDA). This is
an experimental capability, but has been shown to work with the following
configurations:
- AMD ROCm version 5.1.0 and CUDA toolkit version 11.4.0
- AMD ROCm version 5.3.2 and CUDA toolkit version 11.7.0
as well as modern NVIDIA-branded CUDA-compatible GPU hardware.
Additional requirements for GPUDirect RDMA can be found in the section Configuration: CUDA GPU support.
To activate the UPC++ support for HIP-over-CUDA, pass --enable-hip and
--with-hip-platform=nvidia to the configure script:
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-hip --with-hip-platform=nvidiaFor issues with automatic detection of compiler location or build flags,
consult the relevant sections of
Configuration: CUDA GPU support and
Configuration: AMD ROCm/HIP GPU support.
As mentioned in prior sections, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain.
After running configure, return to
Step 2: Compiling UPC++, above.
UPC++ includes initial EXPERIMENTAL support for RMA communication operations on memory buffers resident in a oneAPI-compatible Intel GPU, using the oneAPI Level-Zero (ZE) interface.
Intel GPU memory kind support in this release is believed to be functionally correct,
but has not been tuned for performance. upcxx::copy() operations on ze_device
memory are currently staged through host memory by default and do not yet leverage network-direct RDMA.
- Modern Intel-branded oneAPI-compatible GPU hardware with appropriate kernel drivers
- Intel Level-Zero development headers (
level-zero-devpackage)
The full Intel oneAPI Toolkits are NOT required to build UPC++ and use
ze_device, but are likely required by applications that want to use the
GPU for computation.
To activate the UPC++ support for Intel oneAPI Level-Zero,
pass --enable-ze to the configure script:
cd <upcxx-source-path>
./configure --prefix=<upcxx-install-path> --enable-zeconfigure --enable-ze attempts to automatically detect the install prefix of
the Level-Zero developer tools and related compilation options for your system.
If this automatic detection fails, then you may need to manually
override one or more of the following options to configure:
-
--with-ze-home=...: the install prefix for the Level Zero developer tools Eg--with-ze-home=/usr/local/pkg/intel/level-zero/1.9.4 -
--with-ze-cppflags=...: the pre-processor flags needed to find Level Zero headers Eg--with-ze-cppflags='-I/usr/local/pkg/intel/level-zero/1.9.4/include' -
--with-ze-libflags=...: the linker flags needed to link Level Zero runtime libraries Eg--with-ze-libflags='-L/usr/local/pkg/intel/level-zero/1.9.4/lib64 -lze_loader'
Note that you must build UPC++ with the same host compiler toolchain used for compiling objects linked to any UPC++ oneAPI programs. That is, both UPC++ and your UPC++ application must be compiled using the same host compiler toolchain.
One can validate ze_device support in a given UPC++ install using a command like the following:
$ upcxx-info | grep ZE
UPCXX_ZE: 1
UPCXX_ZE_CPPFLAGS: ...ZE include options...
UPCXX_ZE_LIBFLAGS: ...ZE library options...Where the UPCXX_ZE: 1 indicates the UPC++ install is ZE-aware.
UPC++ ze_device operation can be validated using the following programs in the source tree:
-
test/copy.cppandtest/copy-cover.cpp: correctness testers for the UPC++ze_device -
bench/gpu_microbenchmark.cpp: performance microbenchmark forupcxx::copyusing GPU memory
One can validate a given UPC++ executable includes ze_device support with a command
like the following:
$ upcxx-run -i a.out | grep ZE
UPCXXKindZE: 202303L
UPCXXZEEnabled: 1
UPCXXZEGASNet: 0Where the UPCXXZEEnabled: 1 line indicates the presence of ze_device
support in UPC++, and UPCXXZEGASNet: 0 indicates the default lack of hardware
acceleration for ze_device transfers in the current GASNet release.
This version of UPC++ includes an EXPERIMENTAL prototype-quality implementation
of accelerated memory kinds data transfers on selected platforms using modern
Intel-branded GPUs with HPE Slingshot-11 network hardware. This support is
preliminary and has known correctness and functionality limitations, and
is thus disabled by default; configure option --enable-kind-ze must be
provided to activate this support.
This support requires the following high-performance network conduit configurations, and the current/default version of GASNet-EX:
- ofi-conduit on HPE Cray EX with HPE Slingshot-11 (cxi provider)
Additional requirements:
- Recent Linux OS with x86_64 CPU
- Appropriate Intel GPU drivers installed
When using accelerated memory kinds, calls to upcxx::copy will offload
the data transfer to the network adapter, streaming data directly between the
source and destination memory locations (in host or device memory on any node),
without staging through additional memory buffers. Presence of this support
can be validated using the same commands in the previous section, where the
output includes a UPCXXZEGASNet: 1 line to indicate presence of the support.
In the absence of this experimental support, the Level Zero memory kinds
support in this UPC++ release utilizes a reference implementation which has not
been tuned for performance. In particular, upcxx::copy will stage data
transfers involving device memory through intermediate buffers in host memory,
and is expected to underperform relative to solutions using zero-copy technologies.
Future versions of UPC++ and GASNet-EX will expand and enhance the support
for memory kinds acceleration on Intel GPUs.
See the "Memory Kinds" section in the UPC++ Programmer's Guide for more details on using the UPC++ GPU support.
After running configure, return to
Step 2: Compiling UPC++, above.
The configure script tries to pick sensible defaults for the platform it is
running on, but its behavior can be controlled using the following command line
options:
-
--prefix=...: The location at which UPC++ is to be installed. The default is/usr/local/upcxx. -
--with-cc=...and--with-cxx=...: The C and C++ compilers to use. -
--with-cross=...: The cross-configure settings script to pull from the GASNet-EX source tree (<gasnet>/other/contrib/cross-configure-${VALUE}). -
--without-cross: Disable automatic cross-compilation, for instance to compile for the front-end of a Cray XC system. -
--with-default-network=...: Sets the default network to be used by theupcxxcompiler wrapper. Valid values are listed under "UPC++ Backends" in README. The default is (currently)smp. Users with high-speed networks, such as InfiniBand (ibv), are encouraged to set this parameter to a value appropriate for their system. -
--with-gasnet=...: Provides the GASNet-EX source tree from which UPC++ will configure and build its own copies of GASNet-EX. This can be a path to a tarball, URL to a tarball, or path to a full source tree. If provided, this must correspond to a recent and compatible version of GASNet-EX (NOT GASNet-1). Defaults to an embedded copy of GASNet-EX, or the GASNet-EX download URL. -
--with-gmake=...: GNU Make command to use; must be 3.80 or newer. The default behavior is to search$PATHfor amakeorgmakewhich meets this minimum version requirement. -
--with-python=...: Python interpreter to use; must be Python3 or Python2 version 2.7.5 or newer. The default behavior is to search$PATHfor a suitable interpreter whenupcxx-runis executed. This option results in the use of a full path to the Python interpreter inupcxx-run. - Options for control of (optional) CUDA support are documented in the section Configuration: CUDA GPU support
- Options for control of (optional) AMD ROCm/HIP GPU support are documented in the section Configuration: AMD ROCm/HIP GPU support
- Options for control of (optional) Intel oneAPI GPU support are documented in the section Configuration: Intel oneAPI GPU support
- Options not recognized by the UPC++
configurescript will be passed to the GASNet-EXconfigure. For instance,--with-mpirun-cmd=...might be required to setup MPI-based launch of ibv-conduit applications. Please read the GASNet-EX documentation for more information on this and many other options available to configure GASNet-EX. Additionally, passing the option--help=recursiveto the UPC++ configure script will produce GASNet-EX's configure help message.
In addition to these explicit configure options, there are several environment
variables which can implicitly affect the configuration of GASNet-EX. The most
common of these are listed at the end of the output of configure --help.
Since these influence the GASNet-EX configure script, they are used in the
make or make all stages of the UPC++ build, not its configure stage.