-
Notifications
You must be signed in to change notification settings - Fork 0
implementation defined
This document describes stable features supported by this implementation that go beyond the requirements of the UPC++ Specification, or are specified as implementation-defined behavior.
The following macro definitions are provided by upcxx/upcxx.hpp:
-
UPCXX_VERSION: An integer literal providing the release version of the implementation, in the format [YYYY][MM][PP] corresponding to release YYYY.MM.PP -
UPCXX_SPEC_VERSION: An integer literal providing the revision of the UPC++ specification to which this implementation adheres. See the specification for the specified value. -
UPCXX_KIND_CUDA: An integer literal providing the version number of the CUDA memory-kind feature to which this implementation adheres, defined only when the library is built with CUDA enabled. See the UPC++ specification for the specified value. -
UPCXX_KIND_HIP: An integer literal providing the version number of the ROCm/HIP memory-kind feature to which this implementation adheres, defined only when the library is built with ROCm/HIP enabled. See the UPC++ specification for the specified value. -
UPCXX_KIND_ZE: An integer literal providing the version number of the oneAPI Level Zero (ZE) memory-kind feature to which this implementation adheres, defined only when the library is built with ZE support enabled. See the UPC++ specification for the specified value. -
UPCXX_THREADMODE: This is either undefined (for the default "seq" threadmode) or defined to an unspecified non-zero integer value for the "par" threadmode. Recommended usage is#if UPCXX_THREADMODEto identify the need for thread-safety constructs, such as locks. -
UPCXX_CODEMODE: This is either undefined (for the "debug" codemode) or defined to an unspecified non-zero integer value for the "opt" (production) codemode. -
UPCXX_NETWORK_*: The network being targeted is indicated by a preprocessor identifier with aUPCXX_NETWORK_prefix followed by the network name in capitals, which is defined to a non-zero integer value. Identifiers corresponding to other networks are undefined. Examples includeUPCXX_NETWORK_IBVandUPCXX_NETWORK_UDP.
Future and promise completions default to eager notification. Thus:
-
source_cx::as_future()andoperation_cx::as_future()are equivalent tosource_cx::as_eager_future()andoperation_cx::as_eager_future(), respectively -
source_cx::as_promise(p)andoperation_cx::as_promise(p)are equivalent tosource_cx::as_eager_promise(p)andoperation_cx::as_eager_promise(p), respectively
The default can be changed on a per-translation-unit basis by defining the
UPCXX_DEFER_COMPLETION macro prior to including upcxx/upcxx.hpp. Defining
the macro to a non-zero value makes the default deferred (so that as_future()
and as_promise(p) are equivalent to as_defer_future() and
as_defer_promise(p), respectively), while defining it to 0 makes the default
eager.
Calls to RPC communication functions (e.g., upcxx::rpc()) may throw exceptions.
The exceptions may be thrown on the initiating thread before or
after serialization of the function arguments. In all other ways, a call
throwing such an exception is effectively "canceled" -- it will not lead to
invocation of the function object at the target, nor will it deliver any event
notifications (for example, a promise passed using an as_promise() completion
will remain unchanged by the exceptional call).
Starting in 2021.9.0, resource exhaustion while trying to inject an RPC will
throw a upcxx::bad_shared_alloc exception. The what() member function
includes information about the shared heap state at the point of failure. In
the current release, this should only occur when the RPC payload is somewhat
large (over a few KiB) and the shared heap on the initiating process fails to
allocate a temporary buffer large enough to hold the serialized RPC. In
releases prior to 2021.9.0, such conditions led to an immediate fatal error.
Starting in 2022.3.0, attempting to make a cross-code segment (CCS) RPC call
into an unverified segment when CCS segment verification is enabled will throw
a upcxx::segment_verification_error. CCS RPC calls are RPC calls which
directly invoke functions existing in other executable program segments, such
as dynamic libraries. The what() member function includes information about
the state of the function pointer relocation tables. Catching this exception
may be used to arrange for later collective synchronization of cross-segment
function pointer relocation information using
upcxx::experimental::relo::verify_all() or
upcxx::experimental::relo::verify_segment() when libraries are dlopened
asynchronously. See docs/ccs-rpc for more information
about the CCS RPC feature.
The experimental immediate-mode RPC injection calls
may additionally throw a upcxx::experimental::network_busy exception in
the presence of network congestion.
UPC++ specifies the type upcxx::gpu_default_device which is an implementation-defined
alias for a GPU device type. The binding of that alias is determined as follows:
-
For the common case where UPC++ is configured for exactly one GPU variety (e.g.
--enable-cudaOR--enable-hipOR--enable-ze) thenupcxx::gpu_default_devicedefaults to an alias for that corresponding device type (i.e.upcxx::cuda_device,upcxx::hip_deviceorupcxx::ze_device, respectively). -
When no device support is configured, then
upcxx::gpu_default_devicedefaults to an alias forupcxx::cuda_device. -
User programs may override this default choice by defining one of the following preprocessor macros to 1 before including upcxx.hpp (these may be set independently per translation unit):
-
UPCXX_GPU_DEFAULT_DEVICE_CUDA=1makesgpu_default_devicean alias forcuda_device -
UPCXX_GPU_DEFAULT_DEVICE_HIP=1makesgpu_default_devicean alias forhip_device -
UPCXX_GPU_DEFAULT_DEVICE_ZE=1makesgpu_default_devicean alias forze_device
-
-
For rare cases where UPC++ is configured to support two or more GPU varieties, then
upcxx::gpu_default_devicewill default to aliasing an unspecified device type. Users of such configurations are advised to define one of the two macros described above.
The resulting memory kind can be queried via the gpu_default_device::kind constant.
upcxx::make_gpu_allocator() defaults to returning a device_allocator<gpu_default_device>,
but this can also be overridden on a call-site basis via template argument.
The upcxx::make_gpu_allocator<Device>(sz,device_id) factory function defaults
to device_id = auto_device_id which activates an implementation-defined
"smart" choice of valid GPU device when a device ID was not explicitly provided
by the caller. That "smart" choice is determined as follows:
-
If
Device::device_n()is zero, there are no valid GPUs at the calling process and the call toupcxx::make_gpu_allocator(sz, auto_device_id)will return an inactivedevice_allocator(one with no corresponding segment). -
If
Device::device_n()is one, there is a single valid GPU at the calling process and the call will attempt to construct adevice_allocatorsegment for that GPU. -
Otherwise, there are multiple valid GPUs at the calling process. In this case the "smart" choice will cycle through valid IDs with subsequent calls, with a starting point determined from the process rank in
local_team(). The resulting device ID can be queried viadevice_allocator::device_id(). Programs wanting finer-grained control over device selection in multi-GPU environments may override this choice by explicitly passing thedevice_idargument toupcxx::make_gpu_allocator(sz,device_id).
This implementation provides assertion macros to facilitate debugging on
distributed systems. Unlike the standard assert() macro, the macros below
print a backtrace and/or freeze to allow a debugger to be attached before
aborting program execution.
-
UPCXX_ASSERT_ALWAYS(test),UPCXX_ASSERT_ALWAYS(test, message):
Evaluatestestexactly once, and if the result is a false value, then:- outputs
message(if provided) along with file location information to standard error, - optionally prints a backtrace and/or freezes for debugger (controlled by environment variables), and
- aborts execution by calling
std::abort().
messagemay be any expression such thatstd::cerr << messageis well-formed; for instance, it may itself include stream-insertion operators (e.g.UPCXX_ASSERT_ALWAYS(x > 5, "error! x = " << x)).messageis only evaluated whentestproduces a false value. Ifmessageis not provided, it defaults to a string that includes a textual representation oftest. In all cases, this macro expands to an expression with typevoid. - outputs
-
UPCXX_ASSERT(test),UPCXX_ASSERT(test, message):
In the "debug" codemode, provides the same behavior asUPCXX_ASSERT_ALWAYS(). In the "opt" codemode, this macro expands to a side-effect-free expression with typevoidthat does not evaluate the arguments.
Several unspecified, experimental features are implemented in the
upcxx::experimental namespace.
All the features described in the following sections are subject to change or removal at
any time. If you find any of them useful, please send an email to
upcxx@googlegroups.com, and we will consider adding them to the specification proper.
upcxx::experimental interfaces for collectives over non-TriviallySerializable values:
-
broadcast of Serializable but non-TriviallySerializable values:
template<typename T, typename Cx=/*unspecified*/> RType broadcast_nontrivial(T &&value, intrank_t root, const team &team=world(), Cx &&completions=operation_cx::as_future());
-
reduction of Serializable but non-TriviallySerializable values:
constexpr /*unspecified*/ op_add; constexpr /*unspecified*/ op_mul; constexpr /*unspecified*/ op_min; constexpr /*unspecified*/ op_max; constexpr /*unspecified*/ op_bit_and; constexpr /*unspecified*/ op_bit_or; constexpr /*unspecified*/ op_bit_xor; template <typename T, typename BinaryOp , typename Cx=/*unspecified*/> RType reduce_one_nontrivial(T &&value, BinaryOp &&op, intrank_t root, const team &team = world(), Cx &&completions=operation_cx::as_future()); template <typename T, typename BinaryOp , typename Cx=/*unspecified*/> RType reduce_all_nontrivial(T &&value, BinaryOp &&op, const team &team = world(), Cx &&completions=operation_cx::as_future());
Miscellaneous upcxx::experimental interfaces:
-
utilities for reading environment variables:
template<class T> T os_env(const std::string &name); template<class T> T os_env(const std::string &name, const T &otherwise); std::int64_t os_env(const std::string &name, const std::int64_t &otherwise, std::size_t mem_size_multiplier);
Example uses:
int thread_per_rank = upcxx::experimental::os_env<int>("THREADS", 4); size_t szval = upcxx::experimental::os_env("SEGSZ", 128<<20, 1<<20); // default units = MB
-
ostream-like class that prints to a stream with an optional prefix and as much atomicity as possible:class say { public: say(std::ostream &output, const char *prefix="[%d] "); say(const char *prefix="[%d] "); ~say(); template<typename T> say& operator<<(T const &that); };
Example use:
upcxx::experimental::say() << "my value: " << d;Could result in output like this when run with three processes:
[0] my value: 0 [1] my value: 24.742 [2] my value: 49.484
upcxx::experimental interfaces for "immediate-mode" injection of RPCs.
rpc_ff_immediate() and rpc_immediate() calls accept the same arguments as
the corresponding non-immediate calls (rpc_ff and rpc, respectively),
and have exactly the same semantics under conditions of low network congestion.
However, when an injection attempt detects that network congestion is likely to cause
the initiating thread to be blocked inside the call (stalling due to
constrained network resources), the immediate-mode RPC calls will instead
cancel the injection attempt by throwing upcxx::experimental::network_busy.
template <typename Func, typename ...Args>
void rpc_ff_immediate(intrank_t recipient,
Func &&func, Args &&...args);
template <typename Cx, typename Func, typename ...Args>
RType rpc_ff_immediate(intrank_t recipient,
Cx &&completions,
Func &&func, Args &&...args);
template <typename Func, typename ...Args>
void rpc_ff_immediate(const team &team, intrank_t recipient,
Func &&func, Args &&...args);
template <typename Cx, typename Func, typename ...Args>
RType rpc_ff_immediate(const team &team, intrank_t recipient,
Cx &&completions,
Func &&func, Args &&...args);
template <typename Func, typename ...Args>
RType rpc_immediate(intrank_t recipient,
Func &&func, Args &&...args);
template <typename Cx, typename Func, typename ...Args>
RType rpc_immediate(intrank_t recipient,
Cx &&completions,
Func &&func, Args &&...args);
template <typename Func, typename ...Args>
RType rpc_immediate(const team &team, intrank_t recipient,
Func &&func, Args &&...args);
template <typename Cx, typename Func, typename ...Args>
RType rpc_immediate(const team &team, intrank_t recipient,
Cx &&completions,
Func &&func, Args &&...args);Exceptions:
- May throw
upcxx::experimental::network_busyon the calling thread (at the initiating process) under implementation-defined conditions. The ordering of any such exception throw with respect to argument serialization is unspecified. However a call throwing such an exception shall not deliver any event notifications, nor shall it lead to invocation of the function object. - As with non-immediate RPC, calls may also throw any of the usual exceptions thrown from RPC.
For discussion of this enhancement and experimental results, consult:
- Paul H. Hargrove, Dan Bonachea.
"Investigation into the Performance Benefits of Exposing Network Backpressure in UPC++ and GASNet-EX",
Lawrence Berkeley National Laboratory Technical Report (LBNL-2001668), May 2025.
https://doi.org/10.25344/S4088R
Current caveats:
-
Avoidance of injection-time blocking is "best effort" and not guaranteed, even when using immediate-mode injection. Whether any given injection call actually blocks at injection time depends on details of the network stack and dynamic system state.
-
Currently immediate-mode behavior is only enabled for RPC payloads small enough to use the eager-mode RPC algorithm (under a tunable threshold with a system-dependent default). Injection of larger RPC payloads may still block due to network congestion.
-
Acknowledgments for round-trip RPC never use immediate-mode injection, and might cause an injection stall on the master persona of the target process.
In addition, the implementation provides the following unspecified, experimental macro:
-
upcxx_experimental_memberof_unsafeis a variant ofupcxx_memberofthat can be used on a typeTthat is either standard-layout (in which case the equivalent, specifiedupcxx_memberofshould be preferred), or for which the compiler conditionally supportsoffsetof:// Macro: function template syntax used for clarity template<typename T, memory_kind Kind> global_ptr<MType, Kind> upcxx_experimental_memberof_unsafe( global_ptr<T, Kind> ptr, member-designator MEMBER )
Aside from upcxx::experimental, all other namespaces nested inside of upcxx
are intended solely for internal use by the implementation (e.g. upcxx::backend,
upcxx::detail). Similarly, all identifiers with the UPCXXI or upcxxi
prefix are intended solely for internal use by the implementation.
The behavior and existence of all such interfaces and identifiers is subject
to change without notice, and as such their use in user code is STRONGLY discouraged.
The UPC++ v1.0 Specification is the canonical authoritative document that specifies all the required and guaranteed behaviors of the UPC++ interface. Users are strongly advised to rely solely on features and behaviors specified by that document, or implementation-defined behaviors outlined in the other sections of this document.
The "seq" build of libupcxx is performance-optimized for single-threaded processes, or for a model where only a single thread per process will ever be invoking interprocess communication via UPC++. The performance gains with respect to the "par" build stem from the removal of internal synchronization (mutexes, atomic memory ops) within the UPC++ runtime. Affected UPC++ routines will be observed to have lower overhead than their "par" counterparts.
Whereas "par-mode" libupcxx permits the full generality of the UPC++ specification with respect to multi-threading concerns, "seq" imposes these additional restrictions on the client application:
-
Only the thread which invokes
upcxx::init()may ever hold the master persona. This thread is regarded as the "primordial" thread. -
Any UPC++ routine with internal or user-progress (typically inter-process communication, e.g.
upcxx::rput/rget/rpc/...) must be called from the primordial thread with the master persona at the top of the active persona stack. There are some routines which are excepted from this restriction and are listed below. -
Shared-heap allocation/deallocation (e.g.
upcxx::allocate/deallocate/new_/ new_array/delete_/delete_array) must be called from the primordial thread while holding the master persona. The same applies todevice_allocatorfunctions that manipulate a device heap.
Note that these restrictions must be respected by all object files linked into the final executable, as they are all sharing the same libupcxx.
Types of communication that do not experience restriction:
-
Sending LPCs via
upcxx::persona::lpc()or<completion>_cx::as_lpc()has no added restriction. -
upcxx::progress()andupcxx::future::wait()have no added restriction. Incoming RPCs are only processed if progress is called from the primordial thread while it has the master persona. -
Upcasting/downcasting shared heap memory (e.g.
global_ptr::local()) is always OK. This facilitates a kind of interprocess communication via load/store CPU shared memory access which is permitted in "seq". Note thatupcxx::rput/rgetis still invalid from non-primordial threads even when the remote memory is downcastable locally.
The legality of lpc and progress from the non-primordial thread permits users to orchestrate their own "funneling" strategy, e.g.:
// How a non-primordial thread can tell the master persona to put an rpc on the
// wire on its behalf.
upcxx::master_persona().lpc_ff([=]() {
upcxx::rpc_ff(99, [=]() { std::cout << "Initiated from far away."; });
});UPC++ specifies that processes who are members of upcxx::local_team() have
the ability to obtain valid "raw" C++ pointers (i.e. T*) referencing
shared objects allocated by team members (specifically, global_ptr::is_local()
is guaranteed to return true for such objects). In practice, this generally means
these processes must be co-located on the same compute node, defined as a
set of CPU resources sharing an OS image and coherent physical memory domain.
UPC++ computes upcxx::local_team() membership at startup by examining the
job layout of processes across physical nodes. By default, UPC++ attempts to
maximize the size of each local team to encompass all processes co-resident
on the same compute node (this strategy can be adjusted via GASNet environment
variables, but the default is strongly recommended).
The algorithm used to construct upcxx::local_team() membership additionally
ensures the following invariant:
-
Processes within a single local team always have consecutive rank indexes in
upcxx::world(). - More formally, for all
Iin[0, local_team().rank_n() - 1),local_team()[I+1] == local_team()[I] + 1
This invariant is not currently required by the UPC++ specification, but it is maintained by all versions of the LBNL UPC++ v1.0 implementation.