Skip to content

[SYCL][Doc] extension proposal and Impl for ext_oneapi_P2P. #4332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions sycl/doc/extensions/InfoPlatformP2P/SYCL_ext_oneapi_P2P.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
= SYCL_ext_oneapi_P2P

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en

:blank: pass:[ +]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}

// This is necessary for asciidoc, but not for asciidoctor
:cpp: C++

== Introduction
IMPORTANT: This specification is a draft.

NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc.
used by permission by Khronos.

NOTE: This document is better viewed when rendered as html with asciidoctor.
GitHub does not render image icons.

This document describes an extension to add a boolean platform information descriptor which returns true if devices within the platform are capable of Peer to Peer memory copies.

== Name Strings

+= SYCL_ext_oneapi_P2P

== Status

Working Draft

This is a preview extension specification, intended to provide early access to
a feature for review and community feedback. When the feature matures, this
specification may be released as a formal extension.

Because the interfaces defined by this specification are not final and are
subject to change they are not intended to be used by shipping software
products.

== Version

Built On: {docdate}

== Contact
Jack Kirk, Codeplay (jack 'dot' kirk 'at' codeplay 'dot' com)

== Dependencies

This extension is written against the SYCL 2020 specification, Revision 3.

== Overview

Some vendors provide functionality for direct memory copies between peer devices. This extension adds a platform information descriptor, 'ext_oneapi_P2P', which is used to determine whether or not devices within a given platform are suitable for direct Peer to Peer memory copies. For some vendors, driver APIs that perform Peer to Peer memory copies are available for the full set of devices supported (e.g. sm50 and above for cuda). For the cuda driver API, in the case that a Peer to Peer copy is not available due to a limitation in the connection topology between the devices, then the functions performing the Peer to Peer copy, such as 'cuMemcpyPeer', instead perform a Peer copy via the Host. For these reasons we propose adding a platform information descriptor which indicates whether such Peer to Peer memory copy operations may be arbitrarily performed using any platform member device, acting as a source or destination, that share the same backend as a peer device, acting correspondingly as a destination or source.
For some backends, such as cuda, currently a single SYCL context corresponds with a single backend context, and a backend context may not be shared between devices. For such cases Peer to Peer memory copy may only currently occur between devices that do not share a SYCL context. It is therefore important to account for this use case.
The platform information descriptor, 'ext_oneapi_P2P', may be used as part of the runtime in order to determine whether it is possible to directly copy buffer/image memory between devices that are part of different contexts. A check can be made to ensure that both the source and destination contexts share the same backend. Then, if one of either the source or destination platforms return true for the 'info::platform::ext_oneapi_P2P' query, the direct Peer to Peer memory is considered legal. Other backends such as hip/ROCm have similar APIs to cuda for Peer to Peer copies and also use a single context per device. This consideration is one reason that we considered it more appropriate to add the P2P platform information descriptor, rather than account for a particular case (cuda) in the runtime explicitly.
An alternative to using a platform information descriptor for the P2P memory copy property would be to use a device information descriptor. However, choosing an information descriptor at the device level instead of the platform level would imply that a P2P memory copy between devices will depend upon the particular device capability level in addition to the device vendor. This may be a useful information descriptor in the future, but for the current requirements it would be unnecessary.

== Extension of SYCL 2020 Specification, Revision 3

=== Extension of Section 4.6.2.2. Platform information descriptors

==== Add Platform descriptor 'ext_oneapi_P2P' to table 18.

Add row `info::platform::ext_oneapi_P2P`:

[width="40%",frame="topbot",options="header,footer"]
|======================
|Platform descriptors |Return type |Description
|info::platform::ext_oneapi_P2P | bool| Returns whether the platform supports Peer to Peer memory copies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this check a more fine-grained, i.e. check if 2 given devices have P2P capabilities? I think it may happen that devices from the same platform have no P2P link, or devices from different platform (of the same backend) do have P2P. Here is how similar query is defined in Level-Zero API: https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=zedevicegetp2pproperties#_CPPv424zeDeviceGetP2PProperties18ze_device_handle_t18ze_device_handle_tP26ze_device_p2p_properties_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the corresponding cuda check would use cuDeviceCanAccessPeer(int* canAccessPeer, CUdevice dev, CUdevice peerDev) which is summarized as 'Queries if a device may directly access a peer device's memory. '

However for our current purposes this finegrained binary device check isn't what we need. A more complete description of the P2P descriptor as it is intended to be used within the runtime is:
'P2P' descriptor:
'Indicates whether the backend used by the platform/context has an API which can perform P2P memory copies if the device topology allows; in the case that a pair of devices using the backend are not directly connected via PCIe or otherwise the API is lowered to a pair of memory copies, the first of which copies from the first device to host, the second of which copies from the host to the second device'

For the cuda API comprising cuMemcpyPeerAsync/cuMemcpy3DPeerAsync, in the case that P2P is not available and the second route is taken, the memory copy can be a few times faster using the API compared to the currently existing case whereby two separate sycl events are created corresponding to two separate cuda memcpy function calls (for P2H and H2P) directly. For small array sizes there is little different in speed whether the P2H H2P copy is controlled by the P2P API or called directly by the runtime. Quantitatively the speedup appears to depend on whether there is a memory copy between separate devices (peers) or whether there is a move between contexts on a single device (which also requires calling cuMemcpyPeerAsync etc or making an explicit D2H H2D copy pair), but qualitatively the behaviour is the same so that for all use cases calling the cu Peer memcpy API is preferred.

Apologies, I should have given a more thorough description originally. In any case I agree that the descriptor 'P2P' as defined above is probably of limited use to the user and could be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether the backend used by the platform/context has an API which can perform P2P memory copies if the device topology allows

So, is the answer is "true, backend has P2P capabilities", then you still need to check P2P capabilities of each individual pair of devices involved. Why not just have the device-level query, which would return "false" for any devices under a backend not having P2P API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romanovvlad @smaslov-intel @steffenlarsen @AerialMantis
So for the cuda backend I don't think it is necessary to explicitly check P2P capabilities of each individual pair of devices involved. The way I have implemented it initially is to call cuCtxEnablePeerAccess which as described in the docs will return an error that can be silently handled if cuDeviceCanAccessPeer returns false (meaning that the connection topology does not allow P2P access). Then whether cuDeviceCanAccessPeer is either true or false the cuda driver API Peer copy functions can be called (if cuDeviceCanAccessPeer is false it will do P2H H2P and if true there will be a single P2P copy). Since the HIP/ROCm API is very similar to the cuda API we think that it may possibly behave similarly, but we have not confirmed this yet. However, if Level Zero does require a device-level query then it makes sense to have this, even if the CUDA backend doesn't check at this granularity.
There could be a binary PI query which takes the two devices as arguments, returns true in the case that both devices use the cuda backend, does a binary device check for e.g. level zero using zeDeviceGetP2PProperties or similar if such a check is required. This wouldn't be a big problem and it sounds like this is what will be needed. Although I think there is a small but non-trivial technical issue with such a PI query taking two devices as arguments that is worth mentioning:

The binary device P2P query would be first needed in graph_builder::addCG. In addition to the Source device the query would need to access the Destination device which I think can only be immediately accessed via the 'Record' variable. The destination queue/device can be found from Record in the same way as is done in graph_builder::insertMemoryMove. Perhaps there is a cleaner way but I don't see it currently.
The bigger problem is that there is a kind of 'redundancy check' that adds a connection command if the memory is not in the correct context in Command::processDepEvent that is triggered when the graph_builder::addCG is adjusted to make a direct P2P memory copy instead of the D2H H2D route. There would need to be another binary device query at this point in the code and currently I don't believe there is any immediate way to access the source device via e.g. the DepEvent instance in Command::processDepEvent which cannot uniquely identify the source device when there is more than one device associated with the 'source' context.
One question perhaps someone can answer is what use case this 'redundancy check ' is for? Removing it does not lead to any failed tests using the cuda backend. My current understanding is that any case where memory is not in the correct context is already handled in graph_builder::addCG (The only case I know of is when a second queue submits a command group using memory located on a device of a primary queue that has a different context). The question is therefore whether there is any circumstance that Command::processDepEvent is called with the following if clause returning true, then creating the connection command, without the Command instance being created from graph_builder::insertMemoryMove via graph_builder::addCG

  if (DepEventContext != WorkerContext && !WorkerContext->is_host()) {
    Scheduler::GraphBuilder &GB = Scheduler::getInstance().MGraphBuilder;
    ConnectionCmd = GB.connectDepEvent(this, DepEvent, Dep);
  }

The other important point mentioned is that it is currently undesirable that only a single cuda (I believe also ROCm) device/context (cuda uses only one device per context) can be held by a single sycl context, whereas for OpenCL it is normal that only a single SYCL context is used for all devices. See here for a full description and resolution discussion of this issue. It sounds like it is a sizeable technical challenge to change the runtime and cuda PI so that a single SYCL context can hold more than one cuda context/device, so that at least for the short term the only means of using multiple cuda devices is via multiple SYCL contexts.

How would you recommend we proceed with this? I am assuming that inter-context memory copy would still be supported in the case that all backends use a single SYCL context for several devices (at least via the host): presumably it would be necessary in the case that more than one backend is used simultaneously, for memory copy between devices with different backends (and different SYCL contexts). If inter-context memory copy would then still be supported for devices sharing the same backend then I think it would make sense to make use of the faster Peer to Peer copies for the cases where this is supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best course of action would be to scrap the extension and focus on the PI/runtime side of this.

The problem of having multiple CUDA contexts in a single SYCL context is an enormous can of worms and though related I would argue it's not the main focus of the changes you are introducing.

If the plan is to implement this in tandem with a P2P copy operation in PI, I suggest you implement another PI function for checking if two devices can do P2P copy from each other. Having the finer granularity potentially helps other backends. For the CUDA backend this query could always return true as the P2P copy in the CUDA driver API is able to make P2P copies even when the devices aren't interconnected. For other backends you can just leave them returning false (with a note to implement them correctly in the future). If the new P2P support query returns true you can then utilize the associated P2P copy functions, and if the query returns false then make the copy as a round-trip through the host, which I believe is the current implementation.

In short, I suggest you close this PR and implement the P2P support query as a new PI function that checks if the backend can P2P copy between two devices (or contexts?). This does not need an extension document. Then open a new PR with this, potentially together with the P2P copy function (and it being used by the runtime if you'd like.)

If you have any questions along the way, feel free to reach out or open a discussion topic here on Github.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. The new PR is here: #4401. I will not close this PR.

|======================

== Example Usage

This non-normative section shows some example usages of the extension.

[source,c++]
----
bool P2P = plt.get_info<sycl::info::platform::ext_oneapi_P2P>();
if (P2P)
std::cout << "P2P memory copies supported\n";
else
std::cout << "P2P memory copies not supported\n";
----

The example above calls the get_info method of the sycl platform instance, 'plt', using the new info::platform::ext_oneapi_P2P descriptor as template parameter. The return value is a boolean which determines whether P2P memory copies are supported by the platform, 'plt'.

== Issues

None.

== Revision History

example usage:

[cols="5,15,15,70"]
[grid="rows"]
[options="header"]
|========================================
|Rev|Date|Author|Changes
|1|2021-08-13|Jack Kirk|*Initial public working draft*
|========================================

//************************************************************************
//Other formatting suggestions:
//
//* Use *bold* text for host APIs, or [source] syntax highlighting.
//* Use +mono+ text for device APIs, or [source] syntax highlighting.
//* Use +mono+ text for extension names, types, or enum values.
//* Use _italics_ for parameters.
//************************************************************************
3 changes: 2 additions & 1 deletion sycl/include/CL/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@ typedef enum {
PI_PLATFORM_INFO_NAME = CL_PLATFORM_NAME,
PI_PLATFORM_INFO_PROFILE = CL_PLATFORM_PROFILE,
PI_PLATFORM_INFO_VENDOR = CL_PLATFORM_VENDOR,
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION,
PI_PLATFORM_INFO_ext_oneapi_P2P = 0x40110
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure whether 'ext_oneapi_P2P' is the preferred name for such an extension. I imagine that 'P2P' is not favoured since it does not indicate that it is an extension. I can also use the capitalization, e.g. 'EXT_ONEAPI_P2P' once the preferred name is confirmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've added many device info extensions already, see

// These are Intel-specific extensions.

No naming convention, just use all capital case. The important part is to pick a value that isn't used by anything else already.

} _pi_platform_info;

typedef enum {
Expand Down
11 changes: 6 additions & 5 deletions sycl/include/CL/sycl/info/info_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ namespace info {
// Information descriptors
// A.1 Platform information descriptors
enum class platform {
profile = PI_PLATFORM_INFO_PROFILE,
version = PI_PLATFORM_INFO_VERSION,
name = PI_PLATFORM_INFO_NAME,
vendor = PI_PLATFORM_INFO_VENDOR,
extensions = PI_PLATFORM_INFO_EXTENSIONS,
profile = PI_PLATFORM_INFO_PROFILE,
version = PI_PLATFORM_INFO_VERSION,
name = PI_PLATFORM_INFO_NAME,
vendor = PI_PLATFORM_INFO_VENDOR,
extensions = PI_PLATFORM_INFO_EXTENSIONS,
ext_oneapi_P2P = PI_PLATFORM_INFO_ext_oneapi_P2P,
};

// A.2 Context information desctiptors
Expand Down
1 change: 1 addition & 0 deletions sycl/include/CL/sycl/info/platform_traits.def
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ __SYCL_PARAM_TRAITS_SPEC(platform, version, std::string)
__SYCL_PARAM_TRAITS_SPEC(platform, name, std::string)
__SYCL_PARAM_TRAITS_SPEC(platform, vendor, std::string)
__SYCL_PARAM_TRAITS_SPEC(platform, extensions, std::vector<std::string>)
__SYCL_PARAM_TRAITS_SPEC(platform, ext_oneapi_P2P, bool)
2 changes: 2 additions & 0 deletions sycl/source/detail/pi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,8 @@ std::string platformInfoToString(pi_platform_info info) {
return "PI_PLATFORM_INFO_VENDOR";
case PI_PLATFORM_INFO_EXTENSIONS:
return "PI_PLATFORM_INFO_EXTENSIONS";
case PI_PLATFORM_INFO_ext_oneapi_P2P:
return "PI_PLATFORM_INFO_ext_oneapi_P2P";
}
die("Unknown pi_platform_info value passed to "
"cl::sycl::detail::pi::platformInfoToString");
Expand Down
16 changes: 16 additions & 0 deletions sycl/source/detail/platform_info.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,27 @@ struct get_platform_info<std::vector<std::string>, info::platform::extensions> {
}
};

template <> struct get_platform_info<bool, info::platform::ext_oneapi_P2P> {
static bool get(RT::PiPlatform plt, const plugin &Plugin) {

std::string vendor_name =
get_platform_info<string_class, info::platform::vendor>::get(plt,
Plugin);
bool result = (vendor_name == "NVIDIA Corporation") ? true : false;
return result;
Comment on lines +55 to +59
Copy link
Contributor

@steffenlarsen steffenlarsen Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior should be handled by the backends, i.e.

Suggested change
std::string vendor_name =
get_platform_info<string_class, info::platform::vendor>::get(plt,
Plugin);
bool result = (vendor_name == "NVIDIA Corporation") ? true : false;
return result;
bool SupportsP2P = false;
Plugin.call<PiApiKind::piPlatformGetInfo>(
plt, pi::cast<pi_platform_info>(info::platform::ext_oneapi_P2P), sizeof(bool), &SupportsP2P,
nullptr);
return SupportsP2P;

and then implement cases for PI_PLATFORM_INFO_ext_oneapi_P2P in piPlatformGetInfo.

}
};

// Host platform information methods
template <info::platform param>
inline typename info::param_traits<info::platform, param>::return_type
get_platform_info_host() = delete;

template <>
inline bool get_platform_info_host<info::platform::ext_oneapi_P2P>() {
return false;
}

template <>
inline std::string get_platform_info_host<info::platform::profile>() {
return "FULL PROFILE";
Expand Down