-
Notifications
You must be signed in to change notification settings - Fork 772
[SYCL][Doc] extension proposal and Impl for ext_oneapi_P2P. #4332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
de49f37
7966442
d0ef95f
1dbc791
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
= SYCL_ext_oneapi_P2P | ||
|
||
:source-highlighter: coderay | ||
:coderay-linenums-mode: table | ||
|
||
// This section needs to be after the document title. | ||
:doctype: book | ||
:toc2: | ||
:toc: left | ||
:encoding: utf-8 | ||
:lang: en | ||
|
||
:blank: pass:[ +] | ||
|
||
// Set the default source code type in this document to C++, | ||
// for syntax highlighting purposes. This is needed because | ||
// docbook uses c++ and html5 uses cpp. | ||
:language: {basebackend@docbook:c++:cpp} | ||
|
||
// This is necessary for asciidoc, but not for asciidoctor | ||
:cpp: C++ | ||
|
||
== Introduction | ||
IMPORTANT: This specification is a draft. | ||
|
||
NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are | ||
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. | ||
used by permission by Khronos. | ||
|
||
NOTE: This document is better viewed when rendered as html with asciidoctor. | ||
GitHub does not render image icons. | ||
|
||
This document describes an extension to add a boolean platform information descriptor which returns true if devices within the platform are capable of Peer to Peer memory copies. | ||
|
||
== Name Strings | ||
|
||
+= SYCL_ext_oneapi_P2P | ||
|
||
== Status | ||
|
||
Working Draft | ||
|
||
This is a preview extension specification, intended to provide early access to | ||
a feature for review and community feedback. When the feature matures, this | ||
specification may be released as a formal extension. | ||
|
||
Because the interfaces defined by this specification are not final and are | ||
subject to change they are not intended to be used by shipping software | ||
products. | ||
|
||
== Version | ||
|
||
Built On: {docdate} | ||
|
||
== Contact | ||
Jack Kirk, Codeplay (jack 'dot' kirk 'at' codeplay 'dot' com) | ||
|
||
== Dependencies | ||
|
||
This extension is written against the SYCL 2020 specification, Revision 3. | ||
|
||
== Overview | ||
|
||
Some vendors provide functionality for direct memory copies between peer devices. This extension adds a platform information descriptor, 'ext_oneapi_P2P', which is used to determine whether or not devices within a given platform are suitable for direct Peer to Peer memory copies. For some vendors, driver APIs that perform Peer to Peer memory copies are available for the full set of devices supported (e.g. sm50 and above for cuda). For the cuda driver API, in the case that a Peer to Peer copy is not available due to a limitation in the connection topology between the devices, then the functions performing the Peer to Peer copy, such as 'cuMemcpyPeer', instead perform a Peer copy via the Host. For these reasons we propose adding a platform information descriptor which indicates whether such Peer to Peer memory copy operations may be arbitrarily performed using any platform member device, acting as a source or destination, that share the same backend as a peer device, acting correspondingly as a destination or source. | ||
For some backends, such as cuda, currently a single SYCL context corresponds with a single backend context, and a backend context may not be shared between devices. For such cases Peer to Peer memory copy may only currently occur between devices that do not share a SYCL context. It is therefore important to account for this use case. | ||
The platform information descriptor, 'ext_oneapi_P2P', may be used as part of the runtime in order to determine whether it is possible to directly copy buffer/image memory between devices that are part of different contexts. A check can be made to ensure that both the source and destination contexts share the same backend. Then, if one of either the source or destination platforms return true for the 'info::platform::ext_oneapi_P2P' query, the direct Peer to Peer memory is considered legal. Other backends such as hip/ROCm have similar APIs to cuda for Peer to Peer copies and also use a single context per device. This consideration is one reason that we considered it more appropriate to add the P2P platform information descriptor, rather than account for a particular case (cuda) in the runtime explicitly. | ||
An alternative to using a platform information descriptor for the P2P memory copy property would be to use a device information descriptor. However, choosing an information descriptor at the device level instead of the platform level would imply that a P2P memory copy between devices will depend upon the particular device capability level in addition to the device vendor. This may be a useful information descriptor in the future, but for the current requirements it would be unnecessary. | ||
|
||
== Extension of SYCL 2020 Specification, Revision 3 | ||
|
||
=== Extension of Section 4.6.2.2. Platform information descriptors | ||
|
||
==== Add Platform descriptor 'ext_oneapi_P2P' to table 18. | ||
|
||
Add row `info::platform::ext_oneapi_P2P`: | ||
|
||
[width="40%",frame="topbot",options="header,footer"] | ||
|====================== | ||
|Platform descriptors |Return type |Description | ||
|info::platform::ext_oneapi_P2P | bool| Returns whether the platform supports Peer to Peer memory copies | ||
|====================== | ||
|
||
== Example Usage | ||
|
||
This non-normative section shows some example usages of the extension. | ||
|
||
[source,c++] | ||
---- | ||
bool P2P = plt.get_info<sycl::info::platform::ext_oneapi_P2P>(); | ||
if (P2P) | ||
std::cout << "P2P memory copies supported\n"; | ||
else | ||
std::cout << "P2P memory copies not supported\n"; | ||
---- | ||
|
||
The example above calls the get_info method of the sycl platform instance, 'plt', using the new info::platform::ext_oneapi_P2P descriptor as template parameter. The return value is a boolean which determines whether P2P memory copies are supported by the platform, 'plt'. | ||
|
||
== Issues | ||
|
||
None. | ||
|
||
== Revision History | ||
|
||
example usage: | ||
|
||
[cols="5,15,15,70"] | ||
[grid="rows"] | ||
[options="header"] | ||
|======================================== | ||
|Rev|Date|Author|Changes | ||
|1|2021-08-13|Jack Kirk|*Initial public working draft* | ||
|======================================== | ||
|
||
//************************************************************************ | ||
//Other formatting suggestions: | ||
// | ||
//* Use *bold* text for host APIs, or [source] syntax highlighting. | ||
//* Use +mono+ text for device APIs, or [source] syntax highlighting. | ||
//* Use +mono+ text for extension names, types, or enum values. | ||
//* Use _italics_ for parameters. | ||
//************************************************************************ |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -126,7 +126,8 @@ typedef enum { | |||
PI_PLATFORM_INFO_NAME = CL_PLATFORM_NAME, | ||||
PI_PLATFORM_INFO_PROFILE = CL_PLATFORM_PROFILE, | ||||
PI_PLATFORM_INFO_VENDOR = CL_PLATFORM_VENDOR, | ||||
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION | ||||
PI_PLATFORM_INFO_VERSION = CL_PLATFORM_VERSION, | ||||
PI_PLATFORM_INFO_ext_oneapi_P2P = 0x40110 | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not completely sure whether 'ext_oneapi_P2P' is the preferred name for such an extension. I imagine that 'P2P' is not favoured since it does not indicate that it is an extension. I can also use the capitalization, e.g. 'EXT_ONEAPI_P2P' once the preferred name is confirmed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've added many device info extensions already, see llvm/sycl/include/CL/sycl/detail/pi.h Line 288 in 4af2eb5
No naming convention, just use all capital case. The important part is to pick a value that isn't used by anything else already. |
||||
} _pi_platform_info; | ||||
|
||||
typedef enum { | ||||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -49,11 +49,27 @@ struct get_platform_info<std::vector<std::string>, info::platform::extensions> { | |||||||||||||||||||||
} | ||||||||||||||||||||||
}; | ||||||||||||||||||||||
|
||||||||||||||||||||||
template <> struct get_platform_info<bool, info::platform::ext_oneapi_P2P> { | ||||||||||||||||||||||
static bool get(RT::PiPlatform plt, const plugin &Plugin) { | ||||||||||||||||||||||
|
||||||||||||||||||||||
std::string vendor_name = | ||||||||||||||||||||||
get_platform_info<string_class, info::platform::vendor>::get(plt, | ||||||||||||||||||||||
Plugin); | ||||||||||||||||||||||
bool result = (vendor_name == "NVIDIA Corporation") ? true : false; | ||||||||||||||||||||||
return result; | ||||||||||||||||||||||
Comment on lines
+55
to
+59
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This behavior should be handled by the backends, i.e.
Suggested change
and then implement cases for |
||||||||||||||||||||||
} | ||||||||||||||||||||||
}; | ||||||||||||||||||||||
|
||||||||||||||||||||||
// Host platform information methods | ||||||||||||||||||||||
template <info::platform param> | ||||||||||||||||||||||
inline typename info::param_traits<info::platform, param>::return_type | ||||||||||||||||||||||
get_platform_info_host() = delete; | ||||||||||||||||||||||
|
||||||||||||||||||||||
template <> | ||||||||||||||||||||||
inline bool get_platform_info_host<info::platform::ext_oneapi_P2P>() { | ||||||||||||||||||||||
return false; | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
template <> | ||||||||||||||||||||||
inline std::string get_platform_info_host<info::platform::profile>() { | ||||||||||||||||||||||
return "FULL PROFILE"; | ||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this check a more fine-grained, i.e. check if 2 given devices have P2P capabilities? I think it may happen that devices from the same platform have no P2P link, or devices from different platform (of the same backend) do have P2P. Here is how similar query is defined in Level-Zero API: https://spec.oneapi.io/level-zero/latest/core/api.html?highlight=zedevicegetp2pproperties#_CPPv424zeDeviceGetP2PProperties18ze_device_handle_t18ze_device_handle_tP26ze_device_p2p_properties_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the corresponding cuda check would use cuDeviceCanAccessPeer(int* canAccessPeer, CUdevice dev, CUdevice peerDev) which is summarized as 'Queries if a device may directly access a peer device's memory. '
However for our current purposes this finegrained binary device check isn't what we need. A more complete description of the P2P descriptor as it is intended to be used within the runtime is:
'P2P' descriptor:
'Indicates whether the backend used by the platform/context has an API which can perform P2P memory copies if the device topology allows; in the case that a pair of devices using the backend are not directly connected via PCIe or otherwise the API is lowered to a pair of memory copies, the first of which copies from the first device to host, the second of which copies from the host to the second device'
For the cuda API comprising cuMemcpyPeerAsync/cuMemcpy3DPeerAsync, in the case that P2P is not available and the second route is taken, the memory copy can be a few times faster using the API compared to the currently existing case whereby two separate sycl events are created corresponding to two separate cuda memcpy function calls (for P2H and H2P) directly. For small array sizes there is little different in speed whether the P2H H2P copy is controlled by the P2P API or called directly by the runtime. Quantitatively the speedup appears to depend on whether there is a memory copy between separate devices (peers) or whether there is a move between contexts on a single device (which also requires calling cuMemcpyPeerAsync etc or making an explicit D2H H2D copy pair), but qualitatively the behaviour is the same so that for all use cases calling the cu Peer memcpy API is preferred.
Apologies, I should have given a more thorough description originally. In any case I agree that the descriptor 'P2P' as defined above is probably of limited use to the user and could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is the answer is "true, backend has P2P capabilities", then you still need to check P2P capabilities of each individual pair of devices involved. Why not just have the device-level query, which would return "false" for any devices under a backend not having P2P API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romanovvlad @smaslov-intel @steffenlarsen @AerialMantis
So for the cuda backend I don't think it is necessary to explicitly check P2P capabilities of each individual pair of devices involved. The way I have implemented it initially is to call cuCtxEnablePeerAccess which as described in the docs will return an error that can be silently handled if cuDeviceCanAccessPeer returns false (meaning that the connection topology does not allow P2P access). Then whether cuDeviceCanAccessPeer is either true or false the cuda driver API Peer copy functions can be called (if cuDeviceCanAccessPeer is false it will do P2H H2P and if true there will be a single P2P copy). Since the HIP/ROCm API is very similar to the cuda API we think that it may possibly behave similarly, but we have not confirmed this yet. However, if Level Zero does require a device-level query then it makes sense to have this, even if the CUDA backend doesn't check at this granularity.
There could be a binary PI query which takes the two devices as arguments, returns true in the case that both devices use the cuda backend, does a binary device check for e.g. level zero using zeDeviceGetP2PProperties or similar if such a check is required. This wouldn't be a big problem and it sounds like this is what will be needed. Although I think there is a small but non-trivial technical issue with such a PI query taking two devices as arguments that is worth mentioning:
The binary device P2P query would be first needed in graph_builder::addCG. In addition to the Source device the query would need to access the Destination device which I think can only be immediately accessed via the 'Record' variable. The destination queue/device can be found from Record in the same way as is done in graph_builder::insertMemoryMove. Perhaps there is a cleaner way but I don't see it currently.
The bigger problem is that there is a kind of 'redundancy check' that adds a connection command if the memory is not in the correct context in Command::processDepEvent that is triggered when the graph_builder::addCG is adjusted to make a direct P2P memory copy instead of the D2H H2D route. There would need to be another binary device query at this point in the code and currently I don't believe there is any immediate way to access the source device via e.g. the DepEvent instance in Command::processDepEvent which cannot uniquely identify the source device when there is more than one device associated with the 'source' context.
One question perhaps someone can answer is what use case this 'redundancy check ' is for? Removing it does not lead to any failed tests using the cuda backend. My current understanding is that any case where memory is not in the correct context is already handled in graph_builder::addCG (The only case I know of is when a second queue submits a command group using memory located on a device of a primary queue that has a different context). The question is therefore whether there is any circumstance that Command::processDepEvent is called with the following if clause returning true, then creating the connection command, without the Command instance being created from graph_builder::insertMemoryMove via graph_builder::addCG
The other important point mentioned is that it is currently undesirable that only a single cuda (I believe also ROCm) device/context (cuda uses only one device per context) can be held by a single sycl context, whereas for OpenCL it is normal that only a single SYCL context is used for all devices. See here for a full description and resolution discussion of this issue. It sounds like it is a sizeable technical challenge to change the runtime and cuda PI so that a single SYCL context can hold more than one cuda context/device, so that at least for the short term the only means of using multiple cuda devices is via multiple SYCL contexts.
How would you recommend we proceed with this? I am assuming that inter-context memory copy would still be supported in the case that all backends use a single SYCL context for several devices (at least via the host): presumably it would be necessary in the case that more than one backend is used simultaneously, for memory copy between devices with different backends (and different SYCL contexts). If inter-context memory copy would then still be supported for devices sharing the same backend then I think it would make sense to make use of the faster Peer to Peer copies for the cases where this is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best course of action would be to scrap the extension and focus on the PI/runtime side of this.
The problem of having multiple CUDA contexts in a single SYCL context is an enormous can of worms and though related I would argue it's not the main focus of the changes you are introducing.
If the plan is to implement this in tandem with a P2P copy operation in PI, I suggest you implement another PI function for checking if two devices can do P2P copy from each other. Having the finer granularity potentially helps other backends. For the CUDA backend this query could always return
true
as the P2P copy in the CUDA driver API is able to make P2P copies even when the devices aren't interconnected. For other backends you can just leave them returningfalse
(with a note to implement them correctly in the future). If the new P2P support query returnstrue
you can then utilize the associated P2P copy functions, and if the query returnsfalse
then make the copy as a round-trip through the host, which I believe is the current implementation.In short, I suggest you close this PR and implement the P2P support query as a new PI function that checks if the backend can P2P copy between two devices (or contexts?). This does not need an extension document. Then open a new PR with this, potentially together with the P2P copy function (and it being used by the runtime if you'd like.)
If you have any questions along the way, feel free to reach out or open a discussion topic here on Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks. The new PR is here: #4401. I will not close this PR.