-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Extension spec for queue index #7520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
214 changes: 214 additions & 0 deletions
214
sycl/doc/extensions/proposed/sycl_ext_intel_queue_index.asciidoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
= sycl_ext_intel_queue_index | ||
|
||
:source-highlighter: coderay | ||
:coderay-linenums-mode: table | ||
|
||
// This section needs to be after the document title. | ||
:doctype: book | ||
:toc2: | ||
:toc: left | ||
:encoding: utf-8 | ||
:lang: en | ||
:dpcpp: pass:[DPC++] | ||
|
||
// Set the default source code type in this document to C++, | ||
// for syntax highlighting purposes. This is needed because | ||
// docbook uses c++ and html5 uses cpp. | ||
:language: {basebackend@docbook:c++:cpp} | ||
|
||
|
||
== Notice | ||
|
||
[%hardbreaks] | ||
Copyright (C) 2022-2022 Intel Corporation. All rights reserved. | ||
|
||
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks | ||
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by | ||
permission by Khronos. | ||
|
||
|
||
== Contact | ||
|
||
To report problems with this extension, please open a new issue at: | ||
|
||
https://github.com/intel/llvm/issues | ||
|
||
|
||
== Dependencies | ||
|
||
This extension is written against the SYCL 2020 revision 6 specification. All | ||
references below to the "core SYCL specification" or to section numbers in the | ||
SYCL specification refer to that revision. | ||
|
||
|
||
== Status | ||
|
||
This is a proposed extension specification, intended to gather community | ||
feedback. Interfaces defined in this specification may not be implemented yet | ||
or may be in a preliminary state. The specification itself may also change in | ||
incompatible ways before it is finalized. *Shipping software products should | ||
not rely on APIs defined in this specification.* | ||
|
||
|
||
== Overview | ||
|
||
Backends such as Level Zero and OpenCL expose an "index" to a device's work | ||
submission queue, which allows the application to fine tune the way work is | ||
submitted to a device. This extension exposes that same concept to SYCL | ||
applications. | ||
|
||
Most SYCL applications should not need to use this extension because the SYCL | ||
implementation automatically selects an efficient way to submit work to a | ||
device, including automatic selection of a queue index when necessary. | ||
Therefore, this extension is aimed at advanced users who understand the device | ||
hardware and think they can outperform the default implementation by specifying | ||
an explicit queue index. | ||
|
||
Note that this extension can be supported on any backend, even if the backend | ||
has no notion of a "queue index". Backends that have no native support for a | ||
queue index can report that a device has only a single available queue index. | ||
Applications can then only request one possible queue index, and the backend | ||
can treat this as the default behavior (i.e. the backend can ignore the index). | ||
|
||
|
||
== Specification | ||
|
||
=== Feature test macro | ||
|
||
This extension provides a feature-test macro as described in the core SYCL | ||
specification. An implementation supporting this extension must predefine the | ||
macro `SYCL_EXT_INTEL_QUEUE_INDEX` to one of the values defined in the table | ||
below. Applications can test for the existence of this macro to determine if | ||
the implementation supports this feature, or applications can test the macro's | ||
value to determine which of the extension's features the implementation | ||
supports. | ||
|
||
[%header,cols="1,5"] | ||
|=== | ||
|Value | ||
|Description | ||
|
||
|1 | ||
|Initial version of this extension. | ||
|=== | ||
|
||
=== New device information descriptor | ||
|
||
This extension adds the following new device information descriptor which | ||
allows the application to query the number of available queue indices for the | ||
device. | ||
|
||
``` | ||
namespace sycl::ext::intel::info::device { | ||
|
||
struct max_compute_queue_indices; | ||
|
||
} // namespace sycl::ext::intel::info::device | ||
``` | ||
|
||
The return type for this information descriptor is `int`, and the value is a | ||
positive integer telling the number queue indices that are available for the | ||
device. These indices are numbered sequentially starting at `0`. | ||
|
||
=== New queue property | ||
|
||
This extension adds the following new queue property which can be specified to | ||
the queue constructor via the `property_list` parameter. | ||
|
||
``` | ||
namespace sycl::ext::intel::property::queue { | ||
|
||
class compute_index { | ||
public: | ||
compute_index(int idx); | ||
int get_index(); | ||
}; | ||
|
||
} // namespace sycl::ext::intel::property::queue | ||
``` | ||
|
||
The `compute_index` property is a hint to the implementation which can affect | ||
work submission concurrency. When two queues for the same device have | ||
different queue indices, there is a greater chance that commands submitted to | ||
the two queues will be concurrently submitted to the device. | ||
|
||
It is an error to specify a queue index that is out of range for the queue's | ||
device. The `queue` constructor throws an `exception` with `errc::invalid` if | ||
the index is less than `0` or if the index is greater than or equal to the | ||
value returned by `max_compute_queue_indices` for the queue's device. | ||
|
||
The constructor and member functions of the `compute_index` property have the | ||
following semantics. | ||
|
||
[%header,cols="1,3"] | ||
|=== | ||
|Function | ||
|Description | ||
|
||
|`compute_index(int idx)` | ||
|Constructs a property with the given queue index. | ||
|
||
|`int get_index()` | ||
|Returns the queue index associated with the property. | ||
|=== | ||
|
||
|
||
== Example usage | ||
|
||
The following code snippet shows how to create a SYCL queue using a specific | ||
queue index. | ||
|
||
``` | ||
#include <sycl/sycl.hpp> | ||
|
||
using sycl; | ||
using sycl::ext::intel; | ||
|
||
void foo(device d) { | ||
int max_index = d.get_info<info::device::max_compute_queue_indices>(); | ||
int index = /* choose value between 0 and max_index-1 */; | ||
queue q{d, property::queue::compute_index{index}}; | ||
} | ||
``` | ||
|
||
|
||
== Behavior on Intel GPU devices | ||
|
||
:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md | ||
:sycl_ext_intel_cslice: https://github.com/intel/llvm/pull/7513 | ||
|
||
This non-normative section describes the behavior of the `compute_index` | ||
property for some specific Intel GPU devices when using {dpcpp}. These details | ||
are not part of the extension specification, and this behavior may not apply to | ||
other devices. | ||
|
||
On many Intel devices, there is just one available queue index, and there is | ||
therefore no advantage to using the `compute_index` property. However, this | ||
property can sometimes be useful when running on Data Center GPU Flex series | ||
devices (aka ATS-M) or Data Center GPU Max series devices (aka PVC). | ||
|
||
Some models of ATS-M support multiple queue indices with the semantics | ||
described in the sections above. When a single process submits kernels to | ||
different queue indices, there is a greater likelihood that the kernels will | ||
be submitted concurrently. | ||
|
||
PVC also supports multiple queue indices on each tile, but these queue indices | ||
have a different semantic. In order to expose multiple queue indices on PVC, | ||
the device driver must be configured in {multi-CCS}[multi-CCS] mode. In this | ||
mode, the PVC root device still has just one queue index, however each "tile" | ||
has multiple queue indices. Therefore, the application must first create | ||
sub-devices to access each tile, and then the application can construct a queue | ||
on these sub-devices using the `compute_index` property. | ||
|
||
The semantics of these PVC queue indices is different, though. On PVC, each | ||
queue index corresponds to a fixed subset of the execution units. Queues using | ||
different indices still have a greater likelihood of submitting kernels | ||
concurrently, but each kernel also runs on its own partition of the execution | ||
units. Therefore, the `compute_index` property is just an alternate way to | ||
run on a partition of the device, exactly the same as creating a "cslice" | ||
sub-device via the {sycl_ext_intel_cslice}[sycl_ext_intel_cslice] extension. | ||
|
||
In both the ATS-M case and the PVC case, constructing a SYCL queue with | ||
`compute_index` causes the runtime to submit kernels exclusively to that index | ||
on the underlying Level Zero or OpenCL driver. Without this property, the | ||
runtime is free to distribute kernels across the available queue indices. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Level Zero and OpenCL drivers have 2 levels of exposing those indexes.
The first level is command queue groups, there can be multiple of those and they can have different charecteristics, i.e. you can have Compute command queue group and Copy Command Queue group.
Then inside of those command queue groups we can have multiple indexes representing engines with those capabilieis.
Here I see that everything if flattened to one level, how would this map to what Level Zero and OpenCL drivers offers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are not trying to "flatten" the 2 levels that the Level Zero and OpenCL drivers expose. Instead, we are assuming that a SYCL
queue
always corresponds to a Level Zero / OpenCL "compute command queue group". Thus,compute_index
corresponds to the second level -- i.e. an index within the compute command queue group.In the future, we may decide to add another property like
copy_index
, which allows the application to select a particular queue index within the Level Zero / OpenCL "copy command group". However, we have no customer request for this today.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification, what would happen if we would have multiple compute groups within single device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would depend on what the semantic difference was between the various compute groups. If this happened, my first concern would be the behavior of a SYCL
queue
when there is no property. Would this choose one of the various compute groups arbitrarily?Can you give me an example of why Level Zero might expose multiple compute groups? What would be the reason for doing this vs. exposing multiple indices all within the same compute group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason would be special command queue group for cooperative dispatch kernels.
Another example would be exposing RCS/CCS separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck RCS and CCS may have different advantages. Or going in a more generic approach: there could be different types of compute engines, in the same say we have different types of copy engines (main copy engine vs link engines). So we need to have something generic and flexible enough that could accommodate those kind of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see @MichalMrozek concerns here. Here we are saying that there's a given number of indexes available
however, not all are the same. From what L0 defines here https://spec.oneapi.io/level-zero/latest/core/PROG.html#discovery
so there's an amount of indices per queue group, while here we are presenting an aggregated count of those indices, which might not be completely correct. if we want to make this more generic, we should present also the notion of queue groups, so there's an amount of indices per group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, the property name
compute_index
corresponds to the (first) "compute queue group". If we want to expose other queue groups to SYCL in the future, we will add additional queue properties. For example, if Level Zero decides to expose an RCS compute queue group, we can add a new SYCL queue property likercs_compute_index
.I would rather not change
compute_index
to take both a group number and a queue index. If we did that, we'd need to add documentation about what the group number means because it seems the different queue groups all have different semantics. Since they have different semantics, I'd rather tie each to its own property, so that it's easy to document the semantics of each one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @gmlueck .
considering then that we are open to future additions or changes, then I have no immediate concerns.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same from me +1