Skip to content

[SYCL] Extension spec for queue index #7520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 9, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 214 additions & 0 deletions sycl/doc/extensions/proposed/sycl_ext_intel_queue_index.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
= sycl_ext_intel_queue_index

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2022-2022 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 6 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.


== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*


== Overview

Backends such as Level Zero and OpenCL expose an "index" to a device's work
submission queue, which allows the application to fine tune the way work is
submitted to a device. This extension exposes that same concept to SYCL
applications.

Most SYCL applications should not need to use this extension because the SYCL
implementation automatically selects an efficient way to submit work to a
device, including automatic selection of a queue index when necessary.
Therefore, this extension is aimed at advanced users who understand the device
hardware and think they can outperform the default implementation by specifying
an explicit queue index.

Note that this extension can be supported on any backend, even if the backend
has no notion of a "queue index". Backends that have no native support for a
queue index can report that a device has only a single available queue index.
Applications can then only request one possible queue index, and the backend
can treat this as the default behavior (i.e. the backend can ignore the index).


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_INTEL_QUEUE_INDEX` to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro's
value to determine which of the extension's features the implementation
supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|Initial version of this extension.
|===

=== New device information descriptor

This extension adds the following new device information descriptor which
allows the application to query the number of available queue indices for the
device.

```
namespace sycl::ext::intel::info::device {

struct max_compute_queue_indices;

} // namespace sycl::ext::intel::info::device
```

The return type for this information descriptor is `int`, and the value is a
positive integer telling the number queue indices that are available for the
device. These indices are numbered sequentially starting at `0`.

=== New queue property

This extension adds the following new queue property which can be specified to
the queue constructor via the `property_list` parameter.

```
namespace sycl::ext::intel::property::queue {

class compute_index {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Level Zero and OpenCL drivers have 2 levels of exposing those indexes.
The first level is command queue groups, there can be multiple of those and they can have different charecteristics, i.e. you can have Compute command queue group and Copy Command Queue group.
Then inside of those command queue groups we can have multiple indexes representing engines with those capabilieis.

Here I see that everything if flattened to one level, how would this map to what Level Zero and OpenCL drivers offers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not trying to "flatten" the 2 levels that the Level Zero and OpenCL drivers expose. Instead, we are assuming that a SYCL queue always corresponds to a Level Zero / OpenCL "compute command queue group". Thus, compute_index corresponds to the second level -- i.e. an index within the compute command queue group.

In the future, we may decide to add another property like copy_index, which allows the application to select a particular queue index within the Level Zero / OpenCL "copy command group". However, we have no customer request for this today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification, what would happen if we would have multiple compute groups within single device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would depend on what the semantic difference was between the various compute groups. If this happened, my first concern would be the behavior of a SYCL queue when there is no property. Would this choose one of the various compute groups arbitrarily?

Can you give me an example of why Level Zero might expose multiple compute groups? What would be the reason for doing this vs. exposing multiple indices all within the same compute group?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason would be special command queue group for cooperative dispatch kernels.
Another example would be exposing RCS/CCS separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck RCS and CCS may have different advantages. Or going in a more generic approach: there could be different types of compute engines, in the same say we have different types of copy engines (main copy engine vs link engines). So we need to have something generic and flexible enough that could accommodate those kind of cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see @MichalMrozek concerns here. Here we are saying that there's a given number of indexes available

namespace sycl::ext::intel::info::device {

struct max_compute_queue_indices;

} /

however, not all are the same. From what L0 defines here https://spec.oneapi.io/level-zero/latest/core/PROG.html#discovery

The number and properties of command queue groups is queried by using zeDeviceGetCommandQueueGroupProperties.

The number of physical engines within a group is queried from ze_command_queue_group_properties_t.numQueues.

so there's an amount of indices per queue group, while here we are presenting an aggregated count of those indices, which might not be completely correct. if we want to make this more generic, we should present also the notion of queue groups, so there's an amount of indices per group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, the property name compute_index corresponds to the (first) "compute queue group". If we want to expose other queue groups to SYCL in the future, we will add additional queue properties. For example, if Level Zero decides to expose an RCS compute queue group, we can add a new SYCL queue property like rcs_compute_index.

I would rather not change compute_index to take both a group number and a queue index. If we did that, we'd need to add documentation about what the group number means because it seems the different queue groups all have different semantics. Since they have different semantics, I'd rather tie each to its own property, so that it's easy to document the semantics of each one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @gmlueck .

considering then that we are open to future additions or changes, then I have no immediate concerns.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same from me +1

public:
compute_index(int idx);
int get_index();
};

} // namespace sycl::ext::intel::property::queue
```

The `compute_index` property is a hint to the implementation which can affect
work submission concurrency. When two queues for the same device have
different queue indices, there is a greater chance that commands submitted to
the two queues will be concurrently submitted to the device.

It is an error to specify a queue index that is out of range for the queue's
device. The `queue` constructor throws an `exception` with `errc::invalid` if
the index is less than `0` or if the index is greater than or equal to the
value returned by `max_compute_queue_indices` for the queue's device.

The constructor and member functions of the `compute_index` property have the
following semantics.

[%header,cols="1,3"]
|===
|Function
|Description

|`compute_index(int idx)`
|Constructs a property with the given queue index.

|`int get_index()`
|Returns the queue index associated with the property.
|===


== Example usage

The following code snippet shows how to create a SYCL queue using a specific
queue index.

```
#include <sycl/sycl.hpp>

using sycl;
using sycl::ext::intel;

void foo(device d) {
int max_index = d.get_info<info::device::max_compute_queue_indices>();
int index = /* choose value between 0 and max_index-1 */;
queue q{d, property::queue::compute_index{index}};
}
```


== Behavior on Intel GPU devices

:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md
:sycl_ext_intel_cslice: https://github.com/intel/llvm/pull/7513

This non-normative section describes the behavior of the `compute_index`
property for some specific Intel GPU devices when using {dpcpp}. These details
are not part of the extension specification, and this behavior may not apply to
other devices.

On many Intel devices, there is just one available queue index, and there is
therefore no advantage to using the `compute_index` property. However, this
property can sometimes be useful when running on Data Center GPU Flex series
devices (aka ATS-M) or Data Center GPU Max series devices (aka PVC).

Some models of ATS-M support multiple queue indices with the semantics
described in the sections above. When a single process submits kernels to
different queue indices, there is a greater likelihood that the kernels will
be submitted concurrently.

PVC also supports multiple queue indices on each tile, but these queue indices
have a different semantic. In order to expose multiple queue indices on PVC,
the device driver must be configured in {multi-CCS}[multi-CCS] mode. In this
mode, the PVC root device still has just one queue index, however each "tile"
has multiple queue indices. Therefore, the application must first create
sub-devices to access each tile, and then the application can construct a queue
on these sub-devices using the `compute_index` property.

The semantics of these PVC queue indices is different, though. On PVC, each
queue index corresponds to a fixed subset of the execution units. Queues using
different indices still have a greater likelihood of submitting kernels
concurrently, but each kernel also runs on its own partition of the execution
units. Therefore, the `compute_index` property is just an alternate way to
run on a partition of the device, exactly the same as creating a "cslice"
sub-device via the {sycl_ext_intel_cslice}[sycl_ext_intel_cslice] extension.

In both the ATS-M case and the PVC case, constructing a SYCL queue with
`compute_index` causes the runtime to submit kernels exclusively to that index
on the underlying Level Zero or OpenCL driver. Without this property, the
runtime is free to distribute kernels across the available queue indices.