Copyright © 2022-2022 Intel Corporation. All rights reserved.
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.
To report problems with this extension, please open a new issue at:
This extension is written against the SYCL 2020 revision 6 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.
Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute slice), which is a smaller granularity than "tile". This form of partitioning is not currently enabled by default, so it is considered an advanced feature which most applications are not expected to use. This extension provides a way for these advanced applications to partition a device by cslice when it is enabled in the device driver.
Unlike "tile" partitions, a cslice partition does not have any different cache
affinity from its sibling cslice partitions. Therefore, this extension does
not expose this type of partitioning through
info::partition_property::partition_by_affinity_domain
. Instead, it adds a
new partitioning type
info::partition_property::ext_intel_partition_by_cslice
.
The only Intel GPU devices that currently support this type of partitioning are the Data Center GPU Max series (aka PVC), and this support is only available when the device driver is configured in multi-CCS mode. See that documentation for instructions on how to enable this mode and for other important information. Currently, it is only possible to partition a device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned into two cslice sub-devices. When in 4 CCS Mode, a tile can be partitioned into four cslice sub-devices.
This type of partitioning is currently supported only at the "tile" level.
A device should be queried using info::device::partition_properties
to
determine if it supports partitioning by ext_intel_partition_by_cslice
. If a
device does not support partitioning by ext_intel_partition_by_cslice
it may
first need to be partitioned into per-tile sub-devices via
partition_by_affinity_domain
, and then each of the resulting sub-devices may
be further partitioned by ext_intel_partition_by_cslice
.
It is important to understand that the device driver virtualizes work submission to the cslice sub-devices. (More specifically, the device driver virtualizes work submission to different CCS-es, and this means that on Data Center GPU Max series devices the work submission to a cslice is virtualized.) This virtualization happens only between processes, and not within a single process. For example, consider a single process that constructs two SYCL queues on cslice sub-device #0. Kernels submitted to these two queues are guaranteed to conflict, both using the same set of execution units. Therefore, if a single process wants to explicitly submit kernels to cslice sub-devices and it wants to avoid conflict, it should create queues on different sub-devices. By contrast, consider an example where two separate processes create a SYCL queue on cslice sub-device #0. In this case, the device driver virtualizes access to this cslice, and kernels submitted from the first process may run on different execution units than kernels submitted from the second process. In this second case, the device driver binds the process’s requested cslice to a physical cslice according to the overall system load.
Note that this extension can be supported by any implementation. If an
implementation supports a backend or device without the concept of cslice
partitions it can still conform to this extension by declaring the new
enumerator and member functions specified below. If the info descriptor query
info::device::partition_properties
does not report
ext_intel_partition_by_cslice
, then the backend or device does not support
the creation of cslice partitions.
This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro SYCL_EXT_INTEL_CSLICE
to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro’s
value to determine which of the extension’s features the implementation
supports.
Value | Description |
---|---|
1 |
Initial version of this extension. |
This extension adds a new enumerator named ext_intel_partition_by_cslice
to
info::partition_property
:
namespace sycl::info {
enum class partition_property : /* unspecified */ {
// ...
ext_intel_partition_by_cslice
};
} // namespace sycl::info
The behavior of the info::device::partition_properties
info descriptor query
is also extended to include ext_intel_partition_by_cslice
in the vector of
returned values if the device can be partitioned into at least two sub-devices
along that partition property.
This extension adds a new function template specialization to the device
class:
namespace sycl {
class device {
// ...
// Available only when
// Prop == info::partition_property::ext_intel_partition_by_cslice
template <info::partition_property Prop>
std::vector<device> create_sub_devices() const;
};
} // namespace sycl
This function only participates in overload resolution if the Prop
template
parameter is info::partition_property::ext_intel_partition_by_cslice
. It
returns a std::vector
of sub-devices partitioned from this SYCL device
,
each representing a fixed set of hardware cslices.
If the SYCL device
does not support
info::partition_property::ext_intel_partition_by_cslice
, calling this
function throws a synchronous exception
with the
errc::feature_not_supported
error code.
This section describes the behavior for some of the device info queries when
applied to a device
object that represents a "cslice" partition.
-
info::device::partition_type_property
Returns
ext_intel_partition_by_cslice
. -
info::device::max_compute_units
When partitioning by
ext_intel_partition_by_cslice
, each sub-device represents a fixed subset of the parent device’s compute units. This query returns the number of compute units represented by the sub-device.
The remaining device info queries return the properties or limits of the
sub-device, as is typical for these queries. In general, if a resource is
partitioned among the sub-devices, then the associated info query will
return each sub-device’s share of the resource. However, if a resource is
shared by all of the sub-devices, then the associated info query for each
sub-device will return the same value as for the parent device. For example,
if device global memory is shared by all cslice partitions in a tile, then the
info query info::device::global_mem_size
will return the same value for the
device
object representing the tile as for the device
object representing
a cslice.
The Level Zero device driver doesn’t use the concept of sub-device to represent a fixed partition of cslices. Instead, a Level Zero command queue can be created with a particular queue index, which represents a partition of the cslices.
As a result, calling get_native
for a SYCL device
that represents a cslice
partition returns the same ze_device_handle_t
as the parent device. If an
application wants a native handle to the cslice partition, it must create a
SYCL queue
and then call get_native
on the queue
. This will return a
ze_command_queue_handle_t
that corresponds to the cslice partition.
The OpenCL device driver doesn’t use the concept of sub-device to represent a fixed partition of cslices. Instead, an OpenCL command queue can be created with a particular queue index, which represents a partition of the cslices.
As a result, calling get_native
for a SYCL device
that represents a cslice
partition returns the same cl_device_id
as the parent device. If an
application wants a native handle to the cslice partition, it must create a
SYCL queue
and then call get_native
on the queue
. This will return a
cl_command_queue
that corresponds to the cslice partition.
This section describes the effect of this extension on the DPC++
ONEAPI_DEVICE_SELECTOR
environment variable. Since this environment variable
is not part of the SYCL specification, this section is not a normative part of
the extension specification. Rather, it only describes the impact on DPC++.
As described in the documentation for the
ONEAPI_DEVICE_SELECTOR
, a term in the selector string can be an integral
number followed by a decimal point (.
), where the decimal point indicates a
sub-device. For example, 1.2
means sub-device #2 of device #1. These
decimal points can represent either a sub-device created via
partition_by_affinity_domain
or via ext_intel_partition_by_cslice
. When
DPC++ processes a term with a decimal point, it first attempts to partition
by ext_intel_partition_by_cslice
. If that is not possible, it next attempts
to partition by partition_by_affinity_domain
/
partition_affinity_domain::next_partitionable
.
It is important to keep in mind, though, that requesting a specific cslice via this environment variable has limited effect due to the device driver’s virtualization of cslices. To illustrate, consider an example where two processes are launched as follows, selecting different cslice sub-devices:
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo
The first process selects cslice #0 while the second selects cslice #1. This does have the effect that each process is constrained to a single cslice (which is not the DPC++ default). However, the actual cslice number is irrelevant. Because of cslice virtualization, the device driver will choose some available cslice for each process instead of honoring the value requested in the environment variable. As a result, the following example has exactly the same effect:
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo