-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Add ext_intel_device_id to sycl_ext_intel_device_info #7010
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this! Could you please also add tests for it in sycl/unittests/kernel-and-program/DeviceInfo.cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
sycl/source/detail/device_impl.cpp
Outdated
case aspect::ext_intel_pci_device_id: | ||
return getPlugin().call_nocheck<detail::PiApiKind::piDeviceGetInfo>( | ||
MDevice, PI_DEVICE_INFO_PCI_DEVICE_ID, 0, | ||
nullptr, &return_size) == PI_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments/questions here:
cl_intel_attributes_query
says that the corresponding query returns PCI ID only if device has it, i.e. the fact that the extension is supported, doesn't mean that PCI ID exists for the device:
If the implementation is driven primarily by a PCI device with a PCI device ID
I don't think that this should block the PR, but I think that we do need to add a FIXME
comment here, because strictly speaking, this implementation (of both has
and get_info
) is not entirely correct.
the low 16 bits of the device identifier must contain that PCI device ID, and the remaining bits must be set to zero.
Unfortunately, deviceId
in Level Zero is poorly documented and it is not clear how returned 32 bits are filled and with which exact data. My concern here is that returned value might not have a uniform format across all backends. See my other comment on the spec extension document with other comments on the topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cl_intel_attributes_query says that the corresponding query returns PCI ID only if device has it, i.e. the fact that the extension is supported, doesn't mean that PCI ID exists for the device:
I don't see the issue here. The proposed ext_intel_pci_device_id
does not claim that it would return such ID for any device. It would return a PCI ID. If a device is not a PCI device, it's obvious that it shouldn't have a PCI ID. This is what cl_intel_attributes_query says. If it's a PCI device, why wouldn't it have a PCI device ID? How is this any different from the existing ext_intel_pci_address
?
Unfortunately, deviceId in Level Zero is poorly documented and it is not clear how returned 32 bits are filled and with which exact data. My concern here is that returned value might not have a uniform format across all backends. See my other comment on the spec extension document with other comments on the topic.
cl_intel_attributes_query is clear, and in Level Zero are you aware of any reason why deviceId
could suddenly change? That structure also contains vendorId
, which together with deviceId
form the full PCI ID. Why would Level Zero not return the actual device ID, and what would be the point of not returning it in the lower 16 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should block the PR, but I think that we do need to add a FIXME comment here, because strictly speaking, this implementation (of both has and get_info) is not entirely correct.
What needs to be fixed specifically? In contrast, why is the implementation ext_intel_pci_address
considered correct, which also exists only for PCI devices? How is has
handled differently for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the issue here. The proposed
ext_intel_pci_device_id
does not claim that it would return such ID for any device. It would return a PCI ID. If a device is not a PCI device, it's obvious that it shouldn't have a PCI ID. This is what cl_intel_attributes_query says. If it's a PCI device, why wouldn't it have a PCI device ID?
With that I agree. The inconsistency I'm highlighting here is that I don't see us checking the result returned by CL_DEVICE_ID_INTEL
before saying whether or not the corresponding aspect is supported. I also don't see us raising an exception in case CL_DEVICE_ID_INTEL
did not return a PCI ID (i.e. any bit except for low 16 bits is set to non-zero value).
How is this any different from the existing
ext_intel_pci_address
?
Querying PCI address is not currently supported for OpenCL backend.
llvm/sycl/plugins/opencl/pi_opencl.cpp
Lines 263 to 267 in 9717cc5
// TODO: Check regularly to see if support in enabled in OpenCL. | |
// Intel GPU EU device-specific information extensions. | |
// Some of the queries are enabled by cl_intel_device_attribute_query | |
// extension, but it's not yet in the Registry. | |
case PI_DEVICE_INFO_PCI_ADDRESS: |
Depending on how the corresponding OpenCL extension will be worded (or perhaps there is already such an extension? I'm just not aware of it) it might or might not have the same implementation inconsistency.
cl_intel_attributes_query is clear, and in Level Zero are you aware of any reason why
deviceId
could suddenly change? That structure also containsvendorId
, which together withdeviceId
form the full PCI ID. Why would Level Zero not return the actual device ID, and what would be the point of not returning it in the lower 16 bits?
The reason I asked the question is because I'm unfamiliar with PCI ID encoding/representation. Is it documented somewhere as a standard? If so, there is absolutely no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that I agree. The inconsistency I'm highlighting here is that I don't see us checking the result returned by CL_DEVICE_ID_INTEL before saying whether or not the corresponding aspect is supported. I also don't see us raising an exception in case CL_DEVICE_ID_INTEL did not return a PCI ID (i.e. any bit except for low 16 bits is set to non-zero value).
Even if there wouldn't be any bits in the upper 16 bits, it still wouldn't mean that it's an actual valid PCI ID. There still wouldn't be any guarantee by adding this check. But if you think it would be useful, please add it.
The reason I asked the question is because I'm unfamiliar with PCI ID encoding/representation. Is it documented somewhere as a standard? If so, there is absolutely no problem.
The PCI encoding is widely documented, e.g. https://man7.org/linux/man-pages/man5/pci.ids.5.html:
Devices on the PCI bus are identified by a combination of a
vendor ID (assigned by the PCI SIG) and device ID (assigned by
the vendor). Both IDs are 16-bit integers and the device itself
provides no translation to a human-readable string.
SYCL can already report the vendor ID part of the PCI ID (vendor_id
), and a lot of applications are already relying on it. This extension would enable the device ID part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it does guarantee it, if it's a PCI device, which is the only kind of the device this extension cares about.
Then the aspect shouldn't get defined for cases the extension doesn't "care about". The aspect is meant to be a promise that a certain feature is supported on the device. If the user checks if a device has it and it returns true
then the following query should give a value as defined per this new query extension, which currently says it should be a PCI device ID.
Even if we would rename it to
ext_intel_device_id
, I would still make it mandatory to return the PCI ID if the device is a PCI device, and something unspecified otherwise (but I'm not sure what would be the use of that). Which is exactly what cl_intel_device_attribute_query specifies.
Absolutely! That seems like a reasonable requirement that the OpenCL extension seems to agree with. The problem here boils down to the fact that this extension is currently promising more than the used OpenCL query does, without doing any reliable checks or filtering for values that do not adhere to that promise (i.e. values that are not PCI device IDs).
Only the specs could guarantee the intent, the implementation could be faulty regardless. But if the same issue applies to the PCI address, why was that approved to be part of this extension, while adding the PCI ID is currently blocked because of this?
I am okay with assuming that the intent of the L0 specification is to return the PCI device ID. It does mention the PCI configuration after all.
A completely vague specification would make this extension unusable. If it might break in the future, then it has to be fixed just like any other bug. Yes, we might suddenly have a faulty extension, which would be just like any other bug which has to be fixed. But if we would have a vaguely specified extension, it wouldn't be a bug anymore, but the applications relying on it would break anyway, with no fix in sight. Then all applications would need to come up with their own workarounds, instead of fixing the issue in a centralized way, in the SYCL implementation.
Amending the L0 specification is not the big problem here. The issue is that the OpenCL extension used here is very clear that it may not return a PCI device ID. As such, we cannot just assume that if the query works we will also be able to return a PCI device ID. What I was trying to convey is that any of our OpenCL implementations would be in their full right to return something (now or in the future) that is not a PCI device ID, but with the current version of this SYCL extension and implementation we would accept it as if it was a PCI device ID. Likewise we could have another OpenCL implementation report that it supports the extension query and not return a PCI device ID, and since we would report the device as having the new aspect we would be considering it as supported for this new SYCL extension. Since it isn't a bug in the OpenCL implementations, it is a bug in this PR's implementation/extension.
An option is to ask for a new OpenCL query that guarantees a PCI device ID if it returns a value, but until we have that new query we would have to consider OpenCL backends as not supporting this extension.
But this is far from being the only value that we're already querying from various runtimes, the same risk applies to all of them. What makes querying the PCI ID so much riskier that all the others? Why do we accept the risk for all other values but not for this one?
Other queries do not use the same L0 property and the properties they use may be clearer, such as the address
member of zes_pci_properties_t
for PCI addresses. Again, I am okay with assuming that L0 always returns a PCI device ID as part of deviceId
, but the bigger problem here is not L0 but OpenCL. The rest of the extension queries in the SYCL extension document this PR touches are only supported on L0 and return PI_ERROR_INVALID_VALUE
on OpenCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! That seems like a reasonable requirement that the OpenCL extension seems to agree with. The problem here boils down to the fact that this extension is currently promising more than the used OpenCL query does, without doing any reliable checks or filtering for values that do not adhere to that promise (i.e. values that are not PCI device IDs).
Then the solution is straightforward. Rename the aspect to ext_intel_device_id
and copy the specs from the OpenCL extension (except mentioning OpenCL of course), which is the following:
A unique device identifier for the OpenCL device.
If the implementation is driven primarily by a PCI device with a PCI device ID, the low 16 bits of the device identifier must contain that PCI device ID, and the remaining bits must be set to zero. Otherwise, the choice of what to return may be dictated by operating system or platform policies - but should uniquely identify both the device version and any major configuration options (for example, core count in the case of multi-core devices).
Thus the implementation of ext_intel_device_id
would be perfect for both L0 and OpenCL. Do you have any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If @AlexeySachkov still has concerns regarding which exact bits contain the PCI device ID in the value returned by L0 (but I don't see any reason why it would be bit-shifted, the current implementation clearly doesn't), then I would be okay with the following relaxed wording:
A unique device identifier for the device.
If the implementation is driven primarily by a PCI device with a PCI device ID, the device identifier must be that PCI device ID. Otherwise, the choice of what to return may be dictated by operating system or platform policies - but should uniquely identify both the device version and any major configuration options (for example, core count in the case of multi-core devices).
@AlexeySachkov , do you find this acceptable?
According to L0 specs deviceId
is "device id from PCI configuration". If the return value would be manipulated in any way, including bit-shifted it would not be the "device id from PCI configuration" anymore but something like "device id from PCI configuration shifted by N bits". It doesn't matter whether the device ID is returned in a 16-bit, 32-bit or 64-bit value. If it's not equal to the device ID, it's not the device ID but something else. Many applications, including the L0 implementation itself depends the correctness of the PCI device ID. If L0 would suddenly introduce a bit shift in the future, it would surely break many applications. What would be the possible usefulness of introducing such a breaking change? Why introduce specifically a bit shift anyway? I don't see where is this coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to ext_intel_device_id
, updated the specs.
@@ -57,6 +58,41 @@ The UUID can be obtained using the standard get\_info() interface. | |||
|
|||
|
|||
|
|||
# PCI Device ID # | |||
|
|||
A new device descriptor will be added which will provide the 16-bit PCI device ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of the information descriptor is defined as uint32_t
: do we plan to require this API to return a value in a single format for all backends or each backend is free to return the value in its specific format?
In any case, I think the wording here should be adjusted a bit either to say which exact 16 bits within a 32 bit value correspond to PCI device ID or that the returned value is backend-specific (and therefore it should probably avoid mentioning 16 bits at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value should be the same regardless of backend because the PCI device ID is independent of backends and OSs. It's specific to the device.
The 16-bit is mentioned to avoid confusion with other values like the PCI address. The exact bits could be mentioned but this is not the only value that does not use the full 32 bits of the return value, so why specify this only for this specific value? If nothing is mentioned, why would anyone assume that the ID is returned in e.g. the upper 16 bits?
This adds the ability to query the device ID of a device, which must be equal to the PCI device ID for PCI devices. An important use case of this feature is detecting the GPU architecture, which is necessary in applications that have dedicated code paths for different architectures. Querying the device ID is implemented for Level Zero and OpenCL as well.
@@ -264,6 +264,7 @@ typedef enum { | |||
// Intel UUID extension. | |||
PI_DEVICE_INFO_UUID = 0x106A, | |||
// These are Intel-specific extensions. | |||
PI_DEVICE_INFO_DEVICE_ID = 0x4251, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value here is the same as the value of CL_DEVICE_ID_INTEL
which is defined in the OpenCL extension cl_intel_device_attribute_query
. I presume the intention is that the OpenCL plugin doesn't need any changes because we can pass PI_DEVICE_INFO_DEVICE_ID
in place of CL_DEVICE_ID_INTEL
and it will just work.
However, I don't see any code checking to see if the OpenCL implementation supports the cl_intel_device_attribute_query
extension. Is that consistent with the way we use other OpenCL extensions? Do we just assume that OpenCL supports whatever extension we need without checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clGetDeviceInfo states that it returns
CL_INVALID_VALUE if param_name is not one of the supported values or if size in bytes specified by param_value_size is < size of return type as specified in the Device Queries table and param_value is not a NULL value or if param_name is a value that is available as an extension and the corresponding extension is not supported by the device.
The runtime uses this to determine if the aspect for each of the extension descriptors is supported or not. So OpenCL should be ready out-of-the-gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume the intention is that the OpenCL plugin doesn't need any changes because we can pass PI_DEVICE_INFO_DEVICE_ID in place of CL_DEVICE_ID_INTEL and it will just work.
Yes, that is the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the intention.
I think it should be documented in the sources somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please be a bit more specific about what kind of documentation do you expect here? Most other constants in pi.h
also map to an OpenCL parameter the same way, some of them are part of the core OpenCL but many others map to extensions. Neither of those are documented. So there's nothing new in the way PI_DEVICE_INFO_DEVICE_ID
works here. I followed the conventions of the existing codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I was under a wrong impression that we are only mapping PI to CL values in the OpenCL plugin, and this new one had a direct CL value used, which I thought needs a documentation. Nevermind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec changes LGTM.
Please resolve open issues with @steffenlarsen and @AlexeySachkov who also made comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with the current state of this, but we should have some new unittest cases in sycl/unittests/kernel-and-program/DeviceInfo.cpp for it. Once that's in, you can have a 👍 from me!
Test PR: intel/llvm-test-suite#1320 |
/verify with intel/llvm-test-suite#1320 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@intel/dpcpp-esimd-reviewers & @intel/dpcpp-l0-pi-reviewers - Friendly ping. |
Following failures are unrelated and fixed with #7034:
Merging this. |
This commit adds test for `ext_intel_device_id` introduced in intel/llvm#7010
This commit adds test for `ext_intel_device_id` introduced in intel#7010
This adds the ability to query the device ID of a device, which must be equal to the PCI device ID for PCI devices. An important use case of this feature is detecting the GPU architecture, which is necessary in applications that have dedicated code paths for different architectures. Querying the device ID is implemented for Level Zero and OpenCL as well.