[SYCL][CUDA][HIP] Fix PI version reporting #5509

npmiller · 2022-02-08T11:20:38Z

Report the actual PI version rather than 0.0.

steffenlarsen

This added information is definitely more informative than previously, but I wonder if it would be more in line with PI_DEVICE_INFO_VERSION to report the compute capabilities of the CUDA device instead, i.e. cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR and CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR. See CL_DEVICE_VERSION in clGetDeviceInfo.

I don't know if HIP has a similar info query.

alexbatashev · 2022-02-08T12:41:03Z

@npmiller there was a fail for OCL CPU job, which was caused by machine misconfiugration. I fixed that and restarted the job for you. Should be fine now.

npmiller · 2022-02-08T13:07:15Z

What the device version is meant to be is pretty confusing to me, it seems that it could be backend specific.

For info::device::version the SYCL spec says:

Returns the SYCL version as a std::string in the form: <major_version>.<minor_version>.

https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors

But it does mention in that section that the meaning of any property can be redefined in the backend spec, and in the OpenCL backend spec it says the following:

The device version is an indication of the device’s capabilities, as represented by the device information returned by the sycl::device::get_info() member function. Examples of attributes associated with the device version are resource limits and information about functionality beyond the requirements in the core SYCL specification. The version returned corresponds to the highest version of the OpenCL specification for which the device is conformant, but is not higher than the version of the device’s platform which bounds the overall capabilities of the runtime operating the device.

https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_platform_mixed_version_support_2

Which also sounds a bit contradicting but suggests that it should be the highest OpenCL version reported there.

And in fact AMD OpenCL devices report 2.0 for this field, which is likely to be OpenCL 2.0.

In OpenCL the same version property is defined as OpenCL<space><major_version.minor_version><space><vendor-specific information>

For level zero it's set to the LevelZero API version which is seemingly different from the LevelZero driver version.

https://github.com/intel/llvm/blob/sycl/sycl/plugins/level_zero/pi_level_zero.cpp#L2396

So all of that to say I'm really not sure what's the best thing to report here, I think the PI version makes quite a bit of sense when compared to the OpenCL backend, but it doesn't seem very helpful, compute capabilities would make a lot of sense but it doesn't seem like any other backend uses the device version that way.

What do you think?

As a side note, HIP does have similar compute capabilities version numbers but discourages using them, and advises to use the feature flags instead (AMD HIP API 4.5, section 5.39.2.6):

Major compute capability. On HCC, this is an approximation and features may differ from CUDA CC. See the arch
feature flags for portable ways to query feature caps.

steffenlarsen · 2022-02-08T13:49:06Z

@npmiller - Thank you for the thorough analysis! I agree, the wording and intent of the info query is a little vague.

I see what you mean, that it is hard to draw a comparison between the compute capability and the API versions reported by the other plugins. I am not opposed to reporting the PI version here, but my thinking is that reporting the compute capability gives the user more device-specific information.
Maybe it could report both in the same query? E.g. "PI X.Y (Compute capability A.B)". Since users need to explicitly set SM version higher than sm_50 (?) it could be useful information which I believe the user can get through simply running sycl-ls (assuming this information is added to the query).

smaslov-intel · 2022-02-08T14:50:27Z

I am not opposed to reporting the PI version here, but my thinking is that reporting the compute capability gives the user more device-specific information.

Additionally PI API version is available other ways, and reporting native device's API version is not.

npmiller · 2022-02-08T15:44:30Z

I see, yeah that makes sense it would be really helpful. I think it should be fine to just put the useful data in this field, I've updated it so that it gives the following output:

CUDA plugin and HIP Nvidia plugin:

Platform [#3]:
    Version  : CUDA 11.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : Compute Capability 6.6
        Name       : NVIDIA GeForce GTX 1050 Ti
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 11.6
Platform [#4]:
    Version  : HIP 0.0
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : Compute Capability 6.1
        Name       : NVIDIA GeForce GTX 1050 Ti
        Vendor     : AMD Corporation
        Driver     : HIP 0.0

HIP AMD plugin:

Platform [#2]:          
    Version  : HIP 40421.43          
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 1                                                                                                                                                             
        Device [#0]:                                                                  
        Type       : gpu                                                              
        Version    : gfx908:sramecc+:xnack-                                    
        Name       :                                                                                                                                                         
        Vendor     : AMD Corporation                                                                                                                                         
        Driver     : HIP 40421.43

It's going to lead to a bit of duplication on AMD with #5508 but I don't think that's a big deal.

I had to update the function that was returning the device version string as all the plugins were going through the OpenCL path which trimmed the string we reported in the plugins down to just the numerical version.

npmiller · 2022-02-09T13:00:38Z

So after further discussions and reviewing the current SYCL specification, I've decided to file a ticket with the SYCL specification to seek clarifications on this:

Clarify what the device version property should report KhronosGroup/SYCL-Docs#222

So I'll put this PR on hold until we get some clarifications on what this property is meant to be and what we're allowed to use it for.

npmiller · 2022-05-05T09:52:00Z

The SYCL specification PR has now been merged so now the device version is backend defined, so we should be fine to report the compute capabilities and define it as such in the CUDA backend specification. The CUDA backend specification is not finalized yet so this may need to be tweaked later on but it should be fine for now.

steffenlarsen

LGTM!

npmiller · 2022-05-09T09:28:54Z

Updated to remove the extra processing on OpenCL version strings to extract just the version number, the full version can now be returned directly, deferring the formatting to the OpenCL spec.

This is a change in behavior but the previous behavior was incorrect as it should have been the SYCL version returned not the OpenCL version. And is reflected in the spec changes of KhronosGroup/SYCL-Docs#231

npmiller · 2022-05-09T11:38:07Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-01-05T15:33:05Z

All tests failures are expected, and are fixed by:

[SYCL] Fix tests using device version llvm-test-suite#1019

And for one of the AMD failure by:

[SYCL] Check that fp16 aspect is supported before using half llvm-test-suite#1487

npmiller · 2023-01-24T14:44:57Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-03-10T16:34:51Z

ping @steffenlarsen @smaslov-intel

Sorry I keep forgetting about this PR, are we good to merge this and the matching tests PR: intel/llvm-test-suite#1019 ?

Or does it need more reviews and/or rebasing?

steffenlarsen

LGTM! @npmiller - Could you please push a merge commit to make sure it works with tip.

npmiller · 2023-03-14T12:20:55Z

/verify with intel/llvm-test-suite#1019

steffenlarsen · 2023-03-16T12:12:25Z

/verify with intel/llvm-test-suite#1019

steffenlarsen · 2023-03-17T11:06:03Z

@npmiller - It looks like there are new failing tests. Could you please address these?

npmiller · 2023-03-17T11:34:08Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-03-22T18:38:59Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-03-23T09:41:16Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-03-23T16:49:28Z

/verify with intel/llvm-test-suite#1019

npmiller · 2023-03-24T11:05:24Z

/verify with intel/llvm-test-suite#1019

bader · 2023-03-31T03:20:08Z

@npmiller, please, update your branch and move tests from intel/llvm-test-suite#1019 to sycl/test-e2e.
This way they will be tested on HIP backend before the merge!

Report the actual PI version rather than `0.0`.

Use the Compute Capability for the device version for Nvidia GPUs, and use the architecture for AMD GPUs.

See discussions on: intel/llvm-test-suite#1019

npmiller requested review from a team as code owners February 8, 2022 11:20

npmiller requested a review from sergey-semenov February 8, 2022 11:20

steffenlarsen reviewed Feb 8, 2022

View reviewed changes

npmiller force-pushed the fix-pi-version branch from 065bc73 to 94414d0 Compare May 5, 2022 09:48

npmiller requested a review from steffenlarsen May 5, 2022 09:52

steffenlarsen previously approved these changes May 5, 2022

View reviewed changes

npmiller mentioned this pull request May 5, 2022

[SYCL] Fix tests using device version intel/llvm-test-suite#1019

Open

npmiller dismissed steffenlarsen’s stale review via 43b34a4 May 9, 2022 09:25

npmiller force-pushed the fix-pi-version branch from 94414d0 to 43b34a4 Compare May 9, 2022 09:25

npmiller requested a review from steffenlarsen May 9, 2022 11:38

npmiller force-pushed the fix-pi-version branch 2 times, most recently from 5935c4c to b2bca2c Compare August 29, 2022 18:25

npmiller force-pushed the fix-pi-version branch from b2bca2c to fbc2c8e Compare January 5, 2023 11:49

npmiller temporarily deployed to aws January 5, 2023 12:13 — with GitHub Actions Inactive

npmiller temporarily deployed to aws January 5, 2023 12:44 — with GitHub Actions Inactive

steffenlarsen approved these changes Mar 14, 2023

View reviewed changes

npmiller temporarily deployed to aws March 14, 2023 11:51 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 14, 2023 12:18 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 22, 2023 13:00 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 22, 2023 14:01 — with GitHub Actions Inactive

npmiller added 6 commits March 31, 2023 09:47

[SYCL][CUDA][HIP] Fix PI version reporting

c72c71d

Report the actual PI version rather than `0.0`.

[SYCL][CUDA][HIP] Update device version

f1a7c70

Use the Compute Capability for the device version for Nvidia GPUs, and use the architecture for AMD GPUs.

[SYCL][HIP] Fix HIP plugin build

74e7af1

[SYCL][CUDA][HIP] Fix device version reporting

d8cbb82

[SYCL][CUDA][HIP] Fix assertion namespace

1048cbb

Update tests for new OpenCL version definition

02dc606

See discussions on: intel/llvm-test-suite#1019

npmiller force-pushed the fix-pi-version branch from 2c036d8 to 02dc606 Compare March 31, 2023 09:09

npmiller temporarily deployed to aws March 31, 2023 09:36 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 31, 2023 10:07 — with GitHub Actions Inactive

Fix device version parsing

8b6719a

npmiller temporarily deployed to aws March 31, 2023 11:31 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 31, 2023 12:12 — with GitHub Actions Inactive

Allow intel subgroup extension for subgroup tests

97d6710

npmiller temporarily deployed to aws March 31, 2023 13:42 — with GitHub Actions Inactive

npmiller temporarily deployed to aws March 31, 2023 14:16 — with GitHub Actions Inactive

bader merged commit 88e459f into intel:sycl Mar 31, 2023

[SYCL][CUDA][HIP] Fix PI version reporting #5509

[SYCL][CUDA][HIP] Fix PI version reporting #5509

Uh oh!

Conversation

npmiller commented Feb 8, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

alexbatashev commented Feb 8, 2022

Uh oh!

npmiller commented Feb 8, 2022

Uh oh!

steffenlarsen commented Feb 8, 2022

Uh oh!

smaslov-intel commented Feb 8, 2022

Uh oh!

npmiller commented Feb 8, 2022

Uh oh!

npmiller commented Feb 9, 2022

Uh oh!

npmiller commented May 5, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

npmiller commented May 9, 2022

Uh oh!

npmiller commented May 9, 2022

Uh oh!

npmiller commented Jan 5, 2023

Uh oh!

npmiller commented Jan 24, 2023

Uh oh!

npmiller commented Mar 10, 2023

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

npmiller commented Mar 14, 2023

Uh oh!

steffenlarsen commented Mar 16, 2023

Uh oh!

steffenlarsen commented Mar 17, 2023

Uh oh!

npmiller commented Mar 17, 2023

Uh oh!

npmiller commented Mar 22, 2023

Uh oh!

npmiller commented Mar 23, 2023

Uh oh!

npmiller commented Mar 23, 2023

Uh oh!

npmiller commented Mar 24, 2023

Uh oh!

bader commented Mar 31, 2023

Uh oh!

Uh oh!