Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][HIP] Fix PI version reporting #5509

Merged
merged 8 commits into from
Mar 31, 2023
Merged

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented Feb 8, 2022

Report the actual PI version rather than 0.0.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added information is definitely more informative than previously, but I wonder if it would be more in line with PI_DEVICE_INFO_VERSION to report the compute capabilities of the CUDA device instead, i.e. cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR and CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR. See CL_DEVICE_VERSION in clGetDeviceInfo.

I don't know if HIP has a similar info query.

@alexbatashev
Copy link
Contributor

@npmiller there was a fail for OCL CPU job, which was caused by machine misconfiugration. I fixed that and restarted the job for you. Should be fine now.

@npmiller
Copy link
Contributor Author

npmiller commented Feb 8, 2022

What the device version is meant to be is pretty confusing to me, it seems that it could be backend specific.

For info::device::version the SYCL spec says:

Returns the SYCL version as a std::string in the form: <major_version>.<minor_version>.

But it does mention in that section that the meaning of any property can be redefined in the backend spec, and in the OpenCL backend spec it says the following:

The device version is an indication of the device’s capabilities, as represented by the device information returned by the sycl::device::get_info() member function. Examples of attributes associated with the device version are resource limits and information about functionality beyond the requirements in the core SYCL specification. The version returned corresponds to the highest version of the OpenCL specification for which the device is conformant, but is not higher than the version of the device’s platform which bounds the overall capabilities of the runtime operating the device.

Which also sounds a bit contradicting but suggests that it should be the highest OpenCL version reported there.

And in fact AMD OpenCL devices report 2.0 for this field, which is likely to be OpenCL 2.0.

In OpenCL the same version property is defined as OpenCL<space><major_version.minor_version><space><vendor-specific information>

For level zero it's set to the LevelZero API version which is seemingly different from the LevelZero driver version.

So all of that to say I'm really not sure what's the best thing to report here, I think the PI version makes quite a bit of sense when compared to the OpenCL backend, but it doesn't seem very helpful, compute capabilities would make a lot of sense but it doesn't seem like any other backend uses the device version that way.

What do you think?

As a side note, HIP does have similar compute capabilities version numbers but discourages using them, and advises to use the feature flags instead (AMD HIP API 4.5, section 5.39.2.6):

Major compute capability. On HCC, this is an approximation and features may differ from CUDA CC. See the arch
feature flags for portable ways to query feature caps.

@steffenlarsen
Copy link
Contributor

@npmiller - Thank you for the thorough analysis! I agree, the wording and intent of the info query is a little vague.

I see what you mean, that it is hard to draw a comparison between the compute capability and the API versions reported by the other plugins. I am not opposed to reporting the PI version here, but my thinking is that reporting the compute capability gives the user more device-specific information.
Maybe it could report both in the same query? E.g. "PI X.Y (Compute capability A.B)". Since users need to explicitly set SM version higher than sm_50 (?) it could be useful information which I believe the user can get through simply running sycl-ls (assuming this information is added to the query).

@smaslov-intel
Copy link
Contributor

I am not opposed to reporting the PI version here, but my thinking is that reporting the compute capability gives the user more device-specific information.

Additionally PI API version is available other ways, and reporting native device's API version is not.

@npmiller
Copy link
Contributor Author

npmiller commented Feb 8, 2022

I see, yeah that makes sense it would be really helpful. I think it should be fine to just put the useful data in this field, I've updated it so that it gives the following output:

CUDA plugin and HIP Nvidia plugin:

Platform [#3]:
    Version  : CUDA 11.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : Compute Capability 6.6
        Name       : NVIDIA GeForce GTX 1050 Ti
        Vendor     : NVIDIA Corporation
        Driver     : CUDA 11.6
Platform [#4]:
    Version  : HIP 0.0
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : Compute Capability 6.1
        Name       : NVIDIA GeForce GTX 1050 Ti
        Vendor     : AMD Corporation
        Driver     : HIP 0.0

HIP AMD plugin:

Platform [#2]:          
    Version  : HIP 40421.43          
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 1                                                                                                                                                             
        Device [#0]:                                                                  
        Type       : gpu                                                              
        Version    : gfx908:sramecc+:xnack-                                    
        Name       :                                                                                                                                                         
        Vendor     : AMD Corporation                                                                                                                                         
        Driver     : HIP 40421.43  

It's going to lead to a bit of duplication on AMD with #5508 but I don't think that's a big deal.

I had to update the function that was returning the device version string as all the plugins were going through the OpenCL path which trimmed the string we reported in the plugins down to just the numerical version.

@npmiller
Copy link
Contributor Author

npmiller commented Feb 9, 2022

So after further discussions and reviewing the current SYCL specification, I've decided to file a ticket with the SYCL specification to seek clarifications on this:

So I'll put this PR on hold until we get some clarifications on what this property is meant to be and what we're allowed to use it for.

@npmiller
Copy link
Contributor Author

npmiller commented May 5, 2022

The SYCL specification PR has now been merged so now the device version is backend defined, so we should be fine to report the compute capabilities and define it as such in the CUDA backend specification. The CUDA backend specification is not finalized yet so this may need to be tweaked later on but it should be fine for now.

steffenlarsen
steffenlarsen previously approved these changes May 5, 2022
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@npmiller
Copy link
Contributor Author

npmiller commented May 9, 2022

Updated to remove the extra processing on OpenCL version strings to extract just the version number, the full version can now be returned directly, deferring the formatting to the OpenCL spec.

This is a change in behavior but the previous behavior was incorrect as it should have been the SYCL version returned not the OpenCL version. And is reflected in the spec changes of KhronosGroup/SYCL-Docs#231

@npmiller
Copy link
Contributor Author

npmiller commented May 9, 2022

/verify with intel/llvm-test-suite#1019

@npmiller npmiller force-pushed the fix-pi-version branch 2 times, most recently from 5935c4c to b2bca2c Compare August 29, 2022 18:25
@npmiller npmiller temporarily deployed to aws January 5, 2023 12:13 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 5, 2023 12:44 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor Author

npmiller commented Jan 5, 2023

All tests failures are expected, and are fixed by:

And for one of the AMD failure by:

@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

@npmiller
Copy link
Contributor Author

ping @steffenlarsen @smaslov-intel

Sorry I keep forgetting about this PR, are we good to merge this and the matching tests PR: intel/llvm-test-suite#1019 ?

Or does it need more reviews and/or rebasing?

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @npmiller - Could you please push a merge commit to make sure it works with tip.

@npmiller npmiller temporarily deployed to aws March 14, 2023 11:51 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 14, 2023 12:18 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

1 similar comment
@steffenlarsen
Copy link
Contributor

/verify with intel/llvm-test-suite#1019

@steffenlarsen
Copy link
Contributor

@npmiller - It looks like there are new failing tests. Could you please address these?

@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

@npmiller npmiller temporarily deployed to aws March 22, 2023 13:00 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 22, 2023 14:01 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

3 similar comments
@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

@npmiller
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1019

@bader
Copy link
Contributor

bader commented Mar 31, 2023

@npmiller, please, update your branch and move tests from intel/llvm-test-suite#1019 to sycl/test-e2e.
This way they will be tested on HIP backend before the merge!

@npmiller npmiller temporarily deployed to aws March 31, 2023 09:36 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 31, 2023 10:07 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 31, 2023 11:31 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 31, 2023 12:12 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 31, 2023 13:42 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws March 31, 2023 14:16 — with GitHub Actions Inactive
@bader bader merged commit 88e459f into intel:sycl Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants