Skip to content

[SYCL] Support missing types with ldg extension #8748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 24, 2023

Conversation

jchlanda
Copy link
Contributor

No description provided.

@jchlanda jchlanda requested review from a team as code owners March 23, 2023 10:32
@jchlanda jchlanda requested a review from cperkinsintel March 23, 2023 10:32
@jchlanda jchlanda marked this pull request as draft March 23, 2023 10:32
@jchlanda
Copy link
Contributor Author

Marking as a draft till the pull-down is done (#8738), as this patch relies on the half ldg builtins that were added in to the mainline here: llvm/llvm-project@71b0658

@jchlanda
Copy link
Contributor Author

cc: @JackAKirk

@jchlanda jchlanda temporarily deployed to aws March 23, 2023 10:51 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor

LGTM

btw you can add half cases to the e2e test: intel/llvm-test-suite#1417 (although initial test cases aren't merged yet either)

@jchlanda jchlanda temporarily deployed to aws March 23, 2023 11:19 — with GitHub Actions Inactive
@@ -98,7 +98,7 @@ T ldg(const T* ptr);
`ldg` returns the data of type `T` located at address `ptr`. When called from the `ext_oneapi_cuda` backend the data is cached in the read-only texture cache.
When called from any other backend a copy of the data stored at address `ptr` is returned without using any special cache.

The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `float`, `vec<float, 2>`, `vec<float, 4>`, `double`, or `vec<double, 2>`.
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `half`, `float`, `double`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `vec<half, 2>`, `vec<float, 2>`, `vec<float, 4>`, or `vec<double, 2>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally omit vec<half, 4>? If so, why?

Copy link
Contributor

@JackAKirk JackAKirk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec<half, 4> isn't natively supported with a single __ldg instruction. But it could be supported by just calling the vec2 instruction twice. There is at least one case I saw where an application implements this themselves for double4, so this could be something useful to add in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe half4/double4 are natively implementable, at least this is what is suggested by https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc
We could investigate this in the future.
The types that we added so far match those exposed through CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#read-only-data-cache-load-function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is what the CUDA docs say are natively implemented, that is fine. I only pointed it out because it doesn't seem to match your existing pattern. For the other types, it seems like there are 2- and 4-element versions for all the types that are 32-bits or less. (E.g. float2, float4, int2, int4, short2, short4, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this matches the pattern here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#read-only-data-cache-load-function

However I don't see why we can't also implement it for additional types like double4 in the future, especially if the ptx allows this to be done via a single instruction. It is just that the equivalent of double4 etc is not exposed via CUDA Runtime API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point I followed the types we exposed as builtins, but you are right, I should have been more consistent with types already supported. I'll add the vec<half, 4> as calls to the existing builtins and will revisit it if/when we implement support for instructions mentioned by Jack.

Copy link
Contributor

@JackAKirk JackAKirk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are adding vec<half, 4> can you also add vec<double, 4>? MD uses the 4D version for double (basically they need 3 dimensions for positions/velocities): glotzerlab/hoomd-blue@730a809#diff-c20b0af2267267f69ae103ae2c7e4aecea76b0e5feb3d3f22f0e48f4b652c08dR28

This will save them having to implement this themselves.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if we are sticking to vectors of 2 and 4 elements, I'll add 4 elements variants of longlong and ulonglong.

@jchlanda jchlanda temporarily deployed to aws March 29, 2023 11:28 — with GitHub Actions Inactive
@@ -98,7 +98,7 @@ T ldg(const T* ptr);
`ldg` returns the data of type `T` located at address `ptr`. When called from the `ext_oneapi_cuda` backend the data is cached in the read-only texture cache.
When called from any other backend a copy of the data stored at address `ptr` is returned without using any special cache.

The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `float`, `vec<float, 2>`, `vec<float, 4>`, `double`, or `vec<double, 2>`.
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `half`, `float`, `double`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<long long, 4>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `vec<unsigned long long, 4>`, `vec<half, 2>`, `vec<half, 4>`, `vec<float, 2>`, `vec<float, 4>`, `vec<double, 2>`, or `vec<double, 4>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you are missing these combinations:

  • vec<signed char, 2>
  • vec<signed char, 4>
  • vec<long, 2>
  • vec<long, 4>
  • vec<unsigned long, 2>
  • vec<unsigned long, 4>

Does it make sense to add them to make a complete set?

I also noticed that you use aliases uchar, ushort, and uint which are being deprecated in the core SYCL spec because they don't mean what you would expect. I think you probably want to change these to unsigned char, unsigned short, and unsigned int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added missing types and tidied up the syntax on the vector usage. This PR will need to stay in a draft mode for a wee bit longer, as we didn't have all the relevant builtins added in upstream llvm.

@jchlanda jchlanda temporarily deployed to aws March 29, 2023 12:35 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws March 29, 2023 13:21 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws March 31, 2023 06:52 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws March 31, 2023 07:36 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws March 31, 2023 11:28 — with GitHub Actions Inactive
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec changes LGTM.

@jchlanda jchlanda changed the title [SYCL] Support half and vec2 half types with ldg extension [SYCL] Support missing types with ldg extension Mar 31, 2023
@zjin-lcf
Copy link
Contributor

Are uchar3 and char3 equivalent to uchar4 and char4 ?

@zjin-lcf
Copy link
Contributor

#ifdef ENABLE_BF16
template<>
inline __device__ __nv_bfloat162 ldg(const __nv_bfloat162* val) {
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
    return val[0];
#else
    return __ldg(val);
#endif
}

template<>
inline __device__ __nv_bfloat16 ldg(const __nv_bfloat16* val) {
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
    return val[0];
#else
    return __ldg(val);
#endif
}

@zjin-lcf
Copy link
Contributor

Could the reviewers (cperkinsintel and intel llvm-reviewers) comment on the PR ? Quite a few programs need the extension for half-precision and char3 data type. Thanks.

@jchlanda jchlanda temporarily deployed to aws June 15, 2023 17:35 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws June 16, 2023 09:01 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws June 16, 2023 09:39 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws June 27, 2023 11:43 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws June 27, 2023 12:35 — with GitHub Actions Inactive
@jchlanda
Copy link
Contributor Author

@gmlueck are you still happy with the patch, with this PR in ldg should cover the following types:

  • char,
  • signed char,
  • short,
  • int,
  • long,
  • long long,
  • unsigned char,
  • unsigned short,
  • unsigned int,
  • unsigned long,
  • unsigned long long,
  • half,
  • float,
  • double

alongside their vectors of 2, 3 and 4 elements.

@jchlanda jchlanda requested a review from gmlueck June 30, 2023 10:12
@jchlanda jchlanda temporarily deployed to aws June 30, 2023 11:09 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws June 30, 2023 13:24 — with GitHub Actions Inactive
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, spec still looks good!

@jchlanda
Copy link
Contributor Author

jchlanda commented Jul 4, 2023

@intel/llvm-reviewers-runtime gentle ping, is there anything else that this PR needs?

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little sad that we do so many seemingly repeated yet different branches, but I don't have a good alternative.

LGTM!

@jchlanda
Copy link
Contributor Author

jchlanda commented Jul 6, 2023

I am a little sad that we do so many seemingly repeated yet different branches, but I don't have a good alternative.

LGTM!

Thanks @steffenlarsen, had a go at macros, but it got counterproductive really quick.

@jinz2014
Copy link
Contributor

@jchlanda
I have some question. Some CUDA kernels contain an argument of user-defined struct/class types. Could we still apply "ldg" ? Thanks.

@jchlanda
Copy link
Contributor Author

@jchlanda I have some question. Some CUDA kernels contain an argument of user-defined struct/class types. Could we still apply "ldg" ? Thanks.

Hi @jinz2014
This will not work with the current implementation, sorry. And I am not sure if it could ever be implemented for arbitrary user types. While ISA specifies the load.global.nc instruction can operate on the untyped b64 you would still be limited by the max number of vector elements for the load, which is set to 4. See: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld-global-nc for details.

@jchlanda jchlanda temporarily deployed to aws July 24, 2023 08:13 — with GitHub Actions Inactive
@jchlanda jchlanda temporarily deployed to aws July 24, 2023 08:51 — with GitHub Actions Inactive
@jchlanda
Copy link
Contributor Author

@intel/llvm-gatekeepers is this PR OK to merge in?

@steffenlarsen steffenlarsen merged commit aec8a35 into intel:sycl Jul 24, 2023
mdtoguchi pushed a commit to mdtoguchi/llvm that referenced this pull request Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants