-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL] Support missing types with ldg extension #8748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Marking as a draft till the pull-down is done (#8738), as this patch relies on the half ldg builtins that were added in to the mainline here: llvm/llvm-project@71b0658 |
cc: @JackAKirk |
9148b06
to
93ea36f
Compare
LGTM btw you can add half cases to the e2e test: intel/llvm-test-suite#1417 (although initial test cases aren't merged yet either) |
@@ -98,7 +98,7 @@ T ldg(const T* ptr); | |||
`ldg` returns the data of type `T` located at address `ptr`. When called from the `ext_oneapi_cuda` backend the data is cached in the read-only texture cache. | |||
When called from any other backend a copy of the data stored at address `ptr` is returned without using any special cache. | |||
|
|||
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `float`, `vec<float, 2>`, `vec<float, 4>`, `double`, or `vec<double, 2>`. | |||
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `half`, `float`, `double`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `vec<half, 2>`, `vec<float, 2>`, `vec<float, 4>`, or `vec<double, 2>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intentionally omit vec<half, 4>
? If so, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it could be supported by just calling the vec2 instruction twice. There is at least one case I saw where an application implements this themselves for vec<half, 4>
isn't natively supported with a single __ldg instruction.double4
, so this could be something useful to add in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe half4/double4 are natively implementable, at least this is what is suggested by https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld-global-nc
We could investigate this in the future.
The types that we added so far match those exposed through CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#read-only-data-cache-load-function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is what the CUDA docs say are natively implemented, that is fine. I only pointed it out because it doesn't seem to match your existing pattern. For the other types, it seems like there are 2- and 4-element versions for all the types that are 32-bits or less. (E.g. float2, float4, int2, int4, short2, short4, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this matches the pattern here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#read-only-data-cache-load-function
However I don't see why we can't also implement it for additional types like double4 in the future, especially if the ptx allows this to be done via a single instruction. It is just that the equivalent of double4 etc is not exposed via CUDA Runtime API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point I followed the types we exposed as builtins, but you are right, I should have been more consistent with types already supported. I'll add the vec<half, 4>
as calls to the existing builtins and will revisit it if/when we implement support for instructions mentioned by Jack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are adding vec<half, 4>
can you also add vec<double, 4>
? MD uses the 4D version for double (basically they need 3 dimensions for positions/velocities): glotzerlab/hoomd-blue@730a809#diff-c20b0af2267267f69ae103ae2c7e4aecea76b0e5feb3d3f22f0e48f4b652c08dR28
This will save them having to implement this themselves.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, if we are sticking to vectors of 2 and 4 elements, I'll add 4 elements variants of longlong
and ulonglong
.
@@ -98,7 +98,7 @@ T ldg(const T* ptr); | |||
`ldg` returns the data of type `T` located at address `ptr`. When called from the `ext_oneapi_cuda` backend the data is cached in the read-only texture cache. | |||
When called from any other backend a copy of the data stored at address `ptr` is returned without using any special cache. | |||
|
|||
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `float`, `vec<float, 2>`, `vec<float, 4>`, `double`, or `vec<double, 2>`. | |||
The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `half`, `float`, `double`, `vec<char, 2>`, `vec<char, 4>`, `vec<short, 2>`, `vec<short, 4>`, `vec<int, 2>`, `vec<int, 4>`, `vec<long long, 2>`, `vec<long long, 4>`, `vec<uchar, 2>`, `vec<uchar, 4>`, `vec<ushort, 2>`, `vec<ushort, 4>`, `vec<uint, 2>`, `vec<uint, 4>`, `vec<unsigned long long, 2>`, `vec<unsigned long long, 4>`, `vec<half, 2>`, `vec<half, 4>`, `vec<float, 2>`, `vec<float, 4>`, `vec<double, 2>`, or `vec<double, 4>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point you are missing these combinations:
vec<signed char, 2>
vec<signed char, 4>
vec<long, 2>
vec<long, 4>
vec<unsigned long, 2>
vec<unsigned long, 4>
Does it make sense to add them to make a complete set?
I also noticed that you use aliases uchar
, ushort
, and uint
which are being deprecated in the core SYCL spec because they don't mean what you would expect. I think you probably want to change these to unsigned char
, unsigned short
, and unsigned int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added missing types and tidied up the syntax on the vector usage. This PR will need to stay in a draft mode for a wee bit longer, as we didn't have all the relevant builtins added in upstream llvm.
abc174b
to
f0ef805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec changes LGTM.
Are uchar3 and char3 equivalent to uchar4 and char4 ? |
|
Could the reviewers (cperkinsintel and intel llvm-reviewers) comment on the PR ? Quite a few programs need the extension for half-precision and char3 data type. Thanks. |
@gmlueck are you still happy with the patch, with this PR in
alongside their vectors of 2, 3 and 4 elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, spec still looks good!
@intel/llvm-reviewers-runtime gentle ping, is there anything else that this PR needs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little sad that we do so many seemingly repeated yet different branches, but I don't have a good alternative.
LGTM!
Thanks @steffenlarsen, had a go at macros, but it got counterproductive really quick. |
@jchlanda |
Hi @jinz2014 |
@intel/llvm-gatekeepers is this PR OK to merge in? |
No description provided.