Skip to content

[SYCL] Add fma_relu extension #5749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed

[SYCL] Add fma_relu extension #5749

wants to merge 24 commits into from

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Mar 7, 2022

This extension adds fma_relu, a fused multiply-add operation that returns a * b + c > 0 ? a * b + c : 0. fma_relu is implemented here only for datatypes sycl::half, bfloat16 (using storage type uint16_t) and bfloat16x2 (using storage type uint32_t).

This PR depends on:

Intel PR: #5724
Upstream patch: https://reviews.llvm.org/D118977
Upstream patch: https://reviews.llvm.org/D116673

Merged extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc

Tests PR: intel/llvm-test-suite#898

@hdelan hdelan requested review from a team and bader as code owners March 7, 2022 16:14
@hdelan hdelan requested a review from againull March 7, 2022 16:14
@hdelan hdelan changed the title [SYCL][] Adding fma_relu extension [SYCL] Adding fma_relu extension Mar 7, 2022
@hdelan hdelan changed the title [SYCL] Adding fma_relu extension [SYCL] Add fma_relu extension Mar 7, 2022
bader
bader previously approved these changes Mar 9, 2022
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libclc changes look good to me.

againull
againull previously approved these changes Mar 11, 2022
@bader
Copy link
Contributor

bader commented Mar 11, 2022

@hdelan, please resolve merge conflicts and pre-commit fails.

@hdelan hdelan dismissed stale reviews from againull and bader via 37a18d7 March 11, 2022 12:18
@hdelan
Copy link
Contributor Author

hdelan commented Mar 11, 2022

@hdelan, please resolve merge conflicts and pre-commit fails.

Should be fixed now

@hdelan hdelan requested a review from againull March 11, 2022 14:59
bader
bader previously approved these changes Mar 11, 2022
@bader
Copy link
Contributor

bader commented Mar 11, 2022

@intel/dpcpp-specification-reviewers, ping.

template <typename T>
T fma_relu(T a, T b, T c);
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense for this function to take the bfloat16 or bfloat16x2 types themselves rather than uint16_t and uint32_t?

Also a nit about the organization of this spec ... the "Specification" section below is the formal specification of your extension. The description of the fma_relu function should be there, not in the "Overview" section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense for this function to take the bfloat16 or bfloat16x2 types themselves rather than uint16_t and uint32_t?

I am following the convention used by all of these bfloat16 PRs: #5748 #5724, which use uint16_t and uint32_t as storage types. Perhaps this mention of storage types doesn't belong in this document. Should I remove it?

The description of the fma_relu function should be there, not in the "Overview" section.

Thanks, have swapped that into specification section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the bfloat16 type soon, but they will keep the uint16_t versions also for a transition period. Does it make sense to add bfloat16 versions of fma_relu to this PR, or will you do that in a subsequent one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the bfloat16 type soon, but they will keep the uint16_t versions also for a transition period. Does it make sense to add bfloat16 versions of fma_relu to this PR, or will you do that in a subsequent one?

Good point, cc @hdelan, we should be able to add bfloat16 implementations of the fma_relu functions in this PR provided that #5393 is merged. We do want the bfloat16x2 cases too but this will require the definition of a bfloat16x2 class / extension doc update first, analogous to bfloat16 in #5393, so the corresponding bfloat16x2 impls will probably be done in a separate PR to this. For the joint_matrix API and other bfloat16 math builtins: fabs, fma, fmin, fmax, the uint16_t implementations are already merged and we are already working on follow up PRs for the corresponding bfloat16 implementations.

Removed aspect reference: can be added once the ext_oneapi_bfloat16 aspect is merged.
@hdelan hdelan requested review from bader and gmlueck March 28, 2022 11:20
bader
bader previously approved these changes Mar 28, 2022
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libclc changes look good to me.

== Overview

This extension introduces the `fma_relu` function for datatypes `sycl::half`,
`bfloat16` and `bfloat16x2`. `bfloat16` and `bfloat16x2` refer to the bfloat16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this came up in another review also, but I forget which one. There is no bfloat16x2 type defined currently in sycl_ext_*_bf16_conversion (soon to be renamed sycl_ext_oneapi_bfloat16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes have been made to the doc. fma_relu now accepts the bfloat16 class instead of uint16_t. The bfloat16x2 version still takes uint32_t as a storage type, but the doc explains that this will change once the bfloat16x2 class has been implemented as an extension.

@hdelan hdelan requested review from gmlueck and JackAKirk April 4, 2022 14:24
hdelan and others added 4 commits April 4, 2022 15:44

// Available when T is sycl::half, uint16_t (bfloat16) or uint32_t (bfloat16x2)
template <typename T>
T fma_relu(T a, T b, T c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of extending math functions, you are already adding support for fma, fmax, etc to bfloat16/half variants.
What would be the benefit of adding a specific fma_relu over doing fma followed with fmax(res,0) and return 0 if the max is zero?

This extension of fma_relu is introducing two big "new" territories to DPC++:
1- Introducing ML activation functions to DPC++: the issue is that this type of functions are numerous: the ones we know of and the ones we don't know about them yet. Is the expectation to keep adding these as free functions in DPC++? relu is an easy one that can be written using max. What about the others? Why relu is so special here?

2- Introducing fusions to DPC++: fma_relu is telling the compiler these two functions can be fused together. While this can be important in libraries, is this really necessary for DPC++? DPC++ has a compiler that can detect that this type of relu or other functions is following an fma and can trigger the fusion the user intended.

One other open question and issue is: if we end up deciding to have this type of ML very specific functions in DPC++, what should be the objects that use them? scalar, vector ? marray? why the only vector type here is bfloat16x2 ? Should this be put under the joint matrix umbrella as an another potential tensor hardware accelerated function?

Copy link
Contributor Author

@hdelan hdelan Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are valid points.

The primary benefit of this sort of extension, is that it allows users to concisely target builtins specific to a particular backend. Since the fma_relu function is in the cuda math headers, we think that it is appropriate to have them in DPC++ as well, for ease of porting code etc. It is our feeling that since this extension targets just the CUDA backend, it will always be an extension and will not enter the core spec libraries. A DPC++ extension should (as much as possible) give users access to all of the functionality of the backend API, but not necessarily more. Therefore we do not need to be concerned about making fma_relu work for other backends (unless they also have a similar builtin to target).

The question of fusions is an interesting one, and something we will discuss a bit internally. Perhaps in the long run this is the approach that will be used in some instances.

The objects that use the function should be scalar and vector. The reason that bfloat16 has not been vectorized is because the vector types for the bfloat16 class has not been implemented yet. Once implemented we will add the bfloat vec versions for this function. bfloat16x2 is vectorized since we are relying on an older impl of bf16x2 which uses uint32_t as storage type.

However, we think that for the time being, we are interested in representing backend-specific features in DPC++, and since these features are exposed to the user as a free function in the CUDA headers, we think this is reason enough to bring this function into DPC++ as an extension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share a link to the cuda math headers that contains the full list of math/ML functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a link to the headers online, but you can find __hfma_relu in any regular install of CUDA 11.6

/usr/local/cuda-11.6 $ grep "fma_relu" * -HnrI
include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c);
include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c)
include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c)
include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c);
targets/x86_64-linux/include/cuda_bf16.h:3216:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c);
targets/x86_64-linux/include/cuda_bf16.hpp:2142:__CUDA_BF16_DECL__ __nv_bfloat16 __hfma_relu(const __nv_bfloat16 a, const __nv_bfloat16 b, const __nv_bfloat16 c)
targets/x86_64-linux/include/cuda_fp16.hpp:2453:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c)
targets/x86_64-linux/include/cuda_fp16.h:3251:__CUDA_FP16_DECL__ __half __hfma_relu(const __half a, const __half b, const __half c);

Copy link
Contributor Author

@hdelan hdelan Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think the approach should be with these functions?

Should we:

  1. Implement bfloat16 versions of the core sycl math functions. In the meantime we could make sure that when for instance exp(my_bf16) is being called, it is being cast to a float to ensure correctness, before the bfloat16 specialization of exp is fully implemented. The problem is that there are a lot of core math functions, and maybe there wouldn't be a clear distinction between those that have a native bf16 version, and those that rely on casting. For functions that are not in the core sycl math library, new ones could be added as extensions, as is the case for fma_relu.
  2. Do what CUDA does and make new free functions specifically catering to bf16 like hexp for instance (cuda uses same funcs for bf16 and half). This involves introducing more and more functions and the list is likely to get longer.
  3. Don't try to add support for these bf16 functions.

It is worth noting that not all the functions listed above have their own builtins, but it seems that all of them produce far less ptx than their say float implementation counterpart, so it would be worthwhile calling these special bf16 functions in some way.

The reason we have added fma_relu is so that users can target the PTX builtin relating to fma_relu. We did this relatively blindly because we thought it was a good idea to have access to all PTX builtins, which we still consider correct.

steffenlarsen pushed a commit that referenced this pull request Jun 30, 2022
This PR introduces full support of element wise operations in the cuda backend. `wi_data`, `get_matrix_fill`, and `joint_matrix.get_wi_data()` are introduced for portability with the Intel backend. In addition, in the CUDA backend users can call `joint_matrix.wi_marray` to access the marray that stores the WI owned elements of the matrix and perform optimized element wise operations using math functions that take marrays.
bfloat16 element wise operations support is also included and this PR adds bfloat16 scalar/marray impls replacing the existing uint16_t "storage type" implementations for fma, fmax, fmin, and fabs math functions. The bfloat16 fma_relu function impl has now been added directly in #5749.
The existing temporary uint16_t implementations (introduced in #5748 with unmerged tests intel/llvm-test-suite#897) have been removed, since these bfloat16 implementations replaces them.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@github-actions github-actions bot added the Stale label Oct 5, 2022
@github-actions github-actions bot closed this Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants