Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic: sycl: Vanilla RNN FWD l2r #2236

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ShanoToni
Copy link
Contributor

Description

Adding implementation for generic RNN Vanilla forward left-to-right.
(Note: This PR is to serve as the first part of adding a generic RNN implementation which would support additional functionality with following PRs)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@ShanoToni ShanoToni requested review from a team as code owners November 29, 2024 16:15
@github-actions github-actions bot added platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic labels Nov 29, 2024
@ShanoToni ShanoToni self-assigned this Nov 29, 2024
@ShanoToni ShanoToni force-pushed the rnn_fwd_l2r branch 2 times, most recently from 5383baf to 76f6ce0 Compare December 2, 2024 12:58
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/rnn_utils.hpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/rnn_utils.hpp Outdated Show resolved Hide resolved
@@ -124,6 +124,7 @@ std::unique_ptr<memory_storage_t> buffer_memory_storage_t::clone() const {

storage->buffer_ = buffer_;
storage->base_offset_ = base_offset_;
storage->offset_ = this->offset_;
Copy link
Contributor

@densamoilov densamoilov Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use offset() method to get it without moving offset_ field from private to protected?

@rjoursler, you seem to used the offset in the RNN implementation. Can you please clarify the semantics of the offset_ in memory storage? To me, it seems that the offset is meant to be used for splitting a memory storage into separate logical parts depending on how/where it's being used so carrying over the offset when cloning the memory storage seems weird because the meaning of the offset is context dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning used here is that offset is always relative to the base memory object. The reason for this is to align with subbuffer behaviors in the runtimes. For example in OpenCL,

buffer must be a valid buffer object and cannot be a sub-buffer object.

And in SYCL

b is the buffer with the real data, which must not be a sub-buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning used here is that offset is always relative to the base memory object.

Why can't we use a sub-storage for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to use sub-buffers?
From my experience using sub-buffers is not as simple as using an offset. The main issue I have faced is that some backends require sub-buffers to be aligned to some size larger than the buffer's data type. From the OpenCL link above:

CL_MISALIGNED_SUB_BUFFER_OFFSET if there are no devices in context associated with buffer for which the origin field of the cl_buffer_region structure passed in buffer_create_info is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.

Copy link
Contributor

@rjoursler rjoursler Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use a sub-storage for that?

Potentially, we could use sub-storage for that, but it would require some changes. Currently, the get_sub_storage interface is synonymous with generating sub-buffer runtime objects and, as @Rbiessy mentioned, the problem is that these sub-buffers have restrictions on nesting and alignment. Due to the nature of RNN, it is common to need different ways of splitting the memory into sub-objects. For example, the workspace which corresponds to a single memory allocation by our API actually corresponds to multiple buffers such as gates and states. In addition, each of those buffers may be further subdivided into cells or layers based on what computation is being performed.

To handle this, we need a more robust sub_storage interface which allows arbitrary data alignment, ensures no nested sub-buffers occur, and gives control over when sub-buffers are created. The sub_buffer_t structure I introduced in gpu/intel/ocl/rnn/rnn_utils.hpp (and which is duplicated by this PR), was intended to handle this complexity within the Intel GPU RNN implementation. As this implementation needs similar support, merging the functionality into memory_storage_t would make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle this, we need a more robust sub_storage interface which allows arbitrary data alignment, ensures no nested sub-buffers occur, and gives control over when sub-buffers are created.

memory_storage_t::get_sub_storage() takes an offset and creates a a sub-storage that is essentially a sub-buffer that is always created from the root buffer so we don't create nesting sub-buffers. As for the alignment issue, I'm no sure I understand it, when do we run into the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to clarify the restriction I was referring to apply to OpenCL sub-buffers (and SYCL sub-buffers that may use OpenCL under the hood) which are created when using memory_storage_t::get_sub_storage(). On PVC for instance querying CL_DEVICE_MEM_BASE_ADDR_ALIGN returns 1024 bits. This means that if for instance the root buffer is at address 0x00000400 (aligned to 1024 already) the only possible offsets are multiples of 1024 bits i.e. multiples of 32 elements when using fp32. This gets more complicated if the root buffer is not aligned and when supporting more data types. In general it is much easier to track the offset as a separate variable.

I agree with the first comment to use a getter method and Anton is planning to look into merging the functionality of sub_buffer_t into the main memory_storage_t object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we run into the issue?

All of the calls generating sub_buffer_t objects from the user_data_t structure here may be unaligned depending on the layout and problem dimensions submitted to oneDNN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref matmul has now removed use of sub_buffer and sub_storage, instead now we levarage a new clone_ptr_off of the memory_storage_t.
This new function returns a shallow copy with the addition of offset of the memory storage.
The offset was not used currently by the buffer and usm memory storages from get_memory_arg there is a proposed change to return a ranged accessor instead of an accessor.
To get the offset the get_pointer func to get the memory from inside the kernel offsets the memory by the accessors offset (multi_ptr does not take into accout the offset of the accessor by default). This is similar to what get_native_pointer does, adding the offset of the memory_storage it was created with.
In this way we can get the required memory without the need to create a separate sub_buffer.

#include <memory>
#include "gpu/intel/utils.hpp"

#define DPRINT(fmt, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is meant to be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually used for debugging purposes only when the debug mode is set, which is also present for the ocl implementation, I believe it can remain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds branching and it can affect the generated code as ressources need to be allocated in case the kernel needs to print something. We can keep the debugging macro but I think it should be enabled only when building in debug mode. DPRINT can be defined to be empty when building in release mode.

Copy link
Contributor

@rjoursler rjoursler Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, get_verbose_dev_mode is effectively a compile time constant in a header (see here and here). As such, there shouldn't be any branching in Release builds.

This macro should likely be replaced with VDEBUGINFO though. VDEBUGINFO is not well setup for multi-line logging due to the long prefix as is done with DPRINT, so this may require work to align logging with the VDEBUGINFO layout.

src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
@@ -124,6 +124,7 @@ std::unique_ptr<memory_storage_t> buffer_memory_storage_t::clone() const {

storage->buffer_ = buffer_;
storage->base_offset_ = base_offset_;
storage->offset_ = this->offset_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to use sub-buffers?
From my experience using sub-buffers is not as simple as using an offset. The main issue I have faced is that some backends require sub-buffers to be aligned to some size larger than the buffer's data type. From the OpenCL link above:

CL_MISALIGNED_SUB_BUFFER_OFFSET if there are no devices in context associated with buffer for which the origin field of the cl_buffer_region structure passed in buffer_create_info is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.

src/gpu/nvidia/cudnn_gemm.cpp Outdated Show resolved Hide resolved
@ShanoToni ShanoToni force-pushed the rnn_fwd_l2r branch 2 times, most recently from 4151b5b to 746808b Compare December 10, 2024 13:29
@github-actions github-actions bot removed the platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia label Dec 20, 2024
src/gpu/generic/sycl/rnn/ref_rnn.cpp Show resolved Hide resolved
src/gpu/generic/sycl/rnn/rnn_kernels.hpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/rnn_kernels.hpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
@ShanoToni ShanoToni force-pushed the rnn_fwd_l2r branch 2 times, most recently from 428f9fc to 3aacdf3 Compare January 17, 2025 14:37
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.hpp Show resolved Hide resolved
src/gpu/generic/sycl/rnn/rnn_utils.hpp Outdated Show resolved Hide resolved
src/gpu/gpu_gemm_list.cpp Outdated Show resolved Hide resolved
@ShanoToni ShanoToni force-pushed the rnn_fwd_l2r branch 2 times, most recently from d69ed38 to 0496586 Compare January 27, 2025 16:12
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.cpp Outdated Show resolved Hide resolved
src/gpu/generic/sycl/rnn/ref_rnn.hpp Outdated Show resolved Hide resolved
@AD2605
Copy link
Contributor

AD2605 commented Jan 28, 2025

I think we would also need to update the README by mentioning the SYCL backend now supports RNN L2R

@AD2605
Copy link
Contributor

AD2605 commented Jan 28, 2025

I think we would also need to a SKIP_IF_GENERIC in the Gtests for other configurations of RNN, like right to left, since this PR only adds left to right ?

src/gpu/generic/sycl/rnn/ref_rnn.hpp Outdated Show resolved Hide resolved
dim_t t = type_size(conf_.wei_layer_type);
dim_t lay_stride = offsets_.weights_layer[0];
dim_t dir_stride = offsets_.weights_layer[1];
dim_t offset = (lay * lay_stride + dir * dir_stride) * t;
Copy link
Contributor

@rjoursler rjoursler Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if the memory_descriptor_t structures were stored in user_data_t, rather than just memory_storage_t, we could use memory_desc_wrapper::off to perform offset calculations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, added the memory_desc_wrapper and removed offsets struct.

Copy link
Contributor

@rjoursler rjoursler Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking at the change, you should be able to remove the conf structure as well, it seems like it is only used for accessing the data type, but the data type is contained in the memory_desc_wrapper. You can directly access the size via:

/** return the size of data type (a shortcut) */
size_t data_type_size() const { return types::data_type_size(data_type()); }
.

Another minor note on naming style: using a _mdw suffix is the standard method to denote a memory_desc_wrapper, rather than the _wrap suffix used here.

@ShanoToni ShanoToni requested a review from a team as a code owner January 29, 2025 14:00
@github-actions github-actions bot added documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc component:tests Codeowner: @oneapi-src/onednn-arch labels Jan 29, 2025
@ShanoToni ShanoToni force-pushed the rnn_fwd_l2r branch 4 times, most recently from 6e81137 to c2392c4 Compare January 30, 2025 11:50
@ShanoToni
Copy link
Contributor Author

make test
disable build_vendor_intel
disable build_vendor_intel
disable test_device_cpu
enable build_vendor_generic
enable arch_gpu_ampere
enable compiler_icx-oss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants