-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic: sycl: Vanilla RNN FWD l2r #2236
base: main
Are you sure you want to change the base?
Conversation
5383baf
to
76f6ce0
Compare
@@ -124,6 +124,7 @@ std::unique_ptr<memory_storage_t> buffer_memory_storage_t::clone() const { | |||
|
|||
storage->buffer_ = buffer_; | |||
storage->base_offset_ = base_offset_; | |||
storage->offset_ = this->offset_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use offset()
method to get it without moving offset_
field from private to protected?
@rjoursler, you seem to used the offset in the RNN implementation. Can you please clarify the semantics of the offset_
in memory storage? To me, it seems that the offset is meant to be used for splitting a memory storage into separate logical parts depending on how/where it's being used so carrying over the offset when cloning the memory storage seems weird because the meaning of the offset is context dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning used here is that offset is always relative to the base memory object. The reason for this is to align with subbuffer behaviors in the runtimes. For example in OpenCL,
buffer must be a valid buffer object and cannot be a sub-buffer object.
And in SYCL
b is the buffer with the real data, which must not be a sub-buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning used here is that offset is always relative to the base memory object.
Why can't we use a sub-storage for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to use sub-buffers?
From my experience using sub-buffers is not as simple as using an offset. The main issue I have faced is that some backends require sub-buffers to be aligned to some size larger than the buffer's data type. From the OpenCL link above:
CL_MISALIGNED_SUB_BUFFER_OFFSET if there are no devices in context associated with buffer for which the origin field of the cl_buffer_region structure passed in buffer_create_info is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use a sub-storage for that?
Potentially, we could use sub-storage for that, but it would require some changes. Currently, the get_sub_storage
interface is synonymous with generating sub-buffer runtime objects and, as @Rbiessy mentioned, the problem is that these sub-buffers have restrictions on nesting and alignment. Due to the nature of RNN, it is common to need different ways of splitting the memory into sub-objects. For example, the workspace which corresponds to a single memory allocation by our API actually corresponds to multiple buffers such as gates and states. In addition, each of those buffers may be further subdivided into cells or layers based on what computation is being performed.
To handle this, we need a more robust sub_storage interface which allows arbitrary data alignment, ensures no nested sub-buffers occur, and gives control over when sub-buffers are created. The sub_buffer_t structure I introduced in gpu/intel/ocl/rnn/rnn_utils.hpp
(and which is duplicated by this PR), was intended to handle this complexity within the Intel GPU RNN implementation. As this implementation needs similar support, merging the functionality into memory_storage_t
would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To handle this, we need a more robust sub_storage interface which allows arbitrary data alignment, ensures no nested sub-buffers occur, and gives control over when sub-buffers are created.
memory_storage_t::get_sub_storage()
takes an offset and creates a a sub-storage that is essentially a sub-buffer that is always created from the root buffer so we don't create nesting sub-buffers. As for the alignment issue, I'm no sure I understand it, when do we run into the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to clarify the restriction I was referring to apply to OpenCL sub-buffers (and SYCL sub-buffers that may use OpenCL under the hood) which are created when using memory_storage_t::get_sub_storage()
. On PVC for instance querying CL_DEVICE_MEM_BASE_ADDR_ALIGN
returns 1024 bits. This means that if for instance the root buffer is at address 0x00000400 (aligned to 1024 already) the only possible offsets are multiples of 1024 bits i.e. multiples of 32 elements when using fp32. This gets more complicated if the root buffer is not aligned and when supporting more data types. In general it is much easier to track the offset as a separate variable.
I agree with the first comment to use a getter method and Anton is planning to look into merging the functionality of sub_buffer_t
into the main memory_storage_t
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we run into the issue?
All of the calls generating sub_buffer_t
objects from the user_data_t
structure here may be unaligned depending on the layout and problem dimensions submitted to oneDNN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ref matmul has now removed use of sub_buffer
and sub_storage, instead now we levarage a new clone_ptr_off
of the memory_storage_t
.
This new function returns a shallow copy with the addition of offset of the memory storage.
The offset was not used currently by the buffer and usm memory storages from get_memory_arg
there is a proposed change to return a ranged accessor instead of an accessor.
To get the offset the get_pointer
func to get the memory from inside the kernel offsets the memory by the accessors offset (multi_ptr does not take into accout the offset of the accessor by default). This is similar to what get_native_pointer
does, adding the offset of the memory_storage it was created with.
In this way we can get the required memory without the need to create a separate sub_buffer.
#include <memory> | ||
#include "gpu/intel/utils.hpp" | ||
|
||
#define DPRINT(fmt, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is meant to be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually used for debugging purposes only when the debug mode is set, which is also present for the ocl implementation, I believe it can remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds branching and it can affect the generated code as ressources need to be allocated in case the kernel needs to print something. We can keep the debugging macro but I think it should be enabled only when building in debug mode. DPRINT
can be defined to be empty when building in release mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, get_verbose_dev_mode
is effectively a compile time constant in a header (see here and here). As such, there shouldn't be any branching in Release builds.
This macro should likely be replaced with VDEBUGINFO though. VDEBUGINFO
is not well setup for multi-line logging due to the long prefix as is done with DPRINT, so this may require work to align logging with the VDEBUGINFO layout.
@@ -124,6 +124,7 @@ std::unique_ptr<memory_storage_t> buffer_memory_storage_t::clone() const { | |||
|
|||
storage->buffer_ = buffer_; | |||
storage->base_offset_ = base_offset_; | |||
storage->offset_ = this->offset_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to use sub-buffers?
From my experience using sub-buffers is not as simple as using an offset. The main issue I have faced is that some backends require sub-buffers to be aligned to some size larger than the buffer's data type. From the OpenCL link above:
CL_MISALIGNED_SUB_BUFFER_OFFSET if there are no devices in context associated with buffer for which the origin field of the cl_buffer_region structure passed in buffer_create_info is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.
4151b5b
to
746808b
Compare
746808b
to
cc274d4
Compare
cc274d4
to
9826a50
Compare
428f9fc
to
3aacdf3
Compare
d69ed38
to
0496586
Compare
I think we would also need to update the README by mentioning the SYCL backend now supports RNN L2R |
I think we would also need to a |
dim_t t = type_size(conf_.wei_layer_type); | ||
dim_t lay_stride = offsets_.weights_layer[0]; | ||
dim_t dir_stride = offsets_.weights_layer[1]; | ||
dim_t offset = (lay * lay_stride + dir * dir_stride) * t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, if the memory_descriptor_t
structures were stored in user_data_t
, rather than just memory_storage_t
, we could use memory_desc_wrapper::off to perform offset calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, added the memory_desc_wrapper
and removed offsets struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looking at the change, you should be able to remove the conf
structure as well, it seems like it is only used for accessing the data type, but the data type is contained in the memory_desc_wrapper
. You can directly access the size via:
oneDNN/src/common/memory_desc_wrapper.hpp
Lines 138 to 139 in 8ec3240
/** return the size of data type (a shortcut) */ | |
size_t data_type_size() const { return types::data_type_size(data_type()); } |
Another minor note on naming style: using a _mdw
suffix is the standard method to denote a memory_desc_wrapper
, rather than the _wrap
suffix used here.
0496586
to
a4d5131
Compare
6e81137
to
c2392c4
Compare
c2392c4
to
95f2edb
Compare
make test |
Description
Adding implementation for generic RNN Vanilla forward left-to-right.
(Note: This PR is to serve as the first part of adding a generic RNN implementation which would support additional functionality with following PRs)
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?