-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Fix memory dependency leaks caused by failed kernel enqueue #5120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix memory dependency leaks caused by failed kernel enqueue #5120
Conversation
If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves. Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@sergey-semenov Could you please take a look? |
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
/summary:run |
I have converted this PR to a draft while I do further investigation into a more robust solution. |
This commit adds a new strategy for cleaning up failed commands, namely to walk up the failed commands and its users and replace them by empty commands. This preserves the structure of the graph while replacing failed state. Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
The new commit changes the strategy.
@romanovvlad | @dm-vodopyanov | @sergey-semenov - Thoughts on this strategy? Additional things to consider:
Once a final solution has been found, I will document it in the relevant headers. |
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Sounds legit.
I can't get what is an auxiliary command. |
Should there be any policy for back-off enqueue of failed command or it's user's responsibility to restore and re-enqueue required kernels/commands? |
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
For simplicity, this is not implemented. I am not convinced it is the right approach to assume all users should be deleted to or if it is actually ever applicable.
|
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@sergey-semenov @dm-vodopyanov - I believe this is ready for review. |
@dm-vodopyanov & @sergey-semenov - Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, the changes LGTM.
if (FailedCmd->getType() == Command::CommandType::RUN_CG) { | ||
auto ExecCmd = static_cast<ExecCGCommand *>(FailedCmd); | ||
std::vector<std::shared_ptr<stream_impl>> Streams = ExecCmd->getStreams(); | ||
ExecCmd->clearStreams(); | ||
StreamsToDeallocate.insert(StreamsToDeallocate.end(), Streams.begin(), | ||
Streams.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do similar handling for reduction resources here once #5653 lands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely! I will make sure it is included into whichever is merged last.
Due to performance regressions and test failures in post-commit, these changes have been reverted. I will revisit this shortly. |
…ntel#5120) If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.
If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.