Skip to content

[SYCL] Fix memory dependency leaks caused by failed kernel enqueue #5120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 2, 2022

Conversation

steffenlarsen
Copy link
Contributor

If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.

If a kernel enqueue fails the runtime will immediately try and clean it
up. However, if it has any dependencies or users the cleanup will be
skipped. This can cause the dependencies to stay alive and leak.
These changes forces a full sub-graph cleanup of the command if
enqueuing failed. Additionally, sub-graph cleanup is changed to account
for failed kernel enqueues and will remove the failed command from its
leaves.

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@romanovvlad
Copy link
Contributor

@sergey-semenov Could you please take a look?

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor Author

/summary:run

@steffenlarsen steffenlarsen marked this pull request as draft December 14, 2021 08:27
@steffenlarsen steffenlarsen changed the title [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue Draft: [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue Dec 14, 2021
@steffenlarsen
Copy link
Contributor Author

I have converted this PR to a draft while I do further investigation into a more robust solution.

This commit adds a new strategy for cleaning up failed commands, namely
to walk up the failed commands and its users and replace them by empty
commands. This preserves the structure of the graph while replacing
failed state.

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor Author

steffenlarsen commented Dec 14, 2021

The new commit changes the strategy.

This commit adds a new strategy for cleaning up failed commands, namely
to walk up the failed commands and its users and replace them by empty
commands. This preserves the structure of the graph while replacing
failed state.

@romanovvlad | @dm-vodopyanov | @sergey-semenov - Thoughts on this strategy?

Additional things to consider:

  1. Should the failed command and its users be merged into a single empty command? This command could have all the dependencies of the entire user-subgraph.
  2. What happens when an auxiliary command fails? Is it enough to clean up the primary command and its users? Alternatively we could clean up the failed auxiliary command and all the ones there weren't enqueued after the first failure, then clean up the primary command.

Once a final solution has been found, I will document it in the relevant headers.

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@s-kanaev
Copy link
Contributor

Should the failed command and its users be merged into a single empty command? This command could have all the dependencies of the entire user-subgraph.

Sounds legit.

What happens when an auxiliary command fails? Is it enough to clean up the primary command and its users? Alternatively we could clean up the failed auxiliary command and all the ones there weren't enqueued after the first failure, then clean up the primary command.

I can't get what is an auxiliary command.

@s-kanaev
Copy link
Contributor

Should there be any policy for back-off enqueue of failed command or it's user's responsibility to restore and re-enqueue required kernels/commands?

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor Author

Should the failed command and its users be merged into a single empty command? This command could have all the dependencies of the entire user-subgraph.

For simplicity, this is not implemented. I am not convinced it is the right approach to assume all users should be deleted to or if it is actually ever applicable.

What happens when an auxiliary command fails? Is it enough to clean up the primary command and its users? Alternatively >> we could clean up the failed auxiliary command and all the ones there weren't enqueued after the first failure, then clean up >> the primary command.

I can't get what is an auxiliary command.

addCG will first try to enqueue "auxiliary commands". These may cause the failure as well.

Signed-off-by: Steffen Larsen <steffen.larsen@intel.com>
@steffenlarsen steffenlarsen marked this pull request as ready for review February 14, 2022 14:34
@steffenlarsen
Copy link
Contributor Author

@sergey-semenov @dm-vodopyanov - I believe this is ready for review.

@steffenlarsen steffenlarsen changed the title Draft: [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue Feb 15, 2022
@bader bader requested a review from sergey-semenov February 22, 2022 09:59
@bader bader changed the title [SYCL] Fixes memory dependency leaks caused by failed kernel enqueue [SYCL] Fix memory dependency leaks caused by failed kernel enqueue Feb 22, 2022
@steffenlarsen
Copy link
Contributor Author

@dm-vodopyanov & @sergey-semenov - Friendly ping.

Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, the changes LGTM.

Comment on lines +1280 to +1285
if (FailedCmd->getType() == Command::CommandType::RUN_CG) {
auto ExecCmd = static_cast<ExecCGCommand *>(FailedCmd);
std::vector<std::shared_ptr<stream_impl>> Streams = ExecCmd->getStreams();
ExecCmd->clearStreams();
StreamsToDeallocate.insert(StreamsToDeallocate.end(), Streams.begin(),
Streams.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do similar handling for reduction resources here once #5653 lands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely! I will make sure it is included into whichever is merged last.

@bader bader merged commit 104e439 into intel:sycl Mar 2, 2022
steffenlarsen added a commit that referenced this pull request Mar 3, 2022
bader pushed a commit that referenced this pull request Mar 3, 2022
…queue (#5120)" (#5718)

This reverts commit 104e439.

#5120 is causing post-commit failures and performance regressions. This reverts it.
@steffenlarsen
Copy link
Contributor Author

Due to performance regressions and test failures in post-commit, these changes have been reverted. I will revisit this shortly.

raaiq1 pushed a commit to raaiq1/llvm that referenced this pull request Dec 5, 2022
…ntel#5120)

If a kernel enqueue fails the runtime will immediately try and clean it up. However, if it has any dependencies or users the cleanup will be skipped. This can cause the dependencies to stay alive and leak. These changes forces a full sub-graph cleanup of the command if enqueuing failed. Additionally, sub-graph cleanup is changed to account for failed kernel enqueues and will remove the failed command from its leaves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants