Skip to content

[SYCL] Enable post-enqueue execution graph cleanup #5070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Dec 27, 2021

Conversation

sergey-semenov
Copy link
Contributor

@sergey-semenov sergey-semenov commented Dec 2, 2021

This patch contains the initial implementation of post-enqueue cleanup of
command nodes. This is primarily motivated by significant overhead of post-wait
cleanup in queue::wait when lowered to piQueueFinish, which is due to the fact
that we cannot alternate between waiting for singular events and cleaning up
their commands while other tasks are still executing on device.

Post-enqueue cleanup is performed for enqueued non-leaf nodes, so it can be
triggered by the enqueue process or by removing an enqueued node from leaves.
There are multiple exceptions in the initial implementation: host tasks
(currently cannot be cleaned up after enqueue), kernels with streams (stream
handling is tied to finished command cleanup) and CGs without dependencies
(deleted in addCG like before for now). Because of this, finished command
cleanup is still triggered in event::wait() unconditionally and in
queue::wait() for host tasks and kernels with streams.

In addition, this patch removes queue::wait workarounds for Level Zero that
were required to bypass finished command cleanup overhead.

@sergey-semenov
Copy link
Contributor Author

/summary:run

A very rough implementation of cleaning up command nodes after they're
enqueued and stop being leaves, alloca commands excluded. Handles only a
subset of cases.
@sergey-semenov sergey-semenov force-pushed the graphcleanup2electricboogaloo branch from 2030398 to de35323 Compare December 9, 2021 13:26
@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

@sergey-semenov
Copy link
Contributor Author

/summary:run

Copy link
Contributor

@romanovvlad romanovvlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a partial review.
Please, look at the latest comment first.

Cmd->MLeafCounter -= Record->MReadLeaves.remove(Cmd);
Cmd->MLeafCounter -= Record->MWriteLeaves.remove(Cmd);
if (WasLeaf && Cmd->MLeafCounter == 0 && Cmd->isSuccessfullyEnqueued() &&
Copy link
Contributor

@romanovvlad romanovvlad Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit lost.
Wouldn't it be simpler to:

  1. Add readyForCleanUp method to Command which do all the required checks for a command
  2. Add a "scheduler global" vector ToCleanUp which is guarded by a mutex. (can be later converted to thread local if needed)
  3. In GraphProcessor::enqueue, check if a dep of a command is readyForCleanup(1.) and, if yes, add it to ToCleanUp vector. (Can have local cache if needed to reduce number of locks for accessing the vector).
  4. In all* Scheduler high level entry points check if we have something in ToCleanUp vector and run cleanup.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly wanted to keep the number of cleanup vector mutex locks to a minimum: currently either the graph or deferred cleanup vector mutex is locked once per attempted cleanup. I haven't considered that if needed, the same can be achieved by using a thread local member variable in scheduler. Also, if we use a mutex instead of a thread local variable, we still cannot minimize its locking (per 1 Scheduler entry point) with a local cache on the GraphProcessor::enqueue level. So I would prefer the current solution or a thread local member variable out of those three options.

I currently make the check inside Command::enqueue and populate the vector there since the function only reports that the command has been enqueued at some point rather than as part of this specific Command::enqueue call. We could return the second piece of information from Command::enqueue to move this logic to GraphProcessor::enqueue, but considering that that info is only needed for cleanup, I don't think it would be a significant improvement.

So, do you think that switching to a thread local member variable in scheduler and populating it in Command::enqueue and GraphBuilder::updateLeaves makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using thread local vars here is anything better. While it might end up less code, I don't think maintaining it will be easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree. I suggest keeping the current solution for now then, unless any other reviewers disagree with that. @romanovvlad Feel free to continue the review post-merge so that we can revisit this topic if needed.

@sergey-semenov
Copy link
Contributor Author

sergey-semenov commented Dec 21, 2021

@intel/llvm-reviewers-runtime Since Vlad is on vacation, could someone else please review and weigh in on some of the unresolved discussions (mainly, the one about the current solution vs using a thread local member vector of commands)?

@sergey-semenov
Copy link
Contributor Author

@alexbatashev @intel/llvm-reviewers-runtime Added unit tests for the change. Please, take another look.

@sergey-semenov
Copy link
Contributor Author

/verify with intel/llvm-test-suite#631

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl/doc/EnvironmentVariables.md looks good to me.

@bader
Copy link
Contributor

bader commented Dec 23, 2021

Do you know why SYCL :: Tracing/pi_tracing_test.cpp fails?

@bader
Copy link
Contributor

bader commented Dec 23, 2021

I suggest merging with recent sycl head, update intel/llvm-test-suite#631 with more fixes if necessary and restart Jenkins testing.
I think we can should clean pre-commit results for this PR.
After that, we can check summary job status again.

@sergey-semenov
Copy link
Contributor Author

I think the tracing test fails in the "verify with" run because this branch has been merged with #5172, while the llvm-test-suite branch doesn't have intel/llvm-test-suite#637. Still, I agree that it's better to be safe here, updated both PRs.

@sergey-semenov
Copy link
Contributor Author

/verify with intel/llvm-test-suite#631

@bader
Copy link
Contributor

bader commented Dec 24, 2021

"Generate Doxygen documentation / build (pull_request)" will fail due to issues caused by pulling llorg commits. Please, ignore this failure.

@sergey-semenov
Copy link
Contributor Author

/summary:run

@bader bader merged commit 6fd6098 into intel:sycl Dec 27, 2021
@sergey-semenov sergey-semenov deleted the graphcleanup2electricboogaloo branch December 27, 2021 13:22
void setNeedsCleanupAfterWait(bool NeedsCleanupAfterWait) {
MNeedsCleanupAfterWait = NeedsCleanupAfterWait;
}
bool needsCleanupAfterWait() { return MNeedsCleanupAfterWait; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be const?

dm-vodopyanov added a commit to dm-vodopyanov/llvm that referenced this pull request Jan 14, 2022
This patch reverts performance regression for Level Zero backend
introduced in intel#5070
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants