-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Enable post-enqueue execution graph cleanup #5070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Enable post-enqueue execution graph cleanup #5070
Conversation
/summary:run |
A very rough implementation of cleaning up command nodes after they're enqueued and stop being leaves, alloca commands excluded. Handles only a subset of cases.
2030398
to
de35323
Compare
/summary:run |
/summary:run |
/summary:run |
/summary:run |
/summary:run |
/summary:run |
/summary:run |
/summary:run |
/summary:run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a partial review.
Please, look at the latest comment first.
Cmd->MLeafCounter -= Record->MReadLeaves.remove(Cmd); | ||
Cmd->MLeafCounter -= Record->MWriteLeaves.remove(Cmd); | ||
if (WasLeaf && Cmd->MLeafCounter == 0 && Cmd->isSuccessfullyEnqueued() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit lost.
Wouldn't it be simpler to:
- Add
readyForCleanUp
method to Command which do all the required checks for a command - Add a "scheduler global" vector
ToCleanUp
which is guarded by a mutex. (can be later converted to thread local if needed) - In
GraphProcessor::enqueue
, check if a dep of a command isreadyForCleanup
(1.) and, if yes, add it toToCleanUp
vector. (Can have local cache if needed to reduce number of locks for accessing the vector). - In all* Scheduler high level entry points check if we have something in
ToCleanUp
vector and run cleanup.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly wanted to keep the number of cleanup vector mutex locks to a minimum: currently either the graph or deferred cleanup vector mutex is locked once per attempted cleanup. I haven't considered that if needed, the same can be achieved by using a thread local member variable in scheduler. Also, if we use a mutex instead of a thread local variable, we still cannot minimize its locking (per 1 Scheduler entry point) with a local cache on the GraphProcessor::enqueue
level. So I would prefer the current solution or a thread local member variable out of those three options.
I currently make the check inside Command::enqueue
and populate the vector there since the function only reports that the command has been enqueued at some point rather than as part of this specific Command::enqueue
call. We could return the second piece of information from Command::enqueue
to move this logic to GraphProcessor::enqueue
, but considering that that info is only needed for cleanup, I don't think it would be a significant improvement.
So, do you think that switching to a thread local member variable in scheduler and populating it in Command::enqueue
and GraphBuilder::updateLeaves
makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using thread local vars here is anything better. While it might end up less code, I don't think maintaining it will be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree. I suggest keeping the current solution for now then, unless any other reviewers disagree with that. @romanovvlad Feel free to continue the review post-merge so that we can revisit this topic if needed.
@intel/llvm-reviewers-runtime Since Vlad is on vacation, could someone else please review and weigh in on some of the unresolved discussions (mainly, the one about the current solution vs using a thread local member vector of commands)? |
@alexbatashev @intel/llvm-reviewers-runtime Added unit tests for the change. Please, take another look. |
/verify with intel/llvm-test-suite#631 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl/doc/EnvironmentVariables.md looks good to me.
Do you know why |
I suggest merging with recent |
I think the tracing test fails in the "verify with" run because this branch has been merged with #5172, while the llvm-test-suite branch doesn't have intel/llvm-test-suite#637. Still, I agree that it's better to be safe here, updated both PRs. |
/verify with intel/llvm-test-suite#631 |
"Generate Doxygen documentation / build (pull_request)" will fail due to issues caused by pulling llorg commits. Please, ignore this failure. |
/summary:run |
void setNeedsCleanupAfterWait(bool NeedsCleanupAfterWait) { | ||
MNeedsCleanupAfterWait = NeedsCleanupAfterWait; | ||
} | ||
bool needsCleanupAfterWait() { return MNeedsCleanupAfterWait; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be const?
This patch reverts performance regression for Level Zero backend introduced in intel#5070
This patch contains the initial implementation of post-enqueue cleanup of
command nodes. This is primarily motivated by significant overhead of post-wait
cleanup in queue::wait when lowered to piQueueFinish, which is due to the fact
that we cannot alternate between waiting for singular events and cleaning up
their commands while other tasks are still executing on device.
Post-enqueue cleanup is performed for enqueued non-leaf nodes, so it can be
triggered by the enqueue process or by removing an enqueued node from leaves.
There are multiple exceptions in the initial implementation: host tasks
(currently cannot be cleaned up after enqueue), kernels with streams (stream
handling is tied to finished command cleanup) and CGs without dependencies
(deleted in addCG like before for now). Because of this, finished command
cleanup is still triggered in event::wait() unconditionally and in
queue::wait() for host tasks and kernels with streams.
In addition, this patch removes queue::wait workarounds for Level Zero that
were required to bypass finished command cleanup overhead.