-
Notifications
You must be signed in to change notification settings - Fork 786
[SYCL][L0] Fix interop event leak #5912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Byoungro So <byoungro.so@intel.com>
Retain should be called only for OpenCL backend. Signed-off-by: Byoungro So <byoungro.so@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
is there a matching test over on llvm-test-suite?
@bso-intel, are these failures caused by the changes from your patch? |
@bader , |
Signed-off-by: Byoungro So <byoungro.so@intel.com>
@@ -5247,7 +5247,8 @@ pi_result _pi_event::cleanup(pi_queue LockedQueue) { | |||
// NOTE: that this needs to be done only once for an event so | |||
// this is guarded with the CleanedUp flag. | |||
// | |||
PI_CALL(EventRelease(this, LockedQueue)); | |||
if (RefCount > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the event with "0" reference count is still flying around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call cleanup() when the RefCount of an Event reaches zero.
if (--(Event->RefCount) == 0) {
if (!Event->CleanedUp)
Event->cleanup(LockedQueue);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then do we call cleanup()
when RefCount is > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just answering to your question "Why is the event with "0" reference count is still flying around?".
We call cleanup() when RefCount becomes zero in piEventRelease(), and then We call EventRetain() in cleanup(), it is a circular dependency and causes a crash.
That's why we need to avoid calling piEventRelease() when cleanup() is called from piEventRelease().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cleanup
is only called on RefCount == 0 then why do we call EventRelease
from it? I am not understanding the reason for the circular dependency now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, cleanup() is called even when RefCount is not zero.
That's why there is a call to piEventRelease() from cleanup().
For example, cleanup() is called from piEventsWait() and pi_queue::resetCommandList().
If cleanup() has been called only when RefCount is zero, this code should have been already broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so event.cleanup()
is called as long as the event is known to get completed. But the at cleanup() it will set the CleanedUp
to true, so there should be no other cleanup()
called on it (at destruction). There is a comment about it:
llvm/sycl/plugins/level_zero/pi_level_zero.cpp
Line 5330 in b01d820
// NOTE: that this needs to be done only once for an event so |
Please explain why this check for RefCount == 0
is necessary (show the exact scenario).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bso-intel : please follow up. I really don't want unnecessary RefCount checks which are indicative of some design misses if they are really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I forgot to remove that.
Done now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also remove call to the cleanup from piEventRelease.
Event is retained explicitly when created in createEventAndAssociateQueue() to ensure that the event is not destroyed before it is really signaled. Ref count is decremented in Event->cleanup() after event is singaled. This
means that if reference count for an event is zero then cleanup was already called for this event and should not be called anymore. Interop events and user events don't have associated command list, queue,
kernel and waitlist, so we don't need to call cleanup() for interop and user
events.
I had an intention to do this earlier in 94943f3 but forgot about it.
Opened a PR for this small change: #6073.
Signed-off-by: Byoungro So <byoungro.so@intel.com>
@smaslov-intel |
Signed-off-by: Byoungro So <byoungro.so@intel.com>
@smaslov-intel |
// To match with the regular piEventCreate() and piEventRelease(), | ||
// we increase the ref count of the pi_event we return to SYCL RT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put a more clear comment for this important change.
Where does this "retain" represent and where is the pairing "release"?
Signed-off-by: Byoungro So <byoungro.so@intel.com>
// However, an interop event created in this function would go through | ||
// the same life cycle, meaning it will be released twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize the interop PI event created here would be released by the SYCL RT (at SYCL event's destructor) but what will wait until the event is completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know. I think we should not make any assumption about interop event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we will not do the wait in SYCL RT, and hence the extra retain will cause this PI event to never be released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I just checked SYCL_PI_TRACE and found that interop event is different than regular event.
Regular event created in piEventCreate() has ref count=2 at the beginning and then decremented twice before it is deallocated.
However, interop event is not created by createEventAndAssociateQueue(), so its ref count is 1, and piEventRelease() is called by RT once.
So, we don't have memory leak for interop event.
Signed-off-by: Byoungro So <byoungro.so@intel.com>
// a different life cycle, meaning it will be released only once when RT | ||
// calls piEventRelease in the sycl::event_impl destructor. | ||
// So, we increase the ref count of the interop pi_event here. Refcount=1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Event is conctructed with RefCnt=1, so this retain should change it to 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wonder if we should really call to createEventAndAssociateQueue here as it may be having other wanted side-effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must have been too sick.
You are right. The refcount=2 after retain().
The current code is right because we actually release this interop event twice.
Once by SYCL RT and another time when we release the queue.
That's why we needed the retain here.
Signed-off-by: Byoungro So <byoungro.so@intel.com>
fixed in #6180 |
The retain call for Level Zero backend causes a memory leak.
We can avoid this unnecessary call by adding the condition.