Skip to content

[SYCL][L0] Fix interop event leak #5912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from

Conversation

bso-intel
Copy link
Contributor

The retain call for Level Zero backend causes a memory leak.
We can avoid this unnecessary call by adding the condition.

Signed-off-by: Byoungro So <byoungro.so@intel.com>
Retain should be called only for OpenCL backend.

Signed-off-by: Byoungro So <byoungro.so@intel.com>
@bso-intel bso-intel requested a review from a team as a code owner March 28, 2022 23:24
@bso-intel bso-intel requested a review from cperkinsintel March 28, 2022 23:24
cperkinsintel
cperkinsintel previously approved these changes Apr 1, 2022
Copy link
Contributor

@cperkinsintel cperkinsintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

is there a matching test over on llvm-test-suite?

@bader bader changed the title Interop event leak [SYCL][L0] Fix interop event leak Apr 2, 2022
@bader
Copy link
Contributor

bader commented Apr 6, 2022

@bso-intel, are these failures caused by the changes from your patch?

@bso-intel
Copy link
Contributor Author

@bader ,
Yes. I am investigating....

Signed-off-by: Byoungro So <byoungro.so@intel.com>
@@ -5247,7 +5247,8 @@ pi_result _pi_event::cleanup(pi_queue LockedQueue) {
// NOTE: that this needs to be done only once for an event so
// this is guarded with the CleanedUp flag.
//
PI_CALL(EventRelease(this, LockedQueue));
if (RefCount > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the event with "0" reference count is still flying around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call cleanup() when the RefCount of an Event reaches zero.

 if (--(Event->RefCount) == 0) {
    if (!Event->CleanedUp)
      Event->cleanup(LockedQueue);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do we call cleanup() when RefCount is > 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just answering to your question "Why is the event with "0" reference count is still flying around?".
We call cleanup() when RefCount becomes zero in piEventRelease(), and then We call EventRetain() in cleanup(), it is a circular dependency and causes a crash.
That's why we need to avoid calling piEventRelease() when cleanup() is called from piEventRelease().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cleanup is only called on RefCount == 0 then why do we call EventRelease from it? I am not understanding the reason for the circular dependency now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, cleanup() is called even when RefCount is not zero.
That's why there is a call to piEventRelease() from cleanup().
For example, cleanup() is called from piEventsWait() and pi_queue::resetCommandList().
If cleanup() has been called only when RefCount is zero, this code should have been already broken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so event.cleanup() is called as long as the event is known to get completed. But the at cleanup() it will set the CleanedUp to true, so there should be no other cleanup() called on it (at destruction). There is a comment about it:

// NOTE: that this needs to be done only once for an event so

Please explain why this check for RefCount == 0 is necessary (show the exact scenario).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bso-intel : please follow up. I really don't want unnecessary RefCount checks which are indicative of some design misses if they are really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to remove that.
Done now.

Copy link
Contributor

@againull againull Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also remove call to the cleanup from piEventRelease.
Event is retained explicitly when created in createEventAndAssociateQueue() to ensure that the event is not destroyed before it is really signaled. Ref count is decremented in Event->cleanup() after event is singaled. This
means that if reference count for an event is zero then cleanup was already called for this event and should not be called anymore. Interop events and user events don't have associated command list, queue,
kernel and waitlist, so we don't need to call cleanup() for interop and user
events.
I had an intention to do this earlier in 94943f3 but forgot about it.
Opened a PR for this small change: #6073.

Signed-off-by: Byoungro So <byoungro.so@intel.com>
@bso-intel
Copy link
Contributor Author

@smaslov-intel
I added an additional piEventRetain() to match the reference count of events created by piEventCreate(). (line#5449).
Now all tests passed.

@bso-intel bso-intel requested a review from smaslov-intel April 27, 2022 18:21
Signed-off-by: Byoungro So <byoungro.so@intel.com>
@bso-intel
Copy link
Contributor Author

@smaslov-intel
I removed the condition.

Comment on lines 5446 to 5447
// To match with the regular piEventCreate() and piEventRelease(),
// we increase the ref count of the pi_event we return to SYCL RT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put a more clear comment for this important change.
Where does this "retain" represent and where is the pairing "release"?

Signed-off-by: Byoungro So <byoungro.so@intel.com>
@bso-intel bso-intel requested a review from smaslov-intel April 29, 2022 19:48
Comment on lines 5451 to 5452
// However, an interop event created in this function would go through
// the same life cycle, meaning it will be released twice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the interop PI event created here would be released by the SYCL RT (at SYCL event's destructor) but what will wait until the event is completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know. I think we should not make any assumption about interop event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we will not do the wait in SYCL RT, and hence the extra retain will cause this PI event to never be released.

Copy link
Contributor Author

@bso-intel bso-intel May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I just checked SYCL_PI_TRACE and found that interop event is different than regular event.
Regular event created in piEventCreate() has ref count=2 at the beginning and then decremented twice before it is deallocated.
However, interop event is not created by createEventAndAssociateQueue(), so its ref count is 1, and piEventRelease() is called by RT once.
So, we don't have memory leak for interop event.

bso-intel added 2 commits May 9, 2022 19:10
Signed-off-by: Byoungro So <byoungro.so@intel.com>
Signed-off-by: Byoungro So <byoungro.so@intel.com>
Comment on lines 5452 to 5454
// a different life cycle, meaning it will be released only once when RT
// calls piEventRelease in the sycl::event_impl destructor.
// So, we increase the ref count of the interop pi_event here. Refcount=1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Event is conctructed with RefCnt=1, so this retain should change it to 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if we should really call to createEventAndAssociateQueue here as it may be having other wanted side-effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have been too sick.
You are right. The refcount=2 after retain().
The current code is right because we actually release this interop event twice.
Once by SYCL RT and another time when we release the queue.
That's why we needed the retain here.

Signed-off-by: Byoungro So <byoungro.so@intel.com>
@smaslov-intel
Copy link
Contributor

fixed in #6180

@bso-intel bso-intel deleted the interop-event-leak branch May 31, 2022 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants