-
Notifications
You must be signed in to change notification settings - Fork 771
[SYCL] WA for RT hang when exit(1) unexpectedly called in backend & SYCL_PI_TRACE enabled #5630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xit() If exit() is called by module inside pi call wrapped by mutex for pi traces RT hangs on tear down. Introduces mutex lock skip in this case. Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
…lock Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
private: | ||
std::shared_ptr<RT::PiPlugin> MPlugin; | ||
backend MBackend; | ||
void *MLibraryHandle; // the handle returned from dlopen | ||
std::shared_ptr<std::mutex> TracingMutex; | ||
std::shared_ptr<SpinLock> MTracingSLock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How using SpinLock here is better than std::mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case the thread terminates while owning lock - undefined behavior for std::mutex; or if we skip mutex lock/unlock as I did - undefined behavior for std::mutex.
https://en.cppreference.com/w/cpp/thread/mutex/~mutex
"Destroys the mutex.
The behavior is undefined if the mutex is owned by any thread or if any thread terminates while holding any ownership of the mutex."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what is the behavior of SpinLock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in common, lock/unlock just change stored value in std::atomic_flag. no limitations as with mutex is documented.
// call sync primitive can be still locked. | ||
auto Guard = MShutdownModeEnabled | ||
? std::unique_lock<SpinLock>() | ||
: std::unique_lock<SpinLock>(*MTracingSLock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you avoid locking if tracing called at shutdown, right? Can you explain what is the UB being addressed here? Also, is this really the only lock that is problematic in case of process termination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, you understanding is correct. If anyone inside pi call calls exit(1) (e.g. assert failure) and traces are enabled we will have a hang because we have:
- mutex.lock
- pi call that calls exit() and never returns
- unreachable mutex.unlock;
Exit() causes normal process termination and provides opportunity to release resources. In this case we do shutdown and call piTearDown that is also wrapped by the same mutex.
In this case right now RT just hangs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the problematic places:
the situation with exit in backends may affect two places that I know:
- PI calls tracing - environment variable is documented; addressed by this PR
- ze calls tracing if it is called in piTearDown (?) - ZE_SERIALIZE=1, not documented, fully internal, not addressed. https://github.com/intel/llvm/blob/sycl/sycl/plugins/level_zero/pi_level_zero.cpp#L51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checked - ze calls tracing is not affected, piTearDown doesn't call any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone inside pi call calls exit(1)
Which code does that? Maybe we should disallow exit() from plugins? It should only return error codes to SYCL RT or terminate() in case of fatal error (and no shutdown() in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just turn the mutex around PI call to be "timed_mutex"? Also, holding the mutex around some PI calls (e.g. piProgramBuild) may be too expensive and freeze other threads for no good reasons. The "timed_mutex" will only hold it for, say 1 sec. Yes, some lengthy calls will have it's output mixed with other threads output. That would be addressed in long-term by using XPTI for tracing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a need for mutex at all?
Is it only due multiline output for PI tracing?
Is it feasible to switch to single-line output like e.g. strace?
Any implications for such a change?
@alexbatashev , do you have anything to advise here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutex is to make trace readable, i.e. such that output from multiple threads is not mixed. Remember, some prints are before PI call, some after, and some other are from within the PI call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-kanaev I'm not sure if we need to change anything at all. The original hang was caused by a bug in the underlying runtime. This workaround is for that specific case, there's no guarantee that it won't fire in another case. XPTI tracing apparently is not affected by this issue, and it is almost ready to be merged: #5389, so, we should probably focus on improving that tool, rather than putting some questionable w/as all over runtime. If we still want to fix SYCL_PI_TRACE, @smaslov-intel has a good advice - timed mutex with long enough timeout. That'll be less of a hack than what's proposed by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.cppreference.com/w/cpp/thread/timed_mutex/~timed_mutex
timed mutex has the same UB conditions as std::mutex. Not sure if it would be safe enough to use it.
The behavior is undefined if the mutex is owned by any thread or if any thread terminates while holding any ownership of the mutex.
WAs the case when underlying layer calls exit(1). The only intension of this commit is to eliminate RT hang in this case.