-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventStream::~EventStream() causes dead-lock when exit() is called in alpaka kernel #3241
Comments
|
That depends on how you define "end of simulation": It is not at the intended end of the simulation. clang's OpenMP target offload runtime encounters an error in or around a target region (i.e. inside the alpaka TaskKernel*::operator() function. After printing an error message with zero information, it calls exit(), which ends the program right there. I think it would be nicer of the runtime to raise an exception or trigger sigabrt (like assert() ), but at the moment it is not and it may not be the last runtime to decide to go this route. |
I thought |
No, I think you are right about exit. The problem seems to be, that some constructors are called, but the stack is not unwound properly.
|
Ah, thanks for clarifications. So it seems we can't really force stack unwinding in this case, which feels bad. |
cpp-reference.com confirms, that std::exit does not unwind the stack. It talks about std::atexit, to register handlers, but using this does seem to work.
|
I filed an enhancement request with clang for this issue. I think the library should call abort() in case of a fatal error. |
Some runtimes (Clang's OpenMP target offload) call exit() when an error occurs.
In picongpu (at least using a blocking queue) this causes a deadlock when the pmacc::EventStream destructor is called, which calls cuplaStreamSynchronize(), which uses the alpaka queues' wait functionality, which tries to lock the queues' mutex. The lock in this mutex is held by the enqueue function which never returned in this case.
I do know if this can be feasibly fixed, just want to bring it up for discussion.
The text was updated successfully, but these errors were encountered: