Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want fallback on thread creation failure #525

Open
markati opened this issue Mar 28, 2015 · 8 comments
Open

Want fallback on thread creation failure #525

markati opened this issue Mar 28, 2015 · 8 comments

Comments

@markati
Copy link

markati commented Mar 28, 2015

It seems that openblas just dies when pthread_create fails. It should instead continue execution with the threads already created, or at least it should fall back on the single-thread mode. pthread_create often fails on a many-core machine if an application is launched in parallel.

@xianyi
Copy link
Collaborator

xianyi commented Mar 29, 2015

Thank for the suggestion. I will implement this feature.

@groutr
Copy link

groutr commented Sep 17, 2015

Any news on this feature?

@xianyi
Copy link
Collaborator

xianyi commented Oct 5, 2015

@groutr , I didn't implement it yet.

@xianyi
Copy link
Collaborator

xianyi commented Oct 27, 2015

I merged the patch, which raises a signal when pthread_create fails.

70642fe

Is it enough for this feature request?

@markati
Copy link
Author

markati commented Nov 6, 2015

I'm afraid the patch is making situation worse.

When raise(SIGINT) is called, a signal handler is called back.
The signal handler returns, and then raise(SIGINT) returns with 0.
The for-loop in which raise(SIGINT) was called continue creating
threads, assuming, without any check, the stillborn thread has somehow
been treated with by the signal handler.

Since this is a BLAS library, it is often the case that it is
embedded in an application without specific preference to OpenBLAS,
and the application may have installed a signal handler
that is unrelated to OpenBLAS...maybe for the sake of the application itself.
What if such a handler is called by raise(SIGINT)?
OpenBLAS will resume execution with some working threads left dead.

Application authors therefore must make sure that they have installed
an appropriate signal handler, or that no signal handler has
been installed, before calling BLAS functions.

Furthermore, if application authors are aware that they
must write a signal handler, they have nothing they can do
in the handler. For instance, a signal handler cannot access
blas_threads[i], in which a handle to the stillborn thread
has been stored; since 'blas_threads' is a static global variable and 'i' is
an auto variable. The signal handler can nothing but call exit()
because, if the handler returns, OpenBLAS will behave insanely.
(The handler cannot perform longjmp() either. The behavior is undefined)

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 13, 2018

Revisiting this (and associated PR #668), wouldn't a better error behaviour be to

  • exit the thread creation loop
  • set blas_num_threads to the number successfully created up to that point
  • write a message to stderr
  • continue running with what we have ?

A caller could probably still query the number of threads actually created and raise a signal if desired,
or call goto_set_num_threads() to retry the creation of any "missing" threads.
Additionally, it seems #668 made no attempt to handle the other potential source of pthread_create()
failures, in goto_set_num_threads().

@cipri-tom
Copy link

is there any workaround for the moment ?

@martin-frbg
Copy link
Collaborator

Depends on your use case, trivially you could build OpenBLAS single-threaded.
Otherwise perhaps reverting the patch and removing the "exit" statement from the previous code will be a start, and/or checking in your own code if thread creation can be expected to work before calling OpenBLAS.
Does what I sketched out in my earlier message two weeks ago sound plausible ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants