-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV when calling some CBLAS and LAPACKE functions #1641
Comments
Indeed it was, and test_fork should be passing. You should be able to spawn up to MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER threads. How many are you attempting to spawn? Are you doing this manually or letting blas_server do it for you? |
Our tests don't use
There is no direct control of OS threads in Go, they are handled by its runtime. In my testing environment the Go runtime can see 4 logical CPUs. This value can be changed. If I add a debug output to The debug output is always printed 3 times before the actual test is run, the remaining output is printed during the test. I cannot interpret these numbers because I don't know when
For this comment I've been using our test for |
Did your build environment change recently ? If you did not specify NUM_THREADS, it will be guessed from the number of cpus seen on the build host - this is where the "Max num-threads is 4" at the end of the build info text comes from. |
I didn't change the build environment, and I don't specify NUM_THREADS. I just use the defaults. It's not clear to me what changed in I'm sorry I can't provide a better information, this touches on areas that I'm not very familiar with (Go runtime internals, OpenBLAS internals). |
The whole thread startup code and the management of the thread-local memory was rewritten, but the expectation was that the old limits would still be valid. Wish there was something less complex than gonum to reproduce the problem... |
Possibly a silly suggestion, but could you try to double the value in the definition of MAX_ALLOCATING_THREADS (line 500 of memory.c) just to see if this makes the problem go away or leads to crashes elsewhere ? |
If I double the value of I'll try to provide a minimal reproducer for the current value soon. |
Thank you. That should at least provide a workaround that allows to include the new code in 0.3.1 |
I've prepared a reproducer: OpenBLAS_issue_1641.zip It requires a working Go installation but otherwise is self-contained. To run the test I use the following command inside the unpacked
The code could be smaller but at least it reflects exactly what our test does. In my effort to minimize it I made an interesting observation. The test generates an exhaustive set of all possible input arguments for some limited number of matrix sizes. It does it by having various nested for loops that loop over a set of values for a given parameter. When I removed one or two of these loops by fixing the corresponding argument to a single value, the number of debug output lines from my printf before line 547 in memory.c went down and I don't speak about threads here on purpose because I really don't know who is creating them in this case, Go runtime or OpenBLAS. It is not Gonum or our tests. Does this help? Let me know if you run into any issues with running the test. |
Thank you very much. This allowed me to reproduce the problem (after doing a |
Hmm. Same test but building and running on a dual-core laptop (so 4 threads by default) passes without any issues (and not needing any changes to MAX_ALLOCATING_THREADS). This was with go 1.7.5 rather than go 1.9 however due to using distribution packages for different OS versions. |
Setting GOMAXPROCS to the same number as NUM_THREADS makes tests pass with go 1.9 (and 1 added to MAX_ALLOCATING_THREADS) in tests with a reduced NUM_THREADS setting as well. go 1.7.5 just seems to be using a less aggressive default (?) |
Actually it looks like an off-by-one, and I somehow managed to link to a version with the "+1" already included in my tests with go 1.7.5 on the dual-core box to get the contradicting "results"... |
Just to let you know, I still get the segfaults with the latest |
I also see the segfaults with the patch above (for Zgbmv and Dgeev). |
Do you still see the segfaults when you set GOMAXPROCS to the number of threads as shown on the OpenBLAS build message (i.e., the NUM_THREADS in use for building OpenBLAS) ? |
Re-running your reproducer, I now see the strange effect that most runs start 24 threads or less (and thus run flawlessly on the six-core cpu), but every now and then Go tries to get a 25th for no apparent reason, and OpenBLAS crashes. Even the successful runs differ in the distribution of the thread invocations, sometimes Go has called for 10 threads before the "RUN TestZgbmv" is printed, sometimes 11 (but only goes for a maximum of 23), or just 9 and then 24. This behaviour appears to be totally random, apparently caused by the random nature of the input matrix elements - so the reproducer is able to reproduce both "fixed" and "broken" results. |
I'm also seeing this error If I set OMP_NUM_THREADS=1 the test passes.
Openblas was built with gcc-8.1.1. |
@marrowsuck you can probably get around this by setting NUM_THREADS to a higher number at build time. |
The default GOMAXPROCS is now num_cpu which does match the NUM_THREADS reported at the completion of make, but explicitly setting GOMAXPROCS to that value does still have the segfault behaviour. Setting GOMAXPROCS to any value lower than 6 (for a build with NUM_THREADS=8) though allows the Zgbmv test to complete without segfaulting (higher than this does fail). |
So it seems the crude "factor of 2" correction to MAX_ALLOCATING_THREADS was correct after all. But I still cannot claim to understand where it comes from (though MAX_ALLOCATING_THREADS may be a misnomer as well) |
(Sorry was moving house over the weekend) To be perfectly honest I deduced the factor of 2 from the NUM_BUFFERS definition, and empirically it seemed to be necessary. There's not really a cost to increasing that number, though, it just makes the table of pointers larger, and that's trivially small. At the same time, it would be good to either figure out exactly what the "right" number is, or make the whole thing dynamic without impacting performance. |
Making this thing dynamic would be ideal, if it is possible without impacting performance. At the moment however I am still struggling to understand if/how the second (the BUFFERS_PER_THREAD) dimension of your local_memory_table is used at all ? |
Looking at this again, we can probably simplify it. Right now, we have two scenarios: BUFFERS_PER_THREAD was an attempt to capture the maximum number of buffers any thread could allocate, for a) and b). |
My current impression is that they used to allocate more than one buffer, but with the new code any attempt to do this increments the new_memory_table_pos counter same as if it was for an entirely new thread. I may be wrong though. In any case we need to get rid of the current breakage first, I take it just increasing MAX_ALLOCATING_THREADS by a factor of two (or more) would do ? |
I think so, yes. It just means more pointers in the table, shouldn't hurt anything. |
Closing as 0.3.2 is released now |
I've just tested this with a variety of GOMAXPROCS values (unset, <NUM_CPU, =NUM_CPU and >>NUM_CPU). All pass. Thanks. |
Some of our tests at Gonum that use OpenBLAS as the backing BLAS and LAPACK implementation have recently started to fail due to a segfault. I've bisected it to #1625 which is also supported by OpenBLAS output that I see:
that comes from that PR.
OpenBLAS build info:
Let me know what other information could help investigate this.
The text was updated successfully, but these errors were encountered: