Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when calling some CBLAS and LAPACKE functions #1641

Closed
vladimir-ch opened this issue Jun 25, 2018 · 29 comments · Fixed by #1655
Closed

SIGSEGV when calling some CBLAS and LAPACKE functions #1641

vladimir-ch opened this issue Jun 25, 2018 · 29 comments · Fixed by #1655
Milestone

Comments

@vladimir-ch
Copy link
Contributor

Some of our tests at Gonum that use OpenBLAS as the backing BLAS and LAPACK implementation have recently started to fail due to a segfault. I've bisected it to #1625 which is also supported by OpenBLAS output that I see:

OpenBLAS : Program will terminate because you tried to start too many threads.

that comes from that PR.

OpenBLAS build info:

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... x86_64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (command line : cc)
  Fortran compiler ... GFORTRAN  (command line : gfortran)
  Library Name     ... libopenblas_haswellp-r0.3.1.dev.a (Multi threaded; Max num-threads is 4)

Let me know what other information could help investigate this.

@martin-frbg
Copy link
Collaborator

Are your gonum tests using fork() at some point ? A related issue was noticed in the review phase of #1625, but believed fixed before the PR was merged. @oon3m0oo ?

@oon3m0oo
Copy link
Contributor

Indeed it was, and test_fork should be passing.

You should be able to spawn up to MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER threads. How many are you attempting to spawn? Are you doing this manually or letting blas_server do it for you?

@vladimir-ch
Copy link
Contributor Author

Are your gonum tests using fork() at some point?

Our tests don't use fork and Go's testing package does not either as far as I understand.

How many are you attempting to spawn?

There is no direct control of OS threads in Go, they are handled by its runtime.

In my testing environment the Go runtime can see 4 logical CPUs. This value can be changed. If I add a debug output to memory.c before if (next_memory_table_pos > MAX_ALLOCATING_THREADS) and run the test with 1 CPU, I can see the debug output 4 times (typically), sometimes 5 times. If I set the number of CPUs to 2, the debug output appears 6 times (typically), sometimes 7 times. And so successively until segfault.

The debug output is always printed 3 times before the actual test is run, the remaining output is printed during the test. I cannot interpret these numbers because I don't know when get_memory_table is called. Hopefully you can extract something from it.

Are you doing this manually or letting blas_server do it for you?

For this comment I've been using our test for Zgbmv which calls internally cblas_zgbmv.

@martin-frbg
Copy link
Collaborator

Did your build environment change recently ? If you did not specify NUM_THREADS, it will be guessed from the number of cpus seen on the build host - this is where the "Max num-threads is 4" at the end of the build info text comes from.

@vladimir-ch
Copy link
Contributor Author

I didn't change the build environment, and I don't specify NUM_THREADS. I just use the defaults.

It's not clear to me what changed in memory.c that previously there was no segfault and now there is. Is the check now more rigorous? Or is there now a new limit on the number of threads?

I'm sorry I can't provide a better information, this touches on areas that I'm not very familiar with (Go runtime internals, OpenBLAS internals).

@martin-frbg
Copy link
Collaborator

The whole thread startup code and the management of the thread-local memory was rewritten, but the expectation was that the old limits would still be valid. Wish there was something less complex than gonum to reproduce the problem...

@martin-frbg
Copy link
Collaborator

Possibly a silly suggestion, but could you try to double the value in the definition of MAX_ALLOCATING_THREADS (line 500 of memory.c) just to see if this makes the problem go away or leads to crashes elsewhere ?

@vladimir-ch
Copy link
Contributor Author

vladimir-ch commented Jun 29, 2018

If I double the value of MAX_ALLOCATING_THREADS, the problem goes away.

I'll try to provide a minimal reproducer for the current value soon.

@martin-frbg
Copy link
Collaborator

Thank you. That should at least provide a workaround that allows to include the new code in 0.3.1

@vladimir-ch
Copy link
Contributor Author

I've prepared a reproducer: OpenBLAS_issue_1641.zip

It requires a working Go installation but otherwise is self-contained. To run the test I use the following command inside the unpacked cblas directory:

env LD_LIBRARY_PATH=$HOME/src/OpenBLAS CGO_LDFLAGS="-L$HOME/src/OpenBLAS -lopenblas" go test -v

The code could be smaller but at least it reflects exactly what our test does.

In my effort to minimize it I made an interesting observation. The test generates an exhaustive set of all possible input arguments for some limited number of matrix sizes. It does it by having various nested for loops that loop over a set of values for a given parameter. When I removed one or two of these loops by fixing the corresponding argument to a single value, the number of debug output lines from my printf before line 547 in memory.c went down and local_memory_lable_pos never exceeded MAX_ALLOCATING_THREADS (8 in my case). There was no segfault.

I don't speak about threads here on purpose because I really don't know who is creating them in this case, Go runtime or OpenBLAS. It is not Gonum or our tests. Does this help?

Let me know if you run into any issues with running the test.

@martin-frbg
Copy link
Collaborator

Thank you very much. This allowed me to reproduce the problem (after doing a go get golang.org/x/exp/rand), but I am not sure yet if it is just an off-by-one error or if MAX_ALLOCATING_THREADS actually needs to be doubled.
At least I can make the test pass by just adding 1 to the MAX_ALLOCATING_THREADS variable when I run it on a machine with 6 cores/12 threads.
However, limiting the OpenBLAS build to NUM_THREADS=4 to mimick your case makes it fail again,
unless I double MAX_ALLOCATING_THREADS. (Not sure yet if/how I could tell go to limit itself to just four threads however, so this could well be very much an expectable fault). Will do some more experiments with both old and new versions of memory.c later to try to determine how many threads or buffers it is actually trying to allocate.

@martin-frbg
Copy link
Collaborator

Hmm. Same test but building and running on a dual-core laptop (so 4 threads by default) passes without any issues (and not needing any changes to MAX_ALLOCATING_THREADS). This was with go 1.7.5 rather than go 1.9 however due to using distribution packages for different OS versions.

@martin-frbg
Copy link
Collaborator

Setting GOMAXPROCS to the same number as NUM_THREADS makes tests pass with go 1.9 (and 1 added to MAX_ALLOCATING_THREADS) in tests with a reduced NUM_THREADS setting as well. go 1.7.5 just seems to be using a less aggressive default (?)

@martin-frbg
Copy link
Collaborator

Actually it looks like an off-by-one, and I somehow managed to link to a version with the "+1" already included in my tests with go 1.7.5 on the dual-core box to get the contradicting "results"...
On both systems, Go will enter the function to try and set up NUM_THREADS*2 threads, which will make next_memory_table_pos exceed MAX_ALLOCATING_THREADS by 1 exactly. (And I recall the initial form of the PR containing a comment that the number of threads here was "the server plus the number of threads in the pool", maybe it is the server that needs to be counted in here)

@vladimir-ch
Copy link
Contributor Author

Just to let you know, I still get the segfaults with the latest develop branch (I focused on Zgbmv in order not to introduce LAPACK into the game but also some of our Lapack tests fail due to this). I wrote a smaller reproducer in Go which does not use the testing framework, and I can reproduce the segfault. Then I re-wrote it in C and there is no segfault. This means that this comes from how threads are handled/launched by the Go runtime and so at the moment the behavior is out of our control. We will consider what to do next on our (Gonum) side. Thanks for your help with this.

@kortschak
Copy link
Contributor

I also see the segfaults with the patch above (for Zgbmv and Dgeev).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 2, 2018

Do you still see the segfaults when you set GOMAXPROCS to the number of threads as shown on the OpenBLAS build message (i.e., the NUM_THREADS in use for building OpenBLAS) ?
This should matter only when running Go/Gonum on a bigger machine than used for the OpenBLAS build, but is the only situation where I still saw segfaults with the patch. (OpenBLAS should handle this
more gracefully, but that is another topic - #525)

@martin-frbg martin-frbg reopened this Jul 2, 2018
@martin-frbg
Copy link
Collaborator

Re-running your reproducer, I now see the strange effect that most runs start 24 threads or less (and thus run flawlessly on the six-core cpu), but every now and then Go tries to get a 25th for no apparent reason, and OpenBLAS crashes. Even the successful runs differ in the distribution of the thread invocations, sometimes Go has called for 10 threads before the "RUN TestZgbmv" is printed, sometimes 11 (but only goes for a maximum of 23), or just 9 and then 24. This behaviour appears to be totally random, apparently caused by the random nature of the input matrix elements - so the reproducer is able to reproduce both "fixed" and "broken" results.
I have not yet tried to run this with the "old" memory.c to see how many times the corresponding(?)
routine gets entered there. @oon3m0oo any idea what is going on ?

@marrowsuck
Copy link

I'm also seeing this error
OpenBLAS : Program will terminate because you tried to start too many threads.
whilst building octave.
The error is triggerd by a test script in the makefile.

If I set OMP_NUM_THREADS=1 the test passes.

openblas_config= USE_64BITINT= DYNAMIC_ARCH= DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=4
version=0.3.1
extralib=-lm -lpthread -lgfortran -lm -lpthread -lgfortran

Openblas was built with gcc-8.1.1.

@martin-frbg
Copy link
Collaborator

@marrowsuck you can probably get around this by setting NUM_THREADS to a higher number at build time.
@oon3m0oo I am not sure if I am comparing apples and oranges, but there seems to be a huge number of calls to blas_memory_alloc with "procpos" set to 1 (that is, originating from the STACK_ALLOC macro in common_stackalloc.h) now compared to a few weeks ago.

@kortschak
Copy link
Contributor

Do you still see the segfaults when you set GOMAXPROCS to the number of threads as shown on the OpenBLAS build message (i.e., the NUM_THREADS in use for building OpenBLAS) ?

The default GOMAXPROCS is now num_cpu which does match the NUM_THREADS reported at the completion of make, but explicitly setting GOMAXPROCS to that value does still have the segfault behaviour. Setting GOMAXPROCS to any value lower than 6 (for a build with NUM_THREADS=8) though allows the Zgbmv test to complete without segfaulting (higher than this does fail).

@martin-frbg
Copy link
Collaborator

So it seems the crude "factor of 2" correction to MAX_ALLOCATING_THREADS was correct after all. But I still cannot claim to understand where it comes from (though MAX_ALLOCATING_THREADS may be a misnomer as well)

@oon3m0oo
Copy link
Contributor

oon3m0oo commented Jul 3, 2018

(Sorry was moving house over the weekend)

To be perfectly honest I deduced the factor of 2 from the NUM_BUFFERS definition, and empirically it seemed to be necessary. There's not really a cost to increasing that number, though, it just makes the table of pointers larger, and that's trivially small. At the same time, it would be good to either figure out exactly what the "right" number is, or make the whole thing dynamic without impacting performance.

@martin-frbg
Copy link
Collaborator

Making this thing dynamic would be ideal, if it is possible without impacting performance. At the moment however I am still struggling to understand if/how the second (the BUFFERS_PER_THREAD) dimension of your local_memory_table is used at all ?

@oon3m0oo
Copy link
Contributor

oon3m0oo commented Jul 3, 2018

Looking at this again, we can probably simplify it. Right now, we have two scenarios:
a) No threading or using OpenMP
b) Using pthreads
In a), there is only one thread that allocates, and in OpenMP's case the server allocates memory for all the threads. So we end up using a 1xNUM_BUFFERS array. In b), there are pretty much NUM_BUFFERS+1 allocators (server + workers), and each of the workers manages its own memory. Do they ever need more than one buffer? If not then we have the same thing as before. The server might do more, though, I'm not 100% sure.

BUFFERS_PER_THREAD was an attempt to capture the maximum number of buffers any thread could allocate, for a) and b).

@martin-frbg
Copy link
Collaborator

My current impression is that they used to allocate more than one buffer, but with the new code any attempt to do this increments the new_memory_table_pos counter same as if it was for an entirely new thread. I may be wrong though. In any case we need to get rid of the current breakage first, I take it just increasing MAX_ALLOCATING_THREADS by a factor of two (or more) would do ?

@oon3m0oo
Copy link
Contributor

oon3m0oo commented Jul 3, 2018

I think so, yes. It just means more pointers in the table, shouldn't hurt anything.

@martin-frbg
Copy link
Collaborator

Closing as 0.3.2 is released now

@kortschak
Copy link
Contributor

I've just tested this with a variety of GOMAXPROCS values (unset, <NUM_CPU, =NUM_CPU and >>NUM_CPU). All pass. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants