Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit initial OpenBLAS thread count #46844

Merged
merged 2 commits into from
Nov 15, 2022
Merged

Conversation

staticfloat
Copy link
Member

We set OpenBLAS's initial thread count to 1 to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call BLAS.set_num_threads() to the actual value we require.

@vtjnash
Copy link
Member

vtjnash commented Sep 20, 2022

Also GOTO_NUM_THREADS? And I think you can recover the test for this from
https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b

@staticfloat
Copy link
Member Author

staticfloat commented Sep 20, 2022

I think you can recover the test for this from
https://github.com/JuliaLang/julia/pull/42442/files#diff-e299fbf3f1935b1bb9e7ef61e4644dda078ae51fab7559caba8da03817943c9b

I don't see how that is testing this functionality; that's testing that we limit the number of threads to a maximum of 8, but that's not what we do anymore; and in fact, our logic is much closer to what OpenBLAS itself natively does, so it's difficult to test that this is working.

@vtjnash
Copy link
Member

vtjnash commented Sep 20, 2022

Ah, true. Now that we are later also explicitly changing it, we would need to do that check in __init__, before we set it.

@maleadt maleadt added the backport 1.8 Change should be backported to release-1.8 label Sep 21, 2022
staticfloat added a commit to JuliaCI/sandboxed-buildkite-agent that referenced this pull request Sep 21, 2022
@JeffBezanson
Copy link
Member

LinearAlgebra.__init__ only calls set_num_threads if !haskey(ENV, "OPENBLAS_NUM_THREADS"); does this interfere with that?

@staticfloat
Copy link
Member Author

No; so what happens is, if a thread count is already set, neither this logic nor the LinearAlgebra __init__() logic messes with it.

@vtjnash
Copy link
Member

vtjnash commented Sep 22, 2022

I think I agree with Jeff that this will break LinearAlgebra.__init__, since it will be set now

@staticfloat
Copy link
Member Author

Oh, of course, my bad. I totally misunderstood the question, and confused myself.

One possibility is to unset the environment variable after dlopen() has returned, but I'm not crazy about that idea. What do you think, Jameson?

@vtjnash
Copy link
Member

vtjnash commented Sep 22, 2022

I would perhaps rather just patch openblas to default to 1 thread at startup, though this might harm other apps that use it

@staticfloat
Copy link
Member Author

Okay, I'll just add a short-circuit to OpenBLAS's default thread calculation then. X-ref: JuliaPackaging/Yggdrasil#5555

staticfloat added a commit to staticfloat/OpenBLAS that referenced this pull request Sep 22, 2022
This allows Julia to set a default number of threads (usually `1`) to be
used when no other thread counts are specified [0], to short-circuit the
default OpenBLAS thread initialization routine that spins up a different
number of threads than Julia would otherwise choose.

The reason to add a new environment variable is that we want to be able
to configure OpenBLAS to avoid performing its initial memory
allocation/thread startup, as that can consume significant amounts of
memory, but we still want to be sensitive to legacy codebases that set
things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`.  Creating a new
environment variable that is openblas-specific and is not already
publicly used to control the overall number of threads of programs like
Julia seems to be the best way forward.

[0] JuliaLang/julia#46844
@staticfloat
Copy link
Member Author

I am attempting to upstream this patch here: OpenMathLib/OpenBLAS#3773

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 27, 2022

I'm exited to see this merged, so I can test/use on master, since its only point is faster Julia startup, right? How faster expected?

So, is the "1 failing" check a false alarm? Does the test need to be rerun?

At the very start for freebsd64:

fcntl(): Bad file descriptor
Executing precompile statements... 124/1927
fcntl(): Bad file descriptor
[..]
Executing precompile statements... 682/1927
Killed
gmake[1]: *** [sysimage.mk:89: /usr/home/julia/buildbot/w1_builder/package_freebsd64/build/usr/lib/julia/sys-o.a] Error 1
Executing precompile statements... 683/1927
[..]

@staticfloat staticfloat force-pushed the sf/openblas_threading_control branch 2 times, most recently from 184b918 to 89969c6 Compare September 29, 2022 18:29
staticfloat added a commit to staticfloat/OpenBLAS that referenced this pull request Sep 29, 2022
This allows Julia to set a default number of threads (usually `1`) to be
used when no other thread counts are specified [0], to short-circuit the
default OpenBLAS thread initialization routine that spins up a different
number of threads than Julia would otherwise choose.

The reason to add a new environment variable is that we want to be able
to configure OpenBLAS to avoid performing its initial memory
allocation/thread startup, as that can consume significant amounts of
memory, but we still want to be sensitive to legacy codebases that set
things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`.  Creating a new
environment variable that is openblas-specific and is not already
publicly used to control the overall number of threads of programs like
Julia seems to be the best way forward.

[0] JuliaLang/julia#46844
staticfloat added a commit to staticfloat/OpenBLAS that referenced this pull request Sep 29, 2022
This allows Julia to set a default number of threads (usually `1`) to be
used when no other thread counts are specified [0], to short-circuit the
default OpenBLAS thread initialization routine that spins up a different
number of threads than Julia would otherwise choose.

The reason to add a new environment variable is that we want to be able
to configure OpenBLAS to avoid performing its initial memory
allocation/thread startup, as that can consume significant amounts of
memory, but we still want to be sensitive to legacy codebases that set
things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`.  Creating a new
environment variable that is openblas-specific and is not already
publicly used to control the overall number of threads of programs like
Julia seems to be the best way forward.

[0] JuliaLang/julia#46844
deps/checksums/openblas Outdated Show resolved Hide resolved
staticfloat added a commit to staticfloat/OpenBLAS that referenced this pull request Sep 30, 2022
This allows Julia to set a default number of threads (usually `1`) to be
used when no other thread counts are specified [0], to short-circuit the
default OpenBLAS thread initialization routine that spins up a different
number of threads than Julia would otherwise choose.

The reason to add a new environment variable is that we want to be able
to configure OpenBLAS to avoid performing its initial memory
allocation/thread startup, as that can consume significant amounts of
memory, but we still want to be sensitive to legacy codebases that set
things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`.  Creating a new
environment variable that is openblas-specific and is not already
publicly used to control the overall number of threads of programs like
Julia seems to be the best way forward.

[0] JuliaLang/julia#46844
@KristofferC KristofferC mentioned this pull request Sep 30, 2022
37 tasks
@staticfloat
Copy link
Member Author

Sadly, I can confirm that while this does help a little, it's not very significant. Here are the "commit charge" graphs for the start of the testset for both master and this branch:

master

master

this branch

branch

Isolation tests

With gdb, I have verified that the changes cause OpenBLAS to see only one thread at startup (which is subsequently increased when LinearAlgebra.__init__() sets the number of threads). Looking at the startup commit charge for master versus this branch, I do see a difference:

image

In the above screenshot, PID 5820 is this branch, whereas PID 2052 is master, and both are simply sitting at the REPL after startup with no other options applied. So we should be saving some amount of memory, but apparently it is insignificant compared to the memory that we are using during testing.

@KristofferC KristofferC mentioned this pull request Nov 8, 2022
26 tasks
@staticfloat staticfloat added the backport 1.9 Change should be backported to release-1.9 label Nov 15, 2022
@staticfloat
Copy link
Member Author

Alright, we finally managed to do some more testing on this on some large core-count Windows machines, and this helps significantly, so I'm going to merge.

@staticfloat staticfloat merged commit 58b559f into master Nov 15, 2022
@staticfloat staticfloat deleted the sf/openblas_threading_control branch November 15, 2022 17:36
@KristofferC
Copy link
Member

Just for my understanding, for a large core-count machine this would at most save half of the memory compared to before? With this change, we set OpenBLAS to 1 thread, then LinearAlgebra.__init__ runs:

BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2))

and set the number of OpenBLAS threads to Sys.CPU_THREADS ÷ 2 (which is half the threads OpenBLAS would set earlier`)

If so, even with this, it is probably worth doing something like #46844 (setting the number of openblas threads to 1 for most of the test suite).

@DilumAluthge
Copy link
Member

Yeah, ref JuliaCI/julia-buildkite#247 for changes specific to CI.

@staticfloat
Copy link
Member Author

Actually, something seemed fishy to me about how much this helped the internal workload, and I just tried the following on Julia v1.8.2:

julia> using Distributed
       addprocs(1)
       @everywhere begin
           using LinearAlgebra
           @show LinearAlgebra.BLAS.get_num_threads()
       end

LinearAlgebra.BLAS.get_num_threads() = 16
      From worker 2:    LinearAlgebra.BLAS.get_num_threads() = 1

So it appears that Distributed already attempts to set the BLAS threads to 1. Looking into the source, I found:

So what this means is that we actually already try to restrict OpenBLAS to 1 thread on CI, but because OpenBLAS used to start up and immediately set a number of threads, we would have the problem of initially starting up and consuming a bunch of memory, then never letting go of it.

So all that being said, I believe that with this PR merged, we've actually solved the root problem. We should watch it closely on CI, but I think we've only ever tried to exercise single-threaded BLAS on CI so far, so we should just continue to do that. :)

@KristofferC
Copy link
Member

KristofferC commented Nov 15, 2022

That has to run after LinearAlgebra.__init__, no? So how does that help with memory? OpenBLAS doesn't free buffers if you reduce the numer of threads, right? So LinearAlgebra will bump the threads then Distributed will reduce it, but at that point it is already too late?

I was actually about to open an issue that we need to set the env var for OpenBLAS when spawning distributed workers exactly due to the argumentation above.

Is there a way to query OpenBLAS how many buffer it currently has allocated?

@staticfloat
Copy link
Member Author

Ah, you're right; with this PR we have moved from:

OpenBLAS DLL init (N threads) -> LinearAlgebra.__init__() (N/2 threads) -> disable_library_threading() (1 thread)

to now:

OpenBLAS DLL init (1 thread) -> LinearAlgebra.__init__() (N/2 threads) -> disable_library_threading() (1 thread)

So perhaps we need to change __init__() to not set those threads when we're a Distributed worker.

@KristofferC
Copy link
Member

So perhaps we need to change init() to not set those threads when we're a Distributed worker.

Yes, or set env var when spawning workers unless OpenBLAS threading is explicitly enabled.

@staticfloat
Copy link
Member Author

Yes, or set env var when spawning workers unless OpenBLAS threading is explicitly enabled.

I would be totally okay with that, but I'm not sure exactly where to put that, as I have a hard time disentangling the different cluster managers in Distributed, to find the local Distributed code paths, as opposed to the remote/SSH ones.

@KristofferC
Copy link
Member

Is there anyway we could be more lazy and not set the number of BLAS threads explicitly until a BLAS operation is actually executed?

@staticfloat
Copy link
Member Author

There is always the option of adding an if statement at the beginning of every BLAS operation, but I don't know how feasible that is.

@KristofferC
Copy link
Member

KristofferC commented Nov 16, 2022

Can it be done with libblastrampoline somehow? That intercepts every call going into BLAS so maybe the logic can be there?

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 16, 2022

Thanks for merging!

on some large core-count Windows machines, and this helps significantly

It also helps on my (16-vthread) Linux taking 4.4 ms off (for down to 152.7 ms min. But best mean time is though 2.5 ms worse, likely because I got very lucky with the max value) and every ms counts for some benchmarking (where we're really close to other languages). I've gotten startup down by half with a non-default sysimage, for an older version, a much larger effect, but this too would help (yes, it's only 2.8% faster startup, with sysimage the gain should be amplified to about 5.6% faster).

$ hyperfine 'OPENBLAS_NUM_THREADS=1 julia-1.10-DEV-ee0f3fc334/bin/julia --startup-file=no --history-file=no --compile=min -O0 -e ""'
Benchmark 1: OPENBLAS_NUM_THREADS=1 julia-1.10-DEV-ee0f3fc334/bin/julia --startup-file=no --history-file=no --compile=min -O0 -e ""
  Time (mean ± σ):     181.9 ms ±  14.8 ms    [User: 110.3 ms, System: 71.7 ms]
  Range (min … max):   152.7 ms … 197.4 ms    15 runs

$ hyperfine 'julia-1.10-DEV-ee0f3fc334/bin/julia --startup-file=no --history-file=no --compile=min -O0 -e ""'
Benchmark 1: julia-1.10-DEV-ee0f3fc334/bin/julia --startup-file=no --history-file=no --compile=min -O0 -e ""
  Time (mean ± σ):     173.9 ms ±  14.0 ms    [User: 101.3 ms, System: 72.6 ms]
  Range (min … max):   157.1 ms … 194.3 ms    15 runs

I ran hyperfine many times, could never get min that far down without the ENV, while with it often higher than the min for without it.

I don't think I'm just measuring noise, some real effect, though the machine likely heats up while benchmarking. I tried to minimize all noise, turned off the web browser, but load didn't go down to 0.

KristofferC pushed a commit that referenced this pull request Nov 17, 2022
* Limit initial OpenBLAS thread count

We set OpenBLAS's initial thread count to `1` to prevent runaway
allocation within OpenBLAS's initial thread startup.  LinearAlgebra will
later call `BLAS.set_num_threads()` to the actual value we require.

* Support older names

(cherry picked from commit 58b559f)
KristofferC pushed a commit that referenced this pull request Nov 17, 2022
* Limit initial OpenBLAS thread count

We set OpenBLAS's initial thread count to `1` to prevent runaway
allocation within OpenBLAS's initial thread startup.  LinearAlgebra will
later call `BLAS.set_num_threads()` to the actual value we require.

* Support older names

(cherry picked from commit 58b559f)
KristofferC pushed a commit that referenced this pull request Nov 28, 2022
* Limit initial OpenBLAS thread count

We set OpenBLAS's initial thread count to `1` to prevent runaway
allocation within OpenBLAS's initial thread startup.  LinearAlgebra will
later call `BLAS.set_num_threads()` to the actual value we require.

* Support older names

(cherry picked from commit 58b559f)
@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Dec 16, 2022
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Dec 27, 2022
KristofferC added a commit that referenced this pull request Dec 31, 2022
This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod
KristofferC added a commit that referenced this pull request Jan 1, 2023
…#48064)

This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod
KristofferC added a commit that referenced this pull request Jan 2, 2023
…#48064)

This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod

(cherry picked from commit 75bc5ee)
KristofferC added a commit that referenced this pull request Jan 2, 2023
…#48064)

This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod

(cherry picked from commit 75bc5ee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants