-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use max(1, Sys.CPU_THREADS)
BLAS threads for aarch64
.
#46085
Conversation
None of the common ARM CPUs that come to mind have hyperthreading.
Looks good to me. Probably should be backported to 1.8 too. |
@@ -582,7 +582,11 @@ function __init__() | |||
Base.at_disable_library_threading(() -> BLAS.set_num_threads(1)) | |||
|
|||
if !haskey(ENV, "OPENBLAS_NUM_THREADS") | |||
BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2)) | |||
if Sys.ARCH === :aarch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is right. It's only on CPUs where we know Sys.CPU_THREADS
is already the "right" number of threads for BLAS that we should use Sys.CPU_THREADS
instead of Sys.CPU_THREADS ÷ 2
. Unfortunately we don't have any way at the moment to query this, and the only system where we know for sure we have to do it is the M1 (because we have specific code for setting the value Sys.CPU_THREADS
), which incidentally coincides with all known aarch64-apple-darwin platforms. Also, this should use @static
and I think a better test is Sys.isapple() && Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"
(HostPlatform()
does a normalisation of the arch, which in some cases may be arm64
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the common ARM CPUs that come to mind have hyperthreading.
Probably it's only/mainly Intel which does hyperthreading no? You're excluding 32-bit ARM, which I also doubt does hyperthreading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD Cpus also have it (it's called SMT for them). Power I think does it too. For someone running asahi for example this is probably wrong, same for someone running the new Intel ones. But it was wrong before and i don't think we figured out a way to query performance cores on linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Power I think does it too.
Power does 4x or 8x, depending on the chip, meaning we may want to divide by 4 or 8 for Power? Is that detectable?
For someone running asahi for example this is probably wrong
On asahi, would the 4big4L read as 8 threads, and 8big2L read as 10 threads?
Just by chance, 4 would've been correct for the former, but 5 for the latter isn't exactly ideal either. I'd be curious whether 5 or 10 are faster.
It's only on CPUs where we know
Sys.CPU_THREADS
is already the "right" number of threads for BLAS that we should useSys.CPU_THREADS
instead ofSys.CPU_THREADS ÷ 2
I think this is a rather strong statement, in that 5 doesn't seem obviously better than 10 for 8 big core + 2 little core systems.
Also, this should use
@static
Seems unnecessary when using Sys.ARCH
, given the syntax is valid and the compiler can optimize it away just fine.
julia> foo() = Sys.ARCH === :aarch64
foo (generic function with 1 method)
julia> @code_typed foo()
CodeInfo(
1 ─ return false
) => Bool
but I can use @static
when switching to Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"
I think a better test is
Sys.isapple()
Do you think Asahi linux is more common than Raspberry Pi 4, Graviton, Ampere, A64FX?
Not sure about Nvidea's products.
If we start seeing more ARM laptops, they'll probably have performance + efficiency cores for Windows and Linux.
So it's possible, but I do not know the market shares here.
There are Intel CPUs without hyperthreading, and you can also turn it off in the bios for CPUs with it (this is reasonably common in HPC environments).
So combined with the P+E cores from Alder Lake and beyond, this all does get pretty messy. Would be interesting to see benchmarks there.
Intel advertises their E cores as being efficient in terms of performance/area, not performance/power, so they're marketing their E cores as improving multithreading performance. Their next generation of CPUs will add a lot more E cores without adding extra P cores. Maybe we really do want to use 1 BLAS thread per E core, hoping that the BLAS library does reasonable load balancing (which isn't a given; MKL seems to, but I wouldn't bet on OpenBLAS or BLIS). We'd still probably only want 1 thread per P core, and these unfortunately do have hyperthreading...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alder lake is really annoying because the E cores don't have SMT but the P cores do. Ideally we should always find the number of P cores and set NUM_THREADS to that.
A quick test shows that, at least for BLAS, adding the E cores makes it quite a bit slower.
versioninfo()
Julia Version 1.9.0-DEV.688
Commit 3eaed8b54c (2022-05-31 15:31 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × 12th Gen Intel(R) Core(TM) i5-12600K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, alderlake)
Threads: 6 on 16 virtual cores
Environment:
JULIA_NUM_THREADS = 6
JULIA_NUM_PRECOMPILE_TASKS = 6
julia> using LinearAlgebra
A1=rand(8192,8192); A2=copy(A1);
julia> @btime lu!($A1);
1.022 s (2 allocations: 64.05 KiB)
julia> BLAS.set_num_threads(10)
julia> @btime lu!($A2);
1.847 s (2 allocations: 64.05 KiB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about with using MKL
? I'm curious if MKL does better.
I get the argument for mixing cores optimized for absolute performance with cores optimized for performance/area, but with it hurting benchmarks, so far that sounds more like marketing than reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also set JULIA_NUM_PRECOMPILE_TASKS = 6
, do you think it even hurts this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set that because in macos there were some issues with the E cores and precompiling (it made the sound crackle). With MKL and 6 cores it didn't change much, but with 10 it didn't get worse interestingly. So mkl might do something not to use them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So mkl might do something not to use them
That seems likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, when I said "Intel" above I was referring to x86-64 in general, not chips made by Intel specifically.
Anyway, since it was said that most ARM CPUs (and not just Aaarch64, and not just on macOS) will typically not have multithreading, then the condition should change to
Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) ∈ ("aarch64", "armv6l", "armv7l")
? It remains the already mentioned problem of Asahi Linux, where we don't set correctly the total number of CPUs, but that's a more general problem of detecting the efficiency/power cores, and in any case this is only a default value of threads, which can be always changed (and people strongly caring about the precise number of threads to be used on their machine should set it manually to be safe against changes of defaults in any case).
Use binarybuilder to normalize arch, and for now only use 1 BLAS thread per CPU thread for `Sys.isapple()` and `aarch64`
(cherry picked from commit 97df6db)
@@ -582,7 +582,11 @@ function __init__() | |||
Base.at_disable_library_threading(() -> BLAS.set_num_threads(1)) | |||
|
|||
if !haskey(ENV, "OPENBLAS_NUM_THREADS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this should look at GOTO_NUM_THREADS
AND OMP_NUM_THREADS
as well: https://github.com/xianyi/OpenBLAS/blob/c43ec53bdd00d9423fc609d7b7ecb35e7bf41b85/README.md#setting-the-number-of-threads-using-environment-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GOTO_NUM_THREADS comes from the original GOTOBLAS library. Nobody should be using it!
For OMP, I thought that only applies if you build with OpenMP for threading - which we don't. I think we should only look at OPENBLAS_NUM_THREADS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact OpenBLAS respects those variables. If you set them you can see with BLAS.get_num_threads that OpenBLAS changes the setting accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup it does. I was just saying that we may only want to use one of them and only document that. However, I don't have a strong opinion if we want to use the other ones - but what should we do if more than one of those environment variables are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of those variables is set, we shouldn't override it. See #46118 🙂
None of the common ARM CPUs that come to mind have hyperthreading.
Fixes #46071