Skip to content

Speed up extrema 3-50x #58280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Speed up extrema 3-50x #58280

wants to merge 2 commits into from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 29, 2025

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures.

Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example:

using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")

With results:

cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster

The results are even more dramatic for Float32s, here on my M1:

julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  49.083 μs  151.750 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min  max):  524.435 ns   1.104 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Closes #34790, closes #31442, closes #44606.

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures
@mbauman mbauman added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025
@giordano
Copy link
Contributor

Probably this and #58267 would make good benchmarks in https://github.com/JuliaCI/BaseBenchmarks.jl to better control upgrades of LLVM?

@giordano
Copy link
Contributor

Test failures look relevant

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

Yeah, interesting. Looks like some platforms don't maintain a consistent argument ordering of NaNs. I'm not sure if that's

  • something that we actually need extrema to satisfy; maybe we can relax this test?
  • something that we actually guarantee that min and max must satisfy; maybe this is a separate bug?

Here's the MWE on a skylake-avx512 machine:

julia> f((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> x = -NaN, -NaN;

julia> y = NaN, NaN;

julia> signbit.(f(x, y))
(false, true)

julia> signbit(min(-NaN,NaN))
false

julia> signbit(max(-NaN,NaN))
true

The test asserts that the signs returned by f here are the same — that it either always returns the first arguments to min/max or the second arguments. This doesn't happen on v1.11, but does on beta2 and master.

@oscardssmith
Copy link
Member

In general LLVM (and therefore we) make no guarentees about what NaN you will get.

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

OK, great, then we can just relax that test. I suspect it was written trying to allow for any ordering, but it missed the heterogeneous case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
3 participants