Speed up extrema 3-50x #58280

mbauman · 2025-04-29T20:11:07Z

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures.

Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example:

using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")

With results:

cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster

The results are even more dramatic for Float32s, here on my M1:

julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Closes #34790, closes #31442, closes #44606.

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures

giordano · 2025-04-29T20:14:31Z

Probably this and #58267 would make good benchmarks in https://github.com/JuliaCI/BaseBenchmarks.jl to better control upgrades of LLVM?

giordano · 2025-04-30T08:13:59Z

Test failures look relevant

mbauman · 2025-04-30T13:33:46Z

Yeah, interesting. Looks like some platforms don't maintain a consistent argument ordering of NaNs. I'm not sure if that's

something that we actually need extrema to satisfy; maybe we can relax this test?
something that we actually guarantee that min and max must satisfy; maybe this is a separate bug?

Here's the MWE on a skylake-avx512 machine:

julia> f((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> x = -NaN, -NaN;

julia> y = NaN, NaN;

julia> signbit.(f(x, y))
(false, true)

julia> signbit(min(-NaN,NaN))
false

julia> signbit(max(-NaN,NaN))
true

The test asserts that the signs returned by f here are the same — that it either always returns the first arguments to min/max or the second arguments. This doesn't happen on v1.11, but does on beta2 and master.

oscardssmith · 2025-04-30T13:40:34Z

In general LLVM (and therefore we) make no guarentees about what NaN you will get.

mbauman · 2025-04-30T13:48:59Z

OK, great, then we can just relax that test. I suspect it was written trying to allow for any ordering, but it missed the heterogeneous case.

Speed up extrema 3-50x

dcfb327

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures

mbauman added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025

relax unordered test to actually allow any ordering

f3c440c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up extrema 3-50x #58280

Speed up extrema 3-50x #58280

mbauman commented Apr 29, 2025 •

edited by oscardssmith

Loading

giordano commented Apr 29, 2025

giordano commented Apr 30, 2025

mbauman commented Apr 30, 2025 •

edited

Loading

oscardssmith commented Apr 30, 2025

mbauman commented Apr 30, 2025

Speed up extrema 3-50x #58280

Are you sure you want to change the base?

Speed up extrema 3-50x #58280

Conversation

mbauman commented Apr 29, 2025 • edited by oscardssmith Loading

giordano commented Apr 29, 2025

giordano commented Apr 30, 2025

mbauman commented Apr 30, 2025 • edited Loading

oscardssmith commented Apr 30, 2025

mbauman commented Apr 30, 2025

mbauman commented Apr 29, 2025 •

edited by oscardssmith

Loading

mbauman commented Apr 30, 2025 •

edited

Loading