Description
Array.max
, for example, can be made more than an order of magnitude faster with vectorisation on appropriate hardware. The money shot (forgive the rather random naming, I copy-pasted some code which used to be about factorials, for example), on a machine with Vector<byte>.Count = 16
:
BenchmarkDotNet v0.13.10, macOS Sonoma 14.1 (23B74) [Darwin 23.1.0]
Apple M1 Max, 1 CPU, 10 logical and 10 physical cores
.NET SDK 7.0.403
[Host] : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD DEBUG
DefaultJob : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD
| Method | ArraySize | Vectorized | Mean | Error | StdDev |
|------------------- |---------- |----------- |----------------:|-------------:|-------------:|
| BenchmarkFactorial | 10 | False | 13.95 ns | 0.011 ns | 0.009 ns |
| BenchmarkFactorial | 10 | True | 14.23 ns | 0.015 ns | 0.012 ns |
| BenchmarkFactorial | 100 | False | 71.63 ns | 0.765 ns | 0.716 ns |
| BenchmarkFactorial | 100 | True | 19.84 ns | 0.035 ns | 0.027 ns |
| BenchmarkFactorial | 1000 | False | 641.30 ns | 1.559 ns | 1.382 ns |
| BenchmarkFactorial | 1000 | True | 67.48 ns | 0.125 ns | 0.098 ns |
| BenchmarkFactorial | 10000 | False | 6,283.97 ns | 12.787 ns | 11.336 ns |
| BenchmarkFactorial | 10000 | True | 458.34 ns | 1.151 ns | 0.899 ns |
| BenchmarkFactorial | 100000 | False | 62,312.46 ns | 21.059 ns | 17.585 ns |
| BenchmarkFactorial | 100000 | True | 4,435.52 ns | 7.009 ns | 5.853 ns |
| BenchmarkFactorial | 1000000 | False | 622,888.31 ns | 415.604 ns | 368.422 ns |
| BenchmarkFactorial | 1000000 | True | 44,163.61 ns | 42.545 ns | 35.527 ns |
| BenchmarkFactorial | 2000000 | False | 1,245,989.22 ns | 1,678.470 ns | 1,401.599 ns |
| BenchmarkFactorial | 2000000 | True | 87,850.09 ns | 64.740 ns | 54.061 ns |
// * Hints *
Outliers
Bench.BenchmarkFactorial: Default -> 2 outliers were removed, 3 outliers were detected (14.98 ns, 15.04 ns, 15.14 ns)
Bench.BenchmarkFactorial: Default -> 3 outliers were removed (15.41 ns..15.44 ns)
Bench.BenchmarkFactorial: Default -> 3 outliers were removed (21.06 ns..21.11 ns)
Bench.BenchmarkFactorial: Default -> 1 outlier was removed (647.56 ns)
Bench.BenchmarkFactorial: Default -> 3 outliers were removed, 5 outliers were detected (69.05 ns, 69.08 ns, 69.73 ns..70.75 ns)
Bench.BenchmarkFactorial: Default -> 1 outlier was removed (6.33 us)
Bench.BenchmarkFactorial: Default -> 3 outliers were removed (465.11 ns..465.85 ns)
Bench.BenchmarkFactorial: Default -> 2 outliers were removed (63.04 us, 63.09 us)
Bench.BenchmarkFactorial: Default -> 2 outliers were removed (4.48 us, 4.48 us)
Bench.BenchmarkFactorial: Default -> 1 outlier was removed (628.06 us)
Bench.BenchmarkFactorial: Default -> 2 outliers were removed (44.30 us, 44.38 us)
Bench.BenchmarkFactorial: Default -> 2 outliers were removed (1.25 ms, 1.30 ms)
Bench.BenchmarkFactorial: Default -> 2 outliers were removed (88.18 us, 89.12 us)
// * Legends *
ArraySize : Value of the 'ArraySize' parameter
Vectorized : Value of the 'Vectorized' parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
1 ns : 1 Nanosecond (0.000000001 sec)
It's trivial to write the property-based test that asserts this agrees with Array.max
; and I reran this using Array.max
directly (rather than my vendored version) and got basically the same numbers.
Describe the solution you'd like
All appropriate array methods should be vectorised. There are examples in System.Linq.
One difficulty is that Vector<'a>
requires 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType
. I do not know how to make this change in a way that doesn't break the API, because I don't know how to provide one implementation for the 'a
which satisfy those requirements and one for 'a
which do not. But it would be great if we could simply provide various implementations of Array.max
and have the correct one be selected.
Describe alternatives you've considered
We can always not do this; people can write this themselves, or call out to System.Linq manually. It's sad if they have to do that, though.
Additional context
Program:
namespace Vectorised
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running
module Vectorised =
let inline maxFromFSharpCore (array: _[]) =
if isNull array then
nullArg "array"
if array.Length = 0 then
invalidArg "array" "array was empty"
let mutable acc = array.[0]
for i = 1 to array.Length - 1 do
let curr = array.[i]
if curr > acc then
acc <- curr
acc
open System.Numerics
let inline max<'a when 'a : comparison and 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType> (array : 'a []) : 'a =
if isNull array then
nullArg "array"
let span = System.MemoryExtensions.AsSpan array
if span.IsEmpty then
invalidArg "array" "array was empty"
let mutable index = 0
let mutable acc = Unchecked.defaultof<'a>
if (Vector.IsHardwareAccelerated && span.Length >= Vector<'a>.Count * 2) then
let mutable maxes = Vector<'a> span
index <- Vector<'a>.Count
maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
index <- index + Vector<'a>.Count
while (index + Vector<'a>.Count <= span.Length) do
maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
index <- index + Vector<'a>.Count
acc <- maxes.[0]
for i = 1 to Vector<'a>.Count - 1 do
if maxes.[i] > acc then
acc <- maxes.[i]
else
acc <- span.[0]
index <- 1
for i = index to span.Length - 1 do
if span.[i] > acc then
acc <- span.[i]
acc
let random = System.Random ()
let arrays =
[| 10 ; 100 ; 1_000 ; 10_000 ; 100_000 ; 1_000_000 ; 2_000_000 |]
|> Array.map (fun i -> i, Array.zeroCreate i)
|> Array.map (fun (index, arr) ->
random.NextBytes arr
index, arr
)
|> Map.ofArray
type Bench () =
[<Params(10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000)>]
member val ArraySize = 0 with get, set
[<Params (true, false)>]
member val Vectorized = false with get, set
// Define the method to benchmark
[<Benchmark>]
member this.BenchmarkFactorial() =
if this.Vectorized then
max arrays.[this.ArraySize]
|> ignore
else
maxFromFSharpCore arrays.[this.ArraySize]
|> ignore
[<EntryPoint>]
let main _ =
let summary = BenchmarkRunner.Run<Bench>()
if System.Object.ReferenceEquals (summary, null) then
failwith "got no summary"
0