Skip to content

SIMD vectorisation #16230

Open
Feature
@Smaug123

Description

@Smaug123

Array.max, for example, can be made more than an order of magnitude faster with vectorisation on appropriate hardware. The money shot (forgive the rather random naming, I copy-pasted some code which used to be about factorials, for example), on a machine with Vector<byte>.Count = 16:

BenchmarkDotNet v0.13.10, macOS Sonoma 14.1 (23B74) [Darwin 23.1.0]
Apple M1 Max, 1 CPU, 10 logical and 10 physical cores
.NET SDK 7.0.403
  [Host]     : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD DEBUG
  DefaultJob : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD


| Method             | ArraySize | Vectorized | Mean            | Error        | StdDev       |
|------------------- |---------- |----------- |----------------:|-------------:|-------------:|
| BenchmarkFactorial | 10        | False      |        13.95 ns |     0.011 ns |     0.009 ns |
| BenchmarkFactorial | 10        | True       |        14.23 ns |     0.015 ns |     0.012 ns |
| BenchmarkFactorial | 100       | False      |        71.63 ns |     0.765 ns |     0.716 ns |
| BenchmarkFactorial | 100       | True       |        19.84 ns |     0.035 ns |     0.027 ns |
| BenchmarkFactorial | 1000      | False      |       641.30 ns |     1.559 ns |     1.382 ns |
| BenchmarkFactorial | 1000      | True       |        67.48 ns |     0.125 ns |     0.098 ns |
| BenchmarkFactorial | 10000     | False      |     6,283.97 ns |    12.787 ns |    11.336 ns |
| BenchmarkFactorial | 10000     | True       |       458.34 ns |     1.151 ns |     0.899 ns |
| BenchmarkFactorial | 100000    | False      |    62,312.46 ns |    21.059 ns |    17.585 ns |
| BenchmarkFactorial | 100000    | True       |     4,435.52 ns |     7.009 ns |     5.853 ns |
| BenchmarkFactorial | 1000000   | False      |   622,888.31 ns |   415.604 ns |   368.422 ns |
| BenchmarkFactorial | 1000000   | True       |    44,163.61 ns |    42.545 ns |    35.527 ns |
| BenchmarkFactorial | 2000000   | False      | 1,245,989.22 ns | 1,678.470 ns | 1,401.599 ns |
| BenchmarkFactorial | 2000000   | True       |    87,850.09 ns |    64.740 ns |    54.061 ns |

// * Hints *
Outliers
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed, 3 outliers were detected (14.98 ns, 15.04 ns, 15.14 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (15.41 ns..15.44 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (21.06 ns..21.11 ns)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (647.56 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed, 5 outliers were detected (69.05 ns, 69.08 ns, 69.73 ns..70.75 ns)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (6.33 us)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (465.11 ns..465.85 ns)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (63.04 us, 63.09 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (4.48 us, 4.48 us)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (628.06 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (44.30 us, 44.38 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (1.25 ms, 1.30 ms)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (88.18 us, 89.12 us)

// * Legends *
  ArraySize  : Value of the 'ArraySize' parameter
  Vectorized : Value of the 'Vectorized' parameter
  Mean       : Arithmetic mean of all measurements
  Error      : Half of 99.9% confidence interval
  StdDev     : Standard deviation of all measurements
  1 ns       : 1 Nanosecond (0.000000001 sec)

It's trivial to write the property-based test that asserts this agrees with Array.max; and I reran this using Array.max directly (rather than my vendored version) and got basically the same numbers.

Describe the solution you'd like

All appropriate array methods should be vectorised. There are examples in System.Linq.

One difficulty is that Vector<'a> requires 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType. I do not know how to make this change in a way that doesn't break the API, because I don't know how to provide one implementation for the 'a which satisfy those requirements and one for 'a which do not. But it would be great if we could simply provide various implementations of Array.max and have the correct one be selected.

Describe alternatives you've considered

We can always not do this; people can write this themselves, or call out to System.Linq manually. It's sad if they have to do that, though.

Additional context
Program:

namespace Vectorised

open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

module Vectorised =
    let inline maxFromFSharpCore (array: _[]) =
        if isNull array then
            nullArg "array"

        if array.Length = 0 then
            invalidArg "array" "array was empty"

        let mutable acc = array.[0]

        for i = 1 to array.Length - 1 do
            let curr = array.[i]

            if curr > acc then
                acc <- curr

        acc

    open System.Numerics

    let inline max<'a when 'a : comparison and 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType> (array : 'a []) : 'a =
        if isNull array then
            nullArg "array"

        let span = System.MemoryExtensions.AsSpan array
        if span.IsEmpty then
            invalidArg "array" "array was empty"

        let mutable index = 0
        let mutable acc = Unchecked.defaultof<'a>

        if (Vector.IsHardwareAccelerated && span.Length >= Vector<'a>.Count * 2) then
            let mutable maxes = Vector<'a> span
            index <- Vector<'a>.Count

            maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
            index <- index + Vector<'a>.Count

            while (index + Vector<'a>.Count <= span.Length) do
                maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
                index <- index + Vector<'a>.Count

            acc <- maxes.[0]
            for i = 1 to Vector<'a>.Count - 1 do
                if maxes.[i] > acc then
                    acc <- maxes.[i]

        else
            acc <- span.[0]
            index <- 1

        for i = index to span.Length - 1 do
            if span.[i] > acc then
                acc <- span.[i]

        acc

    let random = System.Random ()

    let arrays =
        [| 10 ; 100 ; 1_000 ; 10_000 ; 100_000 ; 1_000_000 ; 2_000_000 |]
        |> Array.map (fun i -> i, Array.zeroCreate i)
        |> Array.map (fun (index, arr) ->
            random.NextBytes arr
            index, arr
        )
        |> Map.ofArray

    type Bench () =

        [<Params(10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000)>]
        member val ArraySize = 0 with get, set

        [<Params (true, false)>]
        member val Vectorized = false with get, set

        // Define the method to benchmark
        [<Benchmark>]
        member this.BenchmarkFactorial() =
            if this.Vectorized then
                max arrays.[this.ArraySize]
                |> ignore
            else
                maxFromFSharpCore arrays.[this.ArraySize]
                |> ignore

    [<EntryPoint>]
    let main _ =
        let summary = BenchmarkRunner.Run<Bench>()
        if System.Object.ReferenceEquals (summary, null) then
            failwith "got no summary"

        0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions