Skip to content

This project explores how to implement and test SIMD (Single Instruction, Multiple Data) operations in F# using intrinsics

License

Notifications You must be signed in to change notification settings

64J0/fsharp--simd-vector-addition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

F# SIMD Vector Addition

Overview

This project explores how to implement and test SIMD (Single Instruction, Multiple Data) operations in F#, specifically using:

  • 128-bit SIMD via SSE (System.Runtime.Intrinsics.X86.Sse, where SSE stands for Streaming SIMD Extensions)
  • Safe memory access with Span<T> and MemoryMarshal.Cast
  • Automated unit testing with Expecto
  • Performance benchmarking with BenchmarkDotNet

It compares scalar vs SIMD performance on vector addition of float32 arrays.

This program uses the intrinsics technology, that allows us to write low-level CPU instructions directly in high-level languages like F#, C#, or C++, without writing assembly code. They expose specific hardware features (like SIMD, cryptography, etc.) as functions/methods that map directly to processor instructions.

Assembly code

If you’d like to check the generated x86 assembly code, check this link from godbolt.

Features

  • Safe, GC-friendly SIMD implementation (no unsafe blocks)
  • Fallback to scalar operations for trailing elements
  • Unit tests for correctness
  • Benchmarking suite to evaluate performance benefits

Run Unit Tests

dotnet test

Run Benchmarks

Compile and run in Release mode:

dotnet run -c Release

Sample output (running as sudo):

// * Summary *

BenchmarkDotNet v0.15.2, Linux Ubuntu 24.04.2 LTS (Noble Numbat)
13th Gen Intel Core i7-13620H 4.90GHz, 1 CPU, 16 logical and 10 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2


| Method         | Length  | Mean            | Error         | StdDev        | Median          | Ratio | RatioSD | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|--------------- |-------- |----------------:|--------------:|--------------:|----------------:|------:|--------:|---------:|---------:|---------:|----------:|------------:|
| ScalarAdd      | 100     |        62.84 ns |      0.357 ns |      0.334 ns |        62.81 ns |  1.00 |    0.01 |   0.0337 |        - |        - |     424 B |        1.00 |
| SimdAddGeneric | 100     |        25.79 ns |      0.425 ns |      0.377 ns |        25.78 ns |  0.41 |    0.01 |   0.0338 |        - |        - |     424 B |        1.00 |
| SimdAdd        | 100     |        38.33 ns |      0.836 ns |      2.412 ns |        37.32 ns |  0.61 |    0.04 |   0.0337 |        - |        - |     424 B |        1.00 |
| SimdUnsafe     | 100     |        25.22 ns |      0.581 ns |      0.905 ns |        25.03 ns |  0.40 |    0.01 |   0.0337 |        - |        - |     424 B |        1.00 |
|                |         |                 |               |               |                 |       |         |          |          |          |           |             |
| ScalarAdd      | 10000   |     4,779.32 ns |     95.245 ns |    237.194 ns |     4,695.79 ns |  1.00 |    0.07 |   3.1738 |        - |        - |   40024 B |        1.00 |
| SimdAddGeneric | 10000   |     2,695.04 ns |     41.373 ns |     42.487 ns |     2,686.53 ns |  0.57 |    0.03 |   3.1738 |        - |        - |   40024 B |        1.00 |
| SimdAdd        | 10000   |     3,259.33 ns |     64.696 ns |     79.453 ns |     3,236.80 ns |  0.68 |    0.04 |   3.1738 |        - |        - |   40024 B |        1.00 |
| SimdUnsafe     | 10000   |     2,679.35 ns |     42.744 ns |     39.983 ns |     2,670.81 ns |  0.56 |    0.03 |   3.1738 |        - |        - |   40024 B |        1.00 |
|                |         |                 |               |               |                 |       |         |          |          |          |           |             |
| ScalarAdd      | 1000000 | 1,112,382.34 ns | 22,194.992 ns | 45,338.491 ns | 1,112,908.00 ns |  1.00 |    0.06 | 152.3438 | 152.3438 | 152.3438 | 4000125 B |        1.00 |
| SimdAddGeneric | 1000000 |   959,651.77 ns | 19,088.075 ns | 32,925.994 ns |   951,350.95 ns |  0.86 |    0.05 | 154.2969 | 154.2969 | 154.2969 | 4000126 B |        1.00 |
| SimdAdd        | 1000000 | 1,024,855.41 ns | 20,493.475 ns | 24,396.026 ns | 1,024,238.89 ns |  0.92 |    0.04 | 152.3438 | 152.3438 | 152.3438 | 4000125 B |        1.00 |
| SimdUnsafe     | 1000000 |   976,061.11 ns | 19,248.694 ns | 25,696.456 ns |   973,510.02 ns |  0.88 |    0.04 | 154.2969 | 154.2969 | 154.2969 | 4000126 B |        1.00 |

// * Hints *
Outliers
  AddTwoArraysBenchmark.SimdAddGeneric: Default -> 3 outliers were removed (30.80 ns..31.28 ns)
  AddTwoArraysBenchmark.SimdAdd: Default        -> 4 outliers were removed (48.29 ns..49.23 ns)
  AddTwoArraysBenchmark.SimdUnsafe: Default     -> 3 outliers were removed (33.05 ns..35.77 ns)
  AddTwoArraysBenchmark.ScalarAdd: Default      -> 6 outliers were removed (5.39 us..5.47 us)
  AddTwoArraysBenchmark.SimdAddGeneric: Default -> 4 outliers were removed (2.87 us..3.15 us)
  AddTwoArraysBenchmark.SimdAdd: Default        -> 3 outliers were removed (3.49 us..3.67 us)
  AddTwoArraysBenchmark.SimdAdd: Default        -> 1 outlier  was  removed (1.11 ms)

// * Legends *
  Length      : Value of the 'Length' parameter
  Mean        : Arithmetic mean of all measurements
  Error       : Half of 99.9% confidence interval
  StdDev      : Standard deviation of all measurements
  Median      : Value separating the higher half of all measurements (50th percentile)
  Ratio       : Mean of the ratio distribution ([Current]/[Baseline])
  RatioSD     : Standard deviation of the ratio distribution ([Current]/[Baseline])
  Gen0        : GC Generation 0 collects per 1000 operations
  Gen1        : GC Generation 1 collects per 1000 operations
  Gen2        : GC Generation 2 collects per 1000 operations
  Allocated   : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  Alloc Ratio : Allocated memory ratio distribution ([Current]/[Baseline])
  1 ns        : 1 Nanosecond (0.000000001 sec)

// * Diagnostic Output - MemoryDiagnoser *


// ***** BenchmarkRunner: End *****
Run time: 00:05:54 (354.29 sec), executed benchmarks: 12

Global total time: 00:05:57 (357.46 sec), executed benchmarks: 12

Related projects:

About

This project explores how to implement and test SIMD (Single Instruction, Multiple Data) operations in F# using intrinsics

Topics

Resources

License

Stars

Watchers

Forks

Languages