This project explores how to implement and test SIMD (Single Instruction, Multiple Data) operations in F#, specifically using:
- 128-bit SIMD via SSE (
System.Runtime.Intrinsics.X86.Sse
, where SSE stands for Streaming SIMD Extensions) - Safe memory access with
Span<T>
andMemoryMarshal.Cast
- Automated unit testing with
Expecto
- Performance benchmarking with
BenchmarkDotNet
It compares scalar vs SIMD performance on vector addition of float32
arrays.
This program uses the intrinsics technology, that allows us to write low-level CPU instructions directly in high-level languages like F#, C#, or C++, without writing assembly code. They expose specific hardware features (like SIMD, cryptography, etc.) as functions/methods that map directly to processor instructions.
If you’d like to check the generated x86 assembly code, check this link from godbolt.
- Safe, GC-friendly SIMD implementation (no
unsafe
blocks) - Fallback to scalar operations for trailing elements
- Unit tests for correctness
- Benchmarking suite to evaluate performance benefits
dotnet test
Compile and run in Release mode:
dotnet run -c Release
Sample output (running as sudo
):
// * Summary * BenchmarkDotNet v0.15.2, Linux Ubuntu 24.04.2 LTS (Noble Numbat) 13th Gen Intel Core i7-13620H 4.90GHz, 1 CPU, 16 logical and 10 physical cores .NET SDK 9.0.101 [Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2 DEBUG DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2 | Method | Length | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio | |--------------- |-------- |----------------:|--------------:|--------------:|----------------:|------:|--------:|---------:|---------:|---------:|----------:|------------:| | ScalarAdd | 100 | 62.84 ns | 0.357 ns | 0.334 ns | 62.81 ns | 1.00 | 0.01 | 0.0337 | - | - | 424 B | 1.00 | | SimdAddGeneric | 100 | 25.79 ns | 0.425 ns | 0.377 ns | 25.78 ns | 0.41 | 0.01 | 0.0338 | - | - | 424 B | 1.00 | | SimdAdd | 100 | 38.33 ns | 0.836 ns | 2.412 ns | 37.32 ns | 0.61 | 0.04 | 0.0337 | - | - | 424 B | 1.00 | | SimdUnsafe | 100 | 25.22 ns | 0.581 ns | 0.905 ns | 25.03 ns | 0.40 | 0.01 | 0.0337 | - | - | 424 B | 1.00 | | | | | | | | | | | | | | | | ScalarAdd | 10000 | 4,779.32 ns | 95.245 ns | 237.194 ns | 4,695.79 ns | 1.00 | 0.07 | 3.1738 | - | - | 40024 B | 1.00 | | SimdAddGeneric | 10000 | 2,695.04 ns | 41.373 ns | 42.487 ns | 2,686.53 ns | 0.57 | 0.03 | 3.1738 | - | - | 40024 B | 1.00 | | SimdAdd | 10000 | 3,259.33 ns | 64.696 ns | 79.453 ns | 3,236.80 ns | 0.68 | 0.04 | 3.1738 | - | - | 40024 B | 1.00 | | SimdUnsafe | 10000 | 2,679.35 ns | 42.744 ns | 39.983 ns | 2,670.81 ns | 0.56 | 0.03 | 3.1738 | - | - | 40024 B | 1.00 | | | | | | | | | | | | | | | | ScalarAdd | 1000000 | 1,112,382.34 ns | 22,194.992 ns | 45,338.491 ns | 1,112,908.00 ns | 1.00 | 0.06 | 152.3438 | 152.3438 | 152.3438 | 4000125 B | 1.00 | | SimdAddGeneric | 1000000 | 959,651.77 ns | 19,088.075 ns | 32,925.994 ns | 951,350.95 ns | 0.86 | 0.05 | 154.2969 | 154.2969 | 154.2969 | 4000126 B | 1.00 | | SimdAdd | 1000000 | 1,024,855.41 ns | 20,493.475 ns | 24,396.026 ns | 1,024,238.89 ns | 0.92 | 0.04 | 152.3438 | 152.3438 | 152.3438 | 4000125 B | 1.00 | | SimdUnsafe | 1000000 | 976,061.11 ns | 19,248.694 ns | 25,696.456 ns | 973,510.02 ns | 0.88 | 0.04 | 154.2969 | 154.2969 | 154.2969 | 4000126 B | 1.00 | // * Hints * Outliers AddTwoArraysBenchmark.SimdAddGeneric: Default -> 3 outliers were removed (30.80 ns..31.28 ns) AddTwoArraysBenchmark.SimdAdd: Default -> 4 outliers were removed (48.29 ns..49.23 ns) AddTwoArraysBenchmark.SimdUnsafe: Default -> 3 outliers were removed (33.05 ns..35.77 ns) AddTwoArraysBenchmark.ScalarAdd: Default -> 6 outliers were removed (5.39 us..5.47 us) AddTwoArraysBenchmark.SimdAddGeneric: Default -> 4 outliers were removed (2.87 us..3.15 us) AddTwoArraysBenchmark.SimdAdd: Default -> 3 outliers were removed (3.49 us..3.67 us) AddTwoArraysBenchmark.SimdAdd: Default -> 1 outlier was removed (1.11 ms) // * Legends * Length : Value of the 'Length' parameter Mean : Arithmetic mean of all measurements Error : Half of 99.9% confidence interval StdDev : Standard deviation of all measurements Median : Value separating the higher half of all measurements (50th percentile) Ratio : Mean of the ratio distribution ([Current]/[Baseline]) RatioSD : Standard deviation of the ratio distribution ([Current]/[Baseline]) Gen0 : GC Generation 0 collects per 1000 operations Gen1 : GC Generation 1 collects per 1000 operations Gen2 : GC Generation 2 collects per 1000 operations Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B) Alloc Ratio : Allocated memory ratio distribution ([Current]/[Baseline]) 1 ns : 1 Nanosecond (0.000000001 sec) // * Diagnostic Output - MemoryDiagnoser * // ***** BenchmarkRunner: End ***** Run time: 00:05:54 (354.29 sec), executed benchmarks: 12 Global total time: 00:05:57 (357.46 sec), executed benchmarks: 12