Open
Description
Goals
- Gain an understanding of SIMD operations and use cases
- Port ML.NET C++ SIMD algorithms to C#
- Increase ML.NET performance by using AVX operations when supported and where beneficial
- Ensure C# Hardware Intrinsics feature meets the needs of ML.NET
- Unit test all functions and get performance benchmark numbers for before and after changes
- (Stretch) provide software fallback implementations to support more architectures
- (Stretch) Implement ARM64 SIMD algorithms
Progress
Week 1: Familiarize with .NET Development
- Get familiar with C# (Tutorial and Quick starts)
- Make a console app, use the debugger in Visual Studio
- Learn about ML.NET SIMD operations from the Intel Intrinsics Guide and mapping from C to C# functions
- Adopt the team's GitHub workflow and fork a local repo for work
- Implement SSE support and software fallbacks in managed code for DotU on a new .NET Core 2.1 console app with a NuGet package reference
- Write a unit test for the managed code of DotU using XUnit
Week 2: Learn SIMD operations and use them in .NET outside of ML.NET
- Complete first connect with recruiter
- Implement SSE support and software fallbacks in managed code for all key intrinsics
- Comply with coding style standard
- Implement working unit tests for all key intrinsics
- Implement working performance tests for all key intrinsics using BenchmarkDotNet (slides and recording)
- Present performance results in a table (SsePerf-report-github.pdf)
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1155 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.1.300
[Host] : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
Method | Mean | Error | StdDev |
---|---|---|---|
NativeDotUPerf | 363.2 us | 7.7293 us | 18.8143 us |
MyDotUPerf | 340.2 us | 6.7218 us | 8.0018 us |
NativeDotSUPerf | 2,178.3 us | 43.4641 us | 40.6563 us |
MyDotSUPerf | 2,144.7 us | 19.1638 us | 16.0027 us |
NativeSumSqUPerf | 540.6 us | 3.0299 us | 2.8342 us |
MySumSqUPerf | 538.8 us | 2.5507 us | 2.3859 us |
NativeAddUPerf | 313.9 us | 2.5163 us | 2.3537 us |
MyAddUPerf | 303.3 us | 4.5125 us | 4.2210 us |
NativeAddSUPerf | 2,691.8 us | 29.4588 us | 27.5558 us |
MyAddSUPerf | 2,658.1 us | 51.3336 us | 64.9206 us |
NativeAddScaleUPerf | 300.0 us | 5.5529 us | 5.1941 us |
MyAddScaleUPerf | 309.8 us | 5.3974 us | 4.7846 us |
NativeAddScaleSUPerf | 2,550.9 us | 21.8322 us | 20.4218 us |
MyAddScaleSUPerf | 2,805.3 us | 20.5171 us | 19.1917 us |
NativeScaleUPerf | 131.4 us | 0.6347 us | 0.5626 us |
MyScaleUPerf | 130.7 us | 1.2159 us | 1.1373 us |
NativeDist2Perf | 336.4 us | 2.0555 us | 1.9227 us |
MyDist2Perf | 335.2 us | 8.3427 us | 11.4196 us |
NativeSumAbsUPerf | 258.0 us | 1.6470 us | 1.5406 us |
MySumAbsqUPerf | 258.9 us | 0.9447 us | 0.7889 us |
NativeMulElementWiseUPerf | 466.4 us | 1.9625 us | 1.6388 us |
MyMulElementWiseUPerf | 467.2 us | 4.3560 us | 4.0747 us |
Week 3-5: Port algo to C#, write unit tests and performance tests, check in code
- Think about why managed codes for "sparse" intrinsics are slower than native codes
- Apply real data to test implemented managed code using BenchmarkDotNet
- Schedule a meeting for midpoint review with Dan, Eric, Santi, Tanner, and Ivan on Skype at the end of Week 5 on July 20
- Get familiarized with the entire pipeline of ML.NET by creating a ML project
- Integrate local code into ML.NET repo to prepare for checking in code, including:
- C# implementations of intrinsics
- Unit tests
- Performance tests
- Implement additional unit tests to test the complete code paths for two different target frameworks
- Enable multi-targeting
- Make the switch to turn on or off implemented code at will with the UseIntrinsics build attribute
- Check in code with PR Port C# key hardware intrinsics APIs for SSE from SIMD native algorithms dotnet/machinelearning#562
Week 6
- Participate in Microsoft Hackathon
- Attend IEEE conference
Week 7
- Respond to PR comments and Intel partners
- Fix build issues in multi-targeting and disabling netcoreapp3.0 test projects
- Hard-code unit tests
- Introduced a custom random seed in perf tests based on environmental variables for better testing
- Major style changes to best utilize existing libraries and ensure aggressive inlining wherever needed
- Document follow-up action items for performance enhancement in an issue page (Suggestions on CpuMath enhancement #2)
- Fix perf issues of some SSE intrinsics in compliance with C# 7.3 updates
- Fix merge conflicts and obtain green builds for PR
- PR on SSE key intrinsics, as well as their unit tests and perf tests, with multi-targeting, is approved
Week 8-9
- Scale up implementation, unit tests, and performance tests to cover all SSE intrinsics
- Write AVX implementations
- Performance test before and after. We should see some perf gains here.
- Check in code to ML.NET (submitted PR)
Perf test results for all active SSE hardware intrinsics:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Method | Mean | Error | StdDev | Median |
---|---|---|---|---|
NativeAddScalarUPerf | 221.7 us | 4.323 us | 5.467 us | 220.8 us |
ManagedAddScalarUPerf | 217.3 us | 4.207 us | 3.729 us | 215.5 us |
NativeScaleUPerf | 219.0 us | 2.368 us | 2.215 us | 218.9 us |
ManagedScaleUPerf | 182.2 us | 2.677 us | 2.504 us | 182.4 us |
NativeScaleSrcUPerf | 252.4 us | 4.404 us | 3.904 us | 250.8 us |
ManagedScaleSrcUPerf | 271.5 us | 5.357 us | 6.377 us | 272.0 us |
NativeScaleAddUPerf | 230.6 us | 3.230 us | 3.021 us | 230.5 us |
ManagedScaleAddUPerf | 232.3 us | 3.281 us | 2.908 us | 231.8 us |
NativeAddScaleUPerf | 317.5 us | 4.360 us | 4.079 us | 316.0 us |
ManagedAddScaleUPerf | 317.1 us | 4.778 us | 3.990 us | 317.5 us |
NativeAddScaleSUPerf | 4,135.9 us | 66.596 us | 62.294 us | 4,126.9 us |
ManagedAddScaleSUPerf | 4,812.6 us | 39.148 us | 34.704 us | 4,803.0 us |
NativeAddScaleCopyUPerf | 505.4 us | 5.658 us | 4.725 us | 503.8 us |
ManagedAddScaleCopyUPerf | 481.7 us | 9.140 us | 8.550 us | 480.0 us |
NativeAddUPerf | 316.5 us | 5.698 us | 5.330 us | 314.7 us |
ManagedAddUPerf | 335.2 us | 12.130 us | 23.944 us | 321.9 us |
NativeAddSUPerf | 4,249.0 us | 58.001 us | 54.255 us | 4,254.0 us |
ManagedAddSUPerf | 4,583.9 us | 78.739 us | 73.652 us | 4,556.6 us |
NativeMulElementWiseUPerf | 552.5 us | 7.078 us | 5.911 us | 551.5 us |
ManagedMulElementWiseUPerf | 507.9 us | 7.059 us | 6.258 us | 507.8 us |
NativeSumUPerf | 289.2 us | 5.435 us | 5.084 us | 287.6 us |
ManagedSumUPerf | 288.3 us | 2.815 us | 2.350 us | 287.8 us |
NativeSumSqUPerf | 283.2 us | 1.572 us | 1.393 us | 283.3 us |
ManagedSumSqUPerf | 289.8 us | 2.493 us | 2.210 us | 288.8 us |
NativeSumSqDiffUPerf | 289.4 us | 3.621 us | 3.387 us | 289.4 us |
ManagedSumSqDiffUPerf | 290.9 us | 2.772 us | 2.593 us | 290.0 us |
NativeSumAbsUPerf | 289.2 us | 4.836 us | 4.524 us | 287.0 us |
ManagedSumAbsUPerf | 293.1 us | 1.338 us | 1.186 us | 293.2 us |
NativeSumAbsDiffUPerf | 290.7 us | 5.000 us | 4.677 us | 288.8 us |
ManagedSumAbsDiffUPerf | 294.4 us | 5.242 us | 4.903 us | 293.0 us |
NativeMaxAbsUPerf | 288.0 us | 3.924 us | 3.671 us | 285.8 us |
ManagedMaxAbsUPerf | 290.1 us | 2.614 us | 2.317 us | 289.0 us |
NativeMaxAbsDiffUPerf | 292.1 us | 4.805 us | 4.495 us | 289.6 us |
ManagedMaxAbsDiffUPerf | 290.6 us | 2.083 us | 1.846 us | 290.3 us |
NativeDotUPerf | 328.8 us | 3.844 us | 3.407 us | 328.6 us |
ManagedDotUPerf | 333.8 us | 2.154 us | 1.910 us | 333.3 us |
NativeDotSUPerf | 3,414.2 us | 67.058 us | 68.864 us | 3,393.7 us |
ManagedDotSUPerf | 3,753.1 us | 37.440 us | 33.189 us | 3,737.5 us |
NativeDist2Perf | 332.3 us | 3.152 us | 2.632 us | 332.0 us |
ManagedDist2Perf | 333.7 us | 4.368 us | 3.647 us | 332.0 us |
NativeSdcaL1UpdateUPerf | 607.5 us | 8.506 us | 7.957 us | 608.7 us |
ManagedSdcaL1UpdateUPerf | 600.8 us | 12.003 us | 27.820 us | 591.3 us |
NativeSdcaL1UpdateSUPerf | 13,445.5 us | 116.336 us | 108.821 us | 13,447.1 us |
ManagedSdcaL1UpdateSUPerf | 13,824.3 us | 97.564 us | 86.488 us | 13,795.3 us |
Perf tests results for all managed intrinsics with AVX enhancement:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Method | Mean | Error | StdDev |
---|---|---|---|
ManagedAddScalarUPerf | 157.3 us | 1.3138 us | 1.1647 us |
ManagedScaleUPerf | 177.0 us | 3.5143 us | 7.5649 us |
ManagedScaleSrcUPerf | 260.5 us | 0.9317 us | 0.8715 us |
ManagedScaleAddUPerf | 170.3 us | 1.6569 us | 1.5499 us |
ManagedAddScaleUPerf | 272.5 us | 5.4200 us | 9.2035 us |
ManagedAddScaleSUPerf | 5,253.6 us | 105.0419 us | 163.5375 us |
ManagedAddScaleCopyUPerf | 448.2 us | 11.0005 us | 19.8362 us |
ManagedAddUPerf | 263.4 us | 2.5347 us | 2.2469 us |
ManagedAddSUPerf | 4,256.5 us | 38.0944 us | 33.7697 us |
ManagedMulElementWiseUPerf | 441.7 us | 3.2423 us | 2.8742 us |
ManagedSumUPerf | 161.0 us | 1.3688 us | 1.2134 us |
ManagedSumSqUPerf | 165.0 us | 0.4772 us | 0.4230 us |
ManagedSumSqDiffUPerf | 179.5 us | 1.1673 us | 1.0919 us |
ManagedSumAbsUPerf | 174.9 us | 3.4667 us | 5.9799 us |
ManagedSumAbsDiffUPerf | 178.7 us | 0.6264 us | 0.4529 us |
ManagedMaxAbsUPerf | 168.2 us | 1.1892 us | 1.0542 us |
ManagedMaxAbsDiffUPerf | 179.7 us | 1.9884 us | 1.7626 us |
ManagedDotUPerf | 258.1 us | 2.6630 us | 2.2237 us |
ManagedDotSUPerf | 3,297.7 us | 23.2337 us | 19.4012 us |
ManagedDist2Perf | 258.8 us | 3.9883 us | 3.5355 us |
ManagedSdcaL1UpdateUPerf | 545.0 us | 10.7959 us | 17.1234 us |
ManagedSdcaL1UpdateSUPerf | 13,624.1 us | 34.6645 us | 32.4252 us |
In one summary:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev | Median |
---|---|---|---|---|---|
AvxPerformanceTests | AddScalarU | 192.3 us | 3.835 us | 5.2489 us | 192.1 us |
NativePerformanceTests | AddScalarU | 225.9 us | 4.407 us | 6.7300 us | 225.0 us |
SsePerformanceTests | AddScalarU | 240.7 us | 5.306 us | 15.3944 us | 237.7 us |
AvxPerformanceTests | ScaleU | 163.9 us | 2.477 us | 2.0687 us | 163.7 us |
NativePerformanceTests | ScaleU | 188.9 us | 2.688 us | 2.2447 us | 189.3 us |
SsePerformanceTests | ScaleU | 234.1 us | 6.896 us | 20.3319 us | 234.2 us |
AvxPerformanceTests | ScaleSrcU | 281.5 us | 4.158 us | 3.6856 us | 280.5 us |
NativePerformanceTests | ScaleSrcU | 298.0 us | 7.632 us | 21.8963 us | 292.2 us |
SsePerformanceTests | ScaleSrcU | 271.6 us | 5.157 us | 5.0645 us | 271.0 us |
AvxPerformanceTests | ScaleAddU | 182.6 us | 3.654 us | 3.2395 us | 181.7 us |
NativePerformanceTests | ScaleAddU | 231.1 us | 3.641 us | 3.2279 us | 230.9 us |
SsePerformanceTests | ScaleAddU | 210.4 us | 7.888 us | 23.1345 us | 198.2 us |
AvxPerformanceTests | AddScaleU | 295.9 us | 5.907 us | 15.5625 us | 296.1 us |
NativePerformanceTests | AddScaleU | 336.4 us | 5.054 us | 4.7274 us | 336.4 us |
SsePerformanceTests | AddScaleU | 330.2 us | 7.823 us | 10.7077 us | 328.1 us |
AvxPerformanceTests | AddScaleSU | 4,603.2 us | 113.641 us | 326.0574 us | 4,494.6 us |
NativePerformanceTests | AddScaleSU | 3,985.1 us | 54.772 us | 45.7368 us | 3,982.6 us |
SsePerformanceTests | AddScaleSU | 4,441.8 us | 83.317 us | 77.9344 us | 4,416.8 us |
AvxPerformanceTests | AddScaleCopyU | 534.5 us | 10.504 us | 23.2753 us | 531.9 us |
NativePerformanceTests | AddScaleCopyU | 548.6 us | 10.743 us | 15.0600 us | 543.8 us |
SsePerformanceTests | AddScaleCopyU | 504.5 us | 9.430 us | 9.2616 us | 505.9 us |
AvxPerformanceTests | AddU | 272.2 us | 5.391 us | 12.7072 us | 271.3 us |
NativePerformanceTests | AddU | 331.7 us | 6.306 us | 6.7473 us | 333.0 us |
SsePerformanceTests | AddU | 283.4 us | 5.639 us | 11.2608 us | 278.0 us |
AvxPerformanceTests | AddSU | 4,482.2 us | 90.556 us | 200.6652 us | 4,408.2 us |
NativePerformanceTests | AddSU | 4,132.2 us | 81.246 us | 113.8950 us | 4,109.6 us |
SsePerformanceTests | AddSU | 4,164.2 us | 82.393 us | 88.1599 us | 4,144.5 us |
AvxPerformanceTests | MulElementWiseU | 470.3 us | 8.353 us | 7.4044 us | 467.7 us |
NativePerformanceTests | MulElementWiseU | 465.5 us | 8.192 us | 6.8406 us | 465.1 us |
SsePerformanceTests | MulElementWiseU | 392.9 us | 7.107 us | 6.6481 us | 390.3 us |
AvxPerformanceTests | SumU | 154.2 us | 2.413 us | 2.2572 us | 153.6 us |
NativePerformanceTests | SumU | 283.2 us | 3.950 us | 3.6952 us | 282.1 us |
SsePerformanceTests | SumU | 271.7 us | 2.715 us | 2.5394 us | 271.3 us |
AvxPerformanceTests | SumSqU | 180.7 us | 3.583 us | 8.1606 us | 180.7 us |
NativePerformanceTests | SumSqU | 282.3 us | 5.702 us | 5.6003 us | 280.8 us |
SsePerformanceTests | SumSqU | 270.2 us | 1.125 us | 0.9397 us | 270.0 us |
AvxPerformanceTests | SumSqDiffU | 165.9 us | 2.453 us | 2.1745 us | 166.0 us |
NativePerformanceTests | SumSqDiffU | 287.9 us | 3.850 us | 3.6011 us | 288.0 us |
SsePerformanceTests | SumSqDiffU | 276.2 us | 5.080 us | 4.7515 us | 273.5 us |
AvxPerformanceTests | SumAbsU | 160.1 us | 3.095 us | 3.0401 us | 159.8 us |
NativePerformanceTests | SumAbsU | 289.0 us | 5.743 us | 6.6134 us | 286.2 us |
SsePerformanceTests | SumAbsU | 278.2 us | 1.676 us | 1.3994 us | 278.3 us |
AvxPerformanceTests | SumAbsDiffU | 163.8 us | 1.891 us | 1.5792 us | 163.8 us |
NativePerformanceTests | SumAbsDiffU | 288.5 us | 5.688 us | 5.3210 us | 288.7 us |
SsePerformanceTests | SumAbsDiffU | 278.6 us | 4.304 us | 4.0259 us | 277.7 us |
AvxPerformanceTests | MaxAbsU | 157.9 us | 2.158 us | 2.0189 us | 157.7 us |
NativePerformanceTests | MaxAbsU | 281.5 us | 2.903 us | 2.5732 us | 281.9 us |
SsePerformanceTests | MaxAbsU | 278.0 us | 2.890 us | 2.7033 us | 277.3 us |
AvxPerformanceTests | MaxAbsDiffU | 168.7 us | 2.555 us | 2.3895 us | 168.2 us |
NativePerformanceTests | MaxAbsDiffU | 285.9 us | 5.610 us | 5.5096 us | 283.7 us |
SsePerformanceTests | MaxAbsDiffU | 276.0 us | 3.051 us | 2.7046 us | 274.7 us |
AvxPerformanceTests | DotU | 229.6 us | 4.586 us | 4.2898 us | 228.6 us |
NativePerformanceTests | DotU | 314.1 us | 5.461 us | 4.8413 us | 313.5 us |
SsePerformanceTests | DotU | 295.9 us | 4.912 us | 4.5950 us | 293.9 us |
AvxPerformanceTests | DotSU | 3,302.5 us | 49.913 us | 44.2461 us | 3,294.7 us |
NativePerformanceTests | DotSU | 3,741.2 us | 112.502 us | 178.4404 us | 3,720.5 us |
SsePerformanceTests | DotSU | 3,492.2 us | 56.641 us | 47.2981 us | 3,485.0 us |
AvxPerformanceTests | Dist2 | 234.0 us | 4.405 us | 3.9045 us | 233.6 us |
NativePerformanceTests | Dist2 | 319.0 us | 6.373 us | 7.0833 us | 319.7 us |
SsePerformanceTests | Dist2 | 299.2 us | 5.823 us | 5.1618 us | 298.7 us |
AvxPerformanceTests | SdcaL1UpdateU | 604.1 us | 11.995 us | 35.3680 us | 593.6 us |
NativePerformanceTests | SdcaL1UpdateU | 664.3 us | 12.715 us | 12.4873 us | 661.9 us |
SsePerformanceTests | SdcaL1UpdateU | 593.3 us | 11.658 us | 16.3430 us | 594.1 us |
AvxPerformanceTests | SdcaL1UpdateSU | 12,363.5 us | 161.361 us | 143.0421 us | 12,339.6 us |
NativePerformanceTests | SdcaL1UpdateSU | 12,678.7 us | 202.557 us | 179.5616 us | 12,661.0 us |
SsePerformanceTests | SdcaL1UpdateSU | 11,670.2 us | 122.880 us | 108.9298 us | 11,645.9 us |
Week 10-11 (Stretch)
- Provide software fallback implementations (stretch goals)
- Respond to PR feedback for AVX intrinsics
- Streamlined perf test layout
- Report improvement in running time of intrinsics: averaged 17.78%
- Report improvement in running time of end-to-end real-life user scenarios: 13.88%
- Get ML.NET to run on Raspberry Pi
- Present on August 31 (11am-12nn 25/3365, also on Skype)
Week 12
- Improve perf by optimizing loops and alignment issues (Suggestions on CpuMath enhancement #2) at the assembly/instruction level
- Clean up, presentation, close out remaining issues
- Write blog post on how ML.NET is taking advantage of .NET Core hardware intrinsics, and AVX vs SSE comparisons (both implementation and runtime perf)
Latest perf results:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev | Median |
---|---|---|---|---|---|
AvxPerformanceTests | AddScalarU | 157.3 us | 2.680 us | 2.376 us | 157.3 us |
NativePerformanceTests | AddScalarU | 186.7 us | 3.253 us | 3.043 us | 185.7 us |
SsePerformanceTests | AddScalarU | 184.0 us | 3.382 us | 2.824 us | 183.5 us |
AvxPerformanceTests | ScaleU | 157.5 us | 1.754 us | 1.465 us | 157.3 us |
NativePerformanceTests | ScaleU | 174.9 us | 3.437 us | 3.529 us | 173.8 us |
SsePerformanceTests | ScaleU | 184.4 us | 3.158 us | 2.799 us | 184.2 us |
AvxPerformanceTests | ScaleSrcU | 271.6 us | 4.723 us | 3.944 us | 270.1 us |
NativePerformanceTests | ScaleSrcU | 281.0 us | 3.579 us | 3.173 us | 280.7 us |
SsePerformanceTests | ScaleSrcU | 284.6 us | 4.786 us | 4.242 us | 283.6 us |
AvxPerformanceTests | ScaleAddU | 181.4 us | 2.791 us | 2.610 us | 181.6 us |
NativePerformanceTests | ScaleAddU | 192.1 us | 2.769 us | 2.312 us | 191.6 us |
SsePerformanceTests | ScaleAddU | 189.6 us | 2.190 us | 1.829 us | 189.4 us |
AvxPerformanceTests | AddScaleU | 284.1 us | 6.002 us | 5.615 us | 282.2 us |
NativePerformanceTests | AddScaleU | 327.1 us | 5.215 us | 4.623 us | 326.5 us |
SsePerformanceTests | AddScaleU | 321.2 us | 3.093 us | 2.742 us | 321.0 us |
AvxPerformanceTests | AddScaleSU | 4,630.5 us | 58.590 us | 51.939 us | 4,619.5 us |
NativePerformanceTests | AddScaleSU | 3,910.6 us | 43.011 us | 40.233 us | 3,910.2 us |
SsePerformanceTests | AddScaleSU | 4,487.7 us | 88.687 us | 82.958 us | 4,489.2 us |
AvxPerformanceTests | AddScaleCopyU | 465.9 us | 9.862 us | 9.225 us | 463.1 us |
NativePerformanceTests | AddScaleCopyU | 493.9 us | 5.991 us | 5.604 us | 494.0 us |
SsePerformanceTests | AddScaleCopyU | 501.1 us | 6.755 us | 5.988 us | 500.8 us |
AvxPerformanceTests | AddU | 281.8 us | 3.346 us | 2.794 us | 281.4 us |
NativePerformanceTests | AddU | 353.7 us | 4.312 us | 3.600 us | 353.2 us |
SsePerformanceTests | AddU | 351.6 us | 2.268 us | 1.894 us | 352.2 us |
AvxPerformanceTests | AddSU | 4,435.3 us | 38.197 us | 31.896 us | 4,433.4 us |
NativePerformanceTests | AddSU | 4,309.1 us | 50.212 us | 46.968 us | 4,313.8 us |
SsePerformanceTests | AddSU | 4,821.4 us | 60.796 us | 53.894 us | 4,812.6 us |
AvxPerformanceTests | MulElementWiseU | 522.2 us | 7.380 us | 6.543 us | 521.1 us |
NativePerformanceTests | MulElementWiseU | 472.6 us | 9.435 us | 17.721 us | 476.1 us |
SsePerformanceTests | MulElementWiseU | 470.9 us | 8.913 us | 7.901 us | 467.3 us |
AvxPerformanceTests | SumU | 165.3 us | 1.332 us | 1.180 us | 165.0 us |
NativePerformanceTests | SumU | 291.6 us | 2.791 us | 2.474 us | 291.5 us |
SsePerformanceTests | SumU | 288.7 us | 1.568 us | 1.390 us | 288.8 us |
AvxPerformanceTests | SumSqU | 167.8 us | 1.376 us | 1.220 us | 167.9 us |
NativePerformanceTests | SumSqU | 262.7 us | 2.607 us | 2.439 us | 261.9 us |
SsePerformanceTests | SumSqU | 263.3 us | 1.857 us | 1.646 us | 262.9 us |
AvxPerformanceTests | SumSqDiffU | 181.2 us | 2.185 us | 1.937 us | 180.6 us |
NativePerformanceTests | SumSqDiffU | 297.9 us | 5.733 us | 5.888 us | 294.8 us |
SsePerformanceTests | SumSqDiffU | 297.9 us | 2.855 us | 2.671 us | 297.1 us |
AvxPerformanceTests | SumAbsU | 187.8 us | 3.503 us | 3.277 us | 186.7 us |
NativePerformanceTests | SumAbsU | 261.9 us | 1.809 us | 1.510 us | 262.6 us |
SsePerformanceTests | SumAbsU | 274.4 us | 1.539 us | 1.439 us | 274.3 us |
AvxPerformanceTests | SumAbsDiffU | 190.1 us | 1.878 us | 1.568 us | 190.6 us |
NativePerformanceTests | SumAbsDiffU | 294.4 us | 2.982 us | 2.644 us | 293.7 us |
SsePerformanceTests | SumAbsDiffU | 311.4 us | 2.179 us | 1.931 us | 311.0 us |
AvxPerformanceTests | MaxAbsU | 186.8 us | 2.503 us | 2.219 us | 187.6 us |
NativePerformanceTests | MaxAbsU | 263.0 us | 2.535 us | 2.371 us | 262.5 us |
SsePerformanceTests | MaxAbsU | 274.8 us | 1.778 us | 1.576 us | 274.3 us |
AvxPerformanceTests | MaxAbsDiffU | 192.3 us | 3.816 us | 3.918 us | 190.8 us |
NativePerformanceTests | MaxAbsDiffU | 295.9 us | 1.960 us | 1.737 us | 295.7 us |
SsePerformanceTests | MaxAbsDiffU | 311.4 us | 2.292 us | 2.144 us | 311.0 us |
AvxPerformanceTests | DotU | 279.6 us | 4.530 us | 4.237 us | 279.4 us |
NativePerformanceTests | DotU | 358.4 us | 7.314 us | 16.207 us | 351.9 us |
SsePerformanceTests | DotU | 357.9 us | 3.730 us | 3.306 us | 356.8 us |
AvxPerformanceTests | DotSU | 3,374.0 us | 43.577 us | 38.630 us | 3,373.5 us |
NativePerformanceTests | DotSU | 3,443.8 us | 49.761 us | 46.546 us | 3,422.8 us |
SsePerformanceTests | DotSU | 3,959.1 us | 60.141 us | 56.256 us | 3,968.8 us |
AvxPerformanceTests | Dist2 | 268.9 us | 3.041 us | 2.845 us | 268.0 us |
NativePerformanceTests | Dist2 | 364.2 us | 4.073 us | 3.401 us | 363.7 us |
SsePerformanceTests | Dist2 | 359.5 us | 4.037 us | 3.578 us | 359.1 us |
AvxPerformanceTests | SdcaL1UpdateU | 588.4 us | 12.117 us | 15.756 us | 588.0 us |
NativePerformanceTests | SdcaL1UpdateU | 635.4 us | 12.245 us | 10.855 us | 632.8 us |
SsePerformanceTests | SdcaL1UpdateU | 628.8 us | 5.655 us | 4.722 us | 628.7 us |
AvxPerformanceTests | SdcaL1UpdateSU | 13,943.0 us | 127.516 us | 113.040 us | 13,973.4 us |
NativePerformanceTests | SdcaL1UpdateSU | 13,014.6 us | 124.704 us | 116.649 us | 13,024.6 us |
SsePerformanceTests | SdcaL1UpdateSU | 13,957.6 us | 55.439 us | 49.145 us | 13,956.9 us |
Metadata
Metadata
Assignees
Labels
No labels