-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In the main progress page, the performance tests originally sitting in the src\Native\CpuMath\
folder gives comparable performance results for both native and managed implementations of SSE key intrinsics.
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1155 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.1.300
[Host] : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
Method | Mean | Error | StdDev |
---|---|---|---|
NativeDotUPerf | 363.2 us | 7.7293 us | 18.8143 us |
MyDotUPerf | 340.2 us | 6.7218 us | 8.0018 us |
NativeDotSUPerf | 2,178.3 us | 43.4641 us | 40.6563 us |
MyDotSUPerf | 2,144.7 us | 19.1638 us | 16.0027 us |
NativeSumSqUPerf | 540.6 us | 3.0299 us | 2.8342 us |
MySumSqUPerf | 538.8 us | 2.5507 us | 2.3859 us |
NativeAddUPerf | 313.9 us | 2.5163 us | 2.3537 us |
MyAddUPerf | 303.3 us | 4.5125 us | 4.2210 us |
NativeAddSUPerf | 2,691.8 us | 29.4588 us | 27.5558 us |
MyAddSUPerf | 2,658.1 us | 51.3336 us | 64.9206 us |
NativeAddScaleUPerf | 300.0 us | 5.5529 us | 5.1941 us |
MyAddScaleUPerf | 309.8 us | 5.3974 us | 4.7846 us |
NativeAddScaleSUPerf | 2,550.9 us | 21.8322 us | 20.4218 us |
MyAddScaleSUPerf | 2,805.3 us | 20.5171 us | 19.1917 us |
NativeScaleUPerf | 131.4 us | 0.6347 us | 0.5626 us |
MyScaleUPerf | 130.7 us | 1.2159 us | 1.1373 us |
NativeDist2Perf | 336.4 us | 2.0555 us | 1.9227 us |
MyDist2Perf | 335.2 us | 8.3427 us | 11.4196 us |
NativeSumAbsUPerf | 258.0 us | 1.6470 us | 1.5406 us |
MySumAbsqUPerf | 258.9 us | 0.9447 us | 0.7889 us |
NativeMulElementWiseUPerf | 466.4 us | 1.9625 us | 1.6388 us |
MyMulElementWiseUPerf | 467.2 us | 4.3560 us | 4.0747 us |
However, once moved into the test\Microsoft.ML.CpuMath.PerformanceTests\
folder, with multi-targeting, using Span<T>
, having a lower TargetCount (from ~20 to 3) in the ToolChain, the performances of managed DotU
, SumSqU
, Dist2
, and SumAbsU
seem to deviate noticeably from those of their native counterparts. Two relevant tables are shown below.
Run in .NET Core App 3.0 (ManagedXPerf
uses the managed X
)
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain LaunchCount=1 TargetCount=3
WarmupCount=3
Method | Mean | Error | StdDev |
---|---|---|---|
NativeDotUPerf | 346.2 us | 63.44 us | 3.584 us |
ManagedDotUPerf | 662.3 us | 31.82 us | 1.798 us |
NativeDotSUPerf | 2,291.9 us | 280.57 us | 15.853 us |
ManagedDotSUPerf | 2,303.4 us | 275.33 us | 15.557 us |
NativeSumSqUPerf | 551.5 us | 20.80 us | 1.175 us |
ManagedSumSqUPerf | 882.4 us | 385.73 us | 21.794 us |
NativeAddUPerf | 326.1 us | 104.46 us | 5.902 us |
ManagedAddUPerf | 324.6 us | 70.70 us | 3.995 us |
NativeAddSUPerf | 2,982.1 us | 5,531.05 us | 312.514 us |
ManagedAddSUPerf | 2,763.0 us | 951.16 us | 53.742 us |
NativeAddScaleUPerf | 327.4 us | 90.74 us | 5.127 us |
ManagedAddScaleUPerf | 324.5 us | 118.36 us | 6.688 us |
NativeAddScaleSUPerf | 2,675.9 us | 590.91 us | 33.387 us |
ManagedAddScaleSUPerf | 2,693.1 us | 62.59 us | 3.536 us |
NativeScaleUPerf | 140.2 us | 51.56 us | 2.913 us |
ManagedScaleUPerf | 155.3 us | 238.58 us | 13.480 us |
NativeDist2Perf | 348.5 us | 125.00 us | 7.063 us |
ManagedDist2Perf | 671.6 us | 518.96 us | 29.322 us |
NativeSumAbsUPerf | 272.4 us | 79.46 us | 4.490 us |
ManagedSumAbsqUPerf | 601.2 us | 86.91 us | 4.910 us |
NativeMulElementWiseUPerf | 497.3 us | 404.78 us | 22.871 us |
ManagedMulElementWiseUPerf | 493.7 us | 145.72 us | 8.233 us |
Run in .NET Core App 2.1 (ManagedXPerf
uses the native X
)
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=2.2.100-refac-20180613-1
[Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
Toolchain=InProcessToolchain LaunchCount=1 TargetCount=3
WarmupCount=3
Method | Mean | Error | StdDev |
---|---|---|---|
NativeDotUPerf | 352.5 us | 45.87 us | 2.592 us |
ManagedDotUPerf | 346.7 us | 72.99 us | 4.124 us |
NativeDotSUPerf | 2,274.1 us | 729.91 us | 41.241 us |
ManagedDotSUPerf | 2,264.7 us | 220.67 us | 12.468 us |
NativeSumSqUPerf | 601.9 us | 41.43 us | 2.341 us |
ManagedSumSqUPerf | 562.9 us | 453.06 us | 25.599 us |
NativeAddUPerf | 333.9 us | 140.60 us | 7.944 us |
ManagedAddUPerf | 330.2 us | 143.10 us | 8.086 us |
NativeAddSUPerf | 2,839.8 us | 4,658.38 us | 263.207 us |
ManagedAddSUPerf | 2,726.4 us | 467.48 us | 26.413 us |
NativeAddScaleUPerf | 330.6 us | 58.80 us | 3.322 us |
ManagedAddScaleUPerf | 327.8 us | 88.76 us | 5.015 us |
NativeAddScaleSUPerf | 2,755.9 us | 563.51 us | 31.839 us |
ManagedAddScaleSUPerf | 2,752.0 us | 598.46 us | 33.814 us |
NativeScaleUPerf | 141.8 us | 29.23 us | 1.652 us |
ManagedScaleUPerf | 150.2 us | 202.48 us | 11.441 us |
NativeDist2Perf | 350.6 us | 44.27 us | 2.501 us |
ManagedDist2Perf | 350.2 us | 23.96 us | 1.354 us |
NativeSumAbsUPerf | 270.0 us | 82.27 us | 4.648 us |
ManagedSumAbsqUPerf | 272.9 us | 159.45 us | 9.009 us |
NativeMulElementWiseUPerf | 502.1 us | 275.94 us | 15.591 us |
ManagedMulElementWiseUPerf | 503.3 us | 125.91 us | 7.114 us |
TODOs
When I ran the performance tests in the early half of the PR review period, the perfs looked fine, but the most recent run above looked pretty different. Will look into reasons that cause this issue.
Experiments made to find the cause to the issue
- Changing the perf test to
Default
toShortRun
, i.e. increasingLaunchCount
and other warm-up steps to make perf measurement more accurate.
Conclusion: Not the main factor.
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Method | Mean | Error | StdDev | Median |
---|---|---|---|---|
NativeDotUPerf | 351.8 us | 5.815 us | 5.154 us | 350.7 us |
ManagedDotUPerf | 664.8 us | 5.631 us | 5.268 us | 664.7 us |
NativeDotSUPerf | 2,416.7 us | 74.286 us | 207.080 us | 2,342.7 us |
ManagedDotSUPerf | 2,311.7 us | 56.298 us | 49.907 us | 2,308.0 us |
NativeSumSqUPerf | 547.4 us | 2.866 us | 2.237 us | 546.8 us |
ManagedSumSqUPerf | 910.6 us | 24.005 us | 35.186 us | 894.9 us |
NativeAddUPerf | 328.8 us | 5.920 us | 4.943 us | 327.8 us |
ManagedAddUPerf | 357.8 us | 14.013 us | 39.524 us | 337.0 us |
NativeAddSUPerf | 2,749.6 us | 50.185 us | 100.224 us | 2,722.3 us |
ManagedAddSUPerf | 2,873.1 us | 25.477 us | 22.585 us | 2,871.1 us |
NativeAddScaleUPerf | 334.3 us | 8.223 us | 8.076 us | 331.2 us |
ManagedAddScaleUPerf | 334.6 us | 3.100 us | 2.748 us | 333.9 us |
NativeAddScaleSUPerf | 2,729.2 us | 32.378 us | 30.286 us | 2,730.1 us |
ManagedAddScaleSUPerf | 2,670.1 us | 29.478 us | 23.014 us | 2,662.7 us |
NativeScaleUPerf | 140.0 us | 1.780 us | 1.390 us | 140.0 us |
ManagedScaleUPerf | 143.3 us | 2.711 us | 2.784 us | 142.9 us |
NativeDist2Perf | 350.2 us | 3.081 us | 2.573 us | 349.6 us |
ManagedDist2Perf | 664.7 us | 2.621 us | 2.046 us | 664.6 us |
NativeSumAbsUPerf | 271.8 us | 2.229 us | 1.741 us | 271.8 us |
ManagedSumAbsUPerf | 600.1 us | 3.051 us | 2.854 us | 600.6 us |
NativeMulElementWiseUPerf | 503.8 us | 9.875 us | 8.754 us | 501.3 us |
ManagedMulElementWiseUPerf | 518.0 us | 25.485 us | 39.676 us | 498.5 us |
-
Removed the dependency on
Span<T>
to resort to using normal input float arrays instead.
Conclusion: Not the main factor. -
Removed the dependency on the
VectorSum
function to resort to using original code instead.
Conclusion: This is the main factor.
Perf results after the fix:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain LaunchCount=1 TargetCount=3
WarmupCount=3
Method | Mean | Error | StdDev |
---|---|---|---|
NativeDotUPerf | 550.7 us | 1,839.99 us | 103.963 us |
ManagedDotUPerf | 486.4 us | 61.79 us | 3.492 us |
NativeDotSUPerf | 2,446.8 us | 405.57 us | 22.915 us |
ManagedDotSUPerf | 2,620.6 us | 219.16 us | 12.383 us |
NativeSumSqUPerf | 569.0 us | 18.54 us | 1.047 us |
ManagedSumSqUPerf | 579.5 us | 68.04 us | 3.845 us |
NativeAddUPerf | 389.9 us | 562.21 us | 31.766 us |
ManagedAddUPerf | 368.6 us | 48.36 us | 2.733 us |
NativeAddSUPerf | 4,324.0 us | 10,768.20 us | 608.423 us |
ManagedAddSUPerf | 3,118.3 us | 109.61 us | 6.193 us |
NativeAddScaleUPerf | 512.1 us | 1,694.78 us | 95.758 us |
ManagedAddScaleUPerf | 480.3 us | 252.98 us | 14.294 us |
NativeAddScaleSUPerf | 3,425.0 us | 6,916.49 us | 390.795 us |
ManagedAddScaleSUPerf | 3,161.6 us | 808.89 us | 45.704 us |
NativeScaleUPerf | 153.5 us | 52.31 us | 2.955 us |
ManagedScaleUPerf | 152.5 us | 59.76 us | 3.377 us |
NativeDist2Perf | 394.8 us | 126.76 us | 7.162 us |
ManagedDist2Perf | 386.7 us | 145.84 us | 8.240 us |
NativeSumAbsUPerf | 304.7 us | 610.29 us | 34.483 us |
ManagedSumAbsqUPerf | 291.5 us | 277.25 us | 15.665 us |
NativeMulElementWiseUPerf | 563.3 us | 124.22 us | 7.018 us |
ManagedMulElementWiseUPerf | 572.0 us | 295.43 us | 16.692 us |