Description
#821 references and aims to solve this issue.
Suggested by @ahsonkhan to avoid integer overflow in bound checking inside SSE/AVX intrinsics implementation, i.e. change all while (pCurrent + 8 OR 4 <= pEnd)
into while (pEnd - pCurrent >= 8 OR 4)
.
Perf tests results before and after the change are shown below:
Before the change:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev |
---|---|---|---|---|
AvxPerformanceTests | SumU | 159.4 us | 1.104 us | 0.9784 us |
NativePerformanceTests | SumU | 283.5 us | 5.492 us | 4.8687 us |
SsePerformanceTests | SumU | 281.2 us | 1.472 us | 1.3045 us |
AvxPerformanceTests | AddU | 276.1 us | 3.018 us | 2.520 us |
NativePerformanceTests | AddU | 330.1 us | 3.585 us | 3.178 us |
SsePerformanceTests | AddU | 325.6 us | 6.883 us | 7.926 us |
After the change:
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
[Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT
Toolchain=InProcessToolchain
Type | Method | Mean | Error | StdDev |
---|---|---|---|---|
AvxPerformanceTests | SumU | 183.5 us | 3.621 us | 3.023 us |
NativePerformanceTests | SumU | 281.6 us | 5.261 us | 4.921 us |
SsePerformanceTests | SumU | 294.1 us | 2.080 us | 1.946 us |
AvxPerformanceTests | AddU | 296.3 us | 5.185 us | 4.850 us |
NativePerformanceTests | AddU | 335.1 us | 3.053 us | 2.707 us |
SsePerformanceTests | AddU | 345.0 us | 2.155 us | 1.800 us |
Both SSE and AVX implementations are slower by 10-20% after this change.
In my opinion, after seeing the perf results, I may not recommend merging this PR. I may wait until the alternative suggested by @tannergooding in an earlier PR review has been implemented (2nd item under "Functionality" in briancylui#2):
var remainder = count % elementsPerIteration;
float* pEnd = pdst + (count - remainder);
while (pDstCurrent < pEnd)
{ … }
Another question I have is: would pDstCurrent + 8 OR 4
ever have the possibility to result in integer overflow? According to my knowledge, pEnd
is initialized as pDstCurrent + count
, and there are Contract.Asserts
in the wrapper class to check that count
does not exceed the original array length. I'm not sure, and am open to any PR comments and advice.
cc: @danmosemsft @eerhardt @tannergooding @ahsonkhan @markusweimer