Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

briancylui · 2018-08-18T01:10:29Z

What's new:

Implemented all relevant AVX intrinsics, sharing the same software fallbacks previously implemented in CpuMathUtils
Implemented unit tests in a way compatible with both AVX and SSE alignments, and both netcoreapp and netstandard
Implemented separate performance tests for AVX and SSE intrinsics, except for those that accept AlignedArray as an argument
Performance test results for all applicable AVX intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1
Note: This time, most AVX intrinsics with the U suffix are implemented from scratch and do not yet have support from native code, which only contains their counterparts with the X suffix.

Description from #668:

Implemented all remaining active SSE intrinsics, including their software fallbacks and passing unit tests
Implemented the performance tests of all remaining active SSE intrinsics, except for those that accept AlignedArray as an argument
Performance test results for all applicable, active SSE intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1

Description from #562:

with unit tests and performance tests for two target frameworks: .NET Core App 3.0 and .NET Standard 2.0.
.NET Core App 3.0 gets the new C# hardware intrinsics APIs, while .NET Standard 2.0 gets the original native SIMD algorithms.
Several things have made this multi-targeting feature possible.
1. The new CpuMathUtils class that contains the new APIs is implemented as a partial class with method definitions split into two separate files (src\Microsoft.ML.CpuMath\CpuMathUtils.[target].cs), only one of which is compiled based on the target framework.
2. The Microsoft.ML.CpuMath.csproj file makes the switching decision to compile the right files based on the target framework.

Structure:

All new hardware intrinsics APIs are contained in src\Microsoft.ML.CpuMath.
Unit tests for the two target frameworks live in test\Microsoft.ML.CpuMath.UnitTests.[target], and contain the same content except for the target framework in .csproj.
Performance tests live in test\Microsoft.ML.CpuMath.PerformanceTests.

Changes to users:

Originally, users call intrinsics APIs via the SseUtils class in src\Microsoft.ML.CpuMath\Sse.cs, but now they call them via the new CpuMathUtils class, which will handle switching between SSE and AVX in the future.
CpuMathUtils methods and SseUtils methods share the same signature, but CpuMathUtils methods additionally call a new helper class (SseIntrinsics) for C# implementations of SSE operations.

Future follow-up for `CpuMath` enhancement

Suggestions on CpuMath enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).

cc: @eerhardt @safern @tannergooding @danmosemsft

…the intrinsics Note: Building perf tests succeed, but running perf tests for AVX intrinsics ends without results.

…r all used AVX intrinsics Note: Except matrix operations

… passing basic unit tests

…d AVX support

…rix operations with longer input arrays

…different alignments separately, with some style changes to readonly variables

…tcoreapp and netstandard

briancylui · 2018-08-19T02:23:35Z

All 5 checks have passed, with no merge conflicts. This PR is now ready for review.

eerhardt · 2018-08-20T15:10:03Z

src/Microsoft.ML.CpuMath/Microsoft.ML.CpuMath.csproj

@@ -10,6 +10,11 @@
    <LangVersion>7.3</LangVersion>
  </PropertyGroup>

+  <PropertyGroup Condition="'$(Configuration)|$(TargetFramework)|$(Platform)'=='Release-Intrinsics|netstandard2.0|AnyCPU'">


What is this change for?

I also want to know! I think one day Visual Studio added this to the .csproj file on its own... I guess it tries to optimize stuff? But not sure...

Should I delete this?

Yes, please.

eerhardt · 2018-08-20T15:12:58Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

@@ -20,7 +20,7 @@

 namespace Microsoft.ML.Runtime.Internal.CpuMath
 {
-    internal static class SseIntrinsics
+    public static class SseIntrinsics


Why is this public? I don't think it should be.

SseIntrinsics has been made public since I wanted to call SseIntrinsics methods directly from another assembly in perf tests. Since now the original CpuMathUtils code path mostly picks up the AVX intrinsics, I may need to call SseIntrinsics methods directly to get the SSE intrinsics.

IMO - We should InternalsVisibleTo for things like this.

eerhardt · 2018-08-20T15:17:23Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

        {
            fixed (float* pdst = dst)
            {
                float* pDstEnd = pdst + dst.Length;
                float* pDstCurrent = pdst;

-                Vector128<float> x1 = Sse.SetAllVector128(scale);
+                Vector128<float> scalarVector = Sse.SetAllVector128(scalar);


Typically, changes like this should go in a separate PR. We shouldn't be adding AVX support, and then doing unrelated cleanup/refactoring of SSE code in the same change.

Sorry my bad... I'll keep this in mind.

eerhardt · 2018-08-20T15:21:09Z

test/Microsoft.ML.CpuMath.UnitTests.netcoreapp/UnitTests.cs

-        private readonly AlignedArray[] testMatrices;
-        private readonly AlignedArray[] testSrcVectors;
-        private readonly AlignedArray[] testDstVectors;
+        private readonly float[][] _testArrays;


Making changes like this makes it super hard to review what actual changes are being done. Almost every line in this file has changed just because of these renames.

Typically, these kinds of clean ups should happen in a separate PR so it is easier to review the separate changes.

Sorry again... Should have dealt with suggestions in briancylui#2 in a separate PR. Thank you so much for your review.

eerhardt · 2018-08-20T15:25:28Z

test/Microsoft.ML.CpuMath.UnitTests.netcoreapp/UnitTests.cs

-            AlignedArray dst = testDstVectors[dstTest];
+            AlignedArray mat = _testMatrices[matTest];
+            AlignedArray src = _testSrcVectors[srcTest];
+            AlignedArray dst = _testDstVectors[dstTest];

            CpuMathUtils.MatTimesSrc(false, false, mat, src, dst, dst.Size);


One unfortunate result of our unit testing approach is that we are probably only hitting the AVX code paths, since all our CI machines probably have AVX supported.

@tannergooding, do you have any ideas on how we can run the same tests in CI, with different environment variables to turn AVX and both off?

The .netstandard/UnitTests.cs would be hitting the SSE code paths, since the CpuMathUtils for .netstandard is hard-coded to call SseUtils methods. I could use the following example to accept an env var:

machinelearning/test/Microsoft.ML.CpuMath.PerformanceTests/SsePerformanceTests.cs

Lines 37 to 61 in 94401d5

private static int GetSeed()

{

int seed = DEFAULT_SEED;

if (Environment.GetEnvironmentVariable("CPUMATH_SEED") != null)

{

string CPUMATH_SEED = Environment.GetEnvironmentVariable("CPUMATH_SEED");

if (!int.TryParse(CPUMATH_SEED, out seed))

{

if(string.Equals(CPUMATH_SEED, "random", StringComparison.OrdinalIgnoreCase))

{

seed = new Random().Next();

}

else

{

seed = DEFAULT_SEED;

}

}

}

Console.WriteLine("Random seed: " + seed + "; set environment variable CPUMATH_SEED to this value to reproduce results");

return seed;

}

The new code in CpuMathUtils could be:

private static string _simdArch = Environment.GetEnvironmentVariable("SIMD_ARCH"); private static bool _useAvx = (_simdArch == null) ? true : (string.Equals(_simdArch, "avx", StringComparison.OrdinalIgnoreCase) ? true : false); private static bool _useSse = (_simdArch == null) ? true : (string.Equals(_simdArch, "sse", StringComparison.OrdinalIgnoreCase) ? true : false); In the switching function, do: if (Avx.IsSupported && _useAvx) {...} else if (Sse.IsSupported && _useSse) {...} else { // software fallback }

How does it sound?

There are already environment variables for turning off Intrinsics in .NET Core.

For now, I think we could use COMPlus_EnableAVX=false to disable AVX, and that way test the SSE code path. I'm not sure how to turn SSE off as well. If we had that, then we could test the software fallbacks too.

See https://github.com/dotnet/coreclr/issues/19519 for adding more flags support.

and that way test the SSE code path. I'm not sure how to turn SSE off as well.

We can use COMPlus_FeatureSIMD=0 to disable SSE.

eerhardt · 2018-08-20T15:26:49Z

test/Microsoft.ML.CpuMath.PerformanceTests/AvxPerformanceTests.cs

+{
+    public class AvxPerformanceTests
+    {
+        private const int EXP_MAX = 127;


A lot of this appears to be duplicated with the SsePerformanceTests. Can we refactor the two classes so there is less duplication?

eerhardt · 2018-08-20T15:28:42Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netstandard.cs

@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
 {
    public static partial class CpuMathUtils
    {
+        // REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE.  Is it okay?


This should probably say netstandard2.0 and not .NET Core App 2.0. This code is also used on the Windows-only .NET Framework.

eerhardt · 2018-08-20T15:29:33Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netstandard.cs

@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
 {
    public static partial class CpuMathUtils
    {
+        // REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE.  Is it okay?


Actually, we can check AVX support by calling into the native method that does this:

machinelearning/src/Native/CpuMathNative/Sse.cpp

Lines 88 to 105 in 94401d5

// Test whether Avx is available.

EXPORT_API(bool) ChkAvx()

{

#ifdef _WIN32

int cpuInfo[4];

__cpuid(cpuInfo, 1);

// 28th bit of second integer of Cpu Info denotes whether the Avx is supported in CPU or not

// Reference http://msdn.microsoft.com/en-us/library/hskdteyh(v=vs.100).aspx

return cpuInfo[2] & (1 << 28) || false;

#else

unsigned char buffer[16];

(void) getcpuid(1, buffer);

// taken from https://github.com/dotnet/coreclr/blob/b5f4d2df2e087401f2c3aab2c37021e326707915/src/vm/codeman.cpp#L1381

return ((buffer[11] & 0x18) == 0x18) && (xmmYmmStateSupport() == 1);

#endif

}

I see - should I make the .netstandard implementation call some AVX intrinsics too since it can also check for AVX support, or should I leave out this comment and keep GetVectorAlignment() in this file .netstandard.cs to return only Vector128Alignment?

Let's leave it the way you have it for now. But this is something that could be addressed in the future.

eerhardt · 2018-08-20T15:44:30Z

Looking at your perf results, the following AVX tests appear to be slower than the SSE counterparts:

ManagedScaleSrcUPerf
ManagedAddScaleSUPerf
ManagedAddSUPerf

We should investigate why this is. Is there something about the AVX implementation that we did incorrectly? If using AVX in these scenarios doesn't lead to faster execution, maybe we shouldn't use AVX here?

All the other perf tests seem to have fairly decent gains by using AVX over SSE.

/cc @helloguo @sduttac @fiigii - since I know they are interested in AVX performance.

eerhardt · 2018-08-20T15:49:20Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static void ZeroUpper()


@fiigii @tannergooding - Can either of you comment on this? I don't see why this is necessary.

I believe JIT chooses VEX encoding when running on AVX capable machines. vzeroupper is not necessary for avx hw intrinsic implementations.

Right, we do not need ZeroUpper in the JIT compilation environment, and we won't add ZeroUpper intrinsic in the future.

Thanks for your comment - I will proceed to delete this placeholder (ZeroUpper()) in my next commit.

eerhardt · 2018-08-20T15:50:57Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+
+                Vector128<float> scaleVector128 = Sse.SetAllVector128(scale);
+
+                while (pIdxCurrent + 4 <= pEnd)


Do we really need a loop here? Since we know we are within 8 elements of the end, we can only do this section at most 1 time.

No, we don't need to. I should probably change all while to if regarding this case in all methods - thanks!

Meanwhile, I'm looking forward to implementing the 'double-compute' optimization suggested by @tannergooding (the bottom item in briancylui#2), which may drastically change the layout of the intrinsics files and thus will be included in a separate PR.

helloguo · 2018-08-20T16:16:26Z

For operations on sparse arrays, such as DotSU, AddScaleSU, AddSU and so on, GatherVector256 will help a lot. This instruction will be supported with this PR dotnet/coreclr#19392

@eerhardt I believe the perf of ManagedScaleSrcUPerf, ManagedAddScaleSUPerf, ManagedAddSUPerf will become better once we use GatherVector256 intrinsic. The current implementation without gather instruction generates a lot of memory operations and movs.

tannergooding · 2018-08-22T16:14:45Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector256<float> ToVector256(in Vector128<float> a, in Vector128<float> b)


This should just be Avx.SetHighLow

tannergooding · 2018-08-22T16:15:49Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector128<float> GetLow(in Vector256<float> x)


For many of these, it might be better (long term) to just call the inner method directly (for example, just call Avx.GetLowerHalf, rather than GetLow)

tannergooding · 2018-08-22T16:17:07Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector128<float> Rotate(in Vector128<float> x)


This was already defined in the Sse code, we shouldn't duplicate when the code is identical between the two.

Would there be any perf hit if I place these duplicated functions in a separate class IntrinsicsUtils in the same assembly and have the intrinsics in Sse/AvxIntrinsics call them as IntrinsicsUtils.Rotate?

Why not just have them in, and call them from, SseIntrinsics?

but no, there shouldn't be any perf hit by having the static methods in another static class.

tannergooding · 2018-08-22T16:18:44Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector128<float> VectorSum128(in Vector128<float> vector)


Same for this, and the other *128 functions, if the code is identical to the 128-bit code in the SseIntrinsics code, we shouldn't duplicate it here.

tannergooding · 2018-08-22T16:20:24Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector256<float> VectorMax256(in Vector256<float> vector)
+        {
+            Vector256<float> x1 = Avx.Shuffle(vector, vector, 0xB1);


You should have 0xB1 and 0x02, documented. It would also be useful to document why you do two shuffle/max operations

tannergooding · 2018-08-22T16:20:53Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector256<float> GetAbsMask256()
+        {
+            return Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF));


The Avx.SetAllVector256 should be in a static readonly, to help with perf.

To clarify, do you mean that there should be a static readonly variable like Vector256<float> absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF)); instead of a function GetAbsMask256?

tannergooding · 2018-08-22T16:21:51Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+            Vector256<float> xSign = Avx.And(xDst1, signMask); // result = 0x8000 0000 if xDst1 is negative or 0x0000 0000 otherwise
+            Vector256<float> xDst1Abs = Avx.Xor(xDst1, xSign);
+
+            // REVIEW NEEDED: Do we want Signaling or NonSignaling?  The original functionality is NonSignaling, which does not throw an exception even when there is an NaN.


CoreCLR doesn't support signaling NaN (because it doesn't support IEEE exception handling).

tannergooding · 2018-08-22T16:23:46Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+        }
+
+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector256<float> GetNewDst256(in Vector256<float> xDst1, in Vector256<float> signMask, in Vector256<float> xThreshold)


Why is signMask being passed in, shouldn't that be a constant?

tannergooding · 2018-08-22T16:28:17Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+        public static int GetVectorAlignment()
+        {
+            // Assumes SSE support on machines that run ML.NET.
+            return Avx.IsSupported ? Vector256Alignment : Vector128Alignment;


We shouldn't assume SSE support, this may break for ARM32/ARM64 when/if they come online.

ML.NET currently assumes SSE3 support for the active existing intrinsics in SseUtils to function properly. The original alignment value was a constant value 16. When ARM32/ARM64 is adopted in the future, IMO those PRs should address this assumption, but not sure...

It would be better to, IMO, assert that this method is only called if Sse.IsSupported is true then (as the HWIntrinsic code paths should not be getting called otherwise).

It is also undesirable to assume SSE3 support, as there are platforms/OS's where SSE3 support might not be available (I don't think I saw any code that assumed this condition, for the HWIntrinsics, however).

I think adding Contracts.Assert(Sse.IsSupported); is good, but then (1) software fallbacks would fail at the assert even though they don't care about alignment, (2) external calls to GetVectorAlignment particularly in RffTransform.cs would fail before calling any software fallbacks, and (3) I'll need to add the same assert to .netstandard.cs (i.e. the original code path), which could be fine since those machines without SSE support would have failed to call SSE intrinsics anyway.

I agree with adding Contracts.Assert(Sse.IsSupported) somewhere, but also want to avoid the issues in (1) and (2), so I'd love to know about how I should proceed along this direction.

I agree that there may be some systems without SSE3 support. It seems to me that the Sse3.HorizontalAdd intrinsic, which is frequently called for getting the VectorSum, is only available in SSE3 but unfortunately not in SSE, but I may be mistaken...

To handle the software fallbacks (1), you should probably be returning the natural alignment of float (which is generally 4) or just not caring about alignment for that case in general (since the JIT will handle it for primitive types).

That is to say, there is no point in copying the AlignedArray to be 16-byte aligned, when the software fallback will only be accessing elements that require 4-byte alignment.

And long term, we probably shouldn't even have AlignedArray, and should be properly handling the head/tail of the arrays (either via double-compute or scalar access) to ensure the rest of the data is aligned.

Awesome! Will do this - thanks!

Note: In some other files like CpuAligenedMathUtils (yes it is spelled this way currently), they are checking if a.cbAlign == CpuMathUtils.GetVectorAlignment. I will change it to (a.cbAlign % CpuMathUtils.GetVectorAlignment) == 0 in my next wave of commits.

I can't do the same checks Sse/Avx.IsSupported in the GetVectorAlignment() in .netstandard.cs, so I may assume SSE support there and always return Vector128Alignment since it was the original behavior.

In all honesty, I plan on removing AlignedArray all together at some point. So I wouldn't spend too much time polishing this, as it will go away.

tannergooding · 2018-08-22T16:38:02Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs


        public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun)
        {
            Contracts.Assert(mat.Size == dst.Size * src.Size);
            Contracts.Assert(crun >= 0);

-            if (Sse.IsSupported)
+            if (Avx.IsSupported)


We should ensure that using 256-bit operations is a net-win for the entire application before "enabling" it by default. For certain workloads, it can downclock the processor and while this generally results in a higher throughput for the specific 256-bit workload, it can potentially cause a perf hit for any non 256-bit workloads that get executed before the processor goes back to its "highest" frequency level.

As per 15.26 Skylake Server Power Management (from the "Intel® 64 and IA-32 Architectures Optimization Reference Manual")

It is not always easy to predict whether a program's performance will improve from building it to target
Intel AVX-512 instructions. Programs that enjoy high performance gains from the use of xmm or ymm
registers may expect performance improvement by moving to the use of zmm registers. However, some
programs that use zmm registers may not gain as much, or may even lose performance. It is recommended
to try multiple build options and measure the performance of the program.

The same logic applies, but to a lesser extent, to AVX-256 instructions.

@briancylui - there is one end-to-end benchmark in the repo today - https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks. Can you try running it with/without your changes and post the results?

In addition to @tannergooding 's concern, what will be the end-to-end scenarios or applications to drive the optimizations? https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks lists one end-to-end benchmark, but it seems iris.txt is a small data set. Do we have more end-to-end scenarios to test?

Yes, we are currently formulating more benchmarks to add to the repo. See #711.

@eerhardt: Feel free to check out the perf results for end-to-end perf scenarios on briancylui#7

@tannergooding suggested earlier about checking if the input array length is greater than a certain AvxLengthLimit along with checking Avx.IsSupported. I could do some perf tests that accept a Param that is the length of the input array to see what value AvxLengthLimit should take. Same for SseLengthLimit.

I think one of the problems about AVX that impacts perf right now is that we haven't implemented the "double-compute" to take care of alignment that @tannergooding mentioned in briancylui#2 (bottom item). I look forward to implementing it after this PR, since it would change the layout of every function in a significant way.

eerhardt · 2018-08-22T16:52:51Z

@briancylui -

Given that the commented issue (attached below) has been closed, I tried to remove the package reference System.Runtime.Intrinsics.Experimental, but then building src*CpuMath failed. How should I proceed?

Don't make any changes in this PR regarding that. This PR should strictly only be concerned with enabling AVX. That can be handled in a separate PR.

BTW - in order to remove that, you need to move both your 3.0 SDK and runtime up to a newer build.

tannergooding · 2018-08-23T15:59:35Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

-            return Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1);
-        }
+        private static Vector256<float> SetHighLow(in Vector128<float> a, in Vector128<float> b)
+            => Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1);


By, this should be SetHighLow, I meant that there is an Avx.SetHighLow method, which does the right functionality 😄

…l and internal calls

… in hopes of improving perf

briancylui · 2018-08-23T19:47:11Z

Responded to PR comments today in the latest wave of commits (mostly on refactoring code). Also made two intrinsics helper functions about AlignedArray inline in hopes of improving perf.

This PR is now ready for a third of review if any.

briancylui · 2018-08-24T18:22:10Z

This PR is now ready for a third round of review if any.

All previous PR comments have been responded to. Looking forward to optimize perf in a separate PR.

danmoseley · 2018-08-29T17:15:08Z

@eerhardt @tannergooding @fiigii @helloguo do you have more feedback on this PR? I would love to get this merged soon because Brian's internship will soon be ending.

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

tannergooding

Overall, LGTM.

I have a few nits here and there (mostly cleanup or improvements) but nothing I saw that would be considered blocking.

eerhardt

…rHalf

briancylui · 2018-08-29T22:26:06Z

Responded to all PR feedback so far with two reviewer approvals. IMO, this PR is ready to be merged.

safern · 2018-08-29T23:17:24Z

Merged since @briancylui doesn't have write permissions.

briancylui added 16 commits August 13, 2018 13:38

Implemented AVX intrinsics

0dab6d1

Implemented performance tests for AVX intrinsics, with some fixes to …

3d76fb1

…the intrinsics Note: Building perf tests succeed, but running perf tests for AVX intrinsics ends without results.

Changes to perf tests in response to feedback

6a51bd8

Fixes across multiple files to make unit tests and perf tests work fo…

1b2cea9

…r all used AVX intrinsics Note: Except matrix operations

Implemented new AVX intrinsics that do not involve matrix operations,…

f471726

… passing basic unit tests

Implemented perf tests for AVX via CpuMathUtils class

41d65f5

Implemented switching logic for Vector128/256Alignment between SSE an…

8c34e87

…d AVX support

Changed perf tests to reveal SSE and AVX intrinsics perf separately

ddeb655

Fixed access modifiers of private fields

c776fb0

Implemented all unit tests for AVX intrinsics that do not involve mat…

df09fe3

…rix operations with longer input arrays

Implemented unit tests for AVX intrinsics

a9c481f

Fixed errors on the RffTransform.CfltAlign const-expression requirement

c692a6f

Fixed Debug errors by making RffTransform.CfltAlign read-only

40528e4

Fixed errors by making CfltAlign static (and read-only)

4d7d8ef

Developed two unit tests for netcoreapp and netstandard to deal with …

75e4cde

…different alignments separately, with some style changes to readonly variables

Kept only the most recent unit tests which are sufficient for both ne…

a763059

…tcoreapp and netstandard

briancylui force-pushed the AvxSse branch from 62d0c94 to a763059 Compare August 19, 2018 02:04

eerhardt reviewed Aug 20, 2018

View reviewed changes

tannergooding reviewed Aug 22, 2018

View reviewed changes

Respond to PR feedback: Changes to intrinsics

27ad829

tannergooding reviewed Aug 23, 2018

View reviewed changes

briancylui added 3 commits August 23, 2018 12:11

Respond to PR comment: Makes alignment checking consistent in externa…

0fd78a6

…l and internal calls

Respond to PR feedback: Refactored Sse/AvxIntrinsics helper functions

3380ded

Made two Sse/AvxIntrinsics helper functions about AlignedArray inline…

b8d63cc

… in hopes of improving perf

tannergooding reviewed Aug 29, 2018

View reviewed changes

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs Outdated Show resolved Hide resolved

tannergooding reviewed Aug 29, 2018

View reviewed changes

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs Outdated Show resolved Hide resolved

tannergooding approved these changes Aug 29, 2018

View reviewed changes

eerhardt approved these changes Aug 29, 2018

View reviewed changes

Respond to PR feedback: styles for Vector256Alignment and Avx.GetLowe…

32a3704

…rHalf

safern merged commit 4c3759a into dotnet:master Aug 29, 2018

This was referenced Oct 9, 2018

Remove the AddScaleSU performance regression for AVX #1195

Open

Benchmark results analysis for .NET Core 2.1 vs 3.0 #1194

Closed

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

	private static int GetSeed()
	{
	int seed = DEFAULT_SEED;

	if (Environment.GetEnvironmentVariable("CPUMATH_SEED") != null)
	{
	string CPUMATH_SEED = Environment.GetEnvironmentVariable("CPUMATH_SEED");

	if (!int.TryParse(CPUMATH_SEED, out seed))
	{
	if(string.Equals(CPUMATH_SEED, "random", StringComparison.OrdinalIgnoreCase))
	{
	seed = new Random().Next();
	}
	else
	{
	seed = DEFAULT_SEED;
	}
	}
	}

	Console.WriteLine("Random seed: " + seed + "; set environment variable CPUMATH_SEED to this value to reproduce results");

	return seed;
	}

	// Test whether Avx is available.
	EXPORT_API(bool) ChkAvx()
	{
	#ifdef _WIN32
	int cpuInfo[4];
	__cpuid(cpuInfo, 1);

	// 28th bit of second integer of Cpu Info denotes whether the Avx is supported in CPU or not
	// Reference http://msdn.microsoft.com/en-us/library/hskdteyh(v=vs.100).aspx
	return cpuInfo[2] & (1 << 28) \|\| false;
	#else
	unsigned char buffer[16];
	(void) getcpuid(1, buffer);

	// taken from https://github.com/dotnet/coreclr/blob/b5f4d2df2e087401f2c3aab2c37021e326707915/src/vm/codeman.cpp#L1381
	return ((buffer[11] & 0x18) == 0x18) && (xmmYmmStateSupport() == 1);
	#endif
	}


		Vector128<float> scaleVector128 = Sse.SetAllVector128(scale);

		while (pIdxCurrent + 4 <= pEnd)

Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

Conversation

briancylui commented Aug 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new:

Description from #668:

Description from #562:

Structure:

Changes to users:

Future follow-up for CpuMath enhancement

Uh oh!

briancylui commented Aug 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helloguo commented Aug 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

briancylui commented Aug 18, 2018 •

edited

Loading

Future follow-up for `CpuMath` enhancement

eerhardt commented Aug 20, 2018 •

edited

Loading