Skip to content

Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Aug 29, 2018

Conversation

briancylui
Copy link
Contributor

@briancylui briancylui commented Aug 18, 2018

What's new:

  1. Implemented all relevant AVX intrinsics, sharing the same software fallbacks previously implemented in CpuMathUtils
  2. Implemented unit tests in a way compatible with both AVX and SSE alignments, and both netcoreapp and netstandard
  3. Implemented separate performance tests for AVX and SSE intrinsics, except for those that accept AlignedArray as an argument
  4. Performance test results for all applicable AVX intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1
  5. Note: This time, most AVX intrinsics with the U suffix are implemented from scratch and do not yet have support from native code, which only contains their counterparts with the X suffix.

Description from #668:

  1. Implemented all remaining active SSE intrinsics, including their software fallbacks and passing unit tests
  2. Implemented the performance tests of all remaining active SSE intrinsics, except for those that accept AlignedArray as an argument
  3. Performance test results for all applicable, active SSE intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1

Description from #562:

  • with unit tests and performance tests for two target frameworks: .NET Core App 3.0 and .NET Standard 2.0.
  • .NET Core App 3.0 gets the new C# hardware intrinsics APIs, while .NET Standard 2.0 gets the original native SIMD algorithms.
  • Several things have made this multi-targeting feature possible.
    1. The new CpuMathUtils class that contains the new APIs is implemented as a partial class with method definitions split into two separate files (src\Microsoft.ML.CpuMath\CpuMathUtils.[target].cs), only one of which is compiled based on the target framework.
    2. The Microsoft.ML.CpuMath.csproj file makes the switching decision to compile the right files based on the target framework.

Structure:

  1. All new hardware intrinsics APIs are contained in src\Microsoft.ML.CpuMath.
  2. Unit tests for the two target frameworks live in test\Microsoft.ML.CpuMath.UnitTests.[target], and contain the same content except for the target framework in .csproj.
  3. Performance tests live in test\Microsoft.ML.CpuMath.PerformanceTests.

Changes to users:

  1. Originally, users call intrinsics APIs via the SseUtils class in src\Microsoft.ML.CpuMath\Sse.cs, but now they call them via the new CpuMathUtils class, which will handle switching between SSE and AVX in the future.
  2. CpuMathUtils methods and SseUtils methods share the same signature, but CpuMathUtils methods additionally call a new helper class (SseIntrinsics) for C# implementations of SSE operations.

Future follow-up for CpuMath enhancement

  1. Suggestions on CpuMath enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).

cc: @eerhardt @safern @tannergooding @danmosemsft

…the intrinsics

Note: Building perf tests succeed, but running perf tests for AVX intrinsics ends without results.
…r all used AVX intrinsics

Note: Except matrix operations
…different alignments separately, with some style changes to readonly variables
@briancylui
Copy link
Contributor Author

All 5 checks have passed, with no merge conflicts. This PR is now ready for review.

@@ -10,6 +10,11 @@
<LangVersion>7.3</LangVersion>
</PropertyGroup>

<PropertyGroup Condition="'$(Configuration)|$(TargetFramework)|$(Platform)'=='Release-Intrinsics|netstandard2.0|AnyCPU'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to know! I think one day Visual Studio added this to the .csproj file on its own... I guess it tries to optimize stuff? But not sure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I delete this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

@@ -20,7 +20,7 @@

namespace Microsoft.ML.Runtime.Internal.CpuMath
{
internal static class SseIntrinsics
public static class SseIntrinsics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this public? I don't think it should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SseIntrinsics has been made public since I wanted to call SseIntrinsics methods directly from another assembly in perf tests. Since now the original CpuMathUtils code path mostly picks up the AVX intrinsics, I may need to call SseIntrinsics methods directly to get the SSE intrinsics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - We should InternalsVisibleTo for things like this.

{
fixed (float* pdst = dst)
{
float* pDstEnd = pdst + dst.Length;
float* pDstCurrent = pdst;

Vector128<float> x1 = Sse.SetAllVector128(scale);
Vector128<float> scalarVector = Sse.SetAllVector128(scalar);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, changes like this should go in a separate PR. We shouldn't be adding AVX support, and then doing unrelated cleanup/refactoring of SSE code in the same change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad... I'll keep this in mind.

private readonly AlignedArray[] testMatrices;
private readonly AlignedArray[] testSrcVectors;
private readonly AlignedArray[] testDstVectors;
private readonly float[][] _testArrays;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making changes like this makes it super hard to review what actual changes are being done. Almost every line in this file has changed just because of these renames.

Typically, these kinds of clean ups should happen in a separate PR so it is easier to review the separate changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again... Should have dealt with suggestions in briancylui#2 in a separate PR. Thank you so much for your review.

AlignedArray dst = testDstVectors[dstTest];
AlignedArray mat = _testMatrices[matTest];
AlignedArray src = _testSrcVectors[srcTest];
AlignedArray dst = _testDstVectors[dstTest];

CpuMathUtils.MatTimesSrc(false, false, mat, src, dst, dst.Size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One unfortunate result of our unit testing approach is that we are probably only hitting the AVX code paths, since all our CI machines probably have AVX supported.

@tannergooding, do you have any ideas on how we can run the same tests in CI, with different environment variables to turn AVX and both off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .netstandard/UnitTests.cs would be hitting the SSE code paths, since the CpuMathUtils for .netstandard is hard-coded to call SseUtils methods. I could use the following example to accept an env var:

private static int GetSeed()
{
int seed = DEFAULT_SEED;
if (Environment.GetEnvironmentVariable("CPUMATH_SEED") != null)
{
string CPUMATH_SEED = Environment.GetEnvironmentVariable("CPUMATH_SEED");
if (!int.TryParse(CPUMATH_SEED, out seed))
{
if(string.Equals(CPUMATH_SEED, "random", StringComparison.OrdinalIgnoreCase))
{
seed = new Random().Next();
}
else
{
seed = DEFAULT_SEED;
}
}
}
Console.WriteLine("Random seed: " + seed + "; set environment variable CPUMATH_SEED to this value to reproduce results");
return seed;
}

The new code in CpuMathUtils could be:

private static string _simdArch = Environment.GetEnvironmentVariable("SIMD_ARCH");
private static bool _useAvx = (_simdArch == null) ? true : (string.Equals(_simdArch, "avx", StringComparison.OrdinalIgnoreCase) ? true : false);
private static bool _useSse = (_simdArch == null) ? true : (string.Equals(_simdArch, "sse", StringComparison.OrdinalIgnoreCase) ? true : false);

In the switching function, do:
if (Avx.IsSupported && _useAvx) {...}
else if (Sse.IsSupported && _useSse) {...}
else { // software fallback }

How does it sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already environment variables for turning off Intrinsics in .NET Core.

For now, I think we could use COMPlus_EnableAVX=false to disable AVX, and that way test the SSE code path. I'm not sure how to turn SSE off as well. If we had that, then we could test the software fallbacks too.

See https://github.com/dotnet/coreclr/issues/19519 for adding more flags support.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that way test the SSE code path. I'm not sure how to turn SSE off as well.

We can use COMPlus_FeatureSIMD=0 to disable SSE.

{
public class AvxPerformanceTests
{
private const int EXP_MAX = 127;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this appears to be duplicated with the SsePerformanceTests. Can we refactor the two classes so there is less duplication?

@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
{
public static partial class CpuMathUtils
{
// REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE. Is it okay?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say netstandard2.0 and not .NET Core App 2.0. This code is also used on the Windows-only .NET Framework.

@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
{
public static partial class CpuMathUtils
{
// REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE. Is it okay?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can check AVX support by calling into the native method that does this:

// Test whether Avx is available.
EXPORT_API(bool) ChkAvx()
{
#ifdef _WIN32
int cpuInfo[4];
__cpuid(cpuInfo, 1);
// 28th bit of second integer of Cpu Info denotes whether the Avx is supported in CPU or not
// Reference http://msdn.microsoft.com/en-us/library/hskdteyh(v=vs.100).aspx
return cpuInfo[2] & (1 << 28) || false;
#else
unsigned char buffer[16];
(void) getcpuid(1, buffer);
// taken from https://github.com/dotnet/coreclr/blob/b5f4d2df2e087401f2c3aab2c37021e326707915/src/vm/codeman.cpp#L1381
return ((buffer[11] & 0x18) == 0x18) && (xmmYmmStateSupport() == 1);
#endif
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - should I make the .netstandard implementation call some AVX intrinsics too since it can also check for AVX support, or should I leave out this comment and keep GetVectorAlignment() in this file .netstandard.cs to return only Vector128Alignment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it the way you have it for now. But this is something that could be addressed in the future.

@eerhardt
Copy link
Member

eerhardt commented Aug 20, 2018

Looking at your perf results, the following AVX tests appear to be slower than the SSE counterparts:

ManagedScaleSrcUPerf
ManagedAddScaleSUPerf
ManagedAddSUPerf

We should investigate why this is. Is there something about the AVX implementation that we did incorrectly? If using AVX in these scenarios doesn't lead to faster execution, maybe we shouldn't use AVX here?

All the other perf tests seem to have fairly decent gains by using AVX over SSE.

/cc @helloguo @sduttac @fiigii - since I know they are interested in AVX performance.

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static void ZeroUpper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii @tannergooding - Can either of you comment on this? I don't see why this is necessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe JIT chooses VEX encoding when running on AVX capable machines. vzeroupper is not necessary for avx hw intrinsic implementations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we do not need ZeroUpper in the JIT compilation environment, and we won't add ZeroUpper intrinsic in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment - I will proceed to delete this placeholder (ZeroUpper()) in my next commit.


Vector128<float> scaleVector128 = Sse.SetAllVector128(scale);

while (pIdxCurrent + 4 <= pEnd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a loop here? Since we know we are within 8 elements of the end, we can only do this section at most 1 time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't need to. I should probably change all while to if regarding this case in all methods - thanks!

Meanwhile, I'm looking forward to implementing the 'double-compute' optimization suggested by @tannergooding (the bottom item in briancylui#2), which may drastically change the layout of the intrinsics files and thus will be included in a separate PR.

@helloguo
Copy link

For operations on sparse arrays, such as DotSU, AddScaleSU, AddSU and so on, GatherVector256 will help a lot. This instruction will be supported with this PR dotnet/coreclr#19392

@eerhardt I believe the perf of ManagedScaleSrcUPerf, ManagedAddScaleSUPerf, ManagedAddSUPerf will become better once we use GatherVector256 intrinsic. The current implementation without gather instruction generates a lot of memory operations and movs.

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> ToVector256(in Vector128<float> a, in Vector128<float> b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be Avx.SetHighLow

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector128<float> GetLow(in Vector256<float> x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For many of these, it might be better (long term) to just call the inner method directly (for example, just call Avx.GetLowerHalf, rather than GetLow)

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector128<float> Rotate(in Vector128<float> x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already defined in the Sse code, we shouldn't duplicate when the code is identical between the two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any perf hit if I place these duplicated functions in a separate class IntrinsicsUtils in the same assembly and have the intrinsics in Sse/AvxIntrinsics call them as IntrinsicsUtils.Rotate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have them in, and call them from, SseIntrinsics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but no, there shouldn't be any perf hit by having the static methods in another static class.

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector128<float> VectorSum128(in Vector128<float> vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this, and the other *128 functions, if the code is identical to the 128-bit code in the SseIntrinsics code, we shouldn't duplicate it here.

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> VectorMax256(in Vector256<float> vector)
{
Vector256<float> x1 = Avx.Shuffle(vector, vector, 0xB1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have 0xB1 and 0x02, documented. It would also be useful to document why you do two shuffle/max operations

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> GetAbsMask256()
{
return Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Avx.SetAllVector256 should be in a static readonly, to help with perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, do you mean that there should be a static readonly variable like Vector256<float> absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF)); instead of a function GetAbsMask256?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Vector256<float> xSign = Avx.And(xDst1, signMask); // result = 0x8000 0000 if xDst1 is negative or 0x0000 0000 otherwise
Vector256<float> xDst1Abs = Avx.Xor(xDst1, xSign);

// REVIEW NEEDED: Do we want Signaling or NonSignaling? The original functionality is NonSignaling, which does not throw an exception even when there is an NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoreCLR doesn't support signaling NaN (because it doesn't support IEEE exception handling).

}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> GetNewDst256(in Vector256<float> xDst1, in Vector256<float> signMask, in Vector256<float> xThreshold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is signMask being passed in, shouldn't that be a constant?

public static int GetVectorAlignment()
{
// Assumes SSE support on machines that run ML.NET.
return Avx.IsSupported ? Vector256Alignment : Vector128Alignment;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't assume SSE support, this may break for ARM32/ARM64 when/if they come online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML.NET currently assumes SSE3 support for the active existing intrinsics in SseUtils to function properly. The original alignment value was a constant value 16. When ARM32/ARM64 is adopted in the future, IMO those PRs should address this assumption, but not sure...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to, IMO, assert that this method is only called if Sse.IsSupported is true then (as the HWIntrinsic code paths should not be getting called otherwise).

It is also undesirable to assume SSE3 support, as there are platforms/OS's where SSE3 support might not be available (I don't think I saw any code that assumed this condition, for the HWIntrinsics, however).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding Contracts.Assert(Sse.IsSupported); is good, but then (1) software fallbacks would fail at the assert even though they don't care about alignment, (2) external calls to GetVectorAlignment particularly in RffTransform.cs would fail before calling any software fallbacks, and (3) I'll need to add the same assert to .netstandard.cs (i.e. the original code path), which could be fine since those machines without SSE support would have failed to call SSE intrinsics anyway.

I agree with adding Contracts.Assert(Sse.IsSupported) somewhere, but also want to avoid the issues in (1) and (2), so I'd love to know about how I should proceed along this direction.

I agree that there may be some systems without SSE3 support. It seems to me that the Sse3.HorizontalAdd intrinsic, which is frequently called for getting the VectorSum, is only available in SSE3 but unfortunately not in SSE, but I may be mistaken...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the software fallbacks (1), you should probably be returning the natural alignment of float (which is generally 4) or just not caring about alignment for that case in general (since the JIT will handle it for primitive types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to say, there is no point in copying the AlignedArray to be 16-byte aligned, when the software fallback will only be accessing elements that require 4-byte alignment.

And long term, we probably shouldn't even have AlignedArray, and should be properly handling the head/tail of the arrays (either via double-compute or scalar access) to ensure the rest of the data is aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Will do this - thanks!

Note: In some other files like CpuAligenedMathUtils (yes it is spelled this way currently), they are checking if a.cbAlign == CpuMathUtils.GetVectorAlignment. I will change it to (a.cbAlign % CpuMathUtils.GetVectorAlignment) == 0 in my next wave of commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't do the same checks Sse/Avx.IsSupported in the GetVectorAlignment() in .netstandard.cs, so I may assume SSE support there and always return Vector128Alignment since it was the original behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all honesty, I plan on removing AlignedArray all together at some point. So I wouldn't spend too much time polishing this, as it will go away.


public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun)
{
Contracts.Assert(mat.Size == dst.Size * src.Size);
Contracts.Assert(crun >= 0);

if (Sse.IsSupported)
if (Avx.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure that using 256-bit operations is a net-win for the entire application before "enabling" it by default. For certain workloads, it can downclock the processor and while this generally results in a higher throughput for the specific 256-bit workload, it can potentially cause a perf hit for any non 256-bit workloads that get executed before the processor goes back to its "highest" frequency level.

As per 15.26 Skylake Server Power Management (from the "Intel® 64 and IA-32 Architectures Optimization Reference Manual")
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not always easy to predict whether a program's performance will improve from building it to target
Intel AVX-512 instructions. Programs that enjoy high performance gains from the use of xmm or ymm
registers may expect performance improvement by moving to the use of zmm registers. However, some
programs that use zmm registers may not gain as much, or may even lose performance. It is recommended
to try multiple build options and measure the performance of the program.

The same logic applies, but to a lesser extent, to AVX-256 instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briancylui - there is one end-to-end benchmark in the repo today - https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks. Can you try running it with/without your changes and post the results?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @tannergooding 's concern, what will be the end-to-end scenarios or applications to drive the optimizations? https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks lists one end-to-end benchmark, but it seems iris.txt is a small data set. Do we have more end-to-end scenarios to test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are currently formulating more benchmarks to add to the repo. See #711.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt: Feel free to check out the perf results for end-to-end perf scenarios on briancylui#7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding suggested earlier about checking if the input array length is greater than a certain AvxLengthLimit along with checking Avx.IsSupported. I could do some perf tests that accept a Param that is the length of the input array to see what value AvxLengthLimit should take. Same for SseLengthLimit.

I think one of the problems about AVX that impacts perf right now is that we haven't implemented the "double-compute" to take care of alignment that @tannergooding mentioned in briancylui#2 (bottom item). I look forward to implementing it after this PR, since it would change the layout of every function in a significant way.

@eerhardt
Copy link
Member

@briancylui -

Given that the commented issue (attached below) has been closed, I tried to remove the package reference System.Runtime.Intrinsics.Experimental, but then building src*CpuMath failed. How should I proceed?

Don't make any changes in this PR regarding that. This PR should strictly only be concerned with enabling AVX. That can be handled in a separate PR.

BTW - in order to remove that, you need to move both your 3.0 SDK and runtime up to a newer build.

return Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1);
}
private static Vector256<float> SetHighLow(in Vector128<float> a, in Vector128<float> b)
=> Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By, this should be SetHighLow, I meant that there is an Avx.SetHighLow method, which does the right functionality 😄

@briancylui
Copy link
Contributor Author

Responded to PR comments today in the latest wave of commits (mostly on refactoring code). Also made two intrinsics helper functions about AlignedArray inline in hopes of improving perf.

This PR is now ready for a third of review if any.

@briancylui
Copy link
Contributor Author

This PR is now ready for a third round of review if any.

All previous PR comments have been responded to. Looking forward to optimize perf in a separate PR.

@danmoseley
Copy link
Member

@eerhardt @tannergooding @fiigii @helloguo do you have more feedback on this PR? I would love to get this merged soon because Brian's internship will soon be ending.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM.

I have a few nits here and there (mostly cleanup or improvements) but nothing I saw that would be considered blocking.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@briancylui
Copy link
Contributor Author

Responded to all PR feedback so far with two reviewer approvals. IMO, this PR is ready to be merged.

@safern safern merged commit 4c3759a into dotnet:master Aug 29, 2018
@safern
Copy link
Member

safern commented Aug 29, 2018

Merged since @briancylui doesn't have write permissions.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants