-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…the intrinsics Note: Building perf tests succeed, but running perf tests for AVX intrinsics ends without results.
…r all used AVX intrinsics Note: Except matrix operations
… passing basic unit tests
…rix operations with longer input arrays
…different alignments separately, with some style changes to readonly variables
…tcoreapp and netstandard
All 5 checks have passed, with no merge conflicts. This PR is now ready for review. |
@@ -10,6 +10,11 @@ | |||
<LangVersion>7.3</LangVersion> | |||
</PropertyGroup> | |||
|
|||
<PropertyGroup Condition="'$(Configuration)|$(TargetFramework)|$(Platform)'=='Release-Intrinsics|netstandard2.0|AnyCPU'"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to know! I think one day Visual Studio added this to the .csproj file on its own... I guess it tries to optimize stuff? But not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
@@ -20,7 +20,7 @@ | |||
|
|||
namespace Microsoft.ML.Runtime.Internal.CpuMath | |||
{ | |||
internal static class SseIntrinsics | |||
public static class SseIntrinsics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this public? I don't think it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SseIntrinsics
has been made public since I wanted to call SseIntrinsics
methods directly from another assembly in perf tests. Since now the original CpuMathUtils
code path mostly picks up the AVX
intrinsics, I may need to call SseIntrinsics
methods directly to get the SSE
intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO - We should InternalsVisibleTo for things like this.
{ | ||
fixed (float* pdst = dst) | ||
{ | ||
float* pDstEnd = pdst + dst.Length; | ||
float* pDstCurrent = pdst; | ||
|
||
Vector128<float> x1 = Sse.SetAllVector128(scale); | ||
Vector128<float> scalarVector = Sse.SetAllVector128(scalar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, changes like this should go in a separate PR. We shouldn't be adding AVX support, and then doing unrelated cleanup/refactoring of SSE code in the same change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad... I'll keep this in mind.
private readonly AlignedArray[] testMatrices; | ||
private readonly AlignedArray[] testSrcVectors; | ||
private readonly AlignedArray[] testDstVectors; | ||
private readonly float[][] _testArrays; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making changes like this makes it super hard to review what actual changes are being done. Almost every line in this file has changed just because of these renames.
Typically, these kinds of clean ups should happen in a separate PR so it is easier to review the separate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again... Should have dealt with suggestions in briancylui#2 in a separate PR. Thank you so much for your review.
AlignedArray dst = testDstVectors[dstTest]; | ||
AlignedArray mat = _testMatrices[matTest]; | ||
AlignedArray src = _testSrcVectors[srcTest]; | ||
AlignedArray dst = _testDstVectors[dstTest]; | ||
|
||
CpuMathUtils.MatTimesSrc(false, false, mat, src, dst, dst.Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One unfortunate result of our unit testing approach is that we are probably only hitting the AVX code paths, since all our CI machines probably have AVX supported.
@tannergooding, do you have any ideas on how we can run the same tests in CI, with different environment variables to turn AVX and both off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .netstandard/UnitTests.cs
would be hitting the SSE code paths, since the CpuMathUtils
for .netstandard
is hard-coded to call SseUtils
methods. I could use the following example to accept an env var:
machinelearning/test/Microsoft.ML.CpuMath.PerformanceTests/SsePerformanceTests.cs
Lines 37 to 61 in 94401d5
private static int GetSeed() | |
{ | |
int seed = DEFAULT_SEED; | |
if (Environment.GetEnvironmentVariable("CPUMATH_SEED") != null) | |
{ | |
string CPUMATH_SEED = Environment.GetEnvironmentVariable("CPUMATH_SEED"); | |
if (!int.TryParse(CPUMATH_SEED, out seed)) | |
{ | |
if(string.Equals(CPUMATH_SEED, "random", StringComparison.OrdinalIgnoreCase)) | |
{ | |
seed = new Random().Next(); | |
} | |
else | |
{ | |
seed = DEFAULT_SEED; | |
} | |
} | |
} | |
Console.WriteLine("Random seed: " + seed + "; set environment variable CPUMATH_SEED to this value to reproduce results"); | |
return seed; | |
} |
The new code in CpuMathUtils
could be:
private static string _simdArch = Environment.GetEnvironmentVariable("SIMD_ARCH");
private static bool _useAvx = (_simdArch == null) ? true : (string.Equals(_simdArch, "avx", StringComparison.OrdinalIgnoreCase) ? true : false);
private static bool _useSse = (_simdArch == null) ? true : (string.Equals(_simdArch, "sse", StringComparison.OrdinalIgnoreCase) ? true : false);
In the switching function, do:
if (Avx.IsSupported && _useAvx) {...}
else if (Sse.IsSupported && _useSse) {...}
else { // software fallback }
How does it sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already environment variables for turning off Intrinsics in .NET Core.
For now, I think we could use COMPlus_EnableAVX=false
to disable AVX, and that way test the SSE code path. I'm not sure how to turn SSE off as well. If we had that, then we could test the software fallbacks too.
See https://github.com/dotnet/coreclr/issues/19519 for adding more flags support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and that way test the SSE code path. I'm not sure how to turn SSE off as well.
We can use COMPlus_FeatureSIMD=0
to disable SSE.
{ | ||
public class AvxPerformanceTests | ||
{ | ||
private const int EXP_MAX = 127; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this appears to be duplicated with the SsePerformanceTests. Can we refactor the two classes so there is less duplication?
@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath | |||
{ | |||
public static partial class CpuMathUtils | |||
{ | |||
// REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE. Is it okay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably say netstandard2.0
and not .NET Core App 2.0
. This code is also used on the Windows-only .NET Framework.
@@ -6,8 +6,16 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath | |||
{ | |||
public static partial class CpuMathUtils | |||
{ | |||
// REVIEW NEEDED: AVX support cannot be checked in .NET Core App 2.0, so we assume Vector128 alignment for SSE. Is it okay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can check AVX support by calling into the native method that does this:
machinelearning/src/Native/CpuMathNative/Sse.cpp
Lines 88 to 105 in 94401d5
// Test whether Avx is available. | |
EXPORT_API(bool) ChkAvx() | |
{ | |
#ifdef _WIN32 | |
int cpuInfo[4]; | |
__cpuid(cpuInfo, 1); | |
// 28th bit of second integer of Cpu Info denotes whether the Avx is supported in CPU or not | |
// Reference http://msdn.microsoft.com/en-us/library/hskdteyh(v=vs.100).aspx | |
return cpuInfo[2] & (1 << 28) || false; | |
#else | |
unsigned char buffer[16]; | |
(void) getcpuid(1, buffer); | |
// taken from https://github.com/dotnet/coreclr/blob/b5f4d2df2e087401f2c3aab2c37021e326707915/src/vm/codeman.cpp#L1381 | |
return ((buffer[11] & 0x18) == 0x18) && (xmmYmmStateSupport() == 1); | |
#endif | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - should I make the .netstandard
implementation call some AVX
intrinsics too since it can also check for AVX support, or should I leave out this comment and keep GetVectorAlignment()
in this file .netstandard.cs
to return only Vector128Alignment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it the way you have it for now. But this is something that could be addressed in the future.
Looking at your perf results, the following AVX tests appear to be slower than the SSE counterparts:
We should investigate why this is. Is there something about the AVX implementation that we did incorrectly? If using AVX in these scenarios doesn't lead to faster execution, maybe we shouldn't use AVX here? All the other perf tests seem to have fairly decent gains by using AVX over SSE. /cc @helloguo @sduttac @fiigii - since I know they are interested in AVX performance. |
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static void ZeroUpper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii @tannergooding - Can either of you comment on this? I don't see why this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe JIT chooses VEX encoding when running on AVX capable machines. vzeroupper is not necessary for avx hw intrinsic implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we do not need ZeroUpper
in the JIT compilation environment, and we won't add ZeroUpper
intrinsic in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment - I will proceed to delete this placeholder (ZeroUpper()
) in my next commit.
|
||
Vector128<float> scaleVector128 = Sse.SetAllVector128(scale); | ||
|
||
while (pIdxCurrent + 4 <= pEnd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a loop here? Since we know we are within 8 elements of the end, we can only do this section at most 1 time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't need to. I should probably change all while
to if
regarding this case in all methods - thanks!
Meanwhile, I'm looking forward to implementing the 'double-compute' optimization suggested by @tannergooding (the bottom item in briancylui#2), which may drastically change the layout of the intrinsics files and thus will be included in a separate PR.
For operations on sparse arrays, such as DotSU, AddScaleSU, AddSU and so on, @eerhardt I believe the perf of |
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector256<float> ToVector256(in Vector128<float> a, in Vector128<float> b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be Avx.SetHighLow
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector128<float> GetLow(in Vector256<float> x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For many of these, it might be better (long term) to just call the inner method directly (for example, just call Avx.GetLowerHalf
, rather than GetLow
)
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector128<float> Rotate(in Vector128<float> x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already defined in the Sse
code, we shouldn't duplicate when the code is identical between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any perf hit if I place these duplicated functions in a separate class IntrinsicsUtils
in the same assembly and have the intrinsics in Sse/AvxIntrinsics
call them as IntrinsicsUtils.Rotate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have them in, and call them from, SseIntrinsics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but no, there shouldn't be any perf hit by having the static methods in another static class.
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector128<float> VectorSum128(in Vector128<float> vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this, and the other *128
functions, if the code is identical to the 128-bit code in the SseIntrinsics
code, we shouldn't duplicate it here.
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector256<float> VectorMax256(in Vector256<float> vector) | ||
{ | ||
Vector256<float> x1 = Avx.Shuffle(vector, vector, 0xB1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should have 0xB1 and 0x02, documented. It would also be useful to document why you do two shuffle/max
operations
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector256<float> GetAbsMask256() | ||
{ | ||
return Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Avx.SetAllVector256
should be in a static readonly, to help with perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, do you mean that there should be a static readonly
variable like Vector256<float> absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF));
instead of a function GetAbsMask256
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Vector256<float> xSign = Avx.And(xDst1, signMask); // result = 0x8000 0000 if xDst1 is negative or 0x0000 0000 otherwise | ||
Vector256<float> xDst1Abs = Avx.Xor(xDst1, xSign); | ||
|
||
// REVIEW NEEDED: Do we want Signaling or NonSignaling? The original functionality is NonSignaling, which does not throw an exception even when there is an NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CoreCLR doesn't support signaling NaN (because it doesn't support IEEE exception handling).
} | ||
|
||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector256<float> GetNewDst256(in Vector256<float> xDst1, in Vector256<float> signMask, in Vector256<float> xThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is signMask
being passed in, shouldn't that be a constant?
public static int GetVectorAlignment() | ||
{ | ||
// Assumes SSE support on machines that run ML.NET. | ||
return Avx.IsSupported ? Vector256Alignment : Vector128Alignment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't assume SSE support, this may break for ARM32/ARM64 when/if they come online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML.NET currently assumes SSE3 support for the active existing intrinsics in SseUtils
to function properly. The original alignment value was a constant value 16. When ARM32/ARM64 is adopted in the future, IMO those PRs should address this assumption, but not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to, IMO, assert that this method is only called if Sse.IsSupported
is true then (as the HWIntrinsic code paths should not be getting called otherwise).
It is also undesirable to assume SSE3 support, as there are platforms/OS's where SSE3 support might not be available (I don't think I saw any code that assumed this condition, for the HWIntrinsics, however).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding Contracts.Assert(Sse.IsSupported);
is good, but then (1) software fallbacks would fail at the assert even though they don't care about alignment, (2) external calls to GetVectorAlignment
particularly in RffTransform.cs
would fail before calling any software fallbacks, and (3) I'll need to add the same assert to .netstandard.cs
(i.e. the original code path), which could be fine since those machines without SSE support would have failed to call SSE intrinsics anyway.
I agree with adding Contracts.Assert(Sse.IsSupported)
somewhere, but also want to avoid the issues in (1) and (2), so I'd love to know about how I should proceed along this direction.
I agree that there may be some systems without SSE3 support. It seems to me that the Sse3.HorizontalAdd
intrinsic, which is frequently called for getting the VectorSum
, is only available in SSE3 but unfortunately not in SSE, but I may be mistaken...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To handle the software fallbacks (1), you should probably be returning the natural alignment of float
(which is generally 4) or just not caring about alignment for that case in general (since the JIT will handle it for primitive types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to say, there is no point in copying the AlignedArray
to be 16-byte aligned, when the software fallback will only be accessing elements that require 4-byte alignment.
And long term, we probably shouldn't even have AlignedArray, and should be properly handling the head/tail of the arrays (either via double-compute or scalar access) to ensure the rest of the data is aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Will do this - thanks!
Note: In some other files like CpuAligenedMathUtils
(yes it is spelled this way currently), they are checking if a.cbAlign == CpuMathUtils.GetVectorAlignment
. I will change it to (a.cbAlign % CpuMathUtils.GetVectorAlignment) == 0
in my next wave of commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't do the same checks Sse/Avx.IsSupported
in the GetVectorAlignment()
in .netstandard.cs
, so I may assume SSE support there and always return Vector128Alignment
since it was the original behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all honesty, I plan on removing AlignedArray
all together at some point. So I wouldn't spend too much time polishing this, as it will go away.
|
||
public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun) | ||
{ | ||
Contracts.Assert(mat.Size == dst.Size * src.Size); | ||
Contracts.Assert(crun >= 0); | ||
|
||
if (Sse.IsSupported) | ||
if (Avx.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that using 256-bit operations is a net-win for the entire application before "enabling" it by default. For certain workloads, it can downclock the processor and while this generally results in a higher throughput for the specific 256-bit workload, it can potentially cause a perf hit for any non 256-bit workloads that get executed before the processor goes back to its "highest" frequency level.
As per 15.26 Skylake Server Power Management (from the "Intel® 64 and IA-32 Architectures Optimization Reference Manual")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not always easy to predict whether a program's performance will improve from building it to target
Intel AVX-512 instructions. Programs that enjoy high performance gains from the use of xmm or ymm
registers may expect performance improvement by moving to the use of zmm registers. However, some
programs that use zmm registers may not gain as much, or may even lose performance. It is recommended
to try multiple build options and measure the performance of the program.
The same logic applies, but to a lesser extent, to AVX-256 instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@briancylui - there is one end-to-end benchmark in the repo today - https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks. Can you try running it with/without your changes and post the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @tannergooding 's concern, what will be the end-to-end scenarios or applications to drive the optimizations? https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks lists one end-to-end benchmark, but it seems iris.txt is a small data set. Do we have more end-to-end scenarios to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are currently formulating more benchmarks to add to the repo. See #711.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt: Feel free to check out the perf results for end-to-end perf scenarios on briancylui#7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding suggested earlier about checking if the input array length is greater than a certain AvxLengthLimit
along with checking Avx.IsSupported
. I could do some perf tests that accept a Param
that is the length of the input array to see what value AvxLengthLimit
should take. Same for SseLengthLimit
.
I think one of the problems about AVX that impacts perf right now is that we haven't implemented the "double-compute" to take care of alignment that @tannergooding mentioned in briancylui#2 (bottom item). I look forward to implementing it after this PR, since it would change the layout of every function in a significant way.
Don't make any changes in this PR regarding that. This PR should strictly only be concerned with enabling AVX. That can be handled in a separate PR. BTW - in order to remove that, you need to move both your 3.0 SDK and runtime up to a newer build. |
return Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1); | ||
} | ||
private static Vector256<float> SetHighLow(in Vector128<float> a, in Vector128<float> b) | ||
=> Avx.InsertVector128(Avx.ExtendToVector256(b), a, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By, this should be SetHighLow
, I meant that there is an Avx.SetHighLow
method, which does the right functionality 😄
Responded to PR comments today in the latest wave of commits (mostly on refactoring code). Also made two intrinsics helper functions about This PR is now ready for a third of review if any. |
This PR is now ready for a third round of review if any. All previous PR comments have been responded to. Looking forward to optimize perf in a separate PR. |
@eerhardt @tannergooding @fiigii @helloguo do you have more feedback on this PR? I would love to get this merged soon because Brian's internship will soon be ending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM.
I have a few nits here and there (mostly cleanup or improvements) but nothing I saw that would be considered blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to all PR feedback so far with two reviewer approvals. IMO, this PR is ready to be merged. |
Merged since @briancylui doesn't have write permissions. |
What's new:
CpuMathUtils
AlignedArray
as an argumentU
suffix are implemented from scratch and do not yet have support from native code, which only contains their counterparts with theX
suffix.Description from #668:
AlignedArray
as an argumentDescription from #562:
Structure:
Changes to users:
Future follow-up for
CpuMath
enhancementCpuMath
enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).cc: @eerhardt @safern @tannergooding @danmosemsft