Add AVX and FMA intrinsics in Factorization Machine #3940

pkumar07 · 2019-07-01T18:37:15Z

Added AVX and FMA C# intrinsics in Microsoft.ML.StandardTrainers for Factorization Machine algorithm which currently implements C++ SSE code as suggested in #3000 and #3785

@wschin @eerhardt

dnfclas · 2019-07-01T18:37:28Z

All CLA requirements met.

hanblee · 2019-07-02T18:10:10Z

@dotnet-bot test this please

hanblee · 2019-07-02T18:17:40Z

@wschin @eerhardt Would you please help trigger the tests again? They seem to have failed for reasons unrelated to the changes in this PR.

eerhardt · 2019-07-02T21:09:10Z

Would you please help trigger the tests again?

I have retried the failing tests. If you want to do this, in Azure Pipelines, sign in, and then at the top right is a ... button - drop that down and click "Retry".

eerhardt · 2019-07-02T21:11:38Z

src/Microsoft.ML.StandardTrainers/Microsoft.ML.StandardTrainers.csproj

@@ -1,7 +1,8 @@
 <Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
-    <TargetFramework>netstandard2.0</TargetFramework>
+    <TargetFramework Condition="'$(UseIntrinsics)' != 'true'">netstandard2.0</TargetFramework>
+    <TargetFrameworks Condition="'$(UseIntrinsics)' == 'true'">netstandard2.0;netcoreapp3.0</TargetFrameworks>


This type of change isn't going to work for the same reasons as I outlined here: #3785 (comment).

Instead, the intrinsics classes should be moved to Microsoft.ML.CpuMath, which is set up to multi-target for netstandard2.0 and netcoreapp3.0.

Would you please help trigger the tests again?

I have retried the failing tests. If you want to do this, in Azure Pipelines, sign in, and then at the top right is a ... button - drop that down and click "Retry".

Thanks Eric!

This type of change isn't going to work for the same reasons as I outlined here: #3785 (comment).

Instead, the intrinsics classes should be moved to Microsoft.ML.CpuMath, which is set up to multi-target for netstandard2.0 and netcoreapp3.0.

Should I move the intrinsic methods in the existing AvxIntrisics class or keep them in a separate class in Microsoft.ML.CpuMath?

The separate class(es) you made sound good to me. Keeping the logic separate makes sense.

eerhardt · 2019-07-02T21:13:48Z

...crosoft.ML.StandardTrainers/FactorizationMachine/FactorizationMachineInterface.netcoreapp.cs

+    internal static unsafe class FieldAwareFactorizationMachineInterface
+    {
+        internal const string NativePath = "FactorizationMachineNative";
+        public const int CbAlign = 16;


We shouldn't be duplicating all this code. Instead, the pattern is to have 3 files:

a netcoreapp.cs file with the code that only applies to .NET Core.

a netstandard.cs file with the code that only applies to .NET Standard.

A FactorizationMachine.cs file that contains the common code between them.

Then the class is marked partial, so multiple files can add members to the class.

eerhardt · 2019-07-08T14:52:49Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+
+namespace Microsoft.ML.Internal.CpuMath
+{
+    internal static class FactorizationMachineAvxIntrinsics


It is a best practice to keep the file name and the class name in sync.
Similarly, keeping the folder structure and the namespace in sync as well.

So you should either:

Keep the existing folder/file structure, and change the namespace to Microsoft.ML.Internal.CpuMath.FactorizationMachine and change this class name to AvxIntrinsics.

Change the folder/file structure to remove the FactorizationMachine folder and rename the file to FactorizationMachineAvxIntrinsics.cs.

Make some other change where the namespace and folder structures are aligned, and the class names and file names are aligned.

This makes it easier to navigate the folders/files in the repository.

Same for FmaIntrinsics.cs. #Resolved

I will keep the existing file structure and change the namespace to Microsoft.ML.Internal.CpuMath.FactorizationMachine

eerhardt · 2019-07-08T15:02:24Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FactorizationMachineInterface.netcoreapp.cs

+                    else if (Avx.IsSupported)
+                        FactorizationMachineAvxIntrinsics.CalculateIntermediateVariables(fieldCount, latentDim, count, pf, pi, px, pw, Ptr(latentWeights, pv), Ptr(latentSum, pq), pr);
+                    else
+                        CalculateIntermediateVariablesNative(fieldCount, latentDim, count, pf, pi, px, pw, Ptr(latentWeights, pv), Ptr(latentSum, pq), pr);


When Fma and Avx are not supported on the current machine, what do you think about not using native code, but instead falling back to a C# implementation?

The advantage of this is that we can support other architectures (like ARM) without having to compile directly to ARM code. (Note: that doesn't have to happen in this PR, but can have an issue logged for it.)

I agree, it would be good to have a fallback implementation to support other architectures. I can open an issue for the same.

eerhardt · 2019-07-08T15:07:02Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FactorizationMachineInterface.cs

+{
+    internal static unsafe partial class FieldAwareFactorizationMachineInterface
+    {
+        internal const string NativePath = "FactorizationMachineNative";


Unfortunately, now that we've moved this code into CpuMath, this dependency is now backwards.

The FactorizationMachineNative assembly ships in the Microsoft.ML nuget package, which has a dependency on the Microsoft.ML.CpuMath nuget pacakge. However, this is now creating a runtime dependency from CpuMath back to an assembly shipping in the Microsoft.ML nuget package.

I think the simplest fix for this would be to remove the FactorizationMachineNative native assembly all together, and put its 2 functions into the CpuMathNative assembly, which ships in this same nuget package. #Resolved

eerhardt

This is really good work, @pkumar07! Thank you so much for it.

Do you happen to have any performance results to share for this change? I'd be very interested in seeing how switching between FMA, AVX, and the C++'s SSE implementations affect perf.

eerhardt · 2019-07-09T15:07:46Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FactorizationMachineInterface.cs

+        }
+
+        [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
+        public static extern void CalculateIntermediateVariablesNative(int fieldCount, int latentDim, int count, int* /*const*/ fieldIndices, int* /*const*/ featureIndices,


These two DllImport methods should be marked private. No one outside of this class should be able to call them.

eerhardt · 2019-07-09T15:09:48Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FactorizationMachineInterface.cs

+{
+    internal static unsafe partial class FieldAwareFactorizationMachineInterface
+    {
+        internal const string NativePath = "CpuMathNative";


These two const variables can be made private.

eerhardt · 2019-07-09T15:21:16Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+            y = Avx.Add(y, Avx.Multiply(Vector256.Create(0.5f), tmp));
+            tmp = Avx.Add(y, Avx.Permute2x128(y, y, 1));
+            tmp = Avx.HorizontalAdd(tmp, tmp);
+            y = Avx.HorizontalAdd(tmp, tmp);


Why the difference here from the C++ code?

In C++ we have:

_tmp = _mm_add_ps(_y, _mm_movehl_ps(_y, _y)); _y = _mm_add_ps(_tmp, _mm_shuffle_ps(_tmp, _tmp, 1)); // The lowest slot is the response value.

But here, we have:

tmp = Avx.Add(y, Avx.Permute2x128(y, y, 1)); tmp = Avx.HorizontalAdd(tmp, tmp); y = Avx.HorizontalAdd(tmp, tmp);

Every other line in this method reads line-for-line identical to the C++ code (besides the SSE vs AVX differences).

There is no AVX equivalent instruction for _mm_movehl_ps(_y, _y). Permuting and doing horizontal add on the two floating point elements (tmp) does the same computation.

eerhardt · 2019-07-09T15:27:52Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FmaIntrinsics.cs

+                        Vector256<float> h = Fma.MultiplyAdd(gLatent, gLatent, Avx.LoadVector256(hvjfprime + k));
+
+                        // Perform ADAGRAD update rule to adjust latent vector.
+                        v = Fma.MultiplyAddNegated(lr, Avx.Multiply(Avx.ReciprocalSqrt(h), gLatent), v);


For my knowledge (I'm not a SIMD expert), what is the difference between MultiplyAddNegated and MultiplySubtract?

The main difference is the order in which the vectors get multiplied and added/subtracted.

MultiplyAddNegated instruction translates to __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c) instruction. The result is computed as: -(a*b) + c

whereas MultiplySubtract instruction translates to __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c) .
The result for this is computed as: (a*b) - c

More info

eerhardt · 2019-07-09T15:31:57Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FmaIntrinsics.cs

+
+namespace Microsoft.ML.Internal.CpuMath.FactorizationMachine
+{
+    internal static class FmaIntrinsics


Since this class is nearly identical to the AvxIntrinsics class, would it make more sense to collapse these two classes together, and then just interleave if (Fma.IsSupported) inline in the methods when you want to switch use an Fma intrinsic?

That way there isn't so much duplicated code.

We already follow that pattern here:

machinelearning/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

Lines 159 to 171 in fabdabf

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]

private static Vector256<float> MultiplyAdd(Vector256<float> src1, Vector256<float> src2, Vector256<float> src3)

{

if (Fma.IsSupported)

{

return Fma.MultiplyAdd(src1, src2, src3);

}

else

{

Vector256<float> product = Avx.Multiply(src1, src2);

return Avx.Add(product, src3);

}

}

I agree. We can interleave in the methods to avoid code duplication. I separated the implementation into two classes to make it more readable. I will update the code to reflect these changes.

eerhardt

This is looking really great, @pkumar07. Thank you so much for removing FactorizationMachineNative assembly.

My only major concern is the code duplication between the Avx and the Fma path. I think if we can resolve that this will be ready to be merged.

eerhardt · 2019-07-09T17:13:47Z

Adding @tannergooding to give a 2nd pair of eyes on the Fma and Avx implementation.

pkumar07 · 2019-07-09T17:52:56Z

This is really good work, @pkumar07! Thank you so much for it.

Do you happen to have any performance results to share for this change? I'd be very interested in seeing how switching between FMA, AVX, and the C++'s SSE implementations affect perf.

Thanks @eerhardt! I did some performance tests and I see 14% performance gain over Baseline(SSE) for AVX and 15% gain for FMA.

Summary:
OS=Windows 10.0.17763.55
Processor=Intel Core i7 8700K CPU 3.70GHz (Coffee Lake), 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
Baseline: ML.NET Release 1.0

Command: .\dotnet.exe ..\Microsoft.ML.Console\netcoreapp3.0\MML.dll train data=higgs.idv loader=BinaryLoader{} trainer=ffm{} out=model_FM.zip
Dataset

Baseline (Release 1.0)	AVX	AVX+FMA
75.47s	66.08s	65.89s

tannergooding · 2019-07-09T20:11:33Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+        // Compute the output value of the field-aware factorization, as the sum of the linear part and the latent part.
+        // The linear part is the inner product of linearWeights and featureValues.
+        // The latent part is the sum of all intra-field interactions in one field f, for all fields possible
+        public static unsafe void CalculateIntermediateVariables(int fieldCount, int latentDim, int count, int* fieldIndices, int* featureIndices, float* featureValues,


Is there any perf concern around the number of parameters being passed here?

Can you elaborate more on this? For now, I have kept the parameters identical to the C++ code.

C# isn't always as good about inlining as C/C+and there are more parameters being passed than will stay in registers (even for fastcall). Was wondering if that was a concern since these most of these will end up shadow copied to the stack and be double-indirections.

if it is, and if inlining isn't a valid option, reordering the parameters so the pointers are first would ensure they are the ones passed in register (keeping them single indirections).

There is a very slight difference in the codegen. Changing the order of parameters reduces one instruction. I can reorder the parameters in the code.

tannergooding · 2019-07-09T20:18:13Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+        // The latent part is the sum of all intra-field interactions in one field f, for all fields possible
+        public static unsafe void CalculateIntermediateVariables(int fieldCount, int latentDim, int count, int* fieldIndices, int* featureIndices, float* featureValues,
+            float* linearWeights, float* latentWeights, float* latentSum, float* response)
+        {


Might be nice to assert that AVX is supported, given the function is dependent on it.

tannergooding · 2019-07-09T20:22:06Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+                        Vector256<float> vjfprimeBuffer = Avx.LoadVector256(vjfprime + k);
+                        Vector256<float> q = Avx.LoadVector256(qffprime + k);
+                        q = Avx.Add(q, Avx.Multiply(vjfprimeBuffer, x));
+                        Avx.Store(qffprime + k, q);


What's the codegen for this sequence... I would hope it folds both loads, but I've seen some codegen here sometimes.

Codegen:

movsxd rbx, eax vmovups ymm3, ymmword ptr [r8+rbx*4] movsxd rbx, eax vmovups ymm4, ymmword ptr [r10+rbx*4] vmulps ymm3, ymm3, ymm2 vaddps ymm4, ymm4, ymm3 movsxd rbx, eax vmovups ymmword ptr [r10+rbx*4], ymm4

What's the codegen for this sequence... I would hope it folds both loads, but I've seen some codegen here sometimes.

@tannergooding
Should this be created as an issue in coreclr?

I already logged one for it a while back: https://github.com/dotnet/coreclr/issues/25008

tannergooding · 2019-07-09T20:28:30Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+                }
+            }
+
+            y = Avx.Add(y, Avx.Multiply(Vector256.Create(0.5f), tmp));


The Vector256.Create(0.5f) isn't optimized right now and this will broadcast each time. Might be better to manually hoist it to an readonly static?

tannergooding · 2019-07-09T20:30:44Z

src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs

+                        Vector256<float> gLatent = Avx.Multiply(lambdav, v);
+
+                        // Calculate loss function's gradient.
+                        if (fprime != f)


is this better than creating a temporary to hold the multiplier and then doing gLatent = Avx.Add(gLatent, Avx.Multiply(sx, tmp)) outside the if/else?

I tried this as suggested and I see a similar performance.

eerhardt · 2019-07-11T17:03:58Z

src/Microsoft.ML.CpuMath/FactorizationMachine/FactorizationMachineInterface.cs

@@ -0,0 +1,35 @@
+using System.Runtime.InteropServices;


All new code files need a copyright header. You can copy one from an existing .cs file in the repo.

eerhardt

This is really great! Thanks for the work, @pkumar07.

pkumar07 · 2019-07-11T18:47:06Z

This is really great! Thanks for the work, @pkumar07.

Thanks @eerhardt!

pkumar07 · 2019-07-15T17:56:03Z

@eerhardt I have addressed your comments.

eerhardt · 2019-07-15T18:21:54Z

@codemzs @wschin @artidoro - this change still needs another reviewer.

pkumar07 · 2019-07-23T21:24:08Z

@codemzs @wschin @artidoro Is there any update on this PR?

eerhardt · 2019-07-25T18:39:21Z

@codemzs @wschin @artidoro - this change still needs another reviewer.

eerhardt · 2019-08-01T17:45:38Z

@pkumar07 - can you sync up this PR to the latest code. Once it is synced up, and green, I will merge it into master.

pkumar07 · 2019-08-05T19:16:04Z

@eerhardt Could you please trigger the failing tests? I don't have access to Azure Pipelines to Retry the tests.

codemzs · 2019-08-06T01:46:34Z

@pkumar07 done.

pkumar07 · 2019-08-07T21:20:07Z

Thanks @codemzs! :)

eerhardt reviewed Jul 2, 2019

View reviewed changes

eerhardt reviewed Jul 8, 2019

View reviewed changes

eerhardt reviewed Jul 9, 2019

View reviewed changes

eerhardt requested a review from tannergooding July 9, 2019 17:13

tannergooding reviewed Jul 9, 2019

View reviewed changes

pkumar07 force-pushed the fm_managedintrinsics branch from 2eb7f9f to 68e1535 Compare July 11, 2019 01:28

eerhardt reviewed Jul 11, 2019

View reviewed changes

eerhardt approved these changes Jul 11, 2019

View reviewed changes

codemzs requested review from codemzs, artidoro and wschin July 11, 2019 17:08

pkumar07 force-pushed the fm_managedintrinsics branch from 68e1535 to 2e60637 Compare July 11, 2019 22:14

Added AVX/FMA intrinsics to Factorization Machine

7125d3d

pkumar07 force-pushed the fm_managedintrinsics branch from 2e60637 to 7125d3d Compare August 1, 2019 18:36

codemzs merged commit 5d90c11 into dotnet:master Aug 6, 2019

pkumar07 deleted the fm_managedintrinsics branch August 26, 2019 18:26

frank-dong-ms-zz mentioned this pull request Apr 21, 2020

Perf: Optimization for FactorizationMachine #3000

Closed

ghost locked as resolved and limited conversation to collaborators Mar 21, 2022

	[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
	private static Vector256<float> MultiplyAdd(Vector256<float> src1, Vector256<float> src2, Vector256<float> src3)
	{
	if (Fma.IsSupported)
	{
	return Fma.MultiplyAdd(src1, src2, src3);
	}
	else
	{
	Vector256<float> product = Avx.Multiply(src1, src2);
	return Avx.Add(product, src3);
	}
	}

Add AVX and FMA intrinsics in Factorization Machine #3940

Add AVX and FMA intrinsics in Factorization Machine #3940

Uh oh!

Conversation

pkumar07 commented Jul 1, 2019

Uh oh!

dnfclas commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanblee commented Jul 2, 2019

Uh oh!

hanblee commented Jul 2, 2019

Uh oh!

eerhardt commented Jul 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkumar07 Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt left a comment

Choose a reason for hiding this comment

Uh oh!

eerhardt commented Jul 9, 2019

Uh oh!

pkumar07 commented Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnfclas commented Jul 1, 2019 •

edited

Loading

pkumar07 Jul 3, 2019 •

edited

Loading

eerhardt Jul 8, 2019 •

edited

Loading

eerhardt Jul 8, 2019 •

edited

Loading

eerhardt Jul 8, 2019 •

edited

Loading

pkumar07 commented Jul 9, 2019 •

edited

Loading