Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX and FMA intrinsics in Factorization Machine #3940

Merged
merged 1 commit into from
Aug 6, 2019

Conversation

pkumar07
Copy link

@pkumar07 pkumar07 commented Jul 1, 2019

Added AVX and FMA C# intrinsics in Microsoft.ML.StandardTrainers for Factorization Machine algorithm which currently implements C++ SSE code as suggested in #3000 and #3785

@wschin @eerhardt

@dnfclas
Copy link

dnfclas commented Jul 1, 2019

CLA assistant check
All CLA requirements met.

@hanblee
Copy link

hanblee commented Jul 2, 2019

@dotnet-bot test this please

@hanblee
Copy link

hanblee commented Jul 2, 2019

@wschin @eerhardt Would you please help trigger the tests again? They seem to have failed for reasons unrelated to the changes in this PR.

@eerhardt
Copy link
Member

eerhardt commented Jul 2, 2019

Would you please help trigger the tests again?

I have retried the failing tests. If you want to do this, in Azure Pipelines, sign in, and then at the top right is a ... button - drop that down and click "Retry".

@@ -1,7 +1,8 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
<TargetFramework Condition="'$(UseIntrinsics)' != 'true'">netstandard2.0</TargetFramework>
<TargetFrameworks Condition="'$(UseIntrinsics)' == 'true'">netstandard2.0;netcoreapp3.0</TargetFrameworks>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of change isn't going to work for the same reasons as I outlined here: #3785 (comment).

Instead, the intrinsics classes should be moved to Microsoft.ML.CpuMath, which is set up to multi-target for netstandard2.0 and netcoreapp3.0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please help trigger the tests again?

I have retried the failing tests. If you want to do this, in Azure Pipelines, sign in, and then at the top right is a ... button - drop that down and click "Retry".

Thanks Eric!

Copy link
Author

@pkumar07 pkumar07 Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of change isn't going to work for the same reasons as I outlined here: #3785 (comment).

Instead, the intrinsics classes should be moved to Microsoft.ML.CpuMath, which is set up to multi-target for netstandard2.0 and netcoreapp3.0.

Should I move the intrinsic methods in the existing AvxIntrisics class or keep them in a separate class in Microsoft.ML.CpuMath?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate class(es) you made sound good to me. Keeping the logic separate makes sense.

internal static unsafe class FieldAwareFactorizationMachineInterface
{
internal const string NativePath = "FactorizationMachineNative";
public const int CbAlign = 16;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be duplicating all this code. Instead, the pattern is to have 3 files:

  1. a netcoreapp.cs file with the code that only applies to .NET Core.
  2. a netstandard.cs file with the code that only applies to .NET Standard.
  3. A FactorizationMachine.cs file that contains the common code between them.

Then the class is marked partial, so multiple files can add members to the class.


namespace Microsoft.ML.Internal.CpuMath
{
internal static class FactorizationMachineAvxIntrinsics
Copy link
Member

@eerhardt eerhardt Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a best practice to keep the file name and the class name in sync.
Similarly, keeping the folder structure and the namespace in sync as well.

So you should either:

  1. Keep the existing folder/file structure, and change the namespace to Microsoft.ML.Internal.CpuMath.FactorizationMachine and change this class name to AvxIntrinsics.
  2. Change the folder/file structure to remove the FactorizationMachine folder and rename the file to FactorizationMachineAvxIntrinsics.cs.
  3. Make some other change where the namespace and folder structures are aligned, and the class names and file names are aligned.

This makes it easier to navigate the folders/files in the repository.

Same for FmaIntrinsics.cs. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep the existing file structure and change the namespace to Microsoft.ML.Internal.CpuMath.FactorizationMachine

else if (Avx.IsSupported)
FactorizationMachineAvxIntrinsics.CalculateIntermediateVariables(fieldCount, latentDim, count, pf, pi, px, pw, Ptr(latentWeights, pv), Ptr(latentSum, pq), pr);
else
CalculateIntermediateVariablesNative(fieldCount, latentDim, count, pf, pi, px, pw, Ptr(latentWeights, pv), Ptr(latentSum, pq), pr);
Copy link
Member

@eerhardt eerhardt Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Fma and Avx are not supported on the current machine, what do you think about not using native code, but instead falling back to a C# implementation?

The advantage of this is that we can support other architectures (like ARM) without having to compile directly to ARM code. (Note: that doesn't have to happen in this PR, but can have an issue logged for it.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be good to have a fallback implementation to support other architectures. I can open an issue for the same.

{
internal static unsafe partial class FieldAwareFactorizationMachineInterface
{
internal const string NativePath = "FactorizationMachineNative";
Copy link
Member

@eerhardt eerhardt Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, now that we've moved this code into CpuMath, this dependency is now backwards.

The FactorizationMachineNative assembly ships in the Microsoft.ML nuget package, which has a dependency on the Microsoft.ML.CpuMath nuget pacakge. However, this is now creating a runtime dependency from CpuMath back to an assembly shipping in the Microsoft.ML nuget package.

I think the simplest fix for this would be to remove the FactorizationMachineNative native assembly all together, and put its 2 functions into the CpuMathNative assembly, which ships in this same nuget package. #Resolved

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good work, @pkumar07! Thank you so much for it.

Do you happen to have any performance results to share for this change? I'd be very interested in seeing how switching between FMA, AVX, and the C++'s SSE implementations affect perf.

}

[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void CalculateIntermediateVariablesNative(int fieldCount, int latentDim, int count, int* /*const*/ fieldIndices, int* /*const*/ featureIndices,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two DllImport methods should be marked private. No one outside of this class should be able to call them.

{
internal static unsafe partial class FieldAwareFactorizationMachineInterface
{
internal const string NativePath = "CpuMathNative";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two const variables can be made private.

y = Avx.Add(y, Avx.Multiply(Vector256.Create(0.5f), tmp));
tmp = Avx.Add(y, Avx.Permute2x128(y, y, 1));
tmp = Avx.HorizontalAdd(tmp, tmp);
y = Avx.HorizontalAdd(tmp, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the difference here from the C++ code?

In C++ we have:

_tmp = _mm_add_ps(_y, _mm_movehl_ps(_y, _y));
_y = _mm_add_ps(_tmp, _mm_shuffle_ps(_tmp, _tmp, 1)); // The lowest slot is the response value.

But here, we have:

tmp = Avx.Add(y, Avx.Permute2x128(y, y, 1));
tmp = Avx.HorizontalAdd(tmp, tmp);
y = Avx.HorizontalAdd(tmp, tmp);

Every other line in this method reads line-for-line identical to the C++ code (besides the SSE vs AVX differences).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no AVX equivalent instruction for _mm_movehl_ps(_y, _y). Permuting and doing horizontal add on the two floating point elements (tmp) does the same computation.

Vector256<float> h = Fma.MultiplyAdd(gLatent, gLatent, Avx.LoadVector256(hvjfprime + k));

// Perform ADAGRAD update rule to adjust latent vector.
v = Fma.MultiplyAddNegated(lr, Avx.Multiply(Avx.ReciprocalSqrt(h), gLatent), v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my knowledge (I'm not a SIMD expert), what is the difference between MultiplyAddNegated and MultiplySubtract?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is the order in which the vectors get multiplied and added/subtracted.

MultiplyAddNegated instruction translates to __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c) instruction. The result is computed as: -(a*b) + c

whereas MultiplySubtract instruction translates to __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c) .
The result for this is computed as: (a*b) - c

More info


namespace Microsoft.ML.Internal.CpuMath.FactorizationMachine
{
internal static class FmaIntrinsics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class is nearly identical to the AvxIntrinsics class, would it make more sense to collapse these two classes together, and then just interleave if (Fma.IsSupported) inline in the methods when you want to switch use an Fma intrinsic?

That way there isn't so much duplicated code.

We already follow that pattern here:

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> MultiplyAdd(Vector256<float> src1, Vector256<float> src2, Vector256<float> src3)
{
if (Fma.IsSupported)
{
return Fma.MultiplyAdd(src1, src2, src3);
}
else
{
Vector256<float> product = Avx.Multiply(src1, src2);
return Avx.Add(product, src3);
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We can interleave in the methods to avoid code duplication. I separated the implementation into two classes to make it more readable. I will update the code to reflect these changes.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really great, @pkumar07. Thank you so much for removing FactorizationMachineNative assembly.

My only major concern is the code duplication between the Avx and the Fma path. I think if we can resolve that this will be ready to be merged.

@eerhardt
Copy link
Member

eerhardt commented Jul 9, 2019

Adding @tannergooding to give a 2nd pair of eyes on the Fma and Avx implementation.

@pkumar07
Copy link
Author

pkumar07 commented Jul 9, 2019

This is really good work, @pkumar07! Thank you so much for it.

Do you happen to have any performance results to share for this change? I'd be very interested in seeing how switching between FMA, AVX, and the C++'s SSE implementations affect perf.

Thanks @eerhardt! I did some performance tests and I see 14% performance gain over Baseline(SSE) for AVX and 15% gain for FMA.

Summary:
OS=Windows 10.0.17763.55
Processor=Intel Core i7 8700K CPU 3.70GHz (Coffee Lake), 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
Baseline: ML.NET Release 1.0

Command: .\dotnet.exe ..\Microsoft.ML.Console\netcoreapp3.0\MML.dll train data=higgs.idv loader=BinaryLoader{} trainer=ffm{} out=model_FM.zip
Dataset

Baseline (Release 1.0) AVX AVX+FMA
75.47s 66.08s 65.89s

// Compute the output value of the field-aware factorization, as the sum of the linear part and the latent part.
// The linear part is the inner product of linearWeights and featureValues.
// The latent part is the sum of all intra-field interactions in one field f, for all fields possible
public static unsafe void CalculateIntermediateVariables(int fieldCount, int latentDim, int count, int* fieldIndices, int* featureIndices, float* featureValues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any perf concern around the number of parameters being passed here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on this? For now, I have kept the parameters identical to the C++ code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C# isn't always as good about inlining as C/C+and there are more parameters being passed than will stay in registers (even for fastcall). Was wondering if that was a concern since these most of these will end up shadow copied to the stack and be double-indirections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is, and if inlining isn't a valid option, reordering the parameters so the pointers are first would ensure they are the ones passed in register (keeping them single indirections).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a very slight difference in the codegen. Changing the order of parameters reduces one instruction. I can reorder the parameters in the code.

// The latent part is the sum of all intra-field interactions in one field f, for all fields possible
public static unsafe void CalculateIntermediateVariables(int fieldCount, int latentDim, int count, int* fieldIndices, int* featureIndices, float* featureValues,
float* linearWeights, float* latentWeights, float* latentSum, float* response)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to assert that AVX is supported, given the function is dependent on it.

Vector256<float> vjfprimeBuffer = Avx.LoadVector256(vjfprime + k);
Vector256<float> q = Avx.LoadVector256(qffprime + k);
q = Avx.Add(q, Avx.Multiply(vjfprimeBuffer, x));
Avx.Store(qffprime + k, q);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the codegen for this sequence... I would hope it folds both loads, but I've seen some codegen here sometimes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codegen:

movsxd rbx, eax
vmovups ymm3, ymmword ptr [r8+rbx*4]
movsxd rbx, eax
vmovups ymm4, ymmword ptr [r10+rbx*4]
vmulps ymm3, ymm3, ymm2
vaddps ymm4, ymm4, ymm3
movsxd rbx, eax
vmovups ymmword ptr [r10+rbx*4], ymm4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the codegen for this sequence... I would hope it folds both loads, but I've seen some codegen here sometimes.

@tannergooding
Should this be created as an issue in coreclr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already logged one for it a while back: https://github.com/dotnet/coreclr/issues/25008

}
}

y = Avx.Add(y, Avx.Multiply(Vector256.Create(0.5f), tmp));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vector256.Create(0.5f) isn't optimized right now and this will broadcast each time. Might be better to manually hoist it to an readonly static?

Vector256<float> gLatent = Avx.Multiply(lambdav, v);

// Calculate loss function's gradient.
if (fprime != f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better than creating a temporary to hold the multiplier and then doing gLatent = Avx.Add(gLatent, Avx.Multiply(sx, tmp)) outside the if/else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this as suggested and I see a similar performance.

@@ -0,0 +1,35 @@
using System.Runtime.InteropServices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All new code files need a copyright header. You can copy one from an existing .cs file in the repo.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! Thanks for the work, @pkumar07.

@pkumar07
Copy link
Author

This is really great! Thanks for the work, @pkumar07.

Thanks @eerhardt!

@pkumar07
Copy link
Author

@eerhardt I have addressed your comments.

@eerhardt
Copy link
Member

@codemzs @wschin @artidoro - this change still needs another reviewer.

@pkumar07
Copy link
Author

@codemzs @wschin @artidoro Is there any update on this PR?

@eerhardt
Copy link
Member

@codemzs @wschin @artidoro - this change still needs another reviewer.

@eerhardt
Copy link
Member

eerhardt commented Aug 1, 2019

@pkumar07 - can you sync up this PR to the latest code. Once it is synced up, and green, I will merge it into master.

@pkumar07
Copy link
Author

pkumar07 commented Aug 5, 2019

@eerhardt Could you please trigger the failing tests? I don't have access to Azure Pipelines to Retry the tests.

@codemzs
Copy link
Member

codemzs commented Aug 6, 2019

@pkumar07 done.

@codemzs codemzs merged commit 5d90c11 into dotnet:master Aug 6, 2019
@pkumar07
Copy link
Author

pkumar07 commented Aug 7, 2019

Thanks @codemzs! :)

@pkumar07 pkumar07 deleted the fm_managedintrinsics branch August 26, 2019 18:26
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants