Skip to content

Port C# key hardware intrinsics APIs for SSE from SIMD native algorithms #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Aug 6, 2018

Conversation

briancylui
Copy link
Contributor

@briancylui briancylui commented Jul 20, 2018

  • with unit tests and performance tests for two target frameworks: .NET Core App 3.0 and .NET Standard 2.0.
  • .NET Core App 3.0 gets the new C# hardware intrinsics APIs, while .NET Standard 2.0 gets the original native SIMD algorithms.
  • Several things have made this multi-targeting feature possible.
    1. The new CpuMathUtils class that contains the new APIs is implemented as a partial class with method definitions split into two separate files (src\Microsoft.ML.CpuMath\CpuMathUtils.[target].cs), only one of which is compiled based on the target framework.
    2. The Microsoft.ML.CpuMath.csproj file makes the switching decision to compile the right files based on the target framework.

Structure:

  1. All new hardware intrinsics APIs are contained in src\Microsoft.ML.CpuMath.
  2. Unit tests for the two target frameworks live in test\Microsoft.ML.CpuMath.UnitTests.[target], and contain the same content except for the target framework in .csproj.
  3. Performance tests live in test\Microsoft.ML.CpuMath.PerformanceTests.

Changes to users:

  1. Originally, users call intrinsics APIs via the SseUtils class in src\Microsoft.ML.CpuMath\Sse.cs, but now they call them via the new CpuMathUtils class, which will handle switching between SSE and AVX in the future.
  2. CpuMathUtils methods and SseUtils methods share the same signature, but CpuMathUtils methods additionally call a new helper class (SseIntrinsics) for C# implementations of SSE operations.

Future follow-up for CpuMath enhancement

  1. Suggestions on CpuMath enhancement in this PR scheduled for future follow-ups have been compiled into an issue page here.

briancylui and others added 24 commits July 19, 2018 16:06
Need to multi-target CpuMath for netstandard and netcoreapp2.1.  Also, since we are going to move CpuMath into its own NuGet package, remove the dependency from CpuMath to the ML.Core project.
…tting.

Note: It will not compile until Microsoft.ML.CpuMath.csproj is changed to adapt to multi-targetting.
…cs to simplify CpuMathUtils.DotNetCoreApp.cs
- due to working in Mac OSX
@dnfclas
Copy link

dnfclas commented Jul 20, 2018

CLA assistant check
All CLA requirements met.

@eerhardt
Copy link
Member

Everything in this src\Native\CpuMath folder should be deleted. We don't need it in the official code base.


Refers to: src/Native/CpuMath/Intrinsics.csproj:1 in f0f81a5. [](commit_id = f0f81a5, deletion_comment = False)

@@ -0,0 +1,451 @@
using System;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All code files need a copyright header on them. See other .cs files for what the header should look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Some new files include the header, while some others don't. Will fix it right away.


namespace Microsoft.ML.CpuMath.UnitTests.DotNetCoreApp
{
public class UnitTests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give this class a more descriptive name of what it is testing. Probably something similar to CpuMathUtilsTests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited thanks

@briancylui
Copy link
Contributor Author

Issues at the end of today:

  1. Removing NETCoreAppMaximumVersion tag from green build results in errors in running build.cmd -Release. Follow-ups needed.
  2. Building and running perf tests in local machines fail because:
    The application to execute does not exist: 'C:\Users\t-brlui\repos\machinelearning\bin\AnyCPU.Release\Microsoft.ML.CpuMath.PerformanceTests\netcoreapp3.0\Microsoft.ML.CpuMath.PerformanceTests.dll'. Follow-ups needed.

<Project>
<PropertyGroup>
<MSBuildAllProjects>$(MSBuildAllProjects);$(MSBuildThisFileFullPath)</MSBuildAllProjects>
<VSTestTargets Condition="'$(UseIntrinsics)' != 'true' and '$(TargetFramework)' == 'netcoreapp3.0'">ignore.targets</VSTestTargets>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? I'm not familiar. Also, since you are also overriding VSTest - isn't that enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this property setting, we should factor out '$(UseIntrinsics)' != 'true' and '$(TargetFramework)' == 'netcoreapp3.0' into an MSBuild property so it is only declared once.


In reply to: 206735143 [](ancestors = 206735143)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the place where we try to blank VSTest (https://github.com/dotnet/machinelearning/pull/562/files/c6f5f8569ea31bc30e0b7ae54870b9cbf51e2bdb#diff-77ebf330684fd059d7aecec3ce7e6218R19) is processed before the program builds the actual VSTest, so without this line, even though we would have blanked VSTest, the entire VSTest would be built all over again. Because of this order of the build, one way to solve this issue is to ignore the built targets of VSTest to ignore the test projects. Not sure if it is the best way to do it so feel free to suggest any optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I helped him with this yesterday and by doing a pp log I found out that Microsoft.TestPlatform.targets where imported after the actual Empty.targets file even though they are specified to be imported after the common targets, I don't know if that is the way it is designed or if it is a bug.

So I found out that there was a VSTestTargets overridable property declared in: 15.0\Microsoft.Common.targets\ImportAfter\Microsoft.TestPlatform.ImportAfter.targets:17 that points to Microsoft.TestPlatform.targets and this ImportAfter.targets file will import that project if the specified path exists. So in order to override the VSTest target I suggested to just set that property to a non-existent targets file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm reading this correctly, that means this line in Empty.targets:

<Target Name="VSTest"/>

Isn't doing anything and can be removed, right?

Also - it would probably make sense to move setting this property into Empty.targets. That would resolve my "we should factor this condition out" comment above as well. build/AfterCommonTargets.targets is responsible for deciding whether Empty.targets should be imported. Empty.targets decides how to ensure no targets are run (which includes setting this property).

Copy link
Member

@safern safern Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm reading this correctly, that means this line in Empty.targets:

<Target Name="VSTest"/>

Isn't doing anything and can be removed, right?

It is actually needed because in run-tests.proj we're calling the VSTest target on every test project: https://github.com/dotnet/machinelearning/blob/master/test/run-tests.proj#L10

And if we don't declare it, it will complain saying that the VSTest target doesn't exist in both test projects targeting netcoreapp3.0.

Also - it would probably make sense to move setting this property into Empty.targets. That would resolve my "we should factor this condition out" comment above as well. build/AfterCommonTargets.targets is responsible for deciding whether Empty.targets should be imported. Empty.targets decides how to ensure no targets are run (which includes setting this property).

That is right, no sense having the same condition twice, just declare it when Empty.targets is imported, which is conditioned already.

I also thought of just simply excluding both projects from: https://github.com/dotnet/machinelearning/blob/master/test/run-tests.proj#L5

But it makes more sense to me to do everything inside Empty.targets since whenever the repo is ready to build and target netcoreapp3.0 we can just remove this files and we're good to go.

<TargetFramework>netcoreapp3.0</TargetFramework>
<IsPackable>false</IsPackable>
<!-- The SDK we currently restore and use has 2.0 as the maximum supported version -->
<NETCoreAppMaximumVersion>3.0</NETCoreAppMaximumVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be able to be removed.

<TargetFramework>netcoreapp3.0</TargetFramework>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<!-- The SDK we currently restore and use has 2.0 as the maximum supported version -->
<NETCoreAppMaximumVersion>3.0</NETCoreAppMaximumVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

@eerhardt
Copy link
Member

eerhardt commented Aug 1, 2018

Let's not merge this until after we branch for v0.4. We shouldn't add risk to v0.4 with this change...

vector = Sse3.HorizontalAdd(vector, vector);
vector = Sse3.HorizontalAdd(vector, vector);
}
else
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding: (Responding to your previous PR comment) Factoring the VectorSum function out of the intrinsics has worked quite well, but my old implementation of VectorSum (which outputs a Vector128<float>) seemed to cause some perf issues (doubled the running time of 4 intrinsics that call VectorSum). I changed it to ToVectorSum, which changes vector in place, but the same perf issue persisted. Then, I (thanks to Santi's help) tried the AggressiveInlining attribute above to make the function inline, which has improved the perf results back to normal. Curious about how it sounds to you, and if there is any better way to do it.

Side question: Should I also make the other private functions - Load1, Load4, Rotate, RotateReverse, and Store4 - inline with the AggressiveInlining attribute to improve perf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to change private static void ToVectorSum(ref Vector128<float> vector) to private static Vector128<T> ToVectorSum(in Vector128<float> vector), as Vector128<T> was made into a readonly struct, but it otherwise looks good.

Should I also make the other private functions - Load1, Load4, Rotate, RotateReverse, and Store4 - inline with the AggressiveInlining attribute to improve perf

I would think they are already being inlined (except for Store4), if that isn't the case, then the methods are probably small enough that inlining would help. I would recommend benchmarking and looking at the codegen for both having them explicitly inlined and not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not hurt to put AggressiveInlining attribute on these short and internal functions.

For your old code, even though VectorSum is inlined, the codegen is not optimized due to register spill and reload. It seems JIT has optimization opportunity over there. Do you mind opening an issue to discuss about it on CoreCLR github repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input! Adding AggressiveInlining attributes for those short functions to mimic the behavior of the existing native code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pick up logging a bug for CoreCLR. We should get a minimal repro and ensure it isn't already tracked by one of the "first class struct" issues.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. 2 last things (plus resolving the merge conflict on the .sln file), and I think this can be merged.

Nice work, @briancylui

@@ -86,6 +88,9 @@ if %__IntermediatesDir% == "" (
set "__CMakeBinDir=%__CMakeBinDir:\=/%"
set "__IntermediatesDir=%__IntermediatesDir:\=/%"

if [%CMAKE_BUILD_TYPE:~-11%] == [-Intrinsics] (
set CMAKE_BUILD_TYPE=%CMAKE_BUILD_TYPE:~0,-11%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you stick a comment here saying what is happening (it isn't obvious from the code).

"Strip the -Intrinsics from the build type"

<TargetFramework Condition="'$(UseIntrinsics)' != 'true'">netstandard2.0</TargetFramework>
<TargetFrameworks Condition="'$(UseIntrinsics)' == 'true'">netstandard2.0;netcoreapp3.0</TargetFrameworks>
<IncludeInPackage>Microsoft.ML.CpuMath</IncludeInPackage>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<DefineConstants>$(DefineConstants);CORECLR;PRIVATE_CONTRACTS</DefineConstants>
<LangVersion>7.3</LangVersion>
<!-- The SDK we currently restore and use has 2.0 as the maximum supported version -->
<NETCoreAppMaximumVersion>3.0</NETCoreAppMaximumVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with removing the NETCoreAppMaximumVersion tag.

In a recent fix of the VectorSum function in SseIntrinsics.cs, since we need to rely on the read-only feature of Vector128 introduced in C# 7.2 to change the vector in-place, the Lang Version has to be at least 7.2. However, since we use the feature of extensible fixed statements in SseIntrinsics.cs, which is a feature in C# 7.3, the LangVersion restriction has been set to 7.3. Reference: dotnet/roslyn#24469

Please let me know how I should proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I was specifically talking about <NETCoreAppMaximumVersion>3.0</NETCoreAppMaximumVersion>. The LangVersion change is needed.

@briancylui
Copy link
Contributor Author

Responded to the second round of review, and now this PR is ready for a third round if any. Gathering thoughts on this PR before fixing the merge conflicts in the sln file. Please let me know if there is anywhere else I should improve my code. How does it sound to everyone?

@briancylui
Copy link
Contributor Author

Merge conflicts have been fixed. Green builds for all 5 checks.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Nice work, Brian.

.Run(null, CreateClrVsCoreConfig());
}

private static IConfig CreateClrVsCoreConfig()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. We can leave it like this for now.

private const int DEFAULT_SEED = 253421;
private const float DEFAULT_SCALE = 1.11f;

private float[] src, dst, original, src1, src2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This all non constant privates should be prefixed with _ but it can be updated in your next PR to not block this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the issue page (briancylui#2) to include this action item for future follow-up. Thank you for your input!

Copy link
Member

@safern safern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert in the intrinsic operations but looks good to me. Nice work Brian!

@briancylui
Copy link
Contributor Author

Thank you all reviewers for your time reviewing this PR and making great suggestions to fix issues or improve performance. I have learnt a lot from all of you.

Since I don't have write access to the dotnet/machinelearning repo, may anyone with write access please help merging this PR to the master? Much appreciated!

After the merging, I can then start writing more SSE intrinsics in a new branch.

@eerhardt eerhardt merged commit ac4fead into dotnet:master Aug 6, 2018
@briancylui briancylui changed the title Port C# hardware intrinsics APIs for SSE from SIMD native algorithms Port C# key hardware intrinsics APIs for SSE from SIMD native algorithms Aug 8, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants