Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add SoA raytracer as a CQ test for Intel hardware intrinsic #18839

Merged
merged 1 commit into from
Aug 10, 2018

Conversation

fiigii
Copy link

@fiigii fiigii commented Jul 9, 2018

This PR ports the current SIMD benchmark RayTracer ( using Vector3) to a SoA algorithm using AVX/AVX2 intrinsics. The new benchmark keeps the same shading approach of the original raytracer as much as possible, so they can generate the same images and be compared directly.

Performance data (rendering a 2k image)

Execution time Windows Linux
Baseline (RayTracer ) 6.00s 4.13s
PacketTracer 0.83s 0.93s
Performance Gains 7.23x 4.44x

Updated the performance data with #19663

The data collected on

  • Intel Core i9 7900X (Skylake-X) @ 3.3GHz, HT on, Turbo on, 16GB DDR4 2666MHz
  • Windows 10 and Ubuntu 16.04

VTune characterization (module level)

Windows

image

Linux

image

According to the execution time and the module-level VTune data, we can see that

The most obvious module-level difference between the baseline and SoA is that

VTune characterization (managed code)

Windows

image

Linux

image

The codegen issues of RyuJIT have been logged at

VTune characterization (CoreCLR runtime)

Windows

image

Linux

image

Close https://github.com/dotnet/coreclr/issues/17798

@fiigii
Copy link
Author

fiigii commented Jul 9, 2018

@tannergooding
Copy link
Member

wrote a brief VectorMath library that provides vectorized Exp(), Log(), and Pow() functions needed by the SoA raytracer, but the result precision looks not good (@tannergooding could you help?)

@fiigii, You can find open-source/vectorized versions of these functions here (MIT Licensed, Microsoft Owned): https://github.com/Microsoft/DirectXMath/blob/master/Inc/DirectXMathVector.inl

They should be good starting points for speed/precision.

@fiigii
Copy link
Author

fiigii commented Jul 9, 2018

@tannergooding Thank you so much! Will try to port it to C#.

@eerhardt
Copy link
Member

eerhardt commented Jul 9, 2018

the build configuration PacketTracer.csproj is incorrect (The type or namespace name 'Vector256<>' could not be found), @eerhardt could you help?

You need to add a PackageReference to System.Runtime.Intrinsics.Experimental. See

<PackageReference Include="System.Runtime.Intrinsics.Experimental">
<Version>$(MicrosoftPrivateCoreFxNETCoreAppPackageVersion)</Version>
</PackageReference>
as an example.

@fiigii
Copy link
Author

fiigii commented Jul 11, 2018

The precision issue of vectorized Pow() is fixed, and this program no longer crashes after #18849 merged (Thank @tannergooding)

@@ -103,6 +103,9 @@
<PackageReference Include="xunit.runner.utility">
<Version>$(XunitPackageVersion)</Version>
</PackageReference>
<PackageReference Include="System.Runtime.Intrinsics.Experimental">
<Version>$(MicrosoftPrivateCoreFxNETCoreAppPackageVersion)</Version>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt I added this reference, but this still cannot be built. Do I need to add a new csproj under tests\src\JIT\config? Or do something like #16378? cc @AndyAyersMS

@AndyAyersMS
Copy link
Member

Benchmarks are still targeting netstandard1.4, and the new package is not compatible:

D:\repos\coreclr\tests\src\JIT\config\benchmark\benchmark.csproj : 
error NU1202: Package System.Runtime.Intrinsics.Experimental 4.6.0-preview1-26704-01 is not compatible with netstandard1.4 (.NETStandard,Version=v1.4). 
Package System.Runtime.Intrinsics.Experimental 4.6.0-preview1-26704-01 supports: netcoreapp2.1 (.NETCoreApp,Version=v2.1) 
[D:\repos\coreclr\tests\build.proj]

Seems like it ought to be simple to update the main benchmark dependence to netcoreapp2.1 or later, but this has proved difficult to fix. See #16126 for some attempts.

You might be able to clone the benchmark config and target netcoreapp2.1 in the cloned version and then refer to for your new test.

<Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform>
<CLRTestKind>BuildOnly</CLRTestKind>
<NugetTargetMoniker>.NETCoreApp,Version=v3.0</NugetTargetMoniker>
<NugetTargetMonikerShort>netcoreapp3.0</NugetTargetMonikerShort>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyAyersMS Thank you so much for the help. I created a "benchmark+intrinsic.csproj" and target it to netcoreapp3.0., so it can be built now.

@fiigii fiigii changed the title [WIP] Add SoA raytracer as a CQ test for Intel hardware intrinsic Add SoA raytracer as a CQ test for Intel hardware intrinsic Jul 13, 2018
@fiigii
Copy link
Author

fiigii commented Jul 13, 2018

All the issues have been addressed, I think this PR is ready to review. @CarolEidt @tannergooding @AndyAyersMS @mikedn @eerhardt PTAL

Now, the SoA raytracer outputs exactly same picture as the original AoS raytracer, and I will provide detailed profiling data later.
pt

@fiigii
Copy link
Author

fiigii commented Jul 13, 2018

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic please
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx please
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2 please
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd please
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic please

@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic please
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx please
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2 please
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd please
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic please

@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic please
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx please
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2 please
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd please
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic please

@tannergooding
Copy link
Member

This PR ports the current SIMD benchmark RayTracer ( using Vector3) to a SoA algorithm using AVX/AVX2 intrinsics.

It would be nice to have an AoS version (for direct comparison). The same could be said for a SoA version of the System.Numerics.Vector implementation.

Having an Sse version would also be nice, for benchmarking hardware without AVX/AVX2 support.

None of this required now, of course, but as future "up for grabs" work items.

// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General nit: It would be useful to format all these documents according to the recommended conventions....

Things like one statement per line, braces on their own line, no stray newlines, no trailing whitespace, a newline after the top level using statements and before the first namespace, etc....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, VS, for the most part, should do this if you run the "Format Document" command

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, did this happen? It looks like we still have things like:

  • Stray or missing newlines
  • Unsorted usings
  • Methods entirely on a single line
  • etc

public Vector256<float> Distances;
public Vector256<int> ThingIndeces;

public static readonly Vector256<float> NullDistance = SetAllVector256<float>(float.MaxValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why float.MaxValue, instead of something like NaN or Infinity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to compare the NullDistance against the result of Intersect in method MinIntersect.

internal struct Intersections
{
public Vector256<float> Distances;
public Vector256<int> ThingIndeces;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Indices

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

public static Int32RGBPacket256 ConvertToIntRGB(this VectorPacket256 colors)
{
var one = SetAllVector256<float>(1.0f);
var max = SetAllVector256<float>(255.0f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a static readonly more efficient, since we don't have JIT support for this yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, even if we did have JIT support, since the constants are per method, a static readonly might still be better...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will cache them in static fields.

public static Camera Create(VectorPacket256 pos, VectorPacket256 lookAt)
{
VectorPacket256 forward = (lookAt - pos).Normalize();
VectorPacket256 down = new VectorPacket256(SetAllVector256<float>(0), SetAllVector256<float>(-1), SetAllVector256<float>(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetAllVector256<float>(0) should be SetZeroVector256?

VectorPacket256 forward = (lookAt - pos).Normalize();
VectorPacket256 down = new VectorPacket256(SetAllVector256<float>(0), SetAllVector256<float>(-1), SetAllVector256<float>(0));
VectorPacket256 right = SetAllVector256<float>(1.5f) * VectorPacket256.CrossProduct(forward, down).Normalize();
VectorPacket256 up = SetAllVector256<float>(1.5f) * VectorPacket256.CrossProduct(forward, right).Normalize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be more efficient to create a single instance of SetAllVector256(1.5f) and reuse it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

{
var cmp = Compare(dis, NullDistance, FloatComparisonMode.EqualOrderedNonSignaling);
var zero = SetZeroVector256<int>();
var mask = Avx2.CompareEqual(zero, zero);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we getting a mask of ones this way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the most efficient way, some C++ compilers also generate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here would be useful then.

public ObjectPool(Func<T> generator, IProducerConsumerCollection<T> collection)
: base(collection)
{
if (generator == null) throw new ArgumentNullException("generator");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generator is null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameof(generator)?

public T GetObject()
{
T value;
return base.TryTake(out value) ? value : _generator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out T value?

{
var items = new List<T>();
T value;
while (base.TryTake(out value)) items.Add(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out T value?


protected override bool TryAdd(T item)
{
PutObject(item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... PutObject itself calls base.TryAdd, which returns a bool....

Seems we should have a TryPutObject and shouldn't just always return true


protected override bool TryTake(out T item)
{
item = GetObject();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is directly copied from AoS raytracer, dot not need to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an existing file, I agree it doesn't need to be changed in this PR.

It would still be nice to get it cleaned up separately, however.


public Packet256Tracer(int _width, int _hight)
{
if (_width % VectorPacket256.Packet256Size != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parenthesis around mathematical expressions are helpful 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not everyone may agree, but that is my preference)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

internal class Packet256Tracer
{
public int Width { get; private set; }
public int Hight { get; private set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Height?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

// See the LICENSE file in the project root for more information.
//

using System.Runtime.Intrinsics;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These usings are all unused.


internal abstract class ObjectPacket256
{
public Surface Surface { get; private set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need private set

{
if ((_width % VectorPacket256.Packet256Size) != 0)
{
_width += VectorPacket256.Packet256Size - _width % VectorPacket256.Packet256Size;
Copy link
Member

@tannergooding tannergooding Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: My preference is to add parenthesis where they help improve readability or where they help clarify operator precedence.


private static readonly Vector256<float> SevenToZero = SetVector256(7f, 6f, 5f, 4f, 3f, 2f, 1f, 0f);

public Packet256Tracer(int _width, int _height)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters names should not start with _

Vector256<float> Xs = Add(SetAllVector256(fx), SevenToZero);
var dirs = GetPoints(Xs, SetAllVector256<float>(y), camera);
var rayPacket256 = new RayPacket256(camera.Pos, dirs);
var SoAcolors = TraceRay(rayPacket256, scene, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make 0 a named parameter to help improve readability

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 seems already very clear as the first value of depth 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was indicating that, if you were to just read TraceRay(rayPacket256, scene, 0);, it isn't obvious that 0 is for depth (you have to go look at the TraceRay signature).

While TraceRay(rayPacket256, scene, depth: 0); makes this immediately obvious

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, misunderstood your words...


// `FastTranspose` returns an "incomplete" AoS structure,
// which can be written into memory 16-byte by 16-byte.
// Now, .NET Core does not guarantee the 32-byte alignment,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think .NET Core guaranteed 16-byte alignment either (just that the first local was 16-byte aligned, or something similar)...

Avx2.ExtractVector128(output + 20, intAoS.Bs, 1);
}

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you commented out some unused code?

float heightRate1 = Height / 2.0f;
float heightRate2 = Height * 2.0f;

var recenteredX = Divide(Subtract(x, SetAllVector256(widthRate1)), SetAllVector256(widthRate2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more or less perforamnt than a SetAllVector256(Width) and SetAllVector256(2) along with a Multiple and Divide call?

@fiigii
Copy link
Author

fiigii commented Jul 30, 2018

@tannergooding Thank you so much for the review, I have addressed your feedback.

@fiigii
Copy link
Author

fiigii commented Jul 30, 2018

logged the CRT issue at https://github.com/dotnet/coreclr/issues/19203

@fiigii
Copy link
Author

fiigii commented Jul 31, 2018

Removed the package dependency on S.R.Intrinsic.Experimental to match the recent CoreFX update.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I would still really like to see some of the document formatting fixed/improved (even if it is an up-for-grabs bug). There are a number of stray or missing newlines, multiple statements on a single line, unused or unsorted usings, code simplifications (sometimes from new language features, sometimes where the code was unnecessarily verbose), etc.

@4creators
Copy link

@dotnet-bot test OSX10.12 x64 Checked CoreFX Tests
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@tannergooding
Copy link
Member

@dotnet-bot test OSX10.12 x64 Checked CoreFX Tests

@tannergooding
Copy link
Member

@CarolEidt, @AndyAyersMS, @eerhardt, @fiigii. Unless someone has feedback saying otherwise, I plan on merging this after the tests show as all green.

@CarolEidt
Copy link

I am fine with merging this. I had hoped to take time to review, but just haven't found the time, and would indeed like to see this added.

I second Tanner's general comments about improving the documentation and formatting, and am fine with making that a future work item. One specific comment I had was that I hadn't (yet) found any attribution about where this came from. The existing RayTracer was derived from a sample program that was available on MSDN, and if this was also derived from that it would be good to note it somewhere.

@fiigii
Copy link
Author

fiigii commented Aug 10, 2018

@CarolEidt Thanks for the comment. I will continue to improve the code/doc of this benchmark in future PRs. Especially, I have made a prototype of struct promotion to address the GC overhead from VectorPacket, so this benchmark may be changed to struct based.

and if this was also derived from that it would be good to note it somewhere.

This benchmark is not derived from the current RayTracer. I am just using its input data (models) and some top-level framework code with a totally new algorithm, so that these two benchmarks can be directly compared. I will add some comments to note the reused code in the next PR. Thanks!

@tannergooding
Copy link
Member

Merging. The CoreFX tests aren't impacted by this change and are failing with unrelated issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants