Skip to content

Conversation

@JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Oct 26, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Adds new SimdUtils.Shuffle4Channel methods with overloads for float and byte. This is the first part of a series of PRs that will complete #1354

Methods are SIMD optimized on NET Core 3.1+

public static void Shuffle4Channel(ReadOnlySpan<float> source, Span<float> dest, byte control) Avx, Sse
public static void Shuffle4Channel(ReadOnlySpan<byte> source, Span<byte> dest, byte control) Avx2, Ssse3.

There don't appear to be any equivalent methods for System.Numerics so I don't think we can add any optimization there.

Additional utility methods have also been added to SimdUtils.Shuffle which allow the generation of shuffle controls.

Benchmarks

4 Channel Float

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]          : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1
Method Job EnvironmentVariables Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Shuffle4Channel AVX Empty 128 14.49 ns 0.244 ns 0.217 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 128 87.74 ns 0.524 ns 0.490 ns 6.06 0.09 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 128 23.65 ns 0.101 ns 0.094 ns 1.63 0.03 - - - -
Shuffle4Channel AVX Empty 256 25.87 ns 0.492 ns 0.673 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 256 159.52 ns 0.901 ns 0.843 ns 6.12 0.12 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 256 45.47 ns 0.404 ns 0.378 ns 1.75 0.03 - - - -
Shuffle4Channel AVX Empty 512 49.51 ns 0.088 ns 0.083 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 512 297.96 ns 0.926 ns 0.821 ns 6.02 0.02 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 512 90.77 ns 0.191 ns 0.169 ns 1.83 0.00 - - - -
Shuffle4Channel AVX Empty 1024 113.09 ns 1.913 ns 3.090 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 1024 604.58 ns 1.464 ns 1.298 ns 5.29 0.18 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 1024 179.44 ns 0.208 ns 0.184 ns 1.57 0.05 - - - -
Shuffle4Channel AVX Empty 2048 217.95 ns 1.314 ns 1.165 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 2048 1,152.04 ns 3.941 ns 3.494 ns 5.29 0.03 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 2048 349.52 ns 0.587 ns 0.520 ns 1.60 0.01 - - - -

4 Channel Byte

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]          : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  AVX             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  No HwIntrinsics : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  SSE             : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1
Method Job EnvironmentVariables Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Shuffle4Channel AVX Empty 128 20.51 ns 0.270 ns 0.211 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 128 63.00 ns 0.991 ns 0.927 ns 3.08 0.06 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 128 17.25 ns 0.066 ns 0.058 ns 0.84 0.01 - - - -
Shuffle4Channel AVX Empty 256 24.57 ns 0.248 ns 0.219 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 256 124.55 ns 2.501 ns 2.456 ns 5.06 0.10 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 256 21.80 ns 0.094 ns 0.088 ns 0.89 0.01 - - - -
Shuffle4Channel AVX Empty 512 28.51 ns 0.130 ns 0.115 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 512 256.52 ns 1.424 ns 1.332 ns 9.00 0.07 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 512 29.72 ns 0.217 ns 0.203 ns 1.04 0.01 - - - -
Shuffle4Channel AVX Empty 1024 36.40 ns 0.357 ns 0.334 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 1024 492.71 ns 1.498 ns 1.251 ns 13.52 0.12 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 1024 44.71 ns 0.264 ns 0.234 ns 1.23 0.02 - - - -
Shuffle4Channel AVX Empty 2048 59.38 ns 0.180 ns 0.159 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 2048 975.05 ns 2.043 ns 1.811 ns 16.42 0.05 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 2048 81.83 ns 0.212 ns 0.198 ns 1.38 0.01 - - - -

@codecov
Copy link

codecov bot commented Oct 26, 2020

Codecov Report

Merging #1404 into master will increase coverage by 0.09%.
The diff coverage is 99.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1404      +/-   ##
==========================================
+ Coverage   82.89%   82.99%   +0.09%     
==========================================
  Files         690      692       +2     
  Lines       31017    31189     +172     
  Branches     3561     3578      +17     
==========================================
+ Hits        25713    25884     +171     
- Misses       4581     4582       +1     
  Partials      723      723              
Flag Coverage Δ
#unittests 82.99% <99.59%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs 97.29% <97.29%> (ø)
src/ImageSharp/Common/Helpers/IComponentShuffle.cs 100.00% <100.00%> (ø)
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs 96.52% <100.00%> (+3.07%) ⬆️
...ions/Generated/Argb32.PixelOperations.Generated.cs 100.00% <100.00%> (ø)
...ions/Generated/Bgra32.PixelOperations.Generated.cs 100.00% <100.00%> (ø)
...ions/Generated/Rgba32.PixelOperations.Generated.cs 100.00% <100.00%> (ø)
...rc/ImageSharp/PixelFormats/Utils/PixelConverter.cs 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 120080b...dabc237. Read the comment docs.

@JimBobSquarePants
Copy link
Member Author

Obligatory tag to @saucecontrol

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part of a series of PRs that will complete #1354

Can you describe how the pipelines will look like after the refactor for all the following cases (What is the chain of calls starting from PixelOperations<Rgba32>.FromRgb24(...) and PixelOperations<Rgb24>.FromRgba32(...)):

  1. With HwIntrinsics
  2. Without HwIntrinsics, but with accelerated Vector<T> (so we don't regress significantly on 2.1 and Framework)
  3. Without accelerated SIMD

I want to make sure we are on the same page regarding high level plan before looking at details in code.

// Deal with the remainder:
if (source.Length > 0)
{
ShuffleRemainder4Channel(source, dest, control);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-SIMD shuffling has optimized implementations in PixelConverter, although I'm unsure what runtimes support those optimizations.

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw that but there's a limited set of shuffles (Though we could maybe T4 template it?). With this API we could potentially expose the ability to use any shuffle combination. for example XXXX. We can still use the optimizations in the explicit pixel converters though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really depends on where do we want to get in the end, and how the overall pipelines would look like. (See my main comment.)

Copy link
Member

@antonfirsov antonfirsov Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid regression on non-x86 and/or non 3.1+, I think we have 2 options now:

  1. Handle the cases right in the PixelOperations method:
            public override void FromRgba32(Configuration configuration, ReadOnlySpan<Rgba32> sourcePixels, Span<Argb32> destinationPixels)
            {
                Guard.NotNull(configuration, nameof(configuration));
                Guard.DestinationShouldNotBeTooShort(sourcePixels, destinationPixels, nameof(destinationPixels));

#if SUPPORTS_RUNTIME_INTRINSICS
                if (Avx.IsSupported || Sse.IsSupported)
                {
                    var source = MemoryMarshal.Cast<Rgba32, byte>(sourcePixels);
                    var dest = MemoryMarshal.Cast<Argb32, byte>(destinationPixels);

                    SimdUtils.Shuffle4Channel(source, dest, SimdUtils.Shuffle.WXYZ);
                }
                else
                
#else
                {
                    ref uint sourceRef = ref Unsafe.As<Rgba32, uint>(ref MemoryMarshal.GetReference(sourcePixels));
                    ref uint destRef = ref Unsafe.As<Argb32, uint>(ref MemoryMarshal.GetReference(destinationPixels));

                    for (int i = 0; i < sourcePixels.Length; i++)
                    {
                        uint sp = Unsafe.Add(ref sourceRef, i);
                        Unsafe.Add(ref destRef, i) = PixelConverter.FromRgba32.ToArgb32(sp);
                    }
                }
#endif
            }
  1. Add a switch statement to ShuffleRemainder4Channel to handle well-known cases for control with the code in PixelConverter.

What is the best depends on the plans to handle the 4->3 and the 3->4 cases, so we have consistency in the resulting code. See #1404 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is probably better

Copy link
Member

@antonfirsov antonfirsov Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is one other option, you probably meant this:
3. T4 multiple variants of Shuffle4Channel, so we can avoid the switch-branch, but it quite a lot of work, unsure if the perf gains will justify it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There you go.... I haven't wired them up yet but I've optimized our existing shuffles so there's no slowdown on older NET Core.

About 4x faster than they were previously.

Method Job EnvironmentVariables Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Shuffle4Channel AVX Empty 128 21.56 ns 0.454 ns 0.651 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 128 19.50 ns 0.252 ns 0.224 ns 0.89 0.03 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 128 17.66 ns 0.125 ns 0.111 ns 0.81 0.03 - - - -
Shuffle4Channel AVX Empty 256 24.49 ns 0.284 ns 0.252 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 256 38.02 ns 0.783 ns 1.219 ns 1.51 0.04 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 256 29.11 ns 0.317 ns 0.281 ns 1.19 0.02 - - - -
Shuffle4Channel AVX Empty 512 29.75 ns 0.233 ns 0.218 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 512 79.31 ns 1.252 ns 1.171 ns 2.67 0.05 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 512 29.84 ns 0.603 ns 0.592 ns 1.00 0.02 - - - -
Shuffle4Channel AVX Empty 1024 38.62 ns 0.244 ns 0.228 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 1024 150.61 ns 0.293 ns 0.245 ns 3.90 0.02 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 1024 46.57 ns 0.606 ns 0.567 ns 1.21 0.02 - - - -
Shuffle4Channel AVX Empty 2048 59.99 ns 0.352 ns 0.312 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 2048 283.23 ns 0.869 ns 0.770 ns 4.72 0.02 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 2048 85.03 ns 1.708 ns 1.899 ns 1.42 0.04 - - - -

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 27, 2020

Can you describe how the pipelines will look like after the refactor for all the following cases (What is the chain of calls starting
from PixelOperations.FromRgb24(...) and PixelOperations.FromRgba32(...)):

  1. With HwIntrinsics
  2. Without HwIntrinsics, but with accelerated Vector (so we don't regress significantly on 2.1 and Framework)
  3. Without accelerated SIMD

I want to make sure we are on the same page regarding high level plan before looking at details in code.

I would add the accelerated SIMD utilities directly into the PixelOperations implementations.

For example in Argb32.PixelOperations.Generated

            public override void FromRgba32(Configuration configuration, ReadOnlySpan<Rgba32> sourcePixels, Span<Argb32> destinationPixels)
            {
                Guard.NotNull(configuration, nameof(configuration));
                Guard.DestinationShouldNotBeTooShort(sourcePixels, destinationPixels, nameof(destinationPixels));

#if SUPPORTS_RUNTIME_INTRINSICS
                var source = MemoryMarshal.Cast<Rgba32, byte>(sourcePixels);
                var dest = MemoryMarshal.Cast<Argb32, byte>(destinationPixels);

                SimdUtils.Shuffle4Channel(source, dest, SimdUtils.Shuffle.WXYZ);
#else
                ref uint sourceRef = ref Unsafe.As<Rgba32, uint>(ref MemoryMarshal.GetReference(sourcePixels));
                ref uint destRef = ref Unsafe.As<Argb32, uint>(ref MemoryMarshal.GetReference(destinationPixels));

                for (int i = 0; i < sourcePixels.Length; i++)
                {
                    uint sp = Unsafe.Add(ref sourceRef, i);
                    Unsafe.Add(ref destRef, i) = PixelConverter.FromRgba32.ToArgb32(sp);
                }
#endif
            }

The only reason I'm splitting the PRs up is because hitting the generators means lots of files touched which explodes the review.

@antonfirsov
Copy link
Member

Ah ok, makes sense, this is quite simple for the 4 channel -> 4 channel case.

Will be more more tricky for the 3->4 and 4->3 cases:

  • We can do Convert + Shuffle in one step when HwIntriniscs are available.
  • When not (my case 2.), we need Convert -> Shuffle (or Shuffle->Convert, depending on direction) using temporary buffers, where shuffle is best done with PixelConverter.

@saucecontrol
Copy link
Contributor

Looks good from here! You could probably squeeze out some more perf by manually unrolling the shuffle loops since the loop overhead is basically the same as the SIMD work in those, but clearly it's already a nice win. That MmShuffleSpan method is clever :)

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 28, 2020

That MmShuffleSpan method is clever :)

@saucecontrol Thanks! Was simply looking for a way to save typing! 😁

I like useful helper methods like that, I'd love to write one for blending as I find using _MM_SHUFFLE for that unintuitive.

Btw, loop unrolling was a great suggestion. We're at warp speed now! 🚀

4 Channel Float

Method Job EnvironmentVariables Count Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Shuffle4Channel AVX Empty 128 11.45 ns 0.257 ns 0.477 ns 11.70 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 128 66.32 ns 1.316 ns 1.167 ns 65.76 ns 6.02 0.19 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 128 16.78 ns 0.362 ns 0.520 ns 16.89 ns 1.49 0.06 - - - -
Shuffle4Channel AVX Empty 256 16.29 ns 0.183 ns 0.172 ns 16.24 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 256 131.97 ns 1.197 ns 1.119 ns 132.08 ns 8.11 0.12 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 256 27.77 ns 0.234 ns 0.304 ns 27.72 ns 1.71 0.03 - - - -
Shuffle4Channel AVX Empty 512 33.36 ns 0.143 ns 0.134 ns 33.34 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 512 250.81 ns 0.702 ns 0.622 ns 250.86 ns 7.52 0.04 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 512 58.19 ns 0.497 ns 0.441 ns 58.11 ns 1.74 0.01 - - - -
Shuffle4Channel AVX Empty 1024 74.98 ns 0.311 ns 0.290 ns 74.92 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 1024 493.54 ns 1.703 ns 1.593 ns 493.57 ns 6.58 0.03 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 1024 108.31 ns 0.350 ns 0.327 ns 108.25 ns 1.44 0.01 - - - -
Shuffle4Channel AVX Empty 2048 152.54 ns 1.937 ns 1.717 ns 152.78 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 2048 974.79 ns 1.463 ns 1.368 ns 974.75 ns 6.39 0.07 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 2048 214.87 ns 1.048 ns 0.980 ns 214.71 ns 1.41 0.02 - - - -

4 Channel Byte

Method Job EnvironmentVariables Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Shuffle4Channel AVX Empty 128 27.62 ns 0.582 ns 0.693 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 128 19.95 ns 0.431 ns 0.632 ns 0.73 0.03 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 128 17.91 ns 0.369 ns 0.362 ns 0.65 0.02 - - - -
Shuffle4Channel AVX Empty 256 22.89 ns 0.262 ns 0.232 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 256 37.37 ns 0.605 ns 0.566 ns 1.63 0.03 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 256 27.88 ns 0.467 ns 0.437 ns 1.22 0.03 - - - -
Shuffle4Channel AVX Empty 512 28.29 ns 0.564 ns 0.579 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 512 84.90 ns 1.193 ns 1.057 ns 3.00 0.07 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 512 28.50 ns 0.572 ns 0.535 ns 1.01 0.03 - - - -
Shuffle4Channel AVX Empty 1024 35.75 ns 0.731 ns 0.718 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 1024 161.61 ns 2.470 ns 2.310 ns 4.52 0.09 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 1024 40.17 ns 0.830 ns 0.776 ns 1.12 0.03 - - - -
Shuffle4Channel AVX Empty 2048 54.62 ns 0.473 ns 0.443 ns 1.00 0.00 - - - -
Shuffle4Channel No HwIntrinsics COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 2048 304.62 ns 5.085 ns 4.757 ns 5.58 0.09 - - - -
Shuffle4Channel SSE COMPlus_EnableAVX=0 2048 65.00 ns 1.306 ns 1.222 ns 1.19 0.03 - - - -

@JimBobSquarePants
Copy link
Member Author

@antonfirsov I decided to wire up the methods in the converters. Very happy with how it came together now. No performance loss compared to current code and much faster with new intrinsics.

@antonfirsov
Copy link
Member

Nitpick loosely related to this PR:

// In HwIntrinsics_SSE_AVX :
                if (Avx.IsSupported)
                {
                    this.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31)
                        .WithId("AVX").AsBaseline());
                }

I think baseline should be "No HwIntrinsics". Instead of "slowdown" compared to the fastest version, would be better to see the speedup for the non-accelerated ones.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 28, 2020

I think baseline should be "No HwIntrinsics". Instead of "slowdown" compared to the fastest version, would be better to see the speedup for the non-accelerated ones.

@antonfirsov You know it never occurred to me that the order displayed reflected the order registered! Very daft!

I'll fix it when I do 3<=>4 channel shuffling

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

I think there is still space for improvement in the remainder path.

I would even consider applying this pattern to shorten the code in the methods without T4 (now or in a follow-up PR):

interface IComponentShuffle
{
    int Control { get; }
    void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest);
}

struct DefaultShuffle : IComponentShuffle
{
    public DefaultShuffle(int control) { Control = control; }
    public int Control { get; }

    [MethodImpl(InliningOptions.ShortMethod)]
    public void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest)
    {
        ref byte sBase = ref MemoryMarshal.GetReference(source);
        ref byte dBase = ref MemoryMarshal.GetReference(dest);
        Shuffle.InverseMmShuffle(control, out int p3, out int p2, out int p1, out int p0);

        for (int i = 0; i < source.Length; i += 4)
        {
            Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, p0 + i);
            Unsafe.Add(ref dBase, i + 1) = Unsafe.Add(ref sBase, p1 + i);
            Unsafe.Add(ref dBase, i + 2) = Unsafe.Add(ref sBase, p2 + i);
            Unsafe.Add(ref dBase, i + 3) = Unsafe.Add(ref sBase, p3 + i);
        }
    }
}

struct YZWX : IComponentShuffle
{
    public int Control => (0 << 6) | (3 << 4) | (2 << 2) | 1;

    [MethodImpl(InliningOptions.ShortMethod)]
    public void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest)
    {
        ReadOnlySpan<uint> s = MemoryMarshal.Cast<byte, uint>(source);
        Span<uint> d = MemoryMarshal.Cast<byte, uint>(dest);
        ref uint sBase = ref MemoryMarshal.GetReference(s);
        ref uint dBase = ref MemoryMarshal.GetReference(d);

        for (int i = 0; i < s.Length; i++)
        {
            uint packed = Unsafe.Add(ref sBase, i);

            // packed              = [W Z Y X]
            // ROTR(8, packedArgb) = [Y Z W X]
            Unsafe.Add(ref dBase, i) = (packed >> 8) | (packed << 24);
        }
    }
}


public static void Shuffle4Channel<TShuffle>(
    ReadOnlySpan<byte> source,
    Span<byte> dest,
    TShuffle shuffle)
    where TShuffle : struct, IComponentShuffle
{
    // use shuffle.Control for the SIMD and the (inlined) RunFallbackShuffle method for the fallback path
}

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for pushing through the suggested refactor!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants