-
-
Notifications
You must be signed in to change notification settings - Fork 889
Add 4 Channel Shuffle Intrinsics #1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1404 +/- ##
==========================================
+ Coverage 82.89% 82.99% +0.09%
==========================================
Files 690 692 +2
Lines 31017 31189 +172
Branches 3561 3578 +17
==========================================
+ Hits 25713 25884 +171
- Misses 4581 4582 +1
Partials 723 723
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Obligatory tag to @saucecontrol |
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first part of a series of PRs that will complete #1354
Can you describe how the pipelines will look like after the refactor for all the following cases (What is the chain of calls starting from PixelOperations<Rgba32>.FromRgb24(...) and PixelOperations<Rgb24>.FromRgba32(...)):
- With HwIntrinsics
- Without HwIntrinsics, but with accelerated
Vector<T>(so we don't regress significantly on 2.1 and Framework) - Without accelerated SIMD
I want to make sure we are on the same page regarding high level plan before looking at details in code.
| // Deal with the remainder: | ||
| if (source.Length > 0) | ||
| { | ||
| ShuffleRemainder4Channel(source, dest, control); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-SIMD shuffling has optimized implementations in PixelConverter, although I'm unsure what runtimes support those optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw that but there's a limited set of shuffles (Though we could maybe T4 template it?). With this API we could potentially expose the ability to use any shuffle combination. for example XXXX. We can still use the optimizations in the explicit pixel converters though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really depends on where do we want to get in the end, and how the overall pipelines would look like. (See my main comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to avoid regression on non-x86 and/or non 3.1+, I think we have 2 options now:
- Handle the cases right in the
PixelOperationsmethod:
public override void FromRgba32(Configuration configuration, ReadOnlySpan<Rgba32> sourcePixels, Span<Argb32> destinationPixels)
{
Guard.NotNull(configuration, nameof(configuration));
Guard.DestinationShouldNotBeTooShort(sourcePixels, destinationPixels, nameof(destinationPixels));
#if SUPPORTS_RUNTIME_INTRINSICS
if (Avx.IsSupported || Sse.IsSupported)
{
var source = MemoryMarshal.Cast<Rgba32, byte>(sourcePixels);
var dest = MemoryMarshal.Cast<Argb32, byte>(destinationPixels);
SimdUtils.Shuffle4Channel(source, dest, SimdUtils.Shuffle.WXYZ);
}
else
#else
{
ref uint sourceRef = ref Unsafe.As<Rgba32, uint>(ref MemoryMarshal.GetReference(sourcePixels));
ref uint destRef = ref Unsafe.As<Argb32, uint>(ref MemoryMarshal.GetReference(destinationPixels));
for (int i = 0; i < sourcePixels.Length; i++)
{
uint sp = Unsafe.Add(ref sourceRef, i);
Unsafe.Add(ref destRef, i) = PixelConverter.FromRgba32.ToArgb32(sp);
}
}
#endif
}- Add a
switchstatement toShuffleRemainder4Channelto handle well-known cases forcontrolwith the code inPixelConverter.
What is the best depends on the plans to handle the 4->3 and the 3->4 cases, so we have consistency in the resulting code. See #1404 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 is probably better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is one other option, you probably meant this:
3. T4 multiple variants of Shuffle4Channel, so we can avoid the switch-branch, but it quite a lot of work, unsure if the perf gains will justify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There you go.... I haven't wired them up yet but I've optimized our existing shuffles so there's no slowdown on older NET Core.
About 4x faster than they were previously.
| Method | Job | EnvironmentVariables | Count | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Shuffle4Channel | AVX | Empty | 128 | 21.56 ns | 0.454 ns | 0.651 ns | 1.00 | 0.00 | - | - | - | - |
| Shuffle4Channel | No HwIntrinsics | COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 | 128 | 19.50 ns | 0.252 ns | 0.224 ns | 0.89 | 0.03 | - | - | - | - |
| Shuffle4Channel | SSE | COMPlus_EnableAVX=0 | 128 | 17.66 ns | 0.125 ns | 0.111 ns | 0.81 | 0.03 | - | - | - | - |
| Shuffle4Channel | AVX | Empty | 256 | 24.49 ns | 0.284 ns | 0.252 ns | 1.00 | 0.00 | - | - | - | - |
| Shuffle4Channel | No HwIntrinsics | COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 | 256 | 38.02 ns | 0.783 ns | 1.219 ns | 1.51 | 0.04 | - | - | - | - |
| Shuffle4Channel | SSE | COMPlus_EnableAVX=0 | 256 | 29.11 ns | 0.317 ns | 0.281 ns | 1.19 | 0.02 | - | - | - | - |
| Shuffle4Channel | AVX | Empty | 512 | 29.75 ns | 0.233 ns | 0.218 ns | 1.00 | 0.00 | - | - | - | - |
| Shuffle4Channel | No HwIntrinsics | COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 | 512 | 79.31 ns | 1.252 ns | 1.171 ns | 2.67 | 0.05 | - | - | - | - |
| Shuffle4Channel | SSE | COMPlus_EnableAVX=0 | 512 | 29.84 ns | 0.603 ns | 0.592 ns | 1.00 | 0.02 | - | - | - | - |
| Shuffle4Channel | AVX | Empty | 1024 | 38.62 ns | 0.244 ns | 0.228 ns | 1.00 | 0.00 | - | - | - | - |
| Shuffle4Channel | No HwIntrinsics | COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 | 1024 | 150.61 ns | 0.293 ns | 0.245 ns | 3.90 | 0.02 | - | - | - | - |
| Shuffle4Channel | SSE | COMPlus_EnableAVX=0 | 1024 | 46.57 ns | 0.606 ns | 0.567 ns | 1.21 | 0.02 | - | - | - | - |
| Shuffle4Channel | AVX | Empty | 2048 | 59.99 ns | 0.352 ns | 0.312 ns | 1.00 | 0.00 | - | - | - | - |
| Shuffle4Channel | No HwIntrinsics | COMPlus_EnableHWIntrinsic=0,COMPlus_FeatureSIMD=0 | 2048 | 283.23 ns | 0.869 ns | 0.770 ns | 4.72 | 0.02 | - | - | - | - |
| Shuffle4Channel | SSE | COMPlus_EnableAVX=0 | 2048 | 85.03 ns | 1.708 ns | 1.899 ns | 1.42 | 0.04 | - | - | - | - |
I would add the accelerated SIMD utilities directly into the For example in public override void FromRgba32(Configuration configuration, ReadOnlySpan<Rgba32> sourcePixels, Span<Argb32> destinationPixels)
{
Guard.NotNull(configuration, nameof(configuration));
Guard.DestinationShouldNotBeTooShort(sourcePixels, destinationPixels, nameof(destinationPixels));
#if SUPPORTS_RUNTIME_INTRINSICS
var source = MemoryMarshal.Cast<Rgba32, byte>(sourcePixels);
var dest = MemoryMarshal.Cast<Argb32, byte>(destinationPixels);
SimdUtils.Shuffle4Channel(source, dest, SimdUtils.Shuffle.WXYZ);
#else
ref uint sourceRef = ref Unsafe.As<Rgba32, uint>(ref MemoryMarshal.GetReference(sourcePixels));
ref uint destRef = ref Unsafe.As<Argb32, uint>(ref MemoryMarshal.GetReference(destinationPixels));
for (int i = 0; i < sourcePixels.Length; i++)
{
uint sp = Unsafe.Add(ref sourceRef, i);
Unsafe.Add(ref destRef, i) = PixelConverter.FromRgba32.ToArgb32(sp);
}
#endif
}The only reason I'm splitting the PRs up is because hitting the generators means lots of files touched which explodes the review. |
|
Ah ok, makes sense, this is quite simple for the 4 channel -> 4 channel case. Will be more more tricky for the 3->4 and 4->3 cases:
|
|
Looks good from here! You could probably squeeze out some more perf by manually unrolling the shuffle loops since the loop overhead is basically the same as the SIMD work in those, but clearly it's already a nice win. That |
@saucecontrol Thanks! Was simply looking for a way to save typing! 😁 I like useful helper methods like that, I'd love to write one for blending as I find using _MM_SHUFFLE for that unintuitive. Btw, loop unrolling was a great suggestion. We're at warp speed now! 🚀 4 Channel Float
4 Channel Byte
|
|
@antonfirsov I decided to wire up the methods in the converters. Very happy with how it came together now. No performance loss compared to current code and much faster with new intrinsics. |
|
Nitpick loosely related to this PR: // In HwIntrinsics_SSE_AVX :
if (Avx.IsSupported)
{
this.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31)
.WithId("AVX").AsBaseline());
}I think baseline should be "No HwIntrinsics". Instead of "slowdown" compared to the fastest version, would be better to see the speedup for the non-accelerated ones. |
@antonfirsov You know it never occurred to me that the order displayed reflected the order registered! Very daft! I'll fix it when I do 3<=>4 channel shuffling |
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
I think there is still space for improvement in the remainder path.
I would even consider applying this pattern to shorten the code in the methods without T4 (now or in a follow-up PR):
interface IComponentShuffle
{
int Control { get; }
void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest);
}
struct DefaultShuffle : IComponentShuffle
{
public DefaultShuffle(int control) { Control = control; }
public int Control { get; }
[MethodImpl(InliningOptions.ShortMethod)]
public void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest)
{
ref byte sBase = ref MemoryMarshal.GetReference(source);
ref byte dBase = ref MemoryMarshal.GetReference(dest);
Shuffle.InverseMmShuffle(control, out int p3, out int p2, out int p1, out int p0);
for (int i = 0; i < source.Length; i += 4)
{
Unsafe.Add(ref dBase, i) = Unsafe.Add(ref sBase, p0 + i);
Unsafe.Add(ref dBase, i + 1) = Unsafe.Add(ref sBase, p1 + i);
Unsafe.Add(ref dBase, i + 2) = Unsafe.Add(ref sBase, p2 + i);
Unsafe.Add(ref dBase, i + 3) = Unsafe.Add(ref sBase, p3 + i);
}
}
}
struct YZWX : IComponentShuffle
{
public int Control => (0 << 6) | (3 << 4) | (2 << 2) | 1;
[MethodImpl(InliningOptions.ShortMethod)]
public void RunFallbackShuffle(ReadOnlySpan<byte> source, Span<byte> dest)
{
ReadOnlySpan<uint> s = MemoryMarshal.Cast<byte, uint>(source);
Span<uint> d = MemoryMarshal.Cast<byte, uint>(dest);
ref uint sBase = ref MemoryMarshal.GetReference(s);
ref uint dBase = ref MemoryMarshal.GetReference(d);
for (int i = 0; i < s.Length; i++)
{
uint packed = Unsafe.Add(ref sBase, i);
// packed = [W Z Y X]
// ROTR(8, packedArgb) = [Y Z W X]
Unsafe.Add(ref dBase, i) = (packed >> 8) | (packed << 24);
}
}
}
public static void Shuffle4Channel<TShuffle>(
ReadOnlySpan<byte> source,
Span<byte> dest,
TShuffle shuffle)
where TShuffle : struct, IComponentShuffle
{
// use shuffle.Control for the SIMD and the (inlined) RunFallbackShuffle method for the fallback path
}
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for pushing through the suggested refactor!
Add 4 Channel Shuffle Intrinsics
Prerequisites
Description
Adds new
SimdUtils.Shuffle4Channelmethods with overloads forfloatandbyte. This is the first part of a series of PRs that will complete #1354Methods are SIMD optimized on NET Core 3.1+
public static void Shuffle4Channel(ReadOnlySpan<float> source, Span<float> dest, byte control)Avx, Ssepublic static void Shuffle4Channel(ReadOnlySpan<byte> source, Span<byte> dest, byte control)Avx2, Ssse3.There don't appear to be any equivalent methods for
System.Numericsso I don't think we can add any optimization there.Additional utility methods have also been added to
SimdUtils.Shufflewhich allow the generation of shuffle controls.Benchmarks
4 Channel Float
4 Channel Byte