Low-hanging fruit codegen optimizations#2401
Low-hanging fruit codegen optimizations#2401JimBobSquarePants merged 14 commits intoSixLabors:mainfrom
Conversation
src/ImageSharp/Formats/Jpeg/Components/Decoder/ArithmeticScanDecoder.cs
Outdated
Show resolved
Hide resolved
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Thanks for this... I've gone a fair way through but want to comment now.
I think we might need to take a step back and try to find a good balance between creating the best codegen and readability.
I see two issues.
- How we're handling loops is inconsistent, sometimes we cast in diffferent ways in the loop declaration
nint i = ... (nint)(uint)vsnint i = ... (uint)and sometimes we we cast inside the loop. - We're doing casting during field assignment outside of hotpaths. I don't want to introduce changes that will only have very marginal effect sacrificing readabilty in the process.
| colorWhite.FromRgba32(Color.White); | ||
| ref byte dataRef = ref MemoryMarshal.GetReference(data); | ||
| for (nint y = top; y < top + height; y++) | ||
| for (nint y = top; y < (nint)(uint)(top + height); y++) |
There was a problem hiding this comment.
In some places we haven't dont the second explicit cast from uint
e.g. the loops are written as
for (nint y = top; y < (uint)(top + height); y++)I think we should strive to be super consistent here.
There was a problem hiding this comment.
Having another thought on this, I think using nuint is the best option.
- the cast to
uintis enough, so no need for(nint)(uint) - the compiler errors for
for (nuint i = 0; i < span.Length; ++i)whilst there's no error ifnint i = 0is used -- so it's impossible to miss one optimization - for 32-bit optimized code is emitted too
Hence I'll update the usages to nuint.
| public static void AddGreenToBlueAndRed(Span<uint> pixelData) | ||
| { | ||
| if (Avx2.IsSupported) | ||
| if (Avx2.IsSupported && pixelData.Length >= 8) |
There was a problem hiding this comment.
Did the length check early and made a do-while-loop.
This prevents the creation* of the vector addGreenToBlueAndRedMaskAvx2 when not needed.
The same on other loops in this file.
* actually it's a memory load from the data-segment in recent .NET versions
There was a problem hiding this comment.
I've seen similar in the recent CRC PR to the runtime.
| nint u = n - m; | ||
|
|
||
| for (int i = 0; i < u; i += 4) | ||
| for (nint i = 0; i < u; i += 4) |
There was a problem hiding this comment.
Because of the n - m the loop with nint is kept. With nuint one has to take care of underflow due the subtraction.
Alternatives with using nuint would be casting to a signed type or having a if + do-while looping construct. I think the nint as used here is simpler.
There was a problem hiding this comment.
Yep, that makes sense. Thanks for the additional explanation!
| out uint p3, | ||
| out uint p2, | ||
| out uint p1, | ||
| out uint p0) |
There was a problem hiding this comment.
The uint here saves lots of casts elsewhere.
| source.Save( | ||
| path, | ||
| encoder ?? source.GetConfiguration().ImageFormatsManager.FindEncoder(<#= fmt #>Format.Instance)); | ||
| encoder ?? source.GetConfiguration().ImageFormatsManager.GetEncoder(<#= fmt #>Format.Instance)); |
There was a problem hiding this comment.
Wow! I thought I'd redone these. Thanks!!
|
@gfoidl I'm still reviewing this. There's a LOT to read so it's gonna take me a few nights. Thanks for your patience. |
|
No rush (and ultiimately it's up to you when to review 😉). The pain point with such a PR is lots of touched files. But on the pro side the code-base gets in a consistent and good shape (regarding the micro-optimizations). |
JimBobSquarePants
left a comment
There was a problem hiding this comment.
This is looking really good! Just a few comments and questions.
| ref byte dBase = ref MemoryMarshal.GetReference(dest); | ||
|
|
||
| Shuffle.InverseMMShuffle(this.Control, out int p3, out int p2, out int p1, out int p0); | ||
| Shuffle.InverseMMShuffle(this.Control, out uint p3, out uint p2, out uint p1, out uint p0); |
There was a problem hiding this comment.
This is one of those. "Why didn't I think of this!" moments. 😁
There was a problem hiding this comment.
Well, then it would be boring for me 😉.
src/ImageSharp/Common/Helpers/SimdUtils.FallbackIntrinsics128.cs
Outdated
Show resolved
Hide resolved
| nint u = n - m; | ||
|
|
||
| for (int i = 0; i < u; i += 4) | ||
| for (nint i = 0; i < u; i += 4) |
There was a problem hiding this comment.
Yep, that makes sense. Thanks for the additional explanation!
| out uint p3, | ||
| out uint p2, | ||
| out uint p1, | ||
| out uint p0) |
| // We can assign the Bgr24 value directly to last three bytes of this instance. | ||
| ref byte thisRef = ref Unsafe.As<Abgr32, byte>(ref this); | ||
| ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, new IntPtr(1)); | ||
| ref byte thisRefFromB = ref Unsafe.AddByteOffset(ref thisRef, 1); |
There was a problem hiding this comment.
Wouldn't this be better as (nuint)1u?
There was a problem hiding this comment.
AddByteOffset has overloads for nuint and IntPtr, thus C#-compiler picks the nuint overload and treats the literal 1 as nuint-constant. No need for the cast here.
src/ImageSharp/Processing/Processors/Binarization/BinaryThresholdProcessor{TPixel}.cs
Show resolved
Hide resolved
src/ImageSharp/Processing/Processors/Convolution/ConvolutionProcessor{TPixel}.cs
Outdated
Show resolved
Hide resolved
src/ImageSharp/Processing/Processors/Convolution/EdgeDetectorCompassProcessor{TPixel}.cs
Outdated
Show resolved
Hide resolved
| for (int i = 0; i < this.kernelSize; i++) | ||
| { | ||
| int currentYIndex = Unsafe.Add(ref sampleRowBase, i); | ||
| int currentYIndex = Unsafe.Add(ref sampleRowBase, (uint)i); |
There was a problem hiding this comment.
Cast in the outer loop for consistency?
There was a problem hiding this comment.
Just below i is needed for slicing as int.
It would be more consistent, but also two casts instead of one in the loop body.
I have no strong opinion what's better here. Consistency or less casts?
gfoidl
left a comment
There was a problem hiding this comment.
Feedback is addressed / commented.
Once you're happy with the changes I need to push another commit to fix cases like
which I de-optimized in a previous commit of this PR (got the wrong casts here in order to emit fastest code).
Super happy with everything so far! Fire ahead 😄 |
Here we go: 2bbf1cb (only changed that division by vector length (which I unfortunately optimized, then later de-optimized, and with that commit optimized again -- now it's enough of that ping-poing 😉)) |
JimBobSquarePants
left a comment
There was a problem hiding this comment.
Amazing stuff. Thank you!
Prerequisites
Description
Scanned through the code-base (Regex based search) and did some low-hanging fruit optimizations.
Unsafe.Add-usage tweaked the code / loops to don't emit sign extending movs (movsxd->movormovzx)Sorry for touching so many files...
Cf. #2397 (comment)