-
-
Notifications
You must be signed in to change notification settings - Fork 889
Description
Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!
As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.
Current pipeline
Summary of steps currently done by ConvertColorsInto:
[D]: Data representation
(T): Bulk transformation between data representations
(case a) Y+Cb+Cr planes --> Single Rgba32 buffer
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T> |
| [D] | The result image as an Rgba32 buffer |
(case b) Y+Cb+Cr planes --> Single Rgb24 buffer
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T> |
| [D] | Temporary Rgba32 buffer |
| (T) | PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!) |
| [D] | The result image as an Rgb24 buffer |
Optimized pipeline
(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert, the 3 planes, and write them back to the originating buffers |
| D2 | 3 Planes of Buffer2D<float>, R+G+B) |
| (T) | Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows |
| D3 | 3 Planes of Buffer2D<byte>, R+G+B |
| (T) | PACK the separate image planes (color channels) into a single Rgb24 buffer |
| D4 | The result image as an Rgb24 buffer |
(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | All the steps from the default Rgb24 case |
| D4 | Memory<Rgb24> |
| (T) | Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T> |
| D5 | The result image as an TPixel buffer |
The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.
API proposal for packing
The best thing is that we can handle this big task incrementally:
- First, extend
PixelOperations<T>by new packing operations - Then, adapt the changes in
JpegImagePostProcessoras described in the Optimized pipeline paragraph
The packing API is pretty straightforward:
public class PixelOperations<TPixel>
{
// ...
public void PackFromRgbPlanes(
Configuration configuration,
ReadOnlySpan<byte> redChannel,
ReadOnlySpan<byte> greenChannel,
ReadOnlySpan<byte> blueChannel,
Span<TPixel> destination);
}We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.
Note
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.