Skip to content

Speed up JPEG Decoder color conversion #1121

@antonfirsov

Description

@antonfirsov

Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!

As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.

/cc @Sergio0694 @saucecontrol

Current pipeline

Summary of steps currently done by ConvertColorsInto:

[D]: Data representation
(T): Bulk transformation between data representations

(case a) Y+Cb+Cr planes --> Single Rgba32 buffer

[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T>
[D] The result image as an Rgba32 buffer

(case b) Y+Cb+Cr planes --> Single Rgb24 buffer

[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T>
[D] Temporary Rgba32 buffer
(T) PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!)
[D] The result image as an Rgb24 buffer

Optimized pipeline

(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer

D1 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert, the 3 planes, and write them back to the originating buffers
D2 3 Planes of Buffer2D<float>, R+G+B)
(T) Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows
D3 3 Planes of Buffer2D<byte>, R+G+B
(T) PACK the separate image planes (color channels) into a single Rgb24 buffer
D4 The result image as an Rgb24 buffer

(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer

D1 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) All the steps from the default Rgb24 case
D4 Memory<Rgb24>
(T) Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T>
D5 The result image as an TPixel buffer

The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.

API proposal for packing

The best thing is that we can handle this big task incrementally:

  • First, extend PixelOperations<T> by new packing operations
  • Then, adapt the changes in JpegImagePostProcessor as described in the Optimized pipeline paragraph

The packing API is pretty straightforward:

public class PixelOperations<TPixel>
{
    // ...
    
    public void PackFromRgbPlanes(
           Configuration configuration,
	   ReadOnlySpan<byte> redChannel, 
	   ReadOnlySpan<byte> greenChannel, 
	   ReadOnlySpan<byte> blueChannel,
	   Span<TPixel> destination);
}

We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.

Note

It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions