-
Notifications
You must be signed in to change notification settings - Fork 40
Yuv color gamut conversion and transformYuv420 using Arm Neon. #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/src/aarch64/gainmapmath_neon.cpp
Outdated
w += 8; | ||
} while (w < image->width / 2); | ||
y0_ptr += image->luma_stride * 2; | ||
y1_ptr += image->luma_stride * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may go out of boundary if the image height is odd. Could you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dichen,
This function reads the same data as the base implementation, if there is an odd number of rows the final row will not be processed, as is the case with the base implementation.
The standard implementation also reads two rows of Y data here:
https://github.com/google/libultrahdr/blob/main/lib/src/gainmapmath.cpp#L513
Also if there is an odd number of rows the final row will not be processed due to integer truncation here:
https://github.com/google/libultrahdr/blob/main/lib/src/gainmapmath.cpp#L508
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change conflicts with https://github.com/google/libultrahdr/pull/126/commits
@cmacdonald-arm, few observations,
- folder structure is slightly different. I am using lib/src/dsp/. Is lib/src/dsp/arm/ better ? (will be similar to libvpx)
- enabled aosp build (Android.bp as well)
- option to disable all intrinsics via macro at configure level
- support all resolutions, widths may not align to 16 always, example 1080x1920.
pls let me know if you have any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ram,
Personally I have no issues with the creation of a dsp/arm directory. I named the directory AArch64 to signify that all code is only 64-bit compatible. Also if some redesign is taking place to help with the integration of Neon implementations this could help greatly.
With your final point, images with widths that are not multiples to 16 should be fine as along as the buffer containing the images is padded? Currently the ALIGNM macro is used to allow a multiple of 16 scan-lines to be passed to the jpeg encoder, is this likely to change?
ALIGNM is used in API-0: https://github.com/google/libultrahdr/blob/main/lib/src/jpegr.cpp#L225
and API-1: https://github.com/google/libultrahdr/blob/main/lib/src/jpegr.cpp#L225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligned to 16 is an acceptable assumption. This is not likely to change. So the intrinsics should be fine. I have updated the folder structure to lib/src/dsp/arm/*.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ram,
Would the prefered approach here be, first merge your PR, with the updates to the build system and file structure/names. Second rebase these changes and rename the aarch64 Neon files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pull request #126 contains build changes for different platforms (android and host), may be, it can be merged first. Then this change can be rebased on top of main.
We have introduced a macro UHDR_ENABLE_INTRINSICS to enable/disable intrinsics if required. I think that also needs to be incorporated in your change. Currently the arch64 is the only gating option.
There is gainmapmath_neon.h, can this be merged with gainmapmath.h just like editor_helper.h. This is to avoid more header files. If not, when x86 intrinsics are introduced there will be more header files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DichenZhang1 & @ram-mohan, is there any more changes you would like made to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. Thank you.
Add color gamut conversion function using Arm Neon and associated tests. This implementation is only enabled/compatible with AArch64 systems. An important difference between the base C implementation and the Neon version is that the Neon version is "generic". There is a single Neon function that requires the conversion coefficents to be passed as an input vector to the function, this is due to how the function is called in a loop. This change reduces the number of times the coefficents need to be loaded. As of this commit the function is only used as part of the unit tests and will be used in subsequent patches. Change-Id: I0380c31db4ecbb40d7a19375865b2e18ced64b56
Add AArch64 Neon version of transformYuv420 and functional test. This version uses the fixed-point Arm Neon version of the color gamut conversion function, as such the results compared to the base floating-point implementation can contain an off by one error. Change-Id: I9446acdd85223d7072fd972f80b4945321ffc6a4
079101d
to
d900dc0
Compare
@ram-mohan, @DichenZhang1 Do we need to do anything else here before this can be merged? |
@jwright-arm The changes look good to me. I will check perf numbers with and with out the changes once. Nothing is required from you side. Thank you. |
@cmacdonald-arm @jwright-arm I have profiled from my side. The numbers look really good. Thank you for this change. Couple of observations, with neon enabled the change is not bit exact. Is this expected. Further i am seeing an improvement of 16x for convertYuvNeon method. This seems to be on the larger side because neon module is processing 8 pixels per iteration in comparison with non-vector implementation which is doing 1 per iteration. I am unable to explain the additional 2x gain. |
@ram-mohan This change is expected to not to be bit exact, it contains a possible off by one errors. This is due to the usage of fixed point rather than floating point coefficients. In terms of the uplift, 16 Y values and 8 U & V values are processed per loop iteration, also with the usage of fixed point, conversions from integer to floating point and vice versa are no longer needed. |
Thank you, @DichenZhang1 This can be merged as well. |
Thank you! @cmacdonald-arm @ram-mohan |
These 2 patches add a Arm Neon version for the YUV color gamut conversion as well as the transformYuv420 function.