-
Notifications
You must be signed in to change notification settings - Fork 485
Description
We hit a performance issue with ACES 2.0 output transforms using a metal shader. Metal implementation is very slow especially on Apple silicon and the performance hit increases with input size. After the preliminary analysis, the main culprit is the 362-element long const float array that the transform uses. Since the OCIO metal shader implementation uses a "wrapper" struct to encapsulate the functions and data, this array ends up being a member of the encapsulating struct and looks like the array is created in the run-time many times, trashing the L1 cache (90% miss rate). See ocio_gamut_cusp_table_0_hues_array defined in line 88 of the attached shader code which is generated with the v2.4.1 tagged branch.
OpenGL and HLSL generators don't use a wrapper and thus don't suffer from the same issue even on the same hardware. Also pulling the offending array outside of the struct and marking it "constant float" fixes the performance problem. (see the attached shader which has the array outside of the struct).
shader generated with v2.4.1.txt
shader where array pulled outside of the struct
Repro steps
- checkout OCIO v2.4.1 branch and compile
- download the test config that's provided here: https://github.com/AcademySoftwareFoundation/OpenColorIO/wiki/ACES-2.0-optimization
- set OCIO environment variable to point to that config.
- run ociodisplay utility with "-metal" and "-gpuinfo" flags and pass a relatively large image (4k) as the input image
ociodisplay -metal -gpuinfo large_image.exr - with the right click in the UI set
- the image color space to "ACES / ACES2065-1"
- the display to "sRGB - Display"
- set the view to "ACES 2.0 - SDR 100 nits (Rec.709)"
You'll see the generated code in the console and if you do a metal capture you'll see that in the draw call performance report, buffer L1 cache miss rate is very high.