Skip to content

Conversation

@ssh4net
Copy link
Contributor

@ssh4net ssh4net commented Jan 7, 2026

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library: • add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Checklist:

  • I have read the guidelines on contributions and code review procedures.
  • I have updated the documentation if my PR adds features or changes
    behavior.
  • I am sure that this PR's changes are tested somewhere in the
    testsuite
    .
  • I have run and passed the testsuite in CI before submitting the
    PR, by pushing the changes to my fork and seeing that the automated CI
    passed there. (Exceptions: If most tests pass and you can't figure out why
    the remaining ones fail, it's ok to submit the PR and ask for help. Or if
    any failures seem entirely unrelated to your change; sometimes things break
    on the GitHub runners.)
  • My code follows the prevailing code style of this project and I
    fixed any problems reported by the clang-format CI test.
  • If I added or modified a public C++ API call, I have also amended the
    corresponding Python bindings. If altering ImageBufAlgo functions, I also
    exposed the new functionality as oiiotool options.

Optional SIMD optimizations for selected ImageBufAlgo operations using the Google Highway library:
• add/sub
• mul/div
• mad
• resample
Adds CMake and build system support, new implementation helpers, and developer documentation.

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@lgritz
Copy link
Collaborator

lgritz commented Jan 7, 2026

I suspect you used LLM for some of this? Which is fine, but I think you should document in the PR description (commit comment) which tool you used and for what parts.

Comment on lines +115 to +128

# Google Highway SIMD acceleration for selected ImageBufAlgo ops. This is an
# optional optimization dependency: if enabled but not found, it will be
# compiled out.
#
# Back-compat: honor -DUSE_HWY=OFF by mapping it to OIIO_USE_HWY if the latter
# was not explicitly provided.
if (DEFINED USE_HWY AND NOT DEFINED OIIO_USE_HWY)
set (OIIO_USE_HWY ${USE_HWY} CACHE BOOL
"Enable Google Highway SIMD optimizations (if Highway is available)" FORCE)
else ()
option (OIIO_USE_HWY "Enable Google Highway SIMD optimizations (if Highway is available)" ON)
endif ()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessarily complex. Maybe just replace this whole thing with

Suggested change
# Google Highway SIMD acceleration for selected ImageBufAlgo ops. This is an
# optional optimization dependency: if enabled but not found, it will be
# compiled out.
#
# Back-compat: honor -DUSE_HWY=OFF by mapping it to OIIO_USE_HWY if the latter
# was not explicitly provided.
if (DEFINED USE_HWY AND NOT DEFINED OIIO_USE_HWY)
set (OIIO_USE_HWY ${USE_HWY} CACHE BOOL
"Enable Google Highway SIMD optimizations (if Highway is available)" FORCE)
else ()
option (OIIO_USE_HWY "Enable Google Highway SIMD optimizations (if Highway is available)" ON)
endif ()
option (OIIO_USE_HWY "Enable experimental Google Highway SIMD optimizations (if Highway is available)" OFF)

Also note that I changed the default to OFF, and described it as experimental, which I think is better at this point, in case there is an unforeseen issue.



# Google Highway for SIMD (optional optimization)
checked_find_package (hwy ENABLE ${OIIO_USE_HWY})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously, the checked_find_package accepts an ENABLE parameter but doesn't use it! (That's my fault, I guess.) I made another suggestion about the top level enabling variable, so I would recommend changing this to

Suggested change
checked_find_package (hwy ENABLE ${OIIO_USE_HWY})
if (OIIO_USE_HWY)
checked_find_package (hwy)
endif ()

#endif

#ifdef _MSC_VER
# include <malloc.h> // for alloca
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. We already use alloca extensively without trouble, why did this need to be added now?

Comment on lines +316 to +317
#elif defined(_MSC_VER)
# define OIIO_ALLOCA(type, size) (assert(size < (1<<20)), (size) != 0 ? ((type*)_alloca((size) * sizeof(type))) : nullptr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think this definition is identical to the #else clause below, so the MS case would already be handled properly, right?

Comment on lines +13 to +16
#if defined(_WIN32)
# include <malloc.h> // for alloca
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? It always worked before




#if defined(OIIO_USE_HWY) && OIIO_USE_HWY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the cmake side, you see to always define OIIO_USE_HWY. So these can just be

Suggested change
#if defined(OIIO_USE_HWY) && OIIO_USE_HWY
#if OIIO_USE_HWY

Or else, if you want cmake to only add the definition at all if enabled, then #ifdef is what should be in the cpp.

Comment on lines +127 to +131
template<class Rtype, class Atype, class Btype>
static bool
add_impl_hwy(ImageBuf& R, const ImageBuf& A, const ImageBuf& B, ROI roi,
int nthreads)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a line-by-line comparison, but it seems to me that the only difference between add_impl_hwy, sub_impl_hwy, and mul_impl_hwy is likely going to be

[](auto d, auto a, auto b) { return hn::Add(a, b); }

versus that one lambda changing for Sub and Mul.

I would love for even the initial commit to reduce this whole thing to a shared hwy_binary_perpixel_op() template that takes the lambda housing the op kernel as a templated parameter.

Comment on lines +155 to +161
// Process pixel by pixel (scalar fallback for strided channels)
for (int x = roi.xbegin; x < roi.xend; ++x) {
Rtype* r_ptr = ChannelPtr<Rtype>(Rv, x, y, roi.chbegin);
const Atype* a_ptr = ChannelPtr<Atype>(Av, x, y,
roi.chbegin);
const Btype* b_ptr = ChannelPtr<Btype>(Bv, x, y,
roi.chbegin);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should benchmark the strided case and see how it compares to the contiguous case and the full scalar fallback that we've always had. If there is no big speed gain, I would be in favor of eliminating this whole clause and let non-contiguous strides use the old scalar path, then there is much less template expansion for hwy in the cases where there is not a large gain to be had. Note that this means that the "to hwy or not to hwy" test would need to test contiguity in addition to just localpixels().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants