AcademySoftwareFoundation · ssh4net · Jan 7, 2026 · Jan 13, 2026 · Jan 13, 2026 · Jan 13, 2026
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -112,6 +112,7 @@ else ()
 endif ()
 option (${PROJ_NAME}_BUILD_TOOLS "Build the command-line tools" ON)
 option (${PROJ_NAME}_BUILD_TESTS "Build the unit tests" ON)
+option (OIIO_USE_HWY "Enable experimental Google Highway SIMD optimizations (if Highway is available)" OFF)
 set (OIIO_LIBNAME_SUFFIX "" CACHE STRING
      "Optional name appended to ${PROJECT_NAME} libraries that are built")
 option (BUILD_OIIOUTIL_ONLY "If ON, will build *only* libOpenImageIO_Util" OFF)

diff --git a/docs/dev/Architecture.md b/docs/dev/Architecture.md
@@ -117,6 +117,10 @@ objects. These algorithms include simple operations like copying, resizing,
 and compositing images, as well as more complex operations like color
 conversions, resizing, filtering, etc.
 
+Some performance-critical `ImageBufAlgo` implementations have SIMD-accelerated
+paths using Google Highway. For implementation details and guidance for adding
+new kernels, see `docs/dev/ImageBufAlgo_Highway.md`.
+
 ## Image caching: TextureSystem and ImageCache
 
 There are situations where ImageBuf is still not the right abstraction,

diff --git a/docs/dev/ImageBufAlgo_Highway.md b/docs/dev/ImageBufAlgo_Highway.md
@@ -0,0 +1,264 @@
+ImageBufAlgo Highway (hwy) Implementation Guide
+==============================================
+
+This document explains how OpenImageIO uses Google Highway (hwy) to accelerate
+selected `ImageBufAlgo` operations, and how to add or modify kernels in a way
+that preserves OIIO semantics while keeping the code maintainable.
+
+This is a developer-facing document about the implementation structure in
+`src/libOpenImageIO/`. It does not describe the public API behavior of the
+algorithms.
+
+
+Goals and non-goals
+-------------------
+
+Goals:
+- Make the hwy-backed code paths easy to read and easy to extend.
+- Centralize repetitive boilerplate (type conversion, tails, ROI pointer math).
+- Preserve OIIO's numeric semantics (normalized integer model).
+- Keep scalar fallbacks as the source of truth for tricky layout cases.
+
+Non-goals:
+- Explain Highway itself. Refer to the upstream Highway documentation.
+- Guarantee that every ImageBufAlgo op has a hwy implementation.
+
+
+Where the code lives
+--------------------
+
+Core helpers:
+- `src/libOpenImageIO/imagebufalgo_hwy_pvt.h`
+
+Typical hwy call sites:
+- `src/libOpenImageIO/imagebufalgo_addsub.cpp`
+- `src/libOpenImageIO/imagebufalgo_muldiv.cpp`
+- `src/libOpenImageIO/imagebufalgo_mad.cpp`
+- `src/libOpenImageIO/imagebufalgo_pixelmath.cpp`
+- `src/libOpenImageIO/imagebufalgo_xform.cpp` (some ops are hwy-accelerated)
+
+
+Enabling and gating the hwy path
+-------------------------------
+
+The hwy path is only used when:
+- Highway usage is enabled at runtime (`OIIO::pvt::enable_hwy`).
+- The relevant `ImageBuf` objects have local pixel storage (`localpixels()` is
+  non-null), meaning the data is in process memory rather than accessed through
+  an `ImageCache` tile abstraction.
+- The operation can be safely expressed as contiguous streams of pixels/channels
+  for the hot path, or the code falls back to a scalar implementation for
+  strided/non-contiguous layouts.
+
+The common gating pattern looks like:
+- In a typed `*_impl` dispatcher: check `OIIO::pvt::enable_hwy` and `localpixels`
+  and then call a `*_impl_hwy` function; otherwise call `*_impl_scalar`.
+
+Important: the hwy path is an optimization. Correctness must not depend on hwy.
+
+
+OIIO numeric semantics: why we promote to float
+----------------------------------------------
+
+OIIO treats integer image pixels as normalized values:
+- Unsigned integers represent [0, 1].
+- Signed integers represent approximately [-1, 1] with clamping for INT_MIN.
+
+Therefore, most pixel math must be performed in float (or double) space, even
+when the stored data is integer. This is why the hwy layer uses the
+"LoadPromote/Operate/DemoteStore" pattern.
+
+For additional discussion (and pitfalls of saturating integer arithmetic), see:
+- `HIGHWAY_SATURATING_ANALYSIS.md`
+
+
+The core pattern: LoadPromote -> RunHwy* -> DemoteStore
+-------------------------------------------------------
+
+The helper header `imagebufalgo_hwy_pvt.h` defines the reusable building blocks:
+
+1) Computation type selection
+   - `SimdMathType<T>` selects `float` for most types, and `double` only when
+     the destination type is `double`.
+
+   Rationale:
+   - Float math is significantly faster on many targets.
+   - For OIIO, integer images are normalized to [0,1] (or ~[-1,1]), so float
+     precision is sufficient for typical image processing workloads.
+
+2) Load and promote (with normalization)
+   - `LoadPromote(d, ptr)` and `LoadPromoteN(d, ptr, count)` load values and
+     normalize integer ranges into the computation space.
+
+   Rationale:
+   - Consolidates all normalization and conversion logic in one place.
+   - Prevents subtle drift where each operation re-implements integer scaling.
+   - Ensures tail handling ("N" variants) is correct and consistent.
+
+3) Demote and store (with denormalization/clamp/round)
+   - `DemoteStore(d, ptr, v)` and `DemoteStoreN(d, ptr, v, count)` reverse the
+     normalization and store results in the destination pixel type.
+
+   Rationale:
+   - Centralizes rounding and clamping behavior for all destination types.
+   - Ensures output matches OIIO scalar semantics.
+
+4) Generic kernel runners (streaming arrays)
+   - `RunHwyUnaryCmd`, `RunHwyCmd` (binary), `RunHwyTernaryCmd`
+   - These are the primary entry points for most hwy kernels.
+
+   Rationale:
+   - Encapsulates lane iteration and tail processing once.
+   - The call sites only provide the per-lane math lambda, not the boilerplate.
+
+
+Native integer runners: when they are valid
+-------------------------------------------
+
+Some operations are "scale-invariant" under OIIO's normalized integer model.
+For example, for unsigned integer add:
+- `(a/max + b/max)` in float space, then clamped to [0,1], then scaled by max
+  matches saturated integer add `SaturatedAdd(a, b)` for the same bit depth.
+
+For those cases, `imagebufalgo_hwy_pvt.h` provides:
+- `RunHwyUnaryNativeInt<T>`
+- `RunHwyBinaryNativeInt<T>`
+
+These should only be used when all of the following are true:
+- The operation is known to be scale-invariant under the normalization model.
+- Input and output types are the same integral type.
+- The operation does not depend on mixed types or float-range behavior.
+
+Rationale:
+- Avoids promotion/demotion overhead and can be materially faster.
+- Must be opt-in and explicit, because many operations are NOT compatible with
+  raw integer arithmetic (e.g. multiplication, division, pow).
+
+
+Local pixel pointer helpers: reducing boilerplate safely
+-------------------------------------------------------
+
+Most hwy call sites need repeated pointer and stride computations:
+- Pixel size in bytes.
+- Scanline size in bytes.
+- Base pointer to local pixels.
+- Per-row pointer for a given ROI and scanline.
+- Per-pixel pointer for non-contiguous fallbacks.
+
+To centralize that, `imagebufalgo_hwy_pvt.h` defines:
+- `HwyPixels(ImageBuf&)` and `HwyPixels(const ImageBuf&)`
+  returning a small view (`HwyLocalPixelsView`) with:
+  - base pointer (`std::byte*` / `const std::byte*`)
+  - `pixel_bytes`, `scanline_bytes`
+  - `xbegin`, `ybegin`, `nchannels`
+- `RoiNChannels(roi)` for `roi.chend - roi.chbegin`
+- `ChannelsContiguous<T>(view, nchannels)`:
+  true only when the pixel stride exactly equals `nchannels * sizeof(T)`
+- `PixelBase(view, x, y)`, `ChannelPtr<T>(view, x, y, ch)`
+- `RoiRowPtr<T>(view, y, roi)` for the start of the ROI row at `roi.xbegin` and
+  `roi.chbegin`.
+
+Rationale:
+- Avoids duplicating fragile byte-offset math across many ops.
+- Makes it visually obvious what the code is doing: "get row pointer" vs
+  "compute offset by hand."
+- Makes non-contiguous fallback paths less error-prone by reusing the same
+  pointer computations.
+
+Important: these helpers are only valid for `ImageBuf` instances with local
+pixels (`localpixels()` non-null). The call sites must check that before using
+them.
+
+
+Contiguous fast path vs non-contiguous fallback
+-----------------------------------------------
+
+Most operations implement two paths:
+
+1) Contiguous fast path:
+   - Used when pixels are tightly packed for the ROI's channel range.
+   - The operation is executed as a 1D stream of length:
+     `roi.width() * (roi.chend - roi.chbegin)`
+   - Uses `RunHwy*Cmd` (or native-int runner) and benefits from:
+     - fewer branches
+     - fewer pointer computations
+     - auto tail handling
+
+2) Non-contiguous fallback:
+   - Used when pixels have padding, unusual strides, or channel subsets that do
+     not form a dense stream.
+   - Typically loops pixel-by-pixel and channel-by-channel.
+   - May still use the `ChannelPtr` helpers to compute correct addresses.
+
+Rationale:
+- The contiguous path is where SIMD delivers large gains.
+- Trying to SIMD-optimize arbitrary strided layouts often increases complexity
+  and risk for marginal benefit. Keeping a scalar fallback preserves
+  correctness and maintainability.
+
+
+How to add a new hwy kernel
+---------------------------
+
+Step 1: Choose the kernel shape
+- Unary: `R = f(A)` -> use `RunHwyUnaryCmd`
+- Binary: `R = f(A, B)` -> use `RunHwyCmd`
+- Ternary: `R = f(A, B, C)` -> use `RunHwyTernaryCmd`
+
+Step 2: Decide if a native-int fast path is valid
+- Only for scale-invariant ops and same-type integral inputs/outputs.
+- Use `RunHwyUnaryNativeInt` / `RunHwyBinaryNativeInt` when safe.
+- Otherwise, always use the promote/demote runners.
+
+Step 3: Implement the hwy body with a contig check
+Typical structure inside `*_impl_hwy`:
+- Acquire views once:
+  - `auto Rv = HwyPixels(R);`
+  - `auto Av = HwyPixels(A);` etc.
+- In the parallel callback:
+  - compute `nchannels = RoiNChannels(roi)`
+  - compute `contig = ChannelsContiguous<...>(...)` for each image
+  - for each scanline y:
+    - `Rtype* r_row = RoiRowPtr<Rtype>(Rv, y, roi);`
+    - `const Atype* a_row = RoiRowPtr<Atype>(Av, y, roi);` etc.
+    - if contig: call `RunHwy*` with `n = roi.width() * nchannels`
+    - else: fall back per pixel, per channel
+
+Step 4: Keep the scalar path as the reference
+- The scalar implementation should remain correct for all layouts and types.
+- The hwy path should match scalar results for supported cases.
+
+
+Design rationale summary
+------------------------
+
+This design intentionally separates concerns:
+- Type conversion and normalization are centralized (`LoadPromote`,
+  `DemoteStore`).
+- SIMD lane iteration and tail handling are centralized (`RunHwy*` runners).
+- Image address computations are centralized (`HwyPixels`, `RoiRowPtr`,
+  `ChannelPtr`).
+- Operation-specific code is reduced to short lambdas expressing the math.
+
+This makes the hwy layer:
+- Easier to maintain: fewer places to fix bugs when semantics change.
+- Easier to extend: adding an op mostly means writing the math lambda and the
+  dispatch glue.
+- Safer: correctness for unusual layouts remains in scalar fallbacks.
+
+
+Notes on `half`
+---------------
+
+The hwy conversion helpers handle `half` by converting through
+`hwy::float16_t`. This currently assumes the underlying `half` representation
+is compatible with how Highway loads/stores 16-bit floats.
+
+If this assumption is revisited in the future, it should be changed as a
+separate, explicit correctness/performance project.
+
+
+<!-- SPDX-License-Identifier: CC-BY-4.0 -->
+<!-- Copyright Contributors to the OpenImageIO Project. -->
+
+
diff --git a/src/cmake/externalpackages.cmake b/src/cmake/externalpackages.cmake
@@ -225,6 +225,11 @@ if (USE_QT AND OPENGL_FOUND)
 endif ()
 
 
+# Google Highway for SIMD (optional optimization)
+if (OIIO_USE_HWY)
+    checked_find_package (hwy)
+endif ()
+
 # Tessil/robin-map
 checked_find_package (Robinmap REQUIRED
                       VERSION_MIN 1.2.0

diff --git a/src/doc/imagebufalgo.rst b/src/doc/imagebufalgo.rst
@@ -152,6 +152,68 @@ the computation without spawning additional threads, which might tend to
 crowd out the other application threads.
 
 
+SIMD Performance and Data Types
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Many ImageBufAlgo operations use SIMD (Single Instruction, Multiple Data)
+optimizations powered by the Google Highway library to achieve significant
+performance improvements, particularly for integer image formats.
+
+**Integer Type Optimizations:**
+
+OpenImageIO treats all integer images as normalized Standard Dynamic Range
+(SDR) data:
+
+* Unsigned integers (``uint8``, ``uint16``, ``uint32``, ``uint64``) are
+  normalized to the [0.0, 1.0] range: ``float_value = int_value / max_value``
+* Signed integers (``int8``, ``int16``, ``int32``, ``int64``) are normalized
+  to approximately the [-1.0, 1.0] range: ``float_value = int_value / max_value``
+
+Most ImageBufAlgo operations convert integer data to float, perform the
+operation, and convert back. Highway SIMD provides 3-5x speedup for these
+operations compared to scalar code.
+
+**Scale-Invariant Operations:**
+
+Certain operations are *scale-invariant*, meaning they produce identical
+results whether performed on raw integers or normalized floats. For these
+operations, OpenImageIO uses native integer SIMD paths that avoid float
+conversion entirely, achieving 6-12x speedup (2-3x faster than the float
+promotion path):
+
+* ``add``, ``sub`` (with saturation)
+* ``min``, ``max``
+* ``abs``, ``absdiff``
+
+These optimizations automatically activate when all input and output images
+have matching integer types (e.g., all ``uint8``). When types differ or when
+mixing integer and float images, the standard float promotion path is used.
+
+**Controlling SIMD Optimizations:**
+
+Highway SIMD is enabled by default. To disable it globally::
+
+    OIIO::attribute("enable_hwy", 0);
+
+Or via environment variable::
+
+    export OPENIMAGEIO_ENABLE_HWY=0
+
+This is primarily useful for debugging or performance comparison. In normal
+use, the optimizations should remain enabled for best performance.
+
+**Performance Expectations:**
+
+Typical speedups with Highway SIMD (compared to scalar code):
+
+* Float operations: 3-5x faster
+* Integer operations (with float conversion): 3-5x faster
+* Integer scale-invariant operations (native int): 6-12x faster
+* Half-float operations: 3-5x faster
+
+Actual performance depends on the specific operation, image size, data types,
+and hardware capabilities (AVX2, AVX-512, ARM NEON, etc.).
+
 
 .. _sec-iba-patterns: