x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693crafcat7 wants to merge 9 commits into
Conversation
Summary: Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar code carries a true prefix-sum data dependency and the compiler cannot auto-vectorize. The kernel performs an in-register Kogge-Stone scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound axes with no inner dependency are left to the base implementation since the compiler already auto-vectorizes them at memory bandwidth. Changes: 1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86 2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback 3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2) 4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
Add dedicated boundary test cases that exercise w at, just below, and
just above common SIMD vector widths (4 / 8 / 16). The cases cover the
single-tile no-tail path, the running base propagation across multiple
tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
tests only hit these paths incidentally; the new cases make the coverage
explicit and remain valid for any vectorized backend.
Changes:
1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
3. Wire the new function into main() alongside the existing 1d/2d/3d groups
|
@codex review |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6693 +/- ##
==========================================
+ Coverage 95.69% 95.70% +0.01%
==========================================
Files 944 947 +3
Lines 410509 409756 -753
==========================================
- Hits 392834 392160 -674
+ Misses 17675 17596 -79 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4f8e403ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| int j = 0; | ||
|
|
||
| #if __AVX2__ |
There was a problem hiding this comment.
Enable AVX2 kernel in runtime-dispatch builds
The AVX2 fast path is guarded by #if __AVX2__, but this layer only adds cumulativesum_x86.cpp and no cumulativesum_x86_avx2.cpp helper. In the default NCNN_RUNTIME_CPU=ON x86 build, this source is instantiated as avx/fma variants (see cmake/ncnn_add_layer.cmake flags -mavx and -mavx -mfma), so __AVX2__ is false on AVX2/FMA CPUs without AVX512 and the code falls back to SSE2/scalar. That means the new 8-lane kernel is effectively unavailable for the most common AVX2-only runtime target.
Useful? React with 👍 / 👎.
Summary: Add an x86-specific CumulativeSum implementation that accelerates inner-axis prefix scans with AVX and an out-of-line AVX2 Kogge-Stone helper, while covering all valid base-layer dims/axis cases directly. This also replaces the tail base-class bounce with an explicit invalid-input return path and adds boundary tests for SIMD tile edges. Changes: 1. Add x86 cumulativesum AVX2 helper source and runtime dispatch from AVX-capable variants. 2. Implement SIMD prefix-scan and outer-axis accumulation paths for all valid 1D, 2D, and 3D cases in the x86 layer. 3. Extend cumulativesum tests with boundary widths to validate vector tails and running-base propagation.
|
I've revised my submission and updated the current PR information. The main changes are as follows:
|
Summary: Move the CumulativeSum x86 SIMD implementation and AVX2 runtime dispatch into a shared header included by both the layer wrapper and the AVX2 wrapper. This matches the existing x86 packed helper structure while preserving optimized prefix scan behavior. Changes: 1. Add a shared CumulativeSum x86 packed helper header with prefix/add kernels and forwarding logic 2. Simplify the x86 layer implementation to call the shared forward path 3. Export an AVX2 wrapper translation unit for runtime CPU dispatch
|
I've restructured the code, splitting it into:
|
There was a problem hiding this comment.
Pull request overview
Adds an x86-specific CumulativeSum implementation to accelerate inner-dimension prefix scans using SIMD (AVX2 Kogge-Stone prefix scan with AVX/SSE fallbacks) while keeping all valid (dims, axis) cases handled within the x86 layer and extending test coverage to boundary sizes that stress SIMD tiling/tails.
Changes:
- Introduces
CumulativeSum_x86and an x86 implementation offorward_inplace()that covers all(dims, axis)cases. - Adds a SIMD prefix-sum kernel for contiguous inner-dimension scans plus a SIMD “cur += prev” helper for outer-axis scans, including runtime AVX2 dispatch for
__AVX__ && !__AVX2__builds. - Extends
test_cumulativesumwith boundary cases targeting vector-width edges and tail handling.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_cumulativesum.cpp | Adds boundary-focused test cases for 1D/2D/3D shapes to exercise SIMD tiling and tails. |
| src/layer/x86/cumulativesum_x86.h | Declares the x86-specific CumulativeSum_x86 layer override. |
| src/layer/x86/cumulativesum_x86.cpp | Implements CumulativeSum_x86::forward_inplace() and routes to the packed SIMD implementation. |
| src/layer/x86/cumulativesum_x86_packed.h | Provides the SIMD prefix-sum (AVX2/AVX/SSE2) and add helpers plus the main x86 forward implementation with runtime AVX2 dispatch. |
| src/layer/x86/cumulativesum_x86_avx2.cpp | Defines the out-of-line AVX2 entry point used by runtime dispatch builds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rd path
Summary:
Restructure cumulativesum_x86 to follow the established x86 layer
pattern (e.g. deconvolution, interp, cast): normal implementation in
x86.cpp, ISA-specific variant in packed.h + avx2.cpp wrapper.
Changes:
1. Move normal forward_inplace logic and SSE/AVX helpers into
cumulativesum_x86.cpp
2. Rename packed.h helpers to _avx2 suffix and guard with #if __AVX2__
3. Add runtime AVX2 dispatch in forward_inplace via
cumulative_sum_forward_inplace_avx2 free function
4. Update cumulativesum_x86_avx2.cpp wrapper to call _impl variant
|
Thanks for working on this optimization. I think the AVX2 dispatch structure should be adjusted to better match the existing ncnn x86 style. The current implementation dispatches at the whole I would prefer keeping Something along these lines would be cleaner:
There is also a possible behavior issue with the current structure: in compile-time The optimization itself looks reasonable, but I think the dispatch should be moved down to the helper functions before merging. |
Summary: Refactor the x86 CumulativeSum AVX2 runtime dispatch to follow the standard ncnn helper pattern. Previously the dispatch was at the whole forward_inplace() level, duplicating the full dims/axis control flow. This also fixes a bug where compile-time __AVX2__ builds never used the Kogge-Stone prefix-scan kernel. Changes: 1. Rename cumulativesum_x86_packed.h to cumulativesum_x86_helper.h 2. Move AVX2 runtime dispatch into cumulative_sum_prefix_sum_row() and cumulative_sum_add() 3. Remove duplicated forward_inplace logic from the helper header 4. Make cumulativesum_x86_avx2.cpp a thin wrapper exposing only the two kernel functions 5. Keep a single copy of the dims/axis forward logic in cumulativesum_x86.cpp
Summary: The base cumulativesum.cpp was updated to support 4D matrices but the x86 optimized implementation was not, causing test_cumulativesum to fail in CI when the x86 layer returned -100 for dims==4 inputs. Changes: 1. Add dims==4 axis==0/1/2/3 handling to CumulativeSum_x86::forward_inplace() 2. Reuse existing cumulative_sum_add and cumulative_sum_prefix_sum_row helpers
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbf3b5a774
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| __m128 t = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(v), 4)); | ||
| v = _mm_add_ps(v, t); | ||
| t = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(v), 8)); | ||
| v = _mm_add_ps(v, t); |
There was a problem hiding this comment.
Preserve left-to-right cumulative sums
For any inner-axis scan that reaches the SIMD path, this Kogge-Stone helper changes the floating-point association from the existing scalar recurrence (ptr[k] = ptr[k] + ptr[k - 1]). Inputs with cancellation or a large dynamic range can therefore produce very different prefixes; for example [1e20f, -1e20f, 1.f] should leave the third prefix as 1 under the generic layer's left-to-right order, while the tree computes (-1e20f + 1.f) + 1e20f and yields 0. The new tests use small random values, so this x86/generic divergence is not covered.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2fe54205a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| int j = 0; | ||
| float sum = 0.f; | ||
|
|
||
| #if __AVX__ |
There was a problem hiding this comment.
Call the AVX2 prefix helper in fixed-ISA builds
In fixed-ISA x86 builds (NCNN_RUNTIME_CPU=OFF with NCNN_AVX2=ON), this source is compiled with AVX2 enabled, so __AVX2__ is true and the runtime-dispatch block above is skipped; execution then falls through to this generic __AVX__ loop, leaving the newly added cumulative_sum_prefix_sum_row_avx2_impl() dead. Fresh evidence compared with the prior AVX2-dispatch comment is that the helper now exists, but it is still never selected when the main x86 source itself is built for AVX2, so those non-runtime AVX2 builds miss the intended Kogge-Stone fast path.
Useful? React with 👍 / 👎.
Summary:
Add #if __AVX2__ direct-call branches so that when the x86 source
itself is compiled with AVX2 enabled (NCNN_RUNTIME_CPU=OFF), the
Kogge-Stone prefix-sum fast path is actually selected instead of
falling through to the slower split-128-bit AVX loop.
Changes:
1. Add #if __AVX2__ branch in cumulative_sum_prefix_sum_row to call
cumulative_sum_prefix_sum_row_avx2_impl directly
2. Add #if __AVX2__ branch in cumulative_sum_add for consistency
with the runtime-dispatch path
Summary
Adds an x86-specific implementation of
CumulativeSumthat replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone helper plus AVX / SSE fallbacks in the main x86 source. On the refreshed benchmark matrix this yields 1.75×–2.46× single-thread speedup and 1.46×–2.00× 8-thread speedup across the four Stage 2 demos. The x86 layer now covers all valid(dims, axis)cases; outer-axis scans use a simple SIMDcur += prevhelper instead of falling back to the native implementation.Motivation
CumulativeSum::forward_inplaceinsrc/layer/cumulativesum.cppwalks data with an inner serial recurrence:Of the 6
(dims, axis)cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:dims == 1dims == 2, axis == 1dims == 3, axis == 2The other three cases (
dims=2 axis=0,dims=3 axis=0/1) scan across the outer dimension, so each step updates an independent row/channel slice withcur += prev. These paths are memory-bandwidth-bound, but the x86 implementation still handles them with a small SIMD add helper so all valid cases stay inside the x86 layer.Algorithm
Inner row of
wfloats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:_mm256_slli_si256by 4B), then add.After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
_mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now inv.Per-lane top-to-bottom view of one 8-lane tile (lanes 0..7 =
a..h). Each column is one Kogge-Stone stage; values inside a stage are computed in parallel.Tail (<8 lanes) is finished with a scalar accumulator. The SSE2 path uses the same structure on 4 lanes with just stages 1 and 2; the AVX path stitches two 128-bit prefix sums into one 256-bit tile when
__AVX__is available but__AVX2__is not.We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.
Dispatch
The x86 layer handles all six valid
(dims, axis)cases.support_packingis left at its inherited default so the packing machinery auto-unpacks topack1beforeforward_inplace.AVX2 dispatch:
cumulativesum_x86.cppcontains the x86 layer and compile-time AVX2 / AVX / SSE kernels.cumulativesum_x86.hdeclares both the x86 layer class and the AVX2 helper interface, matching existing x86 patterns that avoid extra ISA-only headers for small helper shims.cumulativesum_x86_avx2.cppcontains out-of-line AVX2 helpers compiled byncnn_add_arch_opt_source()with-mavx2or/arch:AVX2.forward_inplace()call only from generatedfma/avxvariants whencpu_support_x86_avx2()is true. This fixes the common AVX2-only target where the layer creator resolves to thefma/avxvariant, where__AVX2__is otherwise false, while keeping compile-time AVX2 code in the main x86 source for fixed-ISA builds.Multi-threading:
dims == 2, axis == 1:#pragma omp parallel forover rows.dims == 3, axis == 2: flattened#pragma omp parallel forover(channel, row)— same parallelism ascollapse(2), but compatible with MSVC OpenMP.Correctness
tests/test_cumulativesum.cppis extended with boundary cases that exercise every edge of the SIMD tiling:axis=1widths 1, 3, 8, 16, 17 with 5 rows.axis=2widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the flattened(channel, row)parallel region.All 22 cases pass on the SIMD build.
Performance
Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements usebenchncnnwithloop=100,cooldown=0,taskset -c 0-7; reported as theminmetric (most stable at sub-millisecond workloads).Baseline build is rebuilt in a detached source tree with
src/layer/x86/cumulativesum_x86.{h,cpp}removed so the base scalar implementation is used.Benchmark matrix
Four demo graphs, each stacking 3
CumulativeSumlayers to amortize benchmark harness noise:cumsum_1d_demo[65536]cumsum_2d_axis1_demo[512, 512]cumsum_axis2_demo[256, 256, 32]cumsum_demo[256, 256, 32]All numbers are
minmilliseconds over 100 iterations on cores 0-7.Interpretation
axis=2demo at 2.46× (1T) and 1.89× (8T).cumulativesum_x86.cppcaused generatedavx512variants on the benchmark machine to bypass the dedicated helper and regress noticeably. Routing all__AVX__variants back through the out-of-line AVX2 helper recovered the earlier benchmark envelope.axis=2layer still carries the largest win, while the first two layers benefit from the x86 SIMD add helper.No regressions
All other layers and networks are unaffected — the new class only overrides
forward_inplaceforCumulativeSum, preserves the existing in-place API, and falls back to the base only for unexpected invalid cases.