Skip to content

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693

Open
crafcat7 wants to merge 9 commits into
Tencent:masterfrom
crafcat7:feat/x86-cumulativesum
Open

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
crafcat7 wants to merge 9 commits into
Tencent:masterfrom
crafcat7:feat/x86-cumulativesum

Conversation

@crafcat7

@crafcat7 crafcat7 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an x86-specific implementation of CumulativeSum that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone helper plus AVX / SSE fallbacks in the main x86 source. On the refreshed benchmark matrix this yields 1.75×–2.46× single-thread speedup and 1.46×–2.00× 8-thread speedup across the four Stage 2 demos. The x86 layer now covers all valid (dims, axis) cases; outer-axis scans use a simple SIMD cur += prev helper instead of falling back to the native implementation.

Motivation

CumulativeSum::forward_inplace in src/layer/cumulativesum.cpp walks data with an inner serial recurrence:

for (int k = 1; k < w; k++)
    ptr[k] = ptr[k] + ptr[k - 1];  // RAW dependency → cannot auto-vectorize

Of the 6 (dims, axis) cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:

  • dims == 1
  • dims == 2, axis == 1
  • dims == 3, axis == 2

The other three cases (dims=2 axis=0, dims=3 axis=0/1) scan across the outer dimension, so each step updates an independent row/channel slice with cur += prev. These paths are memory-bandwidth-bound, but the x86 implementation still handles them with a small SIMD add helper so all valid cases stay inside the x86 layer.

Algorithm

Inner row of w floats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:

  1. Stage 1 — shift-by-1 within each 128-bit half (_mm256_slli_si256 by 4B), then add.
  2. Stage 2 — shift-by-2 within each 128-bit half (8B), then add.
    After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
  3. Stage 3 — broadcast the last lane of the low half into all 4 lanes of the high half via _mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now in v.
  4. Running base — add a broadcast of the previous tile's last lane, then update the base with the new last lane for the next tile.

Per-lane top-to-bottom view of one 8-lane tile (lanes 0..7 = a..h). Each column is one Kogge-Stone stage; values inside a stage are computed in parallel.

 lane  | input |  stage 1   |   stage 2    |     stage 3      |  + running_base
       |       | shift-1 +  | shift-2 +    | broadcast low.3  |
       |       | add (half) | add (half)   | → high + add     |
-------+-------+------------+--------------+------------------+--------------------
   0   |   a   |     a      |      a       |        a         |  a + base
   1   |   b   |    a+b     |     a+b      |       a+b        |  a+b + base
   2   |   c   |    b+c     |    a+b+c     |      a+b+c       |  a+b+c + base
   3   |   d   |    c+d     |   a+b+c+d    |     a+b+c+d      |  a+b+c+d + base
   ----  128-bit lane boundary  ------------------------------+--------------------
   4   |   e   |     e      |      e       |    a+b+c+d + e   |  ...+e + base
   5   |   f   |    e+f     |     e+f      |   a+b+c+d + e+f  |  ...+ef + base
   6   |   g   |    f+g     |    e+f+g     |  a+b+c+d + e+f+g |  ...+efg + base
   7   |   h   |    g+h     |   e+f+g+h    | a+b+c+d + e+f+g+h|  ...+efgh + base
                                                                       │
                                              base for next tile  ◄────┘
                                              (broadcast of lane 7)

Tail (<8 lanes) is finished with a scalar accumulator. The SSE2 path uses the same structure on 4 lanes with just stages 1 and 2; the AVX path stitches two 128-bit prefix sums into one 256-bit tile when __AVX__ is available but __AVX2__ is not.

We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.

Dispatch

The x86 layer handles all six valid (dims, axis) cases. support_packing is left at its inherited default so the packing machinery auto-unpacks to pack1 before forward_inplace.

AVX2 dispatch:

  • cumulativesum_x86.cpp contains the x86 layer and compile-time AVX2 / AVX / SSE kernels.
  • cumulativesum_x86.h declares both the x86 layer class and the AVX2 helper interface, matching existing x86 patterns that avoid extra ISA-only headers for small helper shims.
  • cumulativesum_x86_avx2.cpp contains out-of-line AVX2 helpers compiled by ncnn_add_arch_opt_source() with -mavx2 or /arch:AVX2.
  • Runtime CPU builds select the AVX2 helper once per forward_inplace() call only from generated fma / avx variants when cpu_support_x86_avx2() is true. This fixes the common AVX2-only target where the layer creator resolves to the fma / avx variant, where __AVX2__ is otherwise false, while keeping compile-time AVX2 code in the main x86 source for fixed-ISA builds.

Multi-threading:

  • dims == 2, axis == 1: #pragma omp parallel for over rows.
  • dims == 3, axis == 2: flattened #pragma omp parallel for over (channel, row) — same parallelism as collapse(2), but compatible with MSVC OpenMP.

Correctness

ctest --test-dir build -R test_cumulativesum --output-on-failure
1/1 Test #44: test_cumulativesum ............... Passed

tests/test_cumulativesum.cpp is extended with boundary cases that exercise every edge of the SIMD tiling:

  • 1D lengths 1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 32 — covers tail-only, exact vector, partial-tail, and the first tile past the running-base propagation.
  • 2D axis=1 widths 1, 3, 8, 16, 17 with 5 rows.
  • 3D axis=2 widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the flattened (channel, row) parallel region.

All 22 cases pass on the SIMD build.

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++, -O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements use benchncnn with loop=100, cooldown=0, taskset -c 0-7; reported as the min metric (most stable at sub-millisecond workloads).

Baseline build is rebuilt in a detached source tree with src/layer/x86/cumulativesum_x86.{h,cpp} removed so the base scalar implementation is used.

Benchmark matrix

Four demo graphs, each stacking 3 CumulativeSum layers to amortize benchmark harness noise:

Demo Shape Axis path 1T base 1T opt 1T × 8T base 8T opt 8T ×
cumsum_1d_demo [65536] dims=1 0.07 0.04 1.75× 0.07 0.04 1.75×
cumsum_2d_axis1_demo [512, 512] dims=2 axis=1 0.29 0.13 2.23× 0.06 0.03 2.00×
cumsum_axis2_demo [256, 256, 32] dims=3 axis=2 2.24 0.91 2.46× 0.53 0.28 1.89×
cumsum_demo [256, 256, 32] mixed axis=0→1→2 1.05 0.56 1.88× 0.41 0.28 1.46×

All numbers are min milliseconds over 100 iterations on cores 0-7.

Interpretation

  • RAW-dependency paths (dims=1, dims=2 axis=1, dims=3 axis=2): the scalar recurrence is still the core bottleneck, and the dedicated AVX2 helper restores the stronger performance profile after the temporary AVX-main-path regression. The best current result is the pure axis=2 demo at 2.46× (1T) and 1.89× (8T).
  • Regression note: inlining the AVX2 prefix kernel into cumulativesum_x86.cpp caused generated avx512 variants on the benchmark machine to bypass the dedicated helper and regress noticeably. Routing all __AVX__ variants back through the out-of-line AVX2 helper recovered the earlier benchmark envelope.
  • Mixed demo (1.88× at 1T, 1.46× at 8T): the final axis=2 layer still carries the largest win, while the first two layers benefit from the x86 SIMD add helper.

No regressions

All other layers and networks are unaffected — the new class only overrides forward_inplace for CumulativeSum, preserves the existing in-place API, and falls back to the base only for unexpected invalid cases.

Summary:
  Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan
  axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar
  code carries a true prefix-sum data dependency and the compiler
  cannot auto-vectorize. The kernel performs an in-register Kogge-Stone
  scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning
  N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound
  axes with no inner dependency are left to the base implementation
  since the compiler already auto-vectorizes them at memory bandwidth.

Changes:
  1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86
  2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback
  3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2)
  4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
  Add dedicated boundary test cases that exercise w at, just below, and
  just above common SIMD vector widths (4 / 8 / 16). The cases cover the
  single-tile no-tail path, the running base propagation across multiple
  tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
  tests only hit these paths incidentally; the new cases make the coverage
  explicit and remain valid for any vectorized backend.

Changes:
  1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
  2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
  3. Wire the new function into main() alongside the existing 1d/2d/3d groups
@nihui

nihui commented Apr 27, 2026

Copy link
Copy Markdown
Member

@codex review

@codecov-commenter

codecov-commenter commented Apr 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.49495% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.70%. Comparing base (104edd7) to head (914c52e).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/x86/cumulativesum_x86.cpp 99.02% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6693      +/-   ##
==========================================
+ Coverage   95.69%   95.70%   +0.01%     
==========================================
  Files         944      947       +3     
  Lines      410509   409756     -753     
==========================================
- Hits       392834   392160     -674     
+ Misses      17675    17596      -79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4f8e403ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/x86/cumulativesum_x86.cpp Outdated
{
int j = 0;

#if __AVX2__

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enable AVX2 kernel in runtime-dispatch builds

The AVX2 fast path is guarded by #if __AVX2__, but this layer only adds cumulativesum_x86.cpp and no cumulativesum_x86_avx2.cpp helper. In the default NCNN_RUNTIME_CPU=ON x86 build, this source is instantiated as avx/fma variants (see cmake/ncnn_add_layer.cmake flags -mavx and -mavx -mfma), so __AVX2__ is false on AVX2/FMA CPUs without AVX512 and the code falls back to SSE2/scalar. That means the new 8-lane kernel is effectively unavailable for the most common AVX2-only runtime target.

Useful? React with 👍 / 👎.

Summary:
  Add an x86-specific CumulativeSum implementation that accelerates inner-axis prefix scans with AVX and an out-of-line AVX2 Kogge-Stone helper, while covering all valid base-layer dims/axis cases directly. This also replaces the tail base-class bounce with an explicit invalid-input return path and adds boundary tests for SIMD tile edges.

Changes:
  1. Add x86 cumulativesum AVX2 helper source and runtime dispatch from AVX-capable variants.
  2. Implement SIMD prefix-scan and outer-axis accumulation paths for all valid 1D, 2D, and 3D cases in the x86 layer.
  3. Extend cumulativesum tests with boundary widths to validate vector tails and running-base propagation.
@crafcat7

crafcat7 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor Author

I've revised my submission and updated the current PR information. The main changes are as follows:

  • Added x86-optimized implementation for CumulativeSum
  1. Covered all valid (dims, axis) scenarios in the base layer in src/layer/x86/cumulativesum_x86.cpp
  2. Used SIMD paths for inner-axis prefix scan
  3. Used SIMD cur += prev paths for outer-axis accumulation
  • Added a separate AVX2 helper
  1. src/layer/x86/cumulativesum_x86_avx2.cpp
  2. Handled stronger ISA paths with out-of-line AVX2 Kogge-Stone prefix scan
  3. Upgraded to AVX2 helper via runtime dispatch in the AVX-capable x86 variant
  • Corrected the behavior boundaries of the x86 layer
  1. No longer falls back to the base class for repeated execution
  2. Directly returns for uncovered illegal input -100, consistent with the semantics of the base class's missed branch
  • Supplementary tests
  1. Extend tests/test_cumulativesum.cpp
  2. Add boundary width tests to cover SIMD tile, tail, and running-base propagation scenarios

Summary:

  Move the CumulativeSum x86 SIMD implementation and AVX2 runtime dispatch into a shared header included by both the layer wrapper and the AVX2 wrapper. This matches the existing x86 packed helper structure while preserving optimized prefix scan behavior.

Changes:

  1. Add a shared CumulativeSum x86 packed helper header with prefix/add kernels and forwarding logic

  2. Simplify the x86 layer implementation to call the shared forward path

  3. Export an AVX2 wrapper translation unit for runtime CPU dispatch
@crafcat7

Copy link
Copy Markdown
Contributor Author

I've restructured the code, splitting it into:

  • cumulativesum_x86_packed.h -> Contains the main implementation

  • cumulativesum_x86.h -> External interface

  • cumulativesum_x86.cpp -> Forward

  • cumulativesum_x86_avx2.cpp -> Forward

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an x86-specific CumulativeSum implementation to accelerate inner-dimension prefix scans using SIMD (AVX2 Kogge-Stone prefix scan with AVX/SSE fallbacks) while keeping all valid (dims, axis) cases handled within the x86 layer and extending test coverage to boundary sizes that stress SIMD tiling/tails.

Changes:

  • Introduces CumulativeSum_x86 and an x86 implementation of forward_inplace() that covers all (dims, axis) cases.
  • Adds a SIMD prefix-sum kernel for contiguous inner-dimension scans plus a SIMD “cur += prev” helper for outer-axis scans, including runtime AVX2 dispatch for __AVX__ && !__AVX2__ builds.
  • Extends test_cumulativesum with boundary cases targeting vector-width edges and tail handling.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_cumulativesum.cpp Adds boundary-focused test cases for 1D/2D/3D shapes to exercise SIMD tiling and tails.
src/layer/x86/cumulativesum_x86.h Declares the x86-specific CumulativeSum_x86 layer override.
src/layer/x86/cumulativesum_x86.cpp Implements CumulativeSum_x86::forward_inplace() and routes to the packed SIMD implementation.
src/layer/x86/cumulativesum_x86_packed.h Provides the SIMD prefix-sum (AVX2/AVX/SSE2) and add helpers plus the main x86 forward implementation with runtime AVX2 dispatch.
src/layer/x86/cumulativesum_x86_avx2.cpp Defines the out-of-line AVX2 entry point used by runtime dispatch builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…rd path

Summary:
  Restructure cumulativesum_x86 to follow the established x86 layer
  pattern (e.g. deconvolution, interp, cast): normal implementation in
  x86.cpp, ISA-specific variant in packed.h + avx2.cpp wrapper.

Changes:
  1. Move normal forward_inplace logic and SSE/AVX helpers into
     cumulativesum_x86.cpp
  2. Rename packed.h helpers to _avx2 suffix and guard with #if __AVX2__
  3. Add runtime AVX2 dispatch in forward_inplace via
     cumulative_sum_forward_inplace_avx2 free function
  4. Update cumulativesum_x86_avx2.cpp wrapper to call _impl variant

nihui commented May 25, 2026

Copy link
Copy Markdown
Member

Thanks for working on this optimization.

I think the AVX2 dispatch structure should be adjusted to better match the existing ncnn x86 style. The current implementation dispatches at the whole forward_inplace() level and duplicates the full (dims, axis) control flow in cumulativesum_x86_packed.h. That makes the layer logic harder to maintain and does not feel consistent with the usual pattern used by other x86 helpers.

I would prefer keeping CumulativeSum_x86::forward_inplace() as a single implementation, and doing the ISA dispatch at the small kernel-helper level instead, for example around cumulative_sum_prefix_sum_row() and cumulative_sum_add(). The AVX2 prefix-scan kernel can live in the same private helper header and be included by both cumulativesum_x86.cpp and cumulativesum_x86_avx2.cpp, with the _avx2.cpp file only providing the out-of-line wrappers needed by runtime CPU dispatch.

Something along these lines would be cleaner:

  • keep only one copy of the dims/axis forward logic
  • dispatch to AVX2 inside cumulative_sum_prefix_sum_row() rather than redirecting the whole layer
  • keep the AVX2 Kogge-Stone implementation in the same helper .h as the SSE/AVX fallback kernels
  • make cumulativesum_x86_avx2.cpp a thin wrapper around those helper kernels
  • avoid the packed naming here, since this layer is still pack1 and does not enable packing layout

There is also a possible behavior issue with the current structure: in compile-time __AVX2__ builds, forward_inplace() does not take the runtime-dispatch branch, and the local cumulative_sum_prefix_sum_row() appears to use the generic __AVX__ path instead of the AVX2 Kogge-Stone helper. So the fixed-ISA AVX2/AVX512 variants may not actually use the intended AVX2 prefix-scan kernel.

The optimization itself looks reasonable, but I think the dispatch should be moved down to the helper functions before merging.

crafcat7 added 2 commits May 25, 2026 23:28
Summary:
  Refactor the x86 CumulativeSum AVX2 runtime dispatch to follow the
  standard ncnn helper pattern. Previously the dispatch was at the whole
  forward_inplace() level, duplicating the full dims/axis control flow.
  This also fixes a bug where compile-time __AVX2__ builds never used
  the Kogge-Stone prefix-scan kernel.

Changes:
  1. Rename cumulativesum_x86_packed.h to cumulativesum_x86_helper.h
  2. Move AVX2 runtime dispatch into cumulative_sum_prefix_sum_row() and cumulative_sum_add()
  3. Remove duplicated forward_inplace logic from the helper header
  4. Make cumulativesum_x86_avx2.cpp a thin wrapper exposing only the two kernel functions
  5. Keep a single copy of the dims/axis forward logic in cumulativesum_x86.cpp
Summary:
  The base cumulativesum.cpp was updated to support 4D matrices but the
  x86 optimized implementation was not, causing test_cumulativesum to
  fail in CI when the x86 layer returned -100 for dims==4 inputs.

Changes:
  1. Add dims==4 axis==0/1/2/3 handling to CumulativeSum_x86::forward_inplace()
  2. Reuse existing cumulative_sum_add and cumulative_sum_prefix_sum_row helpers

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbf3b5a774

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +12 to +15
__m128 t = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(v), 4));
v = _mm_add_ps(v, t);
t = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(v), 8));
v = _mm_add_ps(v, t);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve left-to-right cumulative sums

For any inner-axis scan that reaches the SIMD path, this Kogge-Stone helper changes the floating-point association from the existing scalar recurrence (ptr[k] = ptr[k] + ptr[k - 1]). Inputs with cancellation or a large dynamic range can therefore produce very different prefixes; for example [1e20f, -1e20f, 1.f] should leave the third prefix as 1 under the generic layer's left-to-right order, while the tree computes (-1e20f + 1.f) + 1e20f and yields 0. The new tests use small random values, so this x86/generic divergence is not covered.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2fe54205a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

int j = 0;
float sum = 0.f;

#if __AVX__

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Call the AVX2 prefix helper in fixed-ISA builds

In fixed-ISA x86 builds (NCNN_RUNTIME_CPU=OFF with NCNN_AVX2=ON), this source is compiled with AVX2 enabled, so __AVX2__ is true and the runtime-dispatch block above is skipped; execution then falls through to this generic __AVX__ loop, leaving the newly added cumulative_sum_prefix_sum_row_avx2_impl() dead. Fresh evidence compared with the prior AVX2-dispatch comment is that the helper now exists, but it is still never selected when the main x86 source itself is built for AVX2, so those non-runtime AVX2 builds miss the intended Kogge-Stone fast path.

Useful? React with 👍 / 👎.

Summary:
  Add #if __AVX2__ direct-call branches so that when the x86 source
  itself is compiled with AVX2 enabled (NCNN_RUNTIME_CPU=OFF), the
  Kogge-Stone prefix-sum fast path is actually selected instead of
  falling through to the slower split-128-bit AVX loop.

Changes:
  1. Add #if __AVX2__ branch in cumulative_sum_prefix_sum_row to call
     cumulative_sum_prefix_sum_row_avx2_impl directly
  2. Add #if __AVX2__ branch in cumulative_sum_add for consistency
     with the runtime-dispatch path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants