GH-47572: [C++][Parquet] Uniform unpack interface #47573

AntoinePrv · 2025-09-16T12:22:47Z

Rationale for this change

Refactor the unpack function code generator and dispatcher to accommodate more use cases:

~~uint16_t~~ and uint64_t new sizes
~~A dispatch once function returning a function pointer to the correct bit width~~

Update 2025-10-13: the dispatch once and uint16_t implementation were removed as they turned out to be slower.

What changes are included in this PR?

The diff are hard to look at but the important files to look at are:

The two python files for code generation accommodate new sizes (~~with the exception of uint16 on Avx512 for which the algorithms assumptions break~~);
The two code generators have a uniform structure for the "batch unpackers" they generate: each one of them is a specialization of a struct template unpack29_32 > Unpacker<uint32_t, 29>::unpack
- Using specialization instead of hard coded number in function names makes it possible to use them in more generic code
- Wrapping the functions in a struct makes it possible to carry information along the function (such as the number of values that said function unpacks) and to leave the door open for partial specialization for future improvements.
The public functions in bpacking_internal.h are also template specialization unpack32 -> unpack<uint32_t>
The large switch statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with a constexpr generated jump table. The logic is in bpacking_dispatch_internal.h

From a performance perspective:

there is no improvement to the individual "batch unpacker" generated
- The SIMD code is actually doing worst that scalar on SSE4.2 OR uint16_t
there are new sizes that can bring improvements
- unpack<uint64_t> has an SIMD implementation that should benefit DeltaBitPackDecoder
- ~~Novel unpack<uint16_t> should benefit the level decoders~~
- ~~The dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but see get_unpack_fn for the building block).~~

Update 2025-10-13:

Added an unpack implementation for uint8_t and uint16_t that call the uint32_t version on a local buffer combined with a static_cast loop (what was done on the call site before).
The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use.

Are these changes tested?

Very much.

Are there any user-facing changes?

GitHub Issue: [C++][Parquet] Generalize bpacking and possible speedup #47572

github-actions · 2025-09-16T12:23:13Z

⚠️ GitHub issue #47572 has been automatically assigned in GitHub to PR creator.

AntoinePrv · 2025-09-29T13:58:56Z

The results for this are not good

AntoinePrv · 2025-10-13T16:26:23Z

The dispatch once and uint16_t implementation were removed as they turned out to be slower.
Added an unpack implementation for uint8_t and uint16_t that call the uint32_t version on a local buffer combined with a static_cast loop (what was done on the call site before).
The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use.

pitrou · 2025-10-14T10:52:01Z

I'm testing performance of this PR locally on a AMD Zen 2 CPU on Ubuntu, and I'm seeing large regressions on unpack32 performance.

scalar unpack32 regresses on all input sizes:

before:

BM_UnpackUint32/ScalarUnaligned/1/64            23.3 ns         23.3 ns     30099581 items_per_second=2.74685G/s
BM_UnpackUint32/ScalarUnaligned/2/64            15.1 ns         15.1 ns     45528078 items_per_second=4.22707G/s
BM_UnpackUint32/ScalarUnaligned/8/64            16.2 ns         16.1 ns     43418304 items_per_second=3.96297G/s
BM_UnpackUint32/ScalarUnaligned/20/64           22.4 ns         22.4 ns     31217249 items_per_second=2.85244G/s

BM_UnpackUint32/ScalarUnaligned/1/1024           310 ns          310 ns      2251073 items_per_second=3.29834G/s
BM_UnpackUint32/ScalarUnaligned/2/1024           213 ns          213 ns      3288295 items_per_second=4.80898G/s
BM_UnpackUint32/ScalarUnaligned/8/1024           357 ns          357 ns      1957044 items_per_second=2.86556G/s
BM_UnpackUint32/ScalarUnaligned/20/1024          339 ns          339 ns      2066262 items_per_second=3.01942G/s

BM_UnpackUint32/ScalarUnaligned/1/32768         9829 ns         9828 ns        71098 items_per_second=3.33403G/s
BM_UnpackUint32/ScalarUnaligned/2/32768         6543 ns         6543 ns       107368 items_per_second=5.00848G/s
BM_UnpackUint32/ScalarUnaligned/8/32768        11410 ns        11410 ns        60956 items_per_second=2.87199G/s
BM_UnpackUint32/ScalarUnaligned/20/32768       10737 ns        10736 ns        65168 items_per_second=3.05219G/s

after:

BM_UnpackUint32/ScalarUnaligned/1/64            26.3 ns         26.3 ns     26597461 items_per_second=2.43124G/s
BM_UnpackUint32/ScalarUnaligned/2/64            25.9 ns         25.9 ns     27099769 items_per_second=2.47115G/s
BM_UnpackUint32/ScalarUnaligned/8/64            23.0 ns         23.0 ns     30377366 items_per_second=2.78236G/s
BM_UnpackUint32/ScalarUnaligned/20/64           41.1 ns         41.1 ns     17029900 items_per_second=1.55726G/s

BM_UnpackUint32/ScalarUnaligned/1/1024           368 ns          368 ns      1903394 items_per_second=2.78384G/s
BM_UnpackUint32/ScalarUnaligned/2/1024           342 ns          342 ns      2047462 items_per_second=2.99373G/s
BM_UnpackUint32/ScalarUnaligned/8/1024           358 ns          358 ns      1956688 items_per_second=2.86192G/s
BM_UnpackUint32/ScalarUnaligned/20/1024          538 ns          538 ns      1306908 items_per_second=1.90488G/s

BM_UnpackUint32/ScalarUnaligned/1/32768        11647 ns        11646 ns        59931 items_per_second=2.81379G/s
BM_UnpackUint32/ScalarUnaligned/2/32768        10844 ns        10843 ns        64537 items_per_second=3.02197G/s
BM_UnpackUint32/ScalarUnaligned/8/32768        11303 ns        11301 ns        61943 items_per_second=2.89948G/s
BM_UnpackUint32/ScalarUnaligned/20/32768       16915 ns        16915 ns        41375 items_per_second=1.93724G/s

SIMD unpack32 regresses, but only a small input sizes:

before:

BM_UnpackUint32/Avx2Unaligned/1/64              6.10 ns         6.10 ns    115307711 items_per_second=10.4919G/s
BM_UnpackUint32/Avx2Unaligned/2/64              5.80 ns         5.80 ns    119616924 items_per_second=11.0319G/s
BM_UnpackUint32/Avx2Unaligned/8/64              7.84 ns         7.83 ns     89310891 items_per_second=8.16917G/s
BM_UnpackUint32/Avx2Unaligned/20/64             30.1 ns         30.1 ns     23168069 items_per_second=2.12844G/s

BM_UnpackUint32/Avx2Unaligned/1/1024            61.3 ns         61.3 ns     11549890 items_per_second=16.7084G/s
BM_UnpackUint32/Avx2Unaligned/2/1024            60.4 ns         60.4 ns     11462262 items_per_second=16.9551G/s
BM_UnpackUint32/Avx2Unaligned/8/1024             118 ns          118 ns      5936723 items_per_second=8.68313G/s
BM_UnpackUint32/Avx2Unaligned/20/1024            426 ns          426 ns      1628900 items_per_second=2.40438G/s

BM_UnpackUint32/Avx2Unaligned/1/32768           1916 ns         1916 ns       364744 items_per_second=17.1002G/s
BM_UnpackUint32/Avx2Unaligned/2/32768           2071 ns         2071 ns       340164 items_per_second=15.8259G/s
BM_UnpackUint32/Avx2Unaligned/8/32768           3814 ns         3814 ns       183339 items_per_second=8.59169G/s
BM_UnpackUint32/Avx2Unaligned/20/32768         13674 ns        13672 ns        51167 items_per_second=2.39671G/s

after:

BM_UnpackUint32/Avx2Unaligned/1/64              11.7 ns         11.7 ns     60043243 items_per_second=5.47447G/s
BM_UnpackUint32/Avx2Unaligned/2/64              11.7 ns         11.7 ns     59912039 items_per_second=5.47921G/s
BM_UnpackUint32/Avx2Unaligned/8/64              13.4 ns         13.3 ns     52435550 items_per_second=4.79412G/s
BM_UnpackUint32/Avx2Unaligned/20/64             37.6 ns         37.6 ns     18551367 items_per_second=1.69999G/s

BM_UnpackUint32/Avx2Unaligned/1/1024            68.2 ns         68.2 ns     10297857 items_per_second=15.0158G/s
BM_UnpackUint32/Avx2Unaligned/2/1024            68.2 ns         68.2 ns     10265410 items_per_second=15.0135G/s
BM_UnpackUint32/Avx2Unaligned/8/1024             122 ns          122 ns      5723819 items_per_second=8.37532G/s
BM_UnpackUint32/Avx2Unaligned/20/1024            460 ns          460 ns      1521877 items_per_second=2.2277G/s

BM_UnpackUint32/Avx2Unaligned/1/32768           1958 ns         1957 ns       357945 items_per_second=16.7407G/s
BM_UnpackUint32/Avx2Unaligned/2/32768           1939 ns         1939 ns       361274 items_per_second=16.9008G/s
BM_UnpackUint32/Avx2Unaligned/8/32768           3831 ns         3830 ns       182683 items_per_second=8.55518G/s
BM_UnpackUint32/Avx2Unaligned/20/32768         14430 ns        14429 ns        48513 items_per_second=2.27106G/s

pitrou · 2025-10-14T11:37:45Z

A quick experiment shows that much of the original performance can be re-gained by replacing the array of function pointers with a switch statement in unpack_jump. It's less pretty for sure.

AntoinePrv · 2025-10-14T11:38:59Z

@pitrou what about the Read/Decoder benchmarks?

The algorithms are mostly similar, except:

The way pointers are incremented: using out[0], out[1]... instead of out++ =
The function names and how they are grouped in files.
The jump table (that should have a bigger relatitve impact for SIMD since they are faster).

I'm gonna investigate but I suspect these benchmarks are highly unstable.

pitrou

Here is a first round of comments (I haven't looked at all changes).

cpp/src/arrow/util/bpacking_internal.h

pitrou · 2025-10-14T10:17:53Z

cpp/src/arrow/util/bpacking_dispatch_internal.h

+int unpack_width(const uint8_t* in, UnpackedUInt* out, int batch_size) {
+  using UnpackerForWidth = Unpacker<UnpackedUInt, kPackedBitWidth>;
+
+  constexpr auto kValuesUnpacked = UnpackerForWidth::kValuesUnpacked;


Should we ARROW_DCHECK_EQ(batch_size % kValuesUnpacked, 0)?

I think the currently implicit contract of these functions is unpack up to batch_size but possibly less.
At least that's how they are called. Best would be to have the epilog in these functions so that they extract exactly batch_size values.

pitrou · 2025-10-14T10:18:23Z

cpp/apidoc/Doxyfile

                         *test* \
                         *_generated.h \
                         *-benchmark.cc \
+                         *_codegen.py \


I'm surprised, does Doxygen actually look at *.py files?

I was too but it was giving me trouble with it

cpp/src/arrow/util/bit_stream_utils_internal.h

cpp/src/arrow/util/bpacking.cc

cpp/src/arrow/util/bpacking_scalar_generated_internal.h

cpp/src/arrow/util/bpacking_scalar_internal.h

pitrou · 2025-10-14T10:55:40Z

cpp/src/arrow/CMakeLists.txt

    util/bitmap_ops.cc
    util/bpacking.cc
+    util/bpacking_scalar.cc
+    util/bpacking_simd_min.cc


What is "min" for?

It was either neon or sse4.2 (the latter was removed). I could rename it _neon but it does not highlight that it is not compiled any different than regular files.

cpp/src/arrow/util/bpacking_scalar_generated_internal.h

pitrou · 2025-10-14T11:58:03Z

cpp/src/arrow/util/bpacking_scalar_internal.h

+ARROW_EXPORT int unpack_scalar(const uint8_t* in, Uint* out, int batch_size,
                               int num_bits);

+extern template ARROW_TEMPLATE_EXPORT int unpack_scalar<uint8_t>(const uint8_t*, uint8_t*,


I don't think ARROW_TEMPLATE_EXPORT is useful, because this probably wouldn't be called across DLL boundaries (or would it?).

Yes, only the benchmarks and tests need it. Any solution how we might change that?

Ah. Well, we can keep it like this for now if that doesn't degrade performance...

pitrou · 2025-10-14T12:28:14Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-10-14T12:28:19Z

Benchmark runs are scheduled for commit dd1d942. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-10-14T12:33:40Z

@pitrou what about the Read/Decoder benchmarks?

Indeed the picture is different when looking at the Read/Decode benchmarks, where there isn't much of a difference between the array of function pointers and the switch/case statement.

Here is the diff between this PR and git main here: https://gist.github.com/pitrou/39f2c7ea2faec482c4d74a26e1c98f06

AntoinePrv · 2025-10-14T14:28:39Z

I've revert to a switch statement and properly handled bool.

pitrou · 2025-10-16T13:45:14Z

@github-actions crossbow submit -g cpp

github-actions · 2025-10-16T13:47:57Z

Revision: 38aa052

Submitted crossbow builds: ursacomputing/crossbow @ actions-beeb73de91

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou

CI failures are unrelated, I'll merge.

conbench-apache-arrow · 2025-10-16T21:50:35Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7a38744.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change #47573 broke the Windows ARM64 MSVC build by trying to compile xsimd even when `ARROW_SIMD_LEVEL` is set to `NONE`. ### What changes are included in this PR? Make sure that xsimd is only included when SIMD is requested. ### Are these changes tested? Yes, in the Windows ARM64 MSVC build pipeline that I was developing on #47811 when I noticed the change. ### Are there any user-facing changes? No. * GitHub Issue: #47909 Authored-by: Jonathan Giannuzzi <jonathan@giannuzzi.me> Signed-off-by: Antoine Pitrou <antoine@python.org>

### Rationale for this change Refactor the unpack function code generator and dispatcher to accommodate more use cases: - ~`uint16_t`~ and `uint64_t` new sizes - ~A *dispatch once* function returning a function pointer to the correct bit width~ *Update 2025-10-13*: the dispatch once and `uint16_t` implementation were removed as they turned out to be slower. ### What changes are included in this PR? The diff are hard to look at but the important files to look at are: - The two python files for code generation accommodate new sizes (~with the exception of `uint16` on Avx512 for which the algorithms assumptions break~); - The two code generators have a uniform structure for the "batch unpackers" they generate: each one of them is a specialization of a struct template `unpack29_32` > `Unpacker<uint32_t, 29>::unpack` - Using specialization instead of hard coded number in function names makes it possible to use them in more generic code - Wrapping the functions in a struct makes it possible to carry information along the function (such as the number of values that said function unpacks) and to leave the door open for partial specialization for future improvements. - The public functions in `bpacking_internal.h` are also template specialization `unpack32` -> `unpack<uint32_t>` - The large `switch` statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with a `constexpr` generated jump table. The logic is in `bpacking_dispatch_internal.h` From a performance perspective: - there is no improvement to the individual "batch unpacker" generated - The SIMD code is actually doing worst that scalar on SSE4.2 OR `uint16_t` - there are new sizes that can bring improvements - `unpack<uint64_t>` has an SIMD implementation that should benefit `DeltaBitPackDecoder` - ~Novel `unpack<uint16_t>` should benefit the level decoders~ - ~The dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but see `get_unpack_fn` for the building block).~ *Update 2025-10-13*: - Added an unpack implementation for `uint8_t` and `uint16_t` that call the `uint32_t` version on a local buffer combined with a `static_cast` loop (what was done on the call site before). - The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use. ### Are these changes tested? Very much. ### Are there any user-facing changes? * GitHub Issue: apache#47572 Lead-authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

### Rationale for this change apache#47573 broke the Windows ARM64 MSVC build by trying to compile xsimd even when `ARROW_SIMD_LEVEL` is set to `NONE`. ### What changes are included in this PR? Make sure that xsimd is only included when SIMD is requested. ### Are these changes tested? Yes, in the Windows ARM64 MSVC build pipeline that I was developing on apache#47811 when I noticed the change. ### Are there any user-facing changes? No. * GitHub Issue: apache#47909 Authored-by: Jonathan Giannuzzi <jonathan@giannuzzi.me> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ awaiting review Awaiting review labels Sep 16, 2025

AntoinePrv force-pushed the bit-packing branch from de89288 to 60f9155 Compare September 16, 2025 15:53

AntoinePrv mentioned this pull request Sep 16, 2025

GH-47514: [C++][Parquet] Add unpack tests and benchmarks #47515

Merged

AntoinePrv force-pushed the bit-packing branch 2 times, most recently from fd5ff83 to d96f782 Compare September 25, 2025 16:01

AntoinePrv force-pushed the bit-packing branch 2 times, most recently from 4512ec3 to d57a53b Compare September 26, 2025 12:37

AntoinePrv force-pushed the bit-packing branch from abf7629 to bbfc25f Compare October 13, 2025 16:15

AntoinePrv changed the title ~~GH-47572: [C++][Parquet] Generalize bpacking and possible speedup~~ GH-47572: [C++][Parquet] Uniform unpack interface Oct 13, 2025

AntoinePrv force-pushed the bit-packing branch from 8f67f76 to dd1d942 Compare October 13, 2025 16:41

AntoinePrv marked this pull request as ready for review October 13, 2025 16:47

pitrou reviewed Oct 14, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Oct 14, 2025

pitrou reviewed Oct 14, 2025

View reviewed changes

AntoinePrv and others added 14 commits October 16, 2025 15:32

Use unpack functions universally

10263d7

Remove unpack on SSE4.2

7e5fffb

Avoid namespace conflict

1f963dc

Try manual jump table

1cbe7cb

Update gitattribute

d9e6f55

Address review comments

737ef05

Add anonymous namespace

5f5ec52

Add unpack<bool>

a40b87c

Rename test

4b7a83b

Rename simd_min > simd_default

3821a13

Address reviewer comments

28e5abe

Size of bool may not be one

298a1ea

Add half width comment

5926c53

Refine comment

38aa052

pitrou force-pushed the bit-packing branch from 09cfe8b to 38aa052 Compare October 16, 2025 13:35

pitrou added the CI: Extra: C++ Run extra C++ CI label Oct 16, 2025

pitrou approved these changes Oct 16, 2025

View reviewed changes

pitrou merged commit 7a38744 into apache:main Oct 16, 2025
54 of 55 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Oct 16, 2025

AntoinePrv deleted the bit-packing branch October 16, 2025 14:52

pitrou mentioned this pull request Oct 16, 2025

[C++][Parquet] Generalize bpacking and possible speedup #47572

Closed

This was referenced Oct 21, 2025

GH-47196: [CI][C++] Add Windows ARM64 build #47811

Merged

[C++] Windows ARM64 MSVC build is broken #47909

Closed

GH-47909: [C++] Fix MSVC ARM64 build #47910

Merged

GH-47572: [C++][Parquet] Uniform unpack interface #47573

GH-47572: [C++][Parquet] Uniform unpack interface #47573

Uh oh!

Conversation

AntoinePrv commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

AntoinePrv commented Sep 29, 2025

Uh oh!

AntoinePrv commented Oct 13, 2025

Uh oh!

pitrou commented Oct 14, 2025

Uh oh!

pitrou commented Oct 14, 2025

Uh oh!

AntoinePrv commented Oct 14, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Oct 14, 2025

Uh oh!

voltrondatabot commented Oct 14, 2025

Uh oh!

pitrou commented Oct 14, 2025

Uh oh!

AntoinePrv commented Oct 14, 2025

Uh oh!

pitrou commented Oct 16, 2025

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntoinePrv commented Sep 16, 2025 •

edited

Loading