-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47572: [C++][Parquet] Uniform unpack interface #47573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
de89288 to
60f9155
Compare
fd5ff83 to
d96f782
Compare
4512ec3 to
d57a53b
Compare
|
The results for this are not good |
abf7629 to
bbfc25f
Compare
|
8f67f76 to
dd1d942
Compare
|
I'm testing performance of this PR locally on a AMD Zen 2 CPU on Ubuntu, and I'm seeing large regressions on unpack32 performance.
|
|
A quick experiment shows that much of the original performance can be re-gained by replacing the array of function pointers with a switch statement in |
|
@pitrou what about the Read/Decoder benchmarks? The algorithms are mostly similar, except:
I'm gonna investigate but I suspect these benchmarks are highly unstable. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first round of comments (I haven't looked at all changes).
| int unpack_width(const uint8_t* in, UnpackedUInt* out, int batch_size) { | ||
| using UnpackerForWidth = Unpacker<UnpackedUInt, kPackedBitWidth>; | ||
|
|
||
| constexpr auto kValuesUnpacked = UnpackerForWidth::kValuesUnpacked; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ARROW_DCHECK_EQ(batch_size % kValuesUnpacked, 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the currently implicit contract of these functions is unpack up to batch_size but possibly less.
At least that's how they are called. Best would be to have the epilog in these functions so that they extract exactly batch_size values.
| *test* \ | ||
| *_generated.h \ | ||
| *-benchmark.cc \ | ||
| *_codegen.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised, does Doxygen actually look at *.py files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too but it was giving me trouble with it
cpp/src/arrow/CMakeLists.txt
Outdated
| util/bitmap_ops.cc | ||
| util/bpacking.cc | ||
| util/bpacking_scalar.cc | ||
| util/bpacking_simd_min.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "min" for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was either neon or sse4.2 (the latter was removed). I could rename it _neon but it does not highlight that it is not compiled any different than regular files.
| ARROW_EXPORT int unpack_scalar(const uint8_t* in, Uint* out, int batch_size, | ||
| int num_bits); | ||
|
|
||
| extern template ARROW_TEMPLATE_EXPORT int unpack_scalar<uint8_t>(const uint8_t*, uint8_t*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ARROW_TEMPLATE_EXPORT is useful, because this probably wouldn't be called across DLL boundaries (or would it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only the benchmarks and tests need it. Any solution how we might change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Well, we can keep it like this for now if that doesn't degrade performance...
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit dd1d942. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Indeed the picture is different when looking at the Read/Decode benchmarks, where there isn't much of a difference between the array of function pointers and the switch/case statement. Here is the diff between this PR and git main here: https://gist.github.com/pitrou/39f2c7ea2faec482c4d74a26e1c98f06 |
|
I've revert to a |
|
@github-actions crossbow submit -g cpp |
|
Revision: 38aa052 Submitted crossbow builds: ursacomputing/crossbow @ actions-beeb73de91 |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failures are unrelated, I'll merge.
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7a38744. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change #47573 broke the Windows ARM64 MSVC build by trying to compile xsimd even when `ARROW_SIMD_LEVEL` is set to `NONE`. ### What changes are included in this PR? Make sure that xsimd is only included when SIMD is requested. ### Are these changes tested? Yes, in the Windows ARM64 MSVC build pipeline that I was developing on #47811 when I noticed the change. ### Are there any user-facing changes? No. * GitHub Issue: #47909 Authored-by: Jonathan Giannuzzi <jonathan@giannuzzi.me> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change Refactor the unpack function code generator and dispatcher to accommodate more use cases: - ~`uint16_t`~ and `uint64_t` new sizes - ~A *dispatch once* function returning a function pointer to the correct bit width~ *Update 2025-10-13*: the dispatch once and `uint16_t` implementation were removed as they turned out to be slower. ### What changes are included in this PR? The diff are hard to look at but the important files to look at are: - The two python files for code generation accommodate new sizes (~with the exception of `uint16` on Avx512 for which the algorithms assumptions break~); - The two code generators have a uniform structure for the "batch unpackers" they generate: each one of them is a specialization of a struct template `unpack29_32` > `Unpacker<uint32_t, 29>::unpack` - Using specialization instead of hard coded number in function names makes it possible to use them in more generic code - Wrapping the functions in a struct makes it possible to carry information along the function (such as the number of values that said function unpacks) and to leave the door open for partial specialization for future improvements. - The public functions in `bpacking_internal.h` are also template specialization `unpack32` -> `unpack<uint32_t>` - The large `switch` statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with a `constexpr` generated jump table. The logic is in `bpacking_dispatch_internal.h` From a performance perspective: - there is no improvement to the individual "batch unpacker" generated - The SIMD code is actually doing worst that scalar on SSE4.2 OR `uint16_t` - there are new sizes that can bring improvements - `unpack<uint64_t>` has an SIMD implementation that should benefit `DeltaBitPackDecoder` - ~Novel `unpack<uint16_t>` should benefit the level decoders~ - ~The dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but see `get_unpack_fn` for the building block).~ *Update 2025-10-13*: - Added an unpack implementation for `uint8_t` and `uint16_t` that call the `uint32_t` version on a local buffer combined with a `static_cast` loop (what was done on the call site before). - The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use. ### Are these changes tested? Very much. ### Are there any user-facing changes? * GitHub Issue: apache#47572 Lead-authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change apache#47573 broke the Windows ARM64 MSVC build by trying to compile xsimd even when `ARROW_SIMD_LEVEL` is set to `NONE`. ### What changes are included in this PR? Make sure that xsimd is only included when SIMD is requested. ### Are these changes tested? Yes, in the Windows ARM64 MSVC build pipeline that I was developing on apache#47811 when I noticed the change. ### Are there any user-facing changes? No. * GitHub Issue: apache#47909 Authored-by: Jonathan Giannuzzi <jonathan@giannuzzi.me> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Refactor the unpack function code generator and dispatcher to accommodate more use cases:
anduint16_tuint64_tnew sizesA dispatch once function returning a function pointer to the correct bit widthUpdate 2025-10-13: the dispatch once and
uint16_timplementation were removed as they turned out to be slower.What changes are included in this PR?
The diff are hard to look at but the important files to look at are:
with the exception of);uint16on Avx512 for which the algorithms assumptions breakunpack29_32>Unpacker<uint32_t, 29>::unpackbpacking_internal.hare also template specializationunpack32->unpack<uint32_t>switchstatements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with aconstexprgenerated jump table. The logic is inbpacking_dispatch_internal.hFrom a performance perspective:
uint16_tunpack<uint64_t>has an SIMD implementation that should benefitDeltaBitPackDecoderNovelunpack<uint16_t>should benefit the level decodersThe dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but seeget_unpack_fnfor the building block).Update 2025-10-13:
uint8_tanduint16_tthat call theuint32_tversion on a local buffer combined with astatic_castloop (what was done on the call site before).Are these changes tested?
Very much.
Are there any user-facing changes?