-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve Blosc compression operator #2592
improve Blosc compression operator #2592
Conversation
Thank you for fixing this compressor. Unfortunately, you caught us with the docs, there is no section on operators/compressors. |
@psychocoderHPC Can you please extend the tests in |
@psychocoderHPC thanks, that was long overdue, only bzip2 was doing that and the hope was to migrate to blosc2 eventually as in this issue. @pnorbert there is almost nothing in the docs on compression operations since each engine supports its own set. |
ae5dc35
to
7327660
Compare
I updated the blosc tests and added test combinations for |
void *dataOut, const size_t sizeOut, | ||
Params &info) const; | ||
|
||
class __attribute__((packed)) DataHeader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pnorbert I fixed hopefully the ubsan
issue in the CI by disallowing the compiler to add padding.
The disadvantage is that there is no standard attribute in C++ :-(
Is this way ok? If not, do you have a suggestion on how I can fix this issue instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it helps, there is an internal helper function to add padding in a portable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this helps. The problem is that maybe the ptr for the output is not aligned during the compression. If I use the helper and shift my data a few bytes and later during decompress I get the data within an array with a different base address. If so it would be not possible to find the data because the required padding is different.
If I understand the ubsan error for the old code correctly and what I found in the net the packed attribute is required to guarantee that the struct has the same layout even if you compile blosc on a different system. This attributes forbids the compiler to transform the struct which is IMO correct.
I can also remove the header class und work with two pure integers but I found it better readable to use the header class because it allows the code to shift with sizeof(Header)
instead of using a magic number 8byte.
I am not sure which solution guarantees portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is enough and better if I align the struct to 4 byte.
Is ADIOS guaranteeing that my pointers given to compress operators are 4 byte aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that maybe the ptr for the output is not aligned during the compression.
The function I shared can verify that returning the required padding in bytes (if 0 no padding is required).
Is ADIOS guaranteeing that my pointers given to compress operators are 4 byte aligned?
That's a good question. It depends on the serialization buffer consuming this interface. The BP buffer is char
so it shouldn't be a problem. I'd check with @pnorbert for other existing strategies.
I am not sure which solution guarantees portability.
Yup, that's the challenge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Blosc is not covered by CI and thus breaking on MSVC: #2745
I found this on SO, we could add an extra handling for MSVC and assume other compilers support the currently GCC-like syntax for now: https://stackoverflow.com/a/3312896/2719194
I also hoped to switch to blosc2 and it is solving the 2GiB issue but unluckily as I know they now support larger size-types but you need to explicitly pack the data with the framing API and there is no simple function available which is doing it transparently. |
The current implementation of the Blosc ADIOS2 operator is due to limitations of Blosc not supporting variables larger than 2GiB. - Add option `threshold` to set a threshold under which data will only be copied instead of compressed (equivalent to Blosc in ADIOS1). - Use chunk format to avoid 2GiB limit for variables (datasets). - Add support to decompress blosc data written with ADIOS before this PR.
- add test combination for new option `threshold` - add test combination for shuffling
7327660
to
c83fcb7
Compare
I've rebased the topic on release_27 as a candidate for the 2.7.1 patch release. Please update your local branch with the remote. |
improve Blosc compression operator
Add support for improved blosc support: ornladios/ADIOS2#2592
Add support for improved blosc support: ornladios/ADIOS2#2592
* setup.py: bump version to 0.13.4.post0 * ADIOS: 2.6.0 -> 2.7.1 Add support for improved blosc support: ornladios/ADIOS2#2592 * BLOSC: 1.20.1 -> 1.21.0 * ADIOS2: Linux PThread Issue w/ Blosc ``` /usr/local/lib/libblosc.a(blosc.c.o): In function `blosc_init.part.9': blosc.c:(.text+0x43e): undefined reference to `pthread_atfork' ``` * ADIOS 2.7.1: No Build/Install Test * Windows: ADIOS 2.7.0 Better an older ADIOS than no BLOSC support at all. * Blosc: Avoid Pickup of Static pthread lib * Nope, keep LDFLAGS * Win: ADIOS 2.7.1 w/ MSVC Blosc Patch
fix: #2018
The current implementation of the Blosc ADIOS2 operator is due to limitations of Blosc not supporting adios variables larger than 2GiB.
ADIOS1 was originally using a chunked format to work around this limitation.
threshold
to set a threshold under which data will only be copied instead of compressed (equivalent to Blosc in ADIOS1).The header written to decompressed the date is different from the ADIOS1 implementation.
Compared to ADIOS1 this PR implements compression and decompression much cleaner.
@maintainer Where do I should document the new option
threshold
for the Blosc operator. I have not found anything in the documentation.Thanks to @franzpoeschel for providing the test code testBlosc.txt
.
CC-ing: @ax3l