Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Blosc compression operator #2592

Merged

Conversation

psychocoderHPC
Copy link
Contributor

@psychocoderHPC psychocoderHPC commented Jan 20, 2021

fix: #2018

The current implementation of the Blosc ADIOS2 operator is due to limitations of Blosc not supporting adios variables larger than 2GiB.
ADIOS1 was originally using a chunked format to work around this limitation.

  • Add option threshold to set a threshold under which data will only be copied instead of compressed (equivalent to Blosc in ADIOS1).
  • Use chunk format to avoid 2GiB limit for variables (datasets).
  • Add support to decompress Blosc data written with ADIOS before this PR.
    • update blosc tests

The header written to decompressed the date is different from the ADIOS1 implementation.
Compared to ADIOS1 this PR implements compression and decompression much cleaner.

@maintainer Where do I should document the new option threshold for the Blosc operator. I have not found anything in the documentation.

Thanks to @franzpoeschel for providing the test code testBlosc.txt
.
CC-ing: @ax3l

@pnorbert
Copy link
Contributor

Thank you for fixing this compressor. Unfortunately, you caught us with the docs, there is no section on operators/compressors.
@chuckatkins Can you please add redirect this to the patch release?

@pnorbert
Copy link
Contributor

@psychocoderHPC Can you please extend the tests in testing/adios2/engine/bp/operations/TestBPWriteReadBlosc.cpp so that the threshold settings are tested? You need to look at this test anyway as the memory sanitizer fails on it...

@williamfgc
Copy link
Contributor

@psychocoderHPC thanks, that was long overdue, only bzip2 was doing that and the hope was to migrate to blosc2 eventually as in this issue. @pnorbert there is almost nothing in the docs on compression operations since each engine supports its own set.

@psychocoderHPC psychocoderHPC force-pushed the topic-bloscImprovements branch from ae5dc35 to 7327660 Compare January 20, 2021 15:51
@psychocoderHPC
Copy link
Contributor Author

@psychocoderHPC Can you please extend the tests in testing/adios2/engine/bp/operations/TestBPWriteReadBlosc.cpp so that the threshold settings are tested? You need to look at this test anyway as the memory sanitizer fails on it...

I updated the blosc tests and added test combinations for shuffling and threshold

void *dataOut, const size_t sizeOut,
Params &info) const;

class __attribute__((packed)) DataHeader
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnorbert I fixed hopefully the ubsan issue in the CI by disallowing the compiler to add padding.
The disadvantage is that there is no standard attribute in C++ :-(
Is this way ok? If not, do you have a suggestion on how I can fix this issue instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it helps, there is an internal helper function to add padding in a portable way.

Copy link
Contributor Author

@psychocoderHPC psychocoderHPC Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this helps. The problem is that maybe the ptr for the output is not aligned during the compression. If I use the helper and shift my data a few bytes and later during decompress I get the data within an array with a different base address. If so it would be not possible to find the data because the required padding is different.

If I understand the ubsan error for the old code correctly and what I found in the net the packed attribute is required to guarantee that the struct has the same layout even if you compile blosc on a different system. This attributes forbids the compiler to transform the struct which is IMO correct.

I can also remove the header class und work with two pure integers but I found it better readable to use the header class because it allows the code to shift with sizeof(Header) instead of using a magic number 8byte.

I am not sure which solution guarantees portability.

Copy link
Contributor Author

@psychocoderHPC psychocoderHPC Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is enough and better if I align the struct to 4 byte.
Is ADIOS guaranteeing that my pointers given to compress operators are 4 byte aligned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that maybe the ptr for the output is not aligned during the compression.

The function I shared can verify that returning the required padding in bytes (if 0 no padding is required).

Is ADIOS guaranteeing that my pointers given to compress operators are 4 byte aligned?

That's a good question. It depends on the serialization buffer consuming this interface. The BP buffer is char so it shouldn't be a problem. I'd check with @pnorbert for other existing strategies.

I am not sure which solution guarantees portability.

Yup, that's the challenge.

Copy link
Contributor

@ax3l ax3l Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Blosc is not covered by CI and thus breaking on MSVC: #2745

I found this on SO, we could add an extra handling for MSVC and assume other compilers support the currently GCC-like syntax for now: https://stackoverflow.com/a/3312896/2719194

@psychocoderHPC
Copy link
Contributor Author

that and the hope was to migrate to blosc2 eventually as in this issue.

I also hoped to switch to blosc2 and it is solving the 2GiB issue but unluckily as I know they now support larger size-types but you need to explicitly pack the data with the framing API and there is no simple function available which is doing it transparently.
Therefore I decided to stay with the well-tested approach from ADIOS1 with blosc1.

The current implementation of the Blosc ADIOS2 operator is due to
limitations of Blosc not supporting variables larger than 2GiB.

- Add option `threshold` to set a threshold under which data will only
be copied instead of compressed (equivalent to Blosc in ADIOS1).
- Use chunk format to avoid 2GiB limit for variables (datasets).
- Add support to decompress blosc data written with ADIOS before this
PR.
- add test combination for new option `threshold`
- add test combination for shuffling
@chuckatkins chuckatkins force-pushed the topic-bloscImprovements branch from 7327660 to c83fcb7 Compare January 20, 2021 18:36
@chuckatkins chuckatkins changed the base branch from master to release_27 January 20, 2021 18:37
@chuckatkins
Copy link
Contributor

I've rebased the topic on release_27 as a candidate for the 2.7.1 patch release. Please update your local branch with the remote.

@chuckatkins chuckatkins merged commit a7d9507 into ornladios:release_27 Jan 22, 2021
chuckatkins pushed a commit that referenced this pull request Jan 22, 2021
@psychocoderHPC psychocoderHPC deleted the topic-bloscImprovements branch March 3, 2021 07:29
ax3l added a commit to ax3l/openPMD-api that referenced this pull request Jun 4, 2021
Add support for improved blosc support:
ornladios/ADIOS2#2592
ax3l added a commit to ax3l/openPMD-api that referenced this pull request Jun 4, 2021
Add support for improved blosc support:
ornladios/ADIOS2#2592
ax3l added a commit to openPMD/openPMD-api that referenced this pull request Jun 4, 2021
* setup.py: bump version to 0.13.4.post0

* ADIOS: 2.6.0 -> 2.7.1

Add support for improved blosc support:
ornladios/ADIOS2#2592

* BLOSC: 1.20.1 -> 1.21.0

* ADIOS2: Linux PThread Issue w/ Blosc

```
/usr/local/lib/libblosc.a(blosc.c.o): In function `blosc_init.part.9':
blosc.c:(.text+0x43e): undefined reference to `pthread_atfork'
```

* ADIOS 2.7.1: No Build/Install Test

* Windows: ADIOS 2.7.0

Better an older ADIOS than no BLOSC support at all.

* Blosc: Avoid Pickup of Static pthread lib

* Nope, keep LDFLAGS

* Win: ADIOS 2.7.1 w/ MSVC Blosc Patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

C-Blosc: Support for Large Vars
5 participants