Description
I have a proposal for an enhancement and would like to get feedback on this and potentially better ways of achieving the same goal.
In many use cases that are read I/O bandwidth bound like in dataloaders for AI training, compression is typically turned off because that would end up being the bottleneck. We can get upwards of 10 GB/s of read throughput either from an object store or from a distributed filesystem like lustre. But CPU decompression typically maxes out at ~1 GB/s and that is usually with coarse grained multithreaded parallelism. A nice solution to this problem would be to decompress data on the GPU, which we can do at > 50 GB/s. This would make decompression no longer be the bottleneck while speeding up all other parts of the pipeline: lower use of storage, lower data volume fetched from storage, higher effective cache capacity if caching locally and faster CPU-GPU transfers. The issue with any kind of parallel decompression is that the API needs to support batched or concurrent decompression rather than calling decompression on chunks serially in a loop.
Here is some data comparing throughput of some GPU decompression algorithms in nvcomp to multithreaded zstd on the CPU:
This is conceptually similar to #547 with concurrent chunk accesses, but for compression/decompression.
The idea is to allow Codec
s in numcodecs to implement a batched_encode
and batched_decode
in addition to the current encode
and decode
methods. When a codec has these methods available, zarr can dispatch a batch of chunks for encode/decode. The codec implementation can then either use a serial loop, multi-threaded parallelism or parallelize on the GPU using nvcomp. I'm envisioning this to be quite similar to getitems
here:
Line 2061 in 2ff8875
From a GPU (de)compression standpoint, we have thought about using the sharding transformer format in ZEP0002 as the internal format in nvcomp. But after some consideration, this batched compression approach seems much more useful because:
- We can start using GPU decompression with existing zarr datasets that don't have the shard structure. The new sharding format would make GPU compression/decompression much more efficient but would not be required for functionality.
- We can support GPU decompression when accessing subset of chunks within a shard or from multiple shards, like when slicing along a dimension.
- We wouldn't have to add any new shard access api in zarr or multi-dimensional aware codec implementations.
- We don't need to have separate CPU and GPU codec implementations using the same underlying compression algorithm. For example, we can just have a single LZ4 codec that can use the GPU if available in batched mode or use the CPU multi-threaded path or just single threaded on the CPU. So data compressed with a GPU will be directly compatible with CPU decompression allowing users to not require a GPU to decompress a file.
Would really appreciate any suggestions, concerns or other ideas that I might be missing. Also cc @jakirkham since he mentioned that there might be some intersection with Blosc.