Allow batched/concurrent (de)compression support

I have a proposal for an enhancement and would like to get feedback on this and potentially better ways of achieving the same goal.

In many use cases that are read I/O bandwidth bound like in dataloaders for AI training, compression is typically turned off because that would end up being the bottleneck. We can get upwards of 10 GB/s of read throughput either from an object store or from a distributed filesystem like lustre. But CPU decompression typically maxes out at ~1 GB/s and that is usually with coarse grained multithreaded parallelism. A nice solution to this problem would be to decompress data on the GPU, which we can do at > 50 GB/s. This would make decompression no longer be the bottleneck while speeding up all other parts of the pipeline: lower use of storage, lower data volume fetched from storage, higher effective cache capacity if caching locally and faster CPU-GPU transfers. The issue with any kind of parallel decompression is that the API needs to support batched or concurrent decompression rather than calling decompression on chunks serially in a loop.

Here is some data comparing throughput of some GPU decompression algorithms in nvcomp to multithreaded zstd on the CPU: 
<img width="725" alt="decomp_throughput" src="https://user-images.githubusercontent.com/6964110/233476737-69eb0966-2031-49ca-9553-b2eb5dce95e3.png">

This is conceptually similar to https://github.com/zarr-developers/zarr-python/issues/547 with concurrent chunk accesses, but for compression/decompression.

The idea is to allow `Codec`s in numcodecs to implement a `batched_encode` and `batched_decode` in addition to the current `encode` and `decode` methods. When a codec has these methods available, zarr can dispatch a batch of chunks for encode/decode. The codec implementation can then either use a serial loop, multi-threaded parallelism or parallelize on the GPU using nvcomp. I'm envisioning this to be quite  similar to `getitems` here: https://github.com/zarr-developers/zarr-python/blob/2ff887548496855706c69a3c5983b00f17025af6/zarr/core.py#L2061

From a GPU (de)compression standpoint, we have thought about using the sharding transformer format in ZEP0002 as the internal format in nvcomp. But after some consideration, this batched compression approach seems much more useful because:

1. We can start using GPU decompression with existing zarr datasets that don't have the shard structure. The new sharding format would make GPU compression/decompression much more efficient but would not be required for functionality.
2. We can support GPU decompression when accessing subset of chunks within a shard or from multiple shards, like when slicing along a dimension.
3. We wouldn't have to add any new shard access api in zarr or multi-dimensional aware codec implementations.
4. We don't need to have separate CPU and GPU codec implementations using the same underlying compression algorithm. For example, we can just have a single LZ4 codec that can use the GPU if available in batched mode or use the CPU multi-threaded path or just single threaded on the CPU. So data compressed with a GPU will be directly compatible with CPU decompression allowing users to not require a GPU to decompress a file.

Would really appreciate any suggestions, concerns or other ideas that I might be missing. Also cc @jakirkham since he mentioned that there might be some intersection with Blosc.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Allow batched/concurrent (de)compression support #1398

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Allow batched/concurrent (de)compression support #1398

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions