Zstd Codec on the GPU #2863

akshaysubr · 2025-02-25T16:15:20Z

This PR adds a Zstd codec that runs on the GPU using the nvCOMP 4.2 python APIs.

TODO:

Make fully async
Performance benchmarking
CPU-GPU roundtrip testing
Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

src/zarr/codecs/gpu.py

dstansby · 2025-02-28T09:09:22Z

Thanks for opening this PR! At the moment we do not have any codecs implemented in the zarr-python package, but instead store them in numcodecs. So although it looks like there are some zarr-python specific changes that are needed to support the new GPU codec, the actual codec should be implemented in numcodecs, and then imported in zarr-python.

akshaysubr · 2025-02-28T16:00:06Z

@dstansby My understanding was that numcodecs is the place that python bindings to native codec implementations live and that with v3, the Codec class itself lives in zarr-python. The GPU codecs and python bindings are implemented in nvCOMP and imported through the nvidia-nvcomp-cu12 python package so I'm not sure which part of this would need to go in numcodecs. What did you have in mind?

TomAugspurger · 2025-03-18T19:34:30Z

src/zarr/codecs/gpu.py

+        # Convert to nvcomp arrays
+        filtered_inputs, none_indices = await self._convert_to_nvcomp_arrays(chunks_and_specs)
+
+        outputs = self._zstd_codec.decode(filtered_inputs) if len(filtered_inputs) > 0 else []


Question related to #2904 (which is looking into memory usage). Would it be possible for nvcomp-python to provide an out argument to decode? If I'm reading the C++ docs correctly, that does seem to decompress into an output buffer. Eventually it would be nice to do that all the way into Zarr's out buffer.

It's not currently supported, but I've seen similar requests from elsewhere. Hopefully someday.

TomAugspurger · 2025-06-09T14:18:23Z

src/zarr/codecs/gpu.py

+        # Convert to nvcomp arrays
+        filtered_inputs, none_indices = await self._convert_to_nvcomp_arrays(chunks_and_specs)
+
+        outputs = self._zstd_codec.decode(filtered_inputs) if len(filtered_inputs) > 0 else []


It's not currently supported, but I've seen similar requests from elsewhere. Hopefully someday.

TomAugspurger · 2025-06-09T14:19:34Z

src/zarr/codecs/gpu.py

+        chunks_and_specs = list(chunks_and_specs)
+
+        # Convert to nvcomp arrays
+        filtered_inputs, none_indices = await self._convert_to_nvcomp_arrays(chunks_and_specs)


This should just be reinterpreting some bytes / pointers, right? There's no possibility of doing any kind of (blocking) I/O? If so, then I'd recommend making this a regular sync method.

Same comment on L161 with the return await self._convert_from_nvcomp_arrays.

Also in encode.

TomAugspurger · 2025-06-09T14:27:42Z

src/zarr/codecs/gpu.py

+        # Convert to nvcomp arrays
+        filtered_inputs, none_indices = await self._convert_to_nvcomp_arrays(chunks_and_specs)
+
+        outputs = self._zstd_codec.decode(filtered_inputs) if len(filtered_inputs) > 0 else []


This is the line where we should be careful about what happens where.

Do you know whether nvcomp.Codec.decode is blocking, or does it just asynchronously schedule the decode on the GPU? If it's non-blocking, then for simplicity, I'd recommend using asyncio.to_thread around a (regular sync Python) function that schedules the decode and then uses the event.wait() to block to do the stream synchronization.

async def decode_wrapper(codec, filtered_inputs): result = codec.decode(filtered_inputs) # wait for the decode to complete. cupy.cuda.Event? return result

I have some longer-term thoughts around how zarr-python handles concurrency for different types of workloads, but for now I think it's probably best to follow the other codecs, which use to_thread.

More specifically, I thinking something like this on the class:

async def _synchronize_stream(self) -> None: # this is the blocking operation. Offload it to a worker thread to not block the main thread await asyncio.to_thread(self.stream.synchronize)

and then in decode:

... decoded = self._convert_from_nvcomp_arrays(outputs, chunks_and_specs) # Uphold zarr-python's guarantee that the decode is finished before returning await self._synchronize_stream() return decoded

That way, the main thread will schedule everything to happen on the GPU via self._zstd_codec.decode(), but the actual stream synchronization will happen on another thread, so that it doesn't block the event loop.

That requires putting a stream (and maybe Device?) object on the ZstdCodec class so that we can make sure we synchronize the right stream (the same one passed to nvcomp.Codec)

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 25, 2025

TomAugspurger reviewed Feb 25, 2025

View reviewed changes

src/zarr/codecs/gpu.py Show resolved Hide resolved

akshaysubr added 2 commits February 25, 2025 21:30

First working version of Zstd codec on the GPU

49d5ee8

Adding nvcomp to the GPU dependency list

d548adc

akshaysubr force-pushed the gpu-codecs branch from d6608e7 to d548adc Compare February 25, 2025 21:30

Updating codec pipeline batch size for GPU codecs to enable parallelism

a8c0db3

madsbk mentioned this pull request Feb 28, 2025

Multi-GPU setup and python/zarr api rapidsai/kvikio#651

Closed

This was referenced Mar 3, 2025

Support zarr-python 3.x rapidsai/kvikio#646

Merged

Tracking upstream changes pangeo-data/ncar-hackathon-xarray-on-gpus#28

Open

weiji14 mentioned this pull request Mar 11, 2025

Kvikio backend entrypoint with Zarr v3 xarray-contrib/cupy-xarray#70

Draft

6 tasks

TomAugspurger reviewed Mar 18, 2025

View reviewed changes

vyasr mentioned this pull request Apr 7, 2025

Evaluate proper home for Zarr bindings in kvikio rapidsai/kvikio#686

Open

weiji14 mentioned this pull request Apr 7, 2025

Request for packaging Python bindings conda-forge/nvcomp-feedstock#21

Open

TomAugspurger reviewed Jun 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Zstd Codec on the GPU #2863

Zstd Codec on the GPU #2863

Uh oh!

akshaysubr commented Feb 25, 2025

Uh oh!

Uh oh!

dstansby commented Feb 28, 2025

Uh oh!

akshaysubr commented Feb 28, 2025

Uh oh!

TomAugspurger Mar 18, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025

Uh oh!

TomAugspurger Jun 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Zstd Codec on the GPU #2863

Are you sure you want to change the base?

Zstd Codec on the GPU #2863

Uh oh!

Conversation

akshaysubr commented Feb 25, 2025

Uh oh!

Uh oh!

dstansby commented Feb 28, 2025

Uh oh!

akshaysubr commented Feb 28, 2025

Uh oh!

TomAugspurger Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger Jun 9, 2025 •

edited

Loading