Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary chunking of uncompressed files (e.g. netCDF3) #86

Open
TomNicholas opened this issue Apr 19, 2024 · 2 comments · May be fixed by #199
Open

Arbitrary chunking of uncompressed files (e.g. netCDF3) #86

TomNicholas opened this issue Apr 19, 2024 · 2 comments · May be fixed by #199
Labels
enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself zarr-python Relevant to zarr-python upstream

Comments

@TomNicholas
Copy link
Member

@rabernat made the interesting point to me that uncompressed files (e.g. netCDF3 files) have no specific chunking, as you can start reading bytes from any point in the file immediately, with no minimum unit of decompression to do first.

I'm not totally sure what this implies for VirtualiZarr generating references from netCDF3 files, as it's still meaningful to talk about one chunk per file concatenated together in a manifest.

Perhaps this is something that should ultimately be taken account of in zarr readers: that reading bytes from an uncompressed array does not require loading an entire chunk into memory first.

@TomNicholas TomNicholas added the zarr-python Relevant to zarr-python upstream label Apr 19, 2024
@TomNicholas TomNicholas added enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself labels Jul 20, 2024
@TomNicholas
Copy link
Member Author

It seems great minds think alike https://medium.com/pangeo/using-kerchunk-with-uncompressed-netcdf-64-bit-offset-files-cloud-optimized-access-to-hycom-ocean-9008ba6d0d67

That post uses kerchunk.utils.subchunk (which I did not know existed), but I wonder if a neater API to use for a similar feature in virtualizarr could just be to add a chunks argument to open_virtual_dataset, that can only be used for uncompressed files?

@TomNicholas
Copy link
Member Author

In fact one could imagine a .rechunk method on a ManifestArray, but it would raise unless the ManifestArray pointed to contiguous-on-disk uncompressed data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant