Description
openedon Oct 23, 2023
An emerging approach to performant (especially, cloud-native) reads of NetCDF/HDF5 files (as well as GRIB2 and others) is to use a "reference file system" --- basically, a sidecar file that maps a specific byte range within a file onto a chunk that can be read by the Zarr library.
For a more detailed description, see the Kerchunk documentation.
Here's a snippet from inside a reference file system description in JSON format:
"Rad/0.0.22": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
39046,
7986
],
"Rad/0.0.23": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
47032,
7343
],
"Rad/0.0.24": [
"/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
11399,
5174
],
Each triplet here corresponds to (1) the target file, (2) the starting byte, and (3) the number of bytes in that chunk. The JSON key (e.g., Rad/0.0.22
) describes the chunk, and is exactly equivalent to the path to a chunk blob one would encounter in a typical Zarr store.
Since NetCDF already has the machinery to read Zarr archives via its libnczarr
driver, it would be nice to extend that driver to be able to read "virtual" Zarr datasets described in such a format.
In principle, this should be straightforward --- if we already read byte ranges from file systems or S3, NetCDF would just need:
(1) To grab dataset attributes, dimensions, etc. from the JSON file itself
(2) To read the binary byte streams based on the triplet references in the JSON (rather than from wherever that information lives in HDF5).
(3) Once you have the raw byte streams, everything downstream in the libnczarr
implementation --- shuffle, compression, stitching together arrays from chunks, etc. should all be identical to the current libnczarr
implementation.
@ksharonin has done some excellent preliminary work on reading chunks given such a reference triplet that should serve as a useful template for getting this implemented in the core NetCDF library: https://github.com/ksharonin/kerchunkC/tree/master/code/c%2B%2B
CC: @amdasilva