"Reference" file system reads (like Kerchunk + Zarr)

An emerging approach to performant (especially, cloud-native) reads of NetCDF/HDF5 files (as well as GRIB2 and others) is to use a "reference file system" --- basically, a sidecar file that maps a specific byte range within a file onto a chunk that can be read by the Zarr library.

For a more detailed description, see the [Kerchunk documentation](https://fsspec.github.io/kerchunk/).

Here's a snippet from inside a reference file system description in JSON format:

```
        "Rad/0.0.22": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            39046,
            7986
        ],
        "Rad/0.0.23": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            47032,
            7343
        ],
        "Rad/0.0.24": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            11399,
            5174
        ],
```

Each triplet here corresponds to (1) the target file, (2) the starting byte, and (3) the number of bytes in that chunk. The JSON key (e.g., `Rad/0.0.22`) describes the chunk, and is _exactly_ equivalent to the path to a chunk blob one would encounter in a typical Zarr store.

Since NetCDF already has the machinery to read Zarr archives via its `libnczarr` driver, **it would be nice to extend that driver to be able to read "virtual" Zarr datasets described in such a format.**

In principle, this should be straightforward --- if we already read byte ranges from file systems or S3, NetCDF would just need:

(1) To grab dataset attributes, dimensions, etc. from the JSON file itself
(2) To read the binary byte streams based on the triplet references in the JSON (rather than from wherever that information lives in HDF5).
(3) Once you have the raw byte streams, everything downstream in the `libnczarr` implementation --- shuffle, compression, stitching together arrays from chunks, etc. should all be identical to the current `libnczarr` implementation.

@ksharonin has done some excellent preliminary work on reading chunks given such a reference triplet that should serve as a useful template for getting this implemented in the core NetCDF library: https://github.com/ksharonin/kerchunkC/tree/master/code/c%2B%2B

CC: @amdasilva

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Reference" file system reads (like Kerchunk + Zarr) #2777

ashiklom
openedon Oct 23, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Reference" file system reads (like Kerchunk + Zarr) #2777

Description

ashiklomopenedon Oct 23, 2023

Metadata