Skip to content

"Reference" file system reads (like Kerchunk + Zarr) #2777

Open

Description

An emerging approach to performant (especially, cloud-native) reads of NetCDF/HDF5 files (as well as GRIB2 and others) is to use a "reference file system" --- basically, a sidecar file that maps a specific byte range within a file onto a chunk that can be read by the Zarr library.

For a more detailed description, see the Kerchunk documentation.

Here's a snippet from inside a reference file system description in JSON format:

        "Rad/0.0.22": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            39046,
            7986
        ],
        "Rad/0.0.23": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            47032,
            7343
        ],
        "Rad/0.0.24": [
            "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/23/OR_ABI-L1b-RadF-M6C01_G17_s20220012350320_e20220012359386_c20220012359432.nc",
            11399,
            5174
        ],

Each triplet here corresponds to (1) the target file, (2) the starting byte, and (3) the number of bytes in that chunk. The JSON key (e.g., Rad/0.0.22) describes the chunk, and is exactly equivalent to the path to a chunk blob one would encounter in a typical Zarr store.

Since NetCDF already has the machinery to read Zarr archives via its libnczarr driver, it would be nice to extend that driver to be able to read "virtual" Zarr datasets described in such a format.

In principle, this should be straightforward --- if we already read byte ranges from file systems or S3, NetCDF would just need:

(1) To grab dataset attributes, dimensions, etc. from the JSON file itself
(2) To read the binary byte streams based on the triplet references in the JSON (rather than from wherever that information lives in HDF5).
(3) Once you have the raw byte streams, everything downstream in the libnczarr implementation --- shuffle, compression, stitching together arrays from chunks, etc. should all be identical to the current libnczarr implementation.

@ksharonin has done some excellent preliminary work on reading chunks given such a reference triplet that should serve as a useful template for getting this implemented in the core NetCDF library: https://github.com/ksharonin/kerchunkC/tree/master/code/c%2B%2B

CC: @amdasilva

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions