Skip to content

Nested HDF5 Data / HEC-RAS #490

Open
@thwllms

Description

@thwllms

I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.

We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new simulation dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.

The way I implemented it is hacky:

  1. Given an xr.Dataset pulled from the HDF file and the path of each child xr.DataArray within the HDF file,
  2. Get the filters for each DataArray: filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
  3. Get the storage info for each DataArray: storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
  4. Build out metadata for chunks using storage_info
  5. "Write" the xr.Dataset to a zarr.MemoryStore with compute=False, to generate the framework of what's needed for the Zarr metadata
  6. Read the objects generated by writing to zarr.MemoryStore and decode
  7. Assemble the zarr.MemoryStore objects, filters, and storage_info into a dictionary and finally return

I suppose my questions are:

  • Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?
  • Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions