Skip to content

Add a function to produce a ManifestStore from HDF5 files #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 4, 2025

Conversation

maxrjones
Copy link
Member

First draft of first steps for addressing #498, based on changes originally proposed in #490

@maxrjones
Copy link
Member Author

FYI I'll return to this PR later to fix the failing test - the issue is that we now have support for multiple chunk_key_encodings but no way yet to specify which is used in the Zarr V3 array metadata.

@TomNicholas
Copy link
Member

This seems good - you could potentially go the whole way to using open_zarr to implement loadable_variables inside the HDF reader in this PR. Then we do that for every other reader, then we factor that outside of the readers (i.e. reduce them to each just being a function).

@maxrjones
Copy link
Member Author

maxrjones commented Mar 28, 2025

This seems good - you could potentially go the whole way to using open_zarr to implement loadable_variables inside the HDF reader in this PR. Then we do that for every other reader, then we factor that outside of the readers (i.e. reduce them to each just being a function).

The abstraction described here is a necessary prerequisite. Basically we need a utility function like get_default_object_store(filepath) -> ObjectStore that goes from a valid URL to an ObjectStore instance in order to make the store argument in _create_manifest_store() optional.

While there's a lot of code in https://github.com/developmentseed/obstore/blob/main/obstore/python/obstore/fsspec.py that would help the development of that function, I think it's a big enough task to warrant its own PR.

Somewhat relatedly, I'm planning to take tomorrow off and don't want to slow you down. So, my vote would be for you or @sharkinsspatial to take over this PR if you'd want to make progress before Monday.

@maxrjones
Copy link
Member Author

The typing error is because we don't have a way to handle "empty" chunk manifests in a ManifestArray. We need an analogous solution to the current approach of an empty numpy array in a xarray Variable, but I don't understand the circumstances requiring an empty chunk manifest well enough to propose a solution.

Comment on lines +65 to +66
@staticmethod
def _construct_manifest_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every staticmethod here aside from open_virtual_dataset may as well just be an actual function. You will need to do that anyway later to refactor to make the reader just one function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you could leave that to do in #498.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 That is the plan. I think @maxrjones intent here was to make a very small change set to demonstrate the feasibility of using ManifestStore and that we would refactor the structure in a PR for #498.

Comment on lines +27 to +29
manifest = HDFVirtualBackend._dataset_chunk_manifest(
path=empty_dataset_hdf5_file, dataset=ds
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it's better to write tests that use higher-level public API instead of directly calling internals if you can help it, but that's just a nit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think the tests will be easier to structure when we move away from VirtualBacked base to something more functional and the returned ManifestStore will be simpler to introspect and reason about than a complete xr.Dataset.

@sharkinsspatial sharkinsspatial merged commit e707375 into develop Apr 4, 2025
10 checks passed
@sharkinsspatial sharkinsspatial deleted the hdf-reader-function branch April 4, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants