Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions kerchunk/hdf.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import base64
import io
import logging
from typing import Union, BinaryIO

Expand Down Expand Up @@ -47,7 +48,7 @@ class SingleHdf5ToZarr:
to BinaryIO is optional), in which case must also provide url. If a str,
file will be opened using fsspec and storage_options.
url : string
URI of the HDF5 file, if passing a file-like object
URI of the HDF5 file, if passing a file-like object or h5py File/Group
spec : int
The version of output to produce (see README of this repo)
inline_threshold : int
Expand All @@ -73,7 +74,7 @@ class SingleHdf5ToZarr:

def __init__(
self,
h5f: "BinaryIO | str",
h5f: "BinaryIO | str | h5py.File | h5py.Group",
url: str = None,
spec=1,
inline_threshold=500,
Expand All @@ -89,15 +90,22 @@ def __init__(
fs, path = fsspec.core.url_to_fs(h5f, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
url = h5f
else:
self._h5f = h5py.File(self.input_file, mode="r")
elif isinstance(h5f, io.IOBase):
self.input_file = h5f
self._h5f = h5py.File(self.input_file, mode="r")
elif isinstance(h5f, (h5py.File, h5py.Group)):
# assume h5py object (File or group/dataset)
self._h5f = h5f
fs, path = fsspec.core.url_to_fs(url, **(storage_options or {}))
self.input_file = fs.open(path, "rb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need these two lines anymore (they certainly mess up my used case where the file is an S3 object), since the file is loaded as File object up in the first branch of the conditional, if h5f is an h5py.Group then it should be kept that way with self._h5f set to it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_h5f is indeed set to the input two lines above. This exists for any inlining that might happen, which requires getting bytes directly from the original file, not going via h5py.

mess up my use case

What happens? I think providing the URL/options will certainly be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my case it's looking for a local file even if I pass valid S3 storage_options - leave it like this for now, I'll need to do a wee bit more testing to understand what's going on, and will get back to you if Kerchunk needs changing 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urls starts with "s3://"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and no 🤣 It's a very peculariar bucket, the storage options dict that s3fs recognizes is

{'key': 'xxxx', 'secret': "xxxx", 'client_kwargs': {'endpoint_url': 'https://uor-aces-o.s3-ext.jc.rl.ac.uk'}, 'default_fill_cache': False, 'default_cache_type': 'first'}

the call to s3fs to able to read such a strange bucket is as follows:

fs = s3fs.S3FileSystem(**storage_options)
with fs.open(file_url, 'rb') as s3file:
...

but file_url needs to be the truncated (bucket + file-name) ie bnl/da193a_25_day__198807-198807.nc in this case, and s3fs is assembling its full URL via the endpoint URL and that truncated bucket _ filename - it's odd, not 100% sure why this type of s3 storage wants that configuration, but bottom line is in the case of Kerchunk trying to open it as a regular s3 file it's not working - even if I prepend a correct full s3://...path to the file, I get Forbidden access since the storage identification is done wrongly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3://uor-aces-o.s3-ext.jc.rl.ac.uk/bnl/da193a_25_day__198807-198807.nc

This is definitely not the right URL: the first part should be the bucket, not a server name (I'm surprised it even attempts to connect). The URL should be "s3://bnl/da193a_25_day__198807-198807.nc", as the server/endpoint is already included in the storage options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blast! That worked! I knew I'm not doing something right 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though am getting fairly long times from visititems() - very much comparable times to the ones where there is no kerchunking done on a single Group, but rather, on the entire file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's because this self._h5f = h5py.File(self.input_file, mode="r") is a few lines down 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, fixed)

else:
raise ValueError("type of input `h5f` not recognised")
self.spec = spec
self.inline = inline_threshold
if vlen_encode not in ["embed", "null", "leave", "encode"]:
raise NotImplementedError
self.vlen = vlen_encode
self._h5f = h5py.File(self.input_file, mode="r")

self.store = out or {}
self._zroot = zarr.group(store=self.store, overwrite=True)

Expand Down