Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Zarr reader #271

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[Draft] Zarr reader #271

wants to merge 5 commits into from

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Oct 24, 2024

WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.

  • Closes Add Zarr Reader(s) #262
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

Scope of this PR:

  • Read v2 Zarr

Future PR(s):

  • Read v3 Zarr
  • Add in existing Zarr v3 Chunk Manifest
  • sharded v3 data
  • get chunk size with zarr-python (Added Store.getsize zarr-python#2426) instead of fsspec
  • optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)

To Do:

  • Open the store using zarr-python v3 (behind a protected import). This should handle both v2 and v3 stores for us.
  • Use zarr-python to list the variables in the store, and check that all loadable_variables are present
    For each virtual variable:
  • Use zarr-python to get the attributes and the dimension names, and coordinate names (which come from the .zmetadata or zarr.json)
  • Use zarr-python to also get the dtype and chunk grid info + everything else needed to create the virtualizarr.zarr.ZArray object (eventually we can skip this step and use a zarr-python array metadata class directly instead of virtualizarr.zarr.ZArray, see
    Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
  • Use the knowledge of the store location, variable name, and the zarr format to deduce which directory / S3 prefix the chunks must live in.
  • List all the chunks in that directory using fsspec.ls(detail=True), as that should also return the nbytes of each chunk. Remember that chunks are allowed to be missing.
  • The offset of each chunk is just 0 (ignoring sharding for now), and the length is the file size fsspec returned. The paths are just all the paths fsspec listed.
  • Parse the path and length information returned by fsspec into the structure that we can pass to ChunkManifest.init
  • Create a ManifestArray from our ChunkManifest and ZArray
  • Wrap that ManifestArray in an xarray.Variable, using the dims and attrs we read before
    Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
  • Use separate_coords to set the correct variables as coordinate variables (and avoid building indexes whilst doing it)
  • Merge all the variables into one xr.Dataset and return it.
  • All the above should be wrapped in a virtualizarr.readers.zarr.open_virtual_dataset function, which then should be called as a method from a ZarrVirtualBackend(VirtualBackend) subclass.
  • Finally add that ZarrVirtualBackend to the list of readers in virtualizarr.backend.py

@norlandrhagen
Copy link
Collaborator Author

#273

@norlandrhagen
Copy link
Collaborator Author

Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset.

<xarray.Dataset> Size: 3kB
Dimensions:  (time: 10, lat: 9, lon: 18)
Coordinates:
    lat      (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
    lon      (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
  * time     (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
    air      (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Next up is how to deal with fill_values.

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py.

@TomNicholas
Copy link
Member

I got a Zarr loaded as a virtual dataset.

Amazing!

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the ZArray class and what types it allows, or it's in the Kerchunk writer not knowing how to serialize a valid ZArray. Either way if np.float32(nan) is a valid fill_value for a zarr array then it's not the fault of the new zarr reader.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.

Comment on lines +93 to +99
# we should parameterize for:
# - group
# - drop_variables
# - loadable_variables
# - store readable over cloud storage?
# - zarr store version v2, v3
# testing out pytest parameterization with dataclasses :shrug: -- we can revert to a more normal style
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great to test all these cases, but they don't need to be simultaneously parametrized over, because we don't need to test the matrix of all possible combinations of these things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parametrizing over v2 vs v3 would be good though, as every feature should be tested for both versions.

Comment on lines +148 to +150
# Do we have a good way in XRT to compare virtual datasets to xarray datasets? assert_duckarray_allclose? or just roundtrip it.
# from xarray.testing import assert_duckarray_allclose
# xrt.assert_allclose(ds, vds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding to test_integration.py I would first create a tests/test_readers.py/test_zarr.py and put tests in there. Those tests should open a virtual dataset from a zarr store and assert things about the contents of the ManifestArrays, checking they match what you would expect based on the contents of the store. That's important because it's a kerchunk-free way to do useful testing.

loadable_variables,
)

# can we avoid fsspec here if we are using zarr-python for all the reading?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be aiming for that yes.

Comment on lines +59 to +63
assert set(
loadable_variables
).issubset(
set(zarr_arrays)
), f"loadable_variables ({loadable_variables}) is not a subset of variables in existing Zarr store. This zarr contains: {zarr_arrays}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's bad practice to use assert for runtime checks as in theory some python compilers will remove all assert statements as an optimization.

Suggested change
assert set(
loadable_variables
).issubset(
set(zarr_arrays)
), f"loadable_variables ({loadable_variables}) is not a subset of variables in existing Zarr store. This zarr contains: {zarr_arrays}"
missing_vars = set(loadable_variables) - set(zarr_arrays)
if missing_vars:
raise ValueError(
f"Some loadable variables specified are not present in this zarr store: {missing_vars}"
)


# 1b.

zg = zarr.open_group(filepath, mode="r")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything below here is for a single group, so should be refactored into a function that indicates as such (so that we can call that function multiple times to construct a datatree).

Comment on lines +73 to +74
# 3. For each virtual variable:
for var in virtual_vars:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything inside here is for a single array so should also probably be refactored into a function

Comment on lines +123 to +136
# array path to use for all chunks
array_path = fsspec.utils._unstrip_protocol(
array_mapper.root, array_mapper.fs
)

array_chunk_sizes = {
val["name"].split("/")[-1]: {
"path": array_path,
"offset": 0,
"length": val["size"],
}
for val in array_mapper.fs.ls(array_mapper.root, detail=True)
if not val["name"].endswith((".zarray", ".zattrs", ".zgroup"))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be a function


all_array_dims.extend([dim for dim in array_dims])

coord_names = list(set(all_array_dims))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect? I also don't actually see why you need to collect all_array_dims

coord_names = list(set(all_array_dims))

# 4 Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
# We want to drop 'drop_variables' but also virtual variables since we already **manifested** them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groan

loadable_vars=loadable_vars,
indexes=indexes,
coord_names=coord_names,
attrs=zg.attrs.asdict(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you remembered the group-level attributes, nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants