-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Zarr reader #271
base: main
Are you sure you want to change the base?
[Draft] Zarr reader #271
Conversation
Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset. <xarray.Dataset> Size: 3kB
Dimensions: (time: 10, lat: 9, lon: 18)
Coordinates:
lat (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
lon (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
* time (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
air (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly... Next up is how to deal with When I try to write it to Kerchunk JSON, I’m running into some
Where Wondering how There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py. |
Amazing!
This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.
# we should parameterize for: | ||
# - group | ||
# - drop_variables | ||
# - loadable_variables | ||
# - store readable over cloud storage? | ||
# - zarr store version v2, v3 | ||
# testing out pytest parameterization with dataclasses :shrug: -- we can revert to a more normal style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's great to test all these cases, but they don't need to be simultaneously parametrized over, because we don't need to test the matrix of all possible combinations of these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parametrizing over v2 vs v3 would be good though, as every feature should be tested for both versions.
# Do we have a good way in XRT to compare virtual datasets to xarray datasets? assert_duckarray_allclose? or just roundtrip it. | ||
# from xarray.testing import assert_duckarray_allclose | ||
# xrt.assert_allclose(ds, vds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before adding to test_integration.py
I would first create a tests/test_readers.py/test_zarr.py
and put tests in there. Those tests should open a virtual dataset from a zarr store and assert things about the contents of the ManifestArrays
, checking they match what you would expect based on the contents of the store. That's important because it's a kerchunk-free way to do useful testing.
loadable_variables, | ||
) | ||
|
||
# can we avoid fsspec here if we are using zarr-python for all the reading? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be aiming for that yes.
assert set( | ||
loadable_variables | ||
).issubset( | ||
set(zarr_arrays) | ||
), f"loadable_variables ({loadable_variables}) is not a subset of variables in existing Zarr store. This zarr contains: {zarr_arrays}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's bad practice to use assert
for runtime checks as in theory some python compilers will remove all assert
statements as an optimization.
assert set( | |
loadable_variables | |
).issubset( | |
set(zarr_arrays) | |
), f"loadable_variables ({loadable_variables}) is not a subset of variables in existing Zarr store. This zarr contains: {zarr_arrays}" | |
missing_vars = set(loadable_variables) - set(zarr_arrays) | |
if missing_vars: | |
raise ValueError( | |
f"Some loadable variables specified are not present in this zarr store: {missing_vars}" | |
) |
|
||
# 1b. | ||
|
||
zg = zarr.open_group(filepath, mode="r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything below here is for a single group, so should be refactored into a function that indicates as such (so that we can call that function multiple times to construct a datatree).
# 3. For each virtual variable: | ||
for var in virtual_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything inside here is for a single array so should also probably be refactored into a function
# array path to use for all chunks | ||
array_path = fsspec.utils._unstrip_protocol( | ||
array_mapper.root, array_mapper.fs | ||
) | ||
|
||
array_chunk_sizes = { | ||
val["name"].split("/")[-1]: { | ||
"path": array_path, | ||
"offset": 0, | ||
"length": val["size"], | ||
} | ||
for val in array_mapper.fs.ls(array_mapper.root, detail=True) | ||
if not val["name"].endswith((".zarray", ".zattrs", ".zgroup")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be a function
|
||
all_array_dims.extend([dim for dim in array_dims]) | ||
|
||
coord_names = list(set(all_array_dims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect? I also don't actually see why you need to collect all_array_dims
coord_names = list(set(all_array_dims)) | ||
|
||
# 4 Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have). | ||
# We want to drop 'drop_variables' but also virtual variables since we already **manifested** them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groan
loadable_vars=loadable_vars, | ||
indexes=indexes, | ||
coord_names=coord_names, | ||
attrs=zg.attrs.asdict(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you remembered the group-level attributes, nice
WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.
docs/releases.rst
api.rst
Scope of this PR:
Future PR(s):
To Do:
For each virtual variable:
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).