Description
What is your issue?
The upcoming zarr-python PR: zarr-developers/zarr-python#2874 will change how zarr python 3 handles dtypes on zarr stores which will have implications for Xarray. I've run the branch against the Xarray tests in order to see what changes will be necessary here. I've found two categories changes required
1. Encoding Fill Values
Currently xarray
handles the encoding and decoding of fill values via it's own FillValueCoder
xarray/xarray/backends/zarr.py
Line 116 in 7551a7a
with the dtypes refactors the new zarr dtypes will provide their own methods to encode and decode fill values. This is a pretty simple update. the reading can be handled with a commit like this: ianhi@70a7e8b
2. Fixed Length Strings and Appending
This other types of tests that fail seem to be related to the validation logic of when appending can happen.
With the changes the test for appending mismatched length strings:
xarray/xarray/tests/test_backends.py
Lines 2882 to 2891 in 7551a7a
fails on this validation check:
xarray/xarray/backends/zarr.py
Lines 544 to 570 in 7551a7a
ValueError: Mismatched dtypes for variable temperature between Zarr store on disk and dataset to append. Store has dtype |S2 but dataset to append has dtype |S3.
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:576: ValueError
If I bypass the validation check to allow the code to continue as it would have for prior versions of zarr then the expected dataset no longer matches the actual:
_____________________________________________________________________ TestZarrDirectoryStore.test_append_string_length_mismatch_works[3-S] _____________________________________________________________________
self = <xarray.tests.test_backends.TestZarrDirectoryStore object at 0x111684a10>, dtype = 'S'
@pytest.mark.parametrize("dtype", ["U", "S"])
def test_append_string_length_mismatch_works(self, dtype) -> None:
skip_if_zarr_format_2("This doesn't work with Zarr format 2")
# ...but it probably would if we used object dtype
ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
expected = xr.concat([ds, ds_to_append], dim="time")
with self.create_zarr_target() as store_target:
ds.to_zarr(store_target, mode="w", **self.version_kwargs)
ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
actual = xr.open_dataset(store_target, engine="zarr")
> xr.testing.assert_identical(expected, actual)
E AssertionError: Left and right Dataset objects are not identical
E Differing data variables:
E L temperature (time) |S3 18B b'aa' b'bb' b'cc' b'aaa' b'bbb' b'ccc'
E R temperature (time) |S2 12B b'aa' b'bb' b'cc' b'aa' b'bb' b'cc'
/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:2897: AssertionError
the comment in the validation function by @dcherian notes that fixed length strings are not easy to append, but seemingly the function used to pass them through as ok. I'm not 100% clear on what the intended behavior is here, any advice appreciated