Skip to content

Zarr v3 Dtype refactor changes needed in Xarray #10329

Open
@ianhi

Description

@ianhi

What is your issue?

The upcoming zarr-python PR: zarr-developers/zarr-python#2874 will change how zarr python 3 handles dtypes on zarr stores which will have implications for Xarray. I've run the branch against the Xarray tests in order to see what changes will be necessary here. I've found two categories changes required

1. Encoding Fill Values

Currently xarray handles the encoding and decoding of fill values via it's own FillValueCoder

class FillValueCoder:

with the dtypes refactors the new zarr dtypes will provide their own methods to encode and decode fill values. This is a pretty simple update. the reading can be handled with a commit like this: ianhi@70a7e8b

2. Fixed Length Strings and Appending

This other types of tests that fail seem to be related to the validation logic of when appending can happen.

With the changes the test for appending mismatched length strings:

def test_append_string_length_mismatch_works(self, dtype) -> None:
skip_if_zarr_format_2("This doesn't work with Zarr format 2")
# ...but it probably would if we used object dtype
ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
expected = xr.concat([ds, ds_to_append], dim="time")
with self.create_zarr_target() as store_target:
ds.to_zarr(store_target, mode="w", **self.version_kwargs)
ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
actual = xr.open_dataset(store_target, engine="zarr")
xr.testing.assert_identical(expected, actual)

fails on this validation check:

def _validate_datatypes_for_zarr_append(vname, existing_var, new_var):
"""If variable exists in the store, confirm dtype of the data to append is compatible with
existing dtype.
"""
if (
np.issubdtype(new_var.dtype, np.number)
or np.issubdtype(new_var.dtype, np.datetime64)
or np.issubdtype(new_var.dtype, np.bool_)
or new_var.dtype == object
or (new_var.dtype.kind in ("S", "U") and existing_var.dtype == object)
):
# We can skip dtype equality checks under two conditions: (1) if the var to append is
# new to the dataset, because in this case there is no existing var to compare it to;
# or (2) if var to append's dtype is known to be easy-to-append, because in this case
# we can be confident appending won't cause problems. Examples of dtypes which are not
# easy-to-append include length-specified strings of type `|S*` or `<U*` (where * is a
# positive integer character length). For these dtypes, appending dissimilar lengths
# can result in truncation of appended data. Therefore, variables which already exist
# in the dataset, and with dtypes which are not known to be easy-to-append, necessitate
# exact dtype equality, as checked below.
pass
elif not new_var.dtype == existing_var.dtype:
raise ValueError(
f"Mismatched dtypes for variable {vname} between Zarr store on disk "
f"and dataset to append. Store has dtype {existing_var.dtype} but "
f"dataset to append has dtype {new_var.dtype}."
)

ValueError: Mismatched dtypes for variable temperature between Zarr store on disk and dataset to append. Store has dtype |S2 but dataset to append has dtype |S3.

/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:576: ValueError

If I bypass the validation check to allow the code to continue as it would have for prior versions of zarr then the expected dataset no longer matches the actual:

_____________________________________________________________________ TestZarrDirectoryStore.test_append_string_length_mismatch_works[3-S] _____________________________________________________________________

self = <xarray.tests.test_backends.TestZarrDirectoryStore object at 0x111684a10>, dtype = 'S'

    @pytest.mark.parametrize("dtype", ["U", "S"])
    def test_append_string_length_mismatch_works(self, dtype) -> None:
        skip_if_zarr_format_2("This doesn't work with Zarr format 2")
        # ...but it probably would if we used object dtype
        ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
        expected = xr.concat([ds, ds_to_append], dim="time")
        with self.create_zarr_target() as store_target:
            ds.to_zarr(store_target, mode="w", **self.version_kwargs)
            ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
            actual = xr.open_dataset(store_target, engine="zarr")
>           xr.testing.assert_identical(expected, actual)
E           AssertionError: Left and right Dataset objects are not identical
E           Differing data variables:
E           L   temperature  (time) |S3 18B b'aa' b'bb' b'cc' b'aaa' b'bbb' b'ccc'
E           R   temperature  (time) |S2 12B b'aa' b'bb' b'cc' b'aa' b'bb' b'cc'

/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:2897: AssertionError

the comment in the validation function by @dcherian notes that fixed length strings are not easy to append, but seemingly the function used to pass them through as ok. I'm not 100% clear on what the intended behavior is here, any advice appreciated

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions