Control CF-encoding in to_zarr()

**Is your feature request related to a problem? Please describe.**

I believe, xarray's `dataset.to_zarr()` is somewhat inconsitent between creating variables and appending data to existing variables: When creating variables it can deal with writing already encoded data. When appending, it expects decoded data.

When appending data, xarray will always CF-encode variable data according to encoding information of existing variables before it appends new data. This is fine if data to be appended is decoded, but if the data to be appended is already encoded (e.g. because it was previously read by `dataset = xr.open_dataset(..., decode_cf=False)`) then this leads to entirely corrupt data. 

See also xarray issue #5263 and my actual problem described in https://github.com/bcdev/nc2zarr/issues/35.

**Describe the solution you'd like**

A possible hack is to redundantly use `dataset = decode_cf(dataset)` before appending so encoding it again is finally a no-op, as described in #5263. This of course also costs extra CPU for a useless computation. 

I'd like to control whether encoding of data shall take place when appending. If I already have encoded data, I'd like to call `encoded_dataset.to_zarr(..., append_dim='time', encode_cf=False)`.

For example, when I uncomment line 469 in `xarray/backends/zarr.py`, then this fixes this issue too:

https://github.com/pydata/xarray/blob/1b4412eeb7011f53932779e1d7c3534163aedd63/xarray/backends/zarr.py#L460-L471


**Minimal Complete Verifiable Example**:

Here is a test that explains the observed inconsistency.

```python
import shutil
import unittest

import numpy as np
import xarray as xr
import zarr

SRC_DS_1_PATH = 'src_ds_1.zarr'
SRC_DS_2_PATH = 'src_ds_2.zarr'
DST_DS_PATH = 'dst_ds.zarr'


class XarrayToZarrAppendInconsistencyTest(unittest.TestCase):
    @classmethod
    def del_paths(cls):
        for path in (SRC_DS_1_PATH, SRC_DS_2_PATH, DST_DS_PATH):
            shutil.rmtree(path, ignore_errors=True)

    def setUp(self):
        self.del_paths()

        scale_factor = 0.0001
        self.v_values_encoded = np.array([[0, 10000, 15000, 20000]], dtype=np.uint16)
        self.v_values_decoded = np.array([[np.nan, 1., 1.5, 2.]], dtype=np.float32)

        # The variable for the two source datasets
        v = xr.DataArray(self.v_values_encoded,
                         dims=('t', 'x'),
                         attrs=dict(scale_factor=scale_factor, _FillValue=0))

        # Create two source datasets
        src_ds = xr.Dataset(data_vars=dict(v=v))
        src_ds.to_zarr(SRC_DS_1_PATH)
        src_ds.to_zarr(SRC_DS_2_PATH)

        # Assert we have written encoded data
        a1 = zarr.convenience.open_array(SRC_DS_1_PATH + '/v')
        a2 = zarr.convenience.open_array(SRC_DS_2_PATH + '/v')
        np.testing.assert_equal(a1, self.v_values_encoded)  # succeeds
        np.testing.assert_equal(a2, self.v_values_encoded)  # succeeds

        # Assert we correctly decode data
        src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=True)
        src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=True)
        np.testing.assert_equal(src_ds_1.v.data, self.v_values_decoded)  # succeeds
        np.testing.assert_equal(src_ds_2.v.data, self.v_values_decoded)  # succeeds

    def tearDown(self):
        self.del_paths()

    def test_decode_cf_true(self):
        """
        This test succeeds.
        """
        # Open the two source datasets
        src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=True)
        src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=True)
        # Expect data is decoded
        np.testing.assert_equal(src_ds_1.v.data, self.v_values_decoded)  # succeeds
        np.testing.assert_equal(src_ds_2.v.data, self.v_values_decoded)  # succeeds

        # Write 1st source datasets to new dataset, append the 2nd source
        src_ds_1.to_zarr(DST_DS_PATH, mode='w-')
        src_ds_2.to_zarr(DST_DS_PATH, append_dim='t')

        # Open the new dataset
        dst_ds = xr.open_zarr(DST_DS_PATH, decode_cf=True)
        dst_ds_1 = dst_ds.isel(t=slice(0, 1))
        dst_ds_2 = dst_ds.isel(t=slice(1, 2))
        # Expect data is decoded
        np.testing.assert_equal(dst_ds_1.v.data, self.v_values_decoded)  # succeeds
        np.testing.assert_equal(dst_ds_2.v.data, self.v_values_decoded)  # succeeds

    def test_decode_cf_false(self):
        """
        This test fails by the last assertion with

        AssertionError:
        Arrays are not equal

        Mismatched elements: 3 / 4 (75%)
        Max absolute difference: 47600
        Max relative difference: 4.76
         x: array([[    0, 57600, 53632, 49664]], dtype=uint16)
         y: array([[    0, 10000, 15000, 20000]], dtype=uint16)
        """
        # Open the two source datasets
        src_ds_1 = xr.open_zarr(SRC_DS_1_PATH, decode_cf=False)
        src_ds_2 = xr.open_zarr(SRC_DS_2_PATH, decode_cf=False)
        # Expect data is NOT decoded (still encoded)
        np.testing.assert_equal(src_ds_1.v.data, self.v_values_encoded)  # succeeds
        np.testing.assert_equal(src_ds_2.v.data, self.v_values_encoded)  # succeeds

        # Write 1st source datasets to new dataset, append the 2nd source
        src_ds_1.to_zarr(DST_DS_PATH, mode='w-')
        # Avoid ValueError: failed to prevent overwriting existing key scale_factor in attrs. ...
        del src_ds_2.v.attrs['scale_factor']
        del src_ds_2.v.attrs['_FillValue']
        src_ds_2.to_zarr(DST_DS_PATH, append_dim='t')

        # Open the new dataset
        dst_ds = xr.open_zarr(DST_DS_PATH, decode_cf=False)
        dst_ds_1 = dst_ds.isel(t=slice(0, 1))
        dst_ds_2 = dst_ds.isel(t=slice(1, 2))
        # Expect data is NOT decoded (still encoded)
        np.testing.assert_equal(dst_ds_1.v.data, self.v_values_encoded)  # succeeds
        np.testing.assert_equal(dst_ds_2.v.data, self.v_values_encoded)  # fails
```

**Environment**:

<details><summary>Output of <tt>xr.show_versions()</tt></summary>

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:50:08) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: de_DE.cp1252
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.17.0
pandas: 1.2.2
numpy: 1.20.1
scipy: 1.6.0
netCDF4: 1.5.6
pydap: installed
h5netcdf: None
h5py: None
Nio: None
zarr: 2.6.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.0
cfgrib: None
iris: None
bottleneck: None
dask: 2021.02.0
distributed: 2021.02.0
matplotlib: 3.3.4
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: 6.2.2
IPython: 7.21.0
sphinx: 3.5.1

</details>



	if existing_variables:
	# there are variables to append
	# their encoding must be the same as in the store
	ds = open_zarr(self.ds.store, group=self.ds.path, chunks=None)
	variables_with_encoding = {}
	for vn in existing_variables:
	variables_with_encoding[vn] = variables[vn].copy(deep=False)
	variables_with_encoding[vn].encoding = ds[vn].encoding
	variables_with_encoding, _ = self.encode(variables_with_encoding, {})
	variables_encoded.update(variables_with_encoding)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Control CF-encoding in to_zarr() #5405

INSTALLED VERSIONS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Control CF-encoding in to_zarr() #5405

Description

INSTALLED VERSIONS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions