Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attributes that are lists of strings with a single member don't survive a round-trip #4798

Closed
mikapfl opened this issue Jan 12, 2021 · 8 comments · Fixed by #9700
Closed
Labels
topic-backends topic-documentation topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@mikapfl
Copy link

mikapfl commented Jan 12, 2021

What happened:

If I create a Dataset with an attr that is a list of strings with only a single member, write it to netcdf and open it again, the attr is a single string instead of a list of strings.

What you expected to happen:

Writing the Dataset to netcdf and opening it should not change the attrs.

Minimal Complete Verifiable Example:

import xarray as xr

ds = xr.Dataset(attrs={"foo": ["bar"]},)

ds.to_netcdf("ds.nc", engine="netcdf4")
rd = xr.open_dataset("ds.nc", engine="netcdf4")

assert ds.attrs["foo"] == rd.attrs["foo"]

Anything else we need to know?:

This was implemented in PR2045, including the test test_setncattr_string. However, there assert_array_equal is used to compare the attrs after a roundtrip (line 1454 ), which unfortunately compares element-wise if one of the elements is a scalar. Changing the
test like this:

-                assert_array_equal(one_element_list_of_strings, totest.attrs["bar"])
+                assert one_element_list_of_strings == totest.attrs["bar"]

corrects it (the test then fails currently).

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: f52a95c
python: 3.8.6 (default, Sep 25 2020, 09:36:53)
[GCC 10.2.0]
python-bits: 64
OS: Linux
OS-release: 5.8.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.16.3.dev75+gf52a95cb
pandas: 1.2.0
numpy: 1.19.5
scipy: None
netCDF4: 1.5.5.1
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.3.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 51.1.2
pip: 20.3.3
conda: None
pytest: 6.2.1
IPython: None
sphinx: None

@mathause
Copy link
Collaborator

mathause commented Jan 12, 2021

There is an explicit condition for the size to be > 1

def _is_list_of_strings(value):
if np.asarray(value).dtype.kind in ["U", "S"] and np.asarray(value).size > 1:

However, even when setting the condition to > 0 it is saved as a string and not as a list of strings - maybe check if this can be achieved with netCDF4 at all.

@mathause mathause added topic-backends topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) labels Jan 13, 2021
@mikapfl
Copy link
Author

mikapfl commented Jan 13, 2021

Actually, this is a problem which is specific to the netcdf4 backend, using h5netcdf, this works:

import xarray as xr

ds = xr.Dataset(attrs={"foo": ["bar"]},)

ds.to_netcdf("ds.nc", engine="h5netcdf")
rd = xr.open_dataset("ds.nc", engine="h5netcdf")

assert ds.attrs["foo"] == rd.attrs["foo"]

and runs without problems.

As soon as I involve netcdf4 in reading or writing, this fails.

I don't know enough about netcdf's on-disk format to really debug what is going on. The ncdump output looks identical to me for both engines.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 13, 2021

@mikapfl That's by design in netCDF4:

https://github.com/Unidata/netcdf4-python/blob/bcbe3a0c62e6aa012b7afe7c9b2cfc9b2abc5424/src/netCDF4/_netCDF4.pyx#L1478-L1482

        if len(result) == 1:
            return result[0]
        else:
            return result

@mikapfl
Copy link
Author

mikapfl commented Jan 13, 2021

@kmuehlbauer thanks for finding that out!

Now I'm wondering if that should be documented or some other solution; certainly, for me it was surprising, but maybe also my expectations were just wrong.

@kmuehlbauer
Copy link
Contributor

@mikapfl This was added with Unidata/netcdf4-python#597. But I can't say anything profound why the single element is extracted from the list.

@dcherian
Copy link
Contributor

Could add a note to the docstring...

@kmuehlbauer
Copy link
Contributor

h5netcdf aligned with netCDF4 in h5netcdf/h5netcdf#116.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Oct 31, 2024

As this recently surfaced in #9699, it would be great to at least mention this in the docs. Those are the locations where attrs are mentioned.

Should we add this to the data structures or the faq section? We could then link to there from the three API docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-documentation topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants