Skip to content

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Closed
@tarik

Description

@tarik

What happened?

When attempting to save an altered Xarray dataset to a NetCDF file using the to_netcdf method, an error occurs if the original dataset is loaded from a file. Specifically, this error does not occur when the dataset is created directly but only when it is loaded from a file.

What did you expect to happen?

The altered Xarray dataset is saved as a NetCDF file using the to_netcdf method.

Minimal Complete Verifiable Example

import xarray as xr


ds = xr.Dataset(
    data_vars=dict(
        win_1=("attempt", [True, False, True, False, False, True]),
        win_2=("attempt", [False, True, False, True, False, False]),
    ),
    coords=dict(
        attempt=[1, 2, 3, 4, 5, 6],
        player_1=("attempt", ["paper", "paper", "scissors", "scissors", "paper", "paper"]),
        player_2=("attempt", ["rock", "scissors", "paper", "rock", "paper", "rock"]),
    )
)
ds.to_netcdf("dataset.nc")

ds_from_file = xr.load_dataset("dataset.nc")

ds_altered = ds_from_file.where(ds_from_file["player_1"] == "paper", drop=True)
ds_altered.to_netcdf("dataset_altered.nc")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Traceback (most recent call last):
  File "example.py", line 20, in <module>
    ds_altered.to_netcdf("dataset_altered.nc")
  File ".../python3.9/site-packages/xarray/core/dataset.py", line 2303, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1315, in to_netcdf
    dump_to_store(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1362, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 356, in store
    self.set_variables(
  File ".../python3.9/site-packages/xarray/backends/common.py", line 398, in set_variables
    writer.add(source, target)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 243, in add
    target[...] = source
  File ".../python3.9/site-packages/xarray/backends/scipy_.py", line 78, in __setitem__
    data[key] = value
  File ".../python3.9/site-packages/scipy/io/_netcdf.py", line 1019, in __setitem__
    self.data[index] = data
ValueError: could not broadcast input array from shape (4,5) into shape (4,8)

Anything else we need to know?

Findings:

The issue is related to the encoding information of the dataset becoming invalid after filtering data with the where method. The to_netcdf method takes the available encoding information instead of considering the actual shape of the data.

In the provided examples, the maximum length of strings stored in "player_1" and "player_2" is originally set to 8 characters. However, after filtering with the where method, the maximum length of the string becomes 5 in "player_1" and remains 8 in "player_2.". But the encoding information of the variables still shows a length of 8, particularly the attribute char_dim_name.

Workaround:

A workaround to resolve this issue is to call the drop_encoding method on the dataset before saving it with to_netcdf. This action ensures that the encoding information is not available, and the to_netcdf method is forced to take the actual shapes of the data, preventing the broadcasting error.

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.14 (main, Aug 24 2023, 14:01:46)
[GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.3.1-060301-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.3
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions