Description
What happened?
When attempting to save an altered Xarray dataset to a NetCDF file using the to_netcdf
method, an error occurs if the original dataset is loaded from a file. Specifically, this error does not occur when the dataset is created directly but only when it is loaded from a file.
What did you expect to happen?
The altered Xarray dataset is saved as a NetCDF file using the to_netcdf
method.
Minimal Complete Verifiable Example
import xarray as xr
ds = xr.Dataset(
data_vars=dict(
win_1=("attempt", [True, False, True, False, False, True]),
win_2=("attempt", [False, True, False, True, False, False]),
),
coords=dict(
attempt=[1, 2, 3, 4, 5, 6],
player_1=("attempt", ["paper", "paper", "scissors", "scissors", "paper", "paper"]),
player_2=("attempt", ["rock", "scissors", "paper", "rock", "paper", "rock"]),
)
)
ds.to_netcdf("dataset.nc")
ds_from_file = xr.load_dataset("dataset.nc")
ds_altered = ds_from_file.where(ds_from_file["player_1"] == "paper", drop=True)
ds_altered.to_netcdf("dataset_altered.nc")
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Traceback (most recent call last):
File "example.py", line 20, in <module>
ds_altered.to_netcdf("dataset_altered.nc")
File ".../python3.9/site-packages/xarray/core/dataset.py", line 2303, in to_netcdf
return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
File ".../python3.9/site-packages/xarray/backends/api.py", line 1315, in to_netcdf
dump_to_store(
File ".../python3.9/site-packages/xarray/backends/api.py", line 1362, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File ".../python3.9/site-packages/xarray/backends/common.py", line 356, in store
self.set_variables(
File ".../python3.9/site-packages/xarray/backends/common.py", line 398, in set_variables
writer.add(source, target)
File ".../python3.9/site-packages/xarray/backends/common.py", line 243, in add
target[...] = source
File ".../python3.9/site-packages/xarray/backends/scipy_.py", line 78, in __setitem__
data[key] = value
File ".../python3.9/site-packages/scipy/io/_netcdf.py", line 1019, in __setitem__
self.data[index] = data
ValueError: could not broadcast input array from shape (4,5) into shape (4,8)
Anything else we need to know?
Findings:
The issue is related to the encoding information of the dataset becoming invalid after filtering data with the where
method. The to_netcdf
method takes the available encoding information instead of considering the actual shape of the data.
In the provided examples, the maximum length of strings stored in "player_1" and "player_2" is originally set to 8 characters. However, after filtering with the where
method, the maximum length of the string becomes 5 in "player_1" and remains 8 in "player_2.". But the encoding information of the variables still shows a length of 8, particularly the attribute char_dim_name
.
Workaround:
A workaround to resolve this issue is to call the drop_encoding
method on the dataset before saving it with to_netcdf
. This action ensures that the encoding information is not available, and the to_netcdf
method is forced to take the actual shapes of the data, preventing the broadcasting error.
Environment
INSTALLED VERSIONS
commit: None
python: 3.9.14 (main, Aug 24 2023, 14:01:46)
[GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.3.1-060301-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.3
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None