Description
What happened: When combining monthly ERA5 data and saving it individually for single locations, different values/nan values appear when reading the single location file back in.
What you expected to happen: Both should be the same. This works, e.g. when only one month is read.
Minimal Complete Verifiable Example:
import xarray as xr #using version 0.18.2
import numpy as np
import dask
# only as many threads as requested CPUs | only one to be requested, more threads don't seem to be used
dask.config.set(scheduler='synchronous') # this is used only because of the Cluster I work on, but keeping it here in case it is relevant
model_level_file_name_format = "{:d}_europe_{:d}_130_131_132_133_135.nc"
ml_files = [model_level_file_name_format.format(2012, 9), model_level_file_name_format.format(2012, 10)]
ds = xr.open_mfdataset(ml_files, decode_times=True)
# Select single location data
lons = ds['longitude'].values
lats = ds['latitude'].values
i_lat, i_lon = 27,30
ds_loc = ds.sel(latitude=lats[i_lat], longitude=lons[i_lon])
# Save to file
ds_loc.to_netcdf('europe_i_lat_{i_lat}_i_lon_{i_lon}.nc'.format(i_lat=i_lat, i_lon=i_lon))
# Read in again
ds_loc_1 = xr.open_dataset('europe_i_lat_{i_lat}_i_lon_{i_lon}.nc'.format(i_lat=i_lat, i_lon=i_lon), decode_times=True)
print('Test all q values same: ', np.all(ds_loc.q.values == ds_loc_1.q.values))
Anything else we need to know?: I tested this using these two months - many times saving the output works, or the values are slightly different (in the 6th digit). Using a larger timespan (2010-2012) even nan values appear. This issue is not clearly restricted to the q variable, I've not yet found the pattern.
I've included a more detailed assessment (output, data, code)
- only one month: no discrepancies
- two months: discrepancies (in the second month)
- 2010-2013: discrepancies and nan values
at https://uni-bonn.sciebo.de/s/OLHhid8zJg65IFB
I'm not sure where the issue might come from, but as the data is read in correctly at first, it does not seem to be on that side - which would then only leave the process of writing the netcdf output in xarray. I've tested this for a few years and for two months I always get the result, that not all q values are the same. I'm not sure where the problem might be, so I'm not sure where to start for a more minimal example. Hope this is ok.
Cheers, Lavinia
Environment:
INSTALLED VERSIONS
commit: None
python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.25.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0
xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.06.0
distributed: 2021.06.0
matplotlib: 3.4.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: None
pytest: None
IPython: None
sphinx: None