Description
I want to bring up an issue that has tripped up my workflow with large climate models many times. I am dealing with large data arrays of vertical cell thickness. These are 4d arrays (x, y, z, time) but I would define them as coordinates, not data_variables in the xarrays data model (e.g. they should not be multiplied by a value if a dataset is multiplied).
These sort of coordinates might become more prevalent with newer ocean models like MOM6
Whenever I assign these arrays as coordinates operations on the arrays seem to trigger computation, whereas they don't if I set them up as data_variables. The example below shows this behavior.
Is this a bug or done on purpose? Is there a workaround to keep these vertical thicknesses as coordinates?
import xarray as xr
import numpy as np
import dask.array as dsa
# create dataset with with vertical thickness `dz` as data variable
data = xr.DataArray(dsa.random.random([30, 50, 200, 1000]), dims=['x','y', 'z', 't'])
dz = xr.DataArray(dsa.random.random([30, 50, 200, 1000]), dims=['x','y', 'z', 't'])
ds = xr.Dataset({'data':data, 'dz':dz})
#another dataset with `dz` as coordinate
ds_new = xr.Dataset({'data':data})
ds_new.coords['dz'] = dz
%%time
test = ds['data'] * ds['dz']
CPU times: user 1.94 ms, sys: 19.1 ms, total: 21 ms Wall time: 21.6 ms
%%time
test = ds_new['data'] * ds_new['dz']
CPU times: user 17.4 s, sys: 1.98 s, total: 19.4 s Wall time: 12.5 s
Output of xr.show_versions()
xarray: 0.13.0+24.g4254b4af
pandas: 0.25.1
numpy: 1.17.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.0
distributed: 2.5.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.2.0
IPython: 7.8.0
sphinx: None