Description
What is your issue?
I make use of fsspec
to quickly open netcdf files in the cloud and pull out slices of data without needing to read the entire file. Quick and dirty is just ds = xr.open_dataset(fs.open("gs://..."))
.
This works great, in that a many GB file can be lazy-loaded as a dataset in a few hundred milliseconds, by only parsing the netcdf headers with under-the-hood byte range requests. But, only if the netcdf is written from dask-backed arrays. Somehow, writing from numpy-backed arrays produces a different netcdf that requires reading deeper into the file to parse as a dataset.
I spent some time digging into the backends and see xarray is ultimately passing off the store write to dask.array
here. A look at ncdump
and Dataset.encoding
didn't reveal any obvious differences between these files, but there is clearly something. Anyone know why the straight xarray store methods would produce a different netcdf structure, despite the underlying data and encoding being identical?
This should work as an MCVE:
import os
import string
import fsspec
import numpy as np
import xarray as xr
fs = fsspec.filesystem("gs")
bucket = "gs://<your-bucket>"
# create a ~160MB dataset with 20 variables
variables = {v: (["x", "y"], np.random.random(size=(1000, 1000))) for v in string.ascii_letters[:20]}
ds = xr.Dataset(variables)
# Save one version from numpy backed arrays and one from dask backed arrays
ds.compute().to_netcdf("numpy.nc")
ds.chunk().to_netcdf("dask.nc")
# Copy these to a bucket of your choice
fs.put("numpy.nc", bucket)
fs.put("dask.nc", bucket)
Then time reading in these files as datasets with fsspec:
%timeit xr.open_dataset(fs.open(os.path.join(bucket, "numpy.nc")))
# 2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit xr.open_dataset(fs.open(os.path.join(bucket, "dask.nc")))
# 187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)