Skip to content

Differences in to_netcdf for dask and numpy backed arrays #7522

Open
@slevang

Description

@slevang

What is your issue?

I make use of fsspec to quickly open netcdf files in the cloud and pull out slices of data without needing to read the entire file. Quick and dirty is just ds = xr.open_dataset(fs.open("gs://...")).

This works great, in that a many GB file can be lazy-loaded as a dataset in a few hundred milliseconds, by only parsing the netcdf headers with under-the-hood byte range requests. But, only if the netcdf is written from dask-backed arrays. Somehow, writing from numpy-backed arrays produces a different netcdf that requires reading deeper into the file to parse as a dataset.

I spent some time digging into the backends and see xarray is ultimately passing off the store write to dask.array here. A look at ncdump and Dataset.encoding didn't reveal any obvious differences between these files, but there is clearly something. Anyone know why the straight xarray store methods would produce a different netcdf structure, despite the underlying data and encoding being identical?

This should work as an MCVE:

import os
import string
import fsspec
import numpy as np
import xarray as xr

fs = fsspec.filesystem("gs")
bucket = "gs://<your-bucket>"

# create a ~160MB dataset with 20 variables
variables = {v: (["x", "y"], np.random.random(size=(1000, 1000))) for v in string.ascii_letters[:20]}
ds = xr.Dataset(variables)

# Save one version from numpy backed arrays and one from dask backed arrays
ds.compute().to_netcdf("numpy.nc")
ds.chunk().to_netcdf("dask.nc")

# Copy these to a bucket of your choice
fs.put("numpy.nc", bucket)
fs.put("dask.nc", bucket)

Then time reading in these files as datasets with fsspec:

%timeit xr.open_dataset(fs.open(os.path.join(bucket, "numpy.nc")))
# 2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit xr.open_dataset(fs.open(os.path.join(bucket, "dask.nc")))
# 187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions