You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There seems to be a common pattern for writing zarrs from a distributed set of machines, in parallel. It's somewhat described in the prose of the io docs. Quoting:
Creating the template — "the first step is creating an initial Zarr store without writing all of its array data. This can be done by first creating a Dataset with dummy values stored in dask, and then calling to_zarr with compute=False to write only metadata to Zarr"
Writing out each region from workers — "a Zarr store with the correct variable shapes and attributes exists that can be filled out by subsequent calls to to_zarr. The region provides a mapping from dimension names to Python slice objects indicating where the data should be written (in index space, not coordinate space)"
I've been using this fairly successfully recently. It's much better than writing hundreds or thousands of data variables, since many small data variables create a huge number of files.
Are there some tools we can provide to make this easier? Some ideas:
compute=False is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe make_template can be an abstraction over it. Something like xarray_beam.make_template to make the template from a Dataset?
What happens if one worker's data isn't aligned on some dimensions? Will that write to the wrong location? Could we offer an option, similar to the above, to reindex on the template dimensions?
When writing a region, we need to drop other vars. Can we offer this as a kwarg? Occasionally I'll add a dimension with an index to a dataset, run the function to write it — and it'll fail, because I forgot to add that index to the .drop_vars call that precedes the write. When we're writing a template, all the indexes are written up front anyway. (edit: allow coordinates to be independent of region selection in to_zarr #6260)
I've hit an issue where writing a region seemed to cause the worker to attempt to load the whole array into memory — can we offer guarantees for when (non-metadata) data will be loaded during to_zarr?
How about adding raise_if_dask_computes to our public API? The alternative I've been doing is watching htop and existing if I see memory ballooning, which is less cerebral...
It doesn't seem easy to write coords on a DataArray. For example, writing xr.tutorial.load_dataset('air_temperature').assign_coords(lat2=da.lat + 2, a=(('lon',), ['a'] * len(da.lon))).chunk().to_zarr('foo.zarr', compute=False) will cause the non-index coords to be written as empty. But writing them separately conflicts with having a single variable. Currently I manually load each coord before writing, which is not super-friendly.
Some things that were in the list here, as they've been completed!!
Requiring region to be specified as an int range can be inconvenient — would it feasible to have a function that grabs the template metadata, calculates the region ints, and then calculates the implied indexes?
What is your issue?
There seems to be a common pattern for writing zarrs from a distributed set of machines, in parallel. It's somewhat described in the prose of the io docs. Quoting:
I've been using this fairly successfully recently. It's much better than writing hundreds or thousands of data variables, since many small data variables create a huge number of files.
Are there some tools we can provide to make this easier? Some ideas:
compute=False
is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybemake_template
can be an abstraction over it. Something likexarray_beam.make_template
to make the template from a Dataset?metadata_only
param to.to_zarr
? #8343What happens if one worker's data isn't aligned on some dimensions? Will that write to the wrong location? Could we offer an option, similar to the above, to reindex on the template dimensions?
When writing a region, we need to drop other vars. Can we offer this as a kwarg? Occasionally I'll add a dimension with an index to a dataset, run the function to write it — and it'll fail, because I forgot to add that index to the
.drop_vars
call that precedes the write. When we're writing a template, all the indexes are written up front anyway. (edit: allow coordinates to be independent ofregion
selection in to_zarr #6260)More minor papercuts:
to_zarr
?raise_if_dask_computes
to our public API? The alternative I've been doing is watchinghtop
and existing if I see memory ballooning, which is less cerebral...xr.tutorial.load_dataset('air_temperature').assign_coords(lat2=da.lat + 2, a=(('lon',), ['a'] * len(da.lon))).chunk().to_zarr('foo.zarr', compute=False)
will cause the non-index coords to be written as empty. But writing them separately conflicts with having a single variable. Currently I manually load each coord before writing, which is not super-friendly.Some things that were in the list here, as they've been completed!!
region
to be specified as an int range can be inconvenient — would it feasible to have a function that grabs the template metadata, calculates the region ints, and then calculates the implied indexes?to_zarr(region=...)
rather than passing indexes #7702The text was updated successfully, but these errors were encountered: