Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_icechunk for xarray #357

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Add to_icechunk for xarray #357

wants to merge 11 commits into from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 29, 2024

Adds icechunk.xarray.to_icechunk.

TODO:

@dcherian
Copy link
Contributor Author

dcherian commented Oct 31, 2024

This adds two APIs. One is "easy" and does everything in one step, the other allows extra control over initializing the store, and subsequent writes.

  1. icechunk.xarray.to_icechunk(dataset, store=store, ...) which is like ds.to_zarr without the compute kwarg.
  2. To "initialize" the repo, and execute a distributed writer in a separate step, the API is:
from icechunk.xarray import XarrayDatasetWriter

writer = XarrayDatasetWriter(ds, store=store)
writer.write_metadata(group="new2", mode="w") # write metadata
writer.write_eager() # write in-memory arrays
writer.write_lazy() # eagerly write dask arrays

The write_eager method could execute an async write of all the eager in-memory coordinate arrays, but does not do so right now.


Write metadata for arrays in one step. This "initializes" the store but has not written an real values yet.
```python
writer.write_metadata(group="new2", mode="w")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about appends & region writes here.

@TomNicholas
Copy link
Collaborator

Naive question: if icechunk.xarray.to_icechunk is functionally the same as ds.to_zarr(icechunkstore) then why does it need to exist?

Also should we call icechunk.xarray.to_icechunk from inside virtualizarr's .virtualize.to_icechunk accessor method? Since that can also write non-virtual variables.

cc @mpiannucci

@dcherian
Copy link
Contributor Author

dcherian commented Nov 5, 2024

is functionally the same as

It is not. to_zarr will not work for distributed writes. It would write the chunks, but the commit wouldn't be meaningful.

@TomNicholas
Copy link
Collaborator

.to_zarr will not work for distributed writes. It would write the chunks, but the commit wouldn't be meaningful.

Can you expand on what this means in practice? If I have a big distributed dask array, I call to .to_zarr, then commit, what exactly would go wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants