Skip to content

Add mode="r+" for to_zarr and use consolidated writes/reads by default #5252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 17, 2021

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 3, 2021

mode="r+" only allows for modifying pre-existing array values in a
Zarr store. This makes it a safer default mode when doing a limited
region write. It also offers a nice performance bonus when using
consolidated metadata, because the store to modify can be opened in
"consolidated" mode -- rather than painfully slow non-consolidated
mode.

This PR includes several related changes to to_zarr():

  1. It adds support for the new mode="r+".
  2. consolidated=True in to_zarr() now means "open in consolidated
    mode" if using using mode="r+", instead of "write in consolidated
    mode" (which would not make sense for r+).
  3. It allows setting consolidated=True when using region, mostly
    for the sake of fast store opening with r+.
  4. Validation in to_zarr() has been reorganized to always use the
    existing Zarr group, rather than re-opening zar stores from
    scratch, which could require additional network requests.
  5. Incidentally, I've renamed the ZarrStore.ds attribute to
    ZarrStore.zarr_group, which is a much more descriptive name.

These changes gave me a ~5x boost in write performance in a large
parallel job making use of to_zarr with region.

  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

`mode="r+"` only allows for modifying pre-existing array values in a
Zarr store. This makes it a safer default `mode` when doing a limited
`region` write. It also offers a nice performance bonus when using
consolidated metadata, because the store to modify can be opened in
"consolidated" mode.

This PR includes several related changes to `to_zarr()`:

1. It adds support for the new `mode="r+"`.
2. `consolidated=True` in `to_zarr()` now means "open in consolidated
   mode" if using using `mode="r+"`, instead of "write in consolidated
   mode" (which would not make sense for r+).
3. It allows setting `consolidated=True` when using `region`, mostly
   for the sake of fast store opening with r+.
4. Validation in `to_zarr()` has been reorganized to always use the
   _existing_ Zarr group, rather than re-opening zar stores from
   scratch, which could require additional network requests.
5. Incidentally, I've renamed the `ZarrStore.ds` attribute to
   `ZarrStore.zarr_group`, which is a much more descriptive name.

These changes gave me a ~5x boost in write performance in a large
parallel job making use of `to_zarr` with `region`.
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

@rabernat
Copy link
Contributor

rabernat commented May 4, 2021

Question: does this mode still require eager loading of dimension coordinates?

@shoyer
Copy link
Member Author

shoyer commented May 4, 2021 via email

@shoyer
Copy link
Member Author

shoyer commented May 4, 2021

I pushed a commit making consolidated metadata the default (see #5251). This was easier to do building on top of my existing refactor here (but conceivably could be separated).

@shoyer shoyer changed the title Add mode="r+" for to_zarr and improve to_zarr performance Add mode="r+" for to_zarr and use consolidated writes/reads by default May 4, 2021
@dcherian dcherian requested a review from rabernat May 4, 2021 21:40
@shoyer
Copy link
Member Author

shoyer commented May 6, 2021

Thanks @joshmoore for the copy-editing!

@pep8speaks
Copy link

pep8speaks commented May 6, 2021

Hello @shoyer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-17 16:06:29 UTC

@shoyer
Copy link
Member Author

shoyer commented Jun 10, 2021

@rabernat gentle ping for review here :)

@shoyer
Copy link
Member Author

shoyer commented Jun 10, 2021

@nbren12 @spencerkclark are either of you up for a review here?

@dcherian
Copy link
Contributor

I vote to merge since it's so useful and @shoyer has probably found all the bugs already :)

@shoyer
Copy link
Member Author

shoyer commented Jun 17, 2021

I vote to merge since it's so useful and @shoyer has probably found all the bugs already :)

Hah :). We've been using this branch at Google for a while now and it seems to be working pretty well.

@dcherian
Copy link
Contributor

it seems to be working pretty well.

best kind of code review!

Thanks @shoyer

@dcherian dcherian merged commit 158bc52 into pydata:master Jun 17, 2021
@github-actions
Copy link
Contributor

Unit Test Results

0 files  ±0  0 suites  ±0   0s ⏱️ ±0s
0 tests ±0  0 ✔️ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 158bc52. ± Comparison against base commit 158bc52.

@rabernat
Copy link
Contributor

Really sorry I didn't get around to review. My excuse is that I moved back to NYC last week and fell behind on everything. Thanks for moving it forward. 💪

@nbren12
Copy link
Contributor

nbren12 commented Jun 22, 2021

I'm sorry too! I don't have any good excuse though...

@shoyer
Copy link
Member Author

shoyer commented Jun 22, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants