Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement dask methods on DataTree #9670

Merged
merged 26 commits into from
Oct 24, 2024

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Oct 24, 2024

For now this only contains compute and load, but in implementing those I've found a bug in copy: we don't actually (shallow) copy the variables, which means that my implementation of compute would modify the original tree.

@keewis keewis added topic-DataTree Related to the implementation of a DataTree class topic-chunked-arrays Managing different chunked backends, e.g. dask labels Oct 24, 2024
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
Comment on lines +870 to +872
data = self._to_dataset_view(rebuild_dims=False, inherit=inherit)._copy(
deep=deep, memo=memo
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

we're still missing an implementation for persist, which I skipped mostly because I wasn't sure how to test that. @TomNicholas, would you like to help me with that? If so, feel free to directly push to this PR.

{
dim: size
for dim, size in combined_chunks.items()
if dim in node.dataset.dims
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to use node.dataset.dims to avoid including inherited dims (which we can't chunk anyways because we only inherit indexed dims, and there is no chunked index so far)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise I just saw _node_dims in _get_all_dims, so that might have less overhead. I'll go ahead and use that instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will also be more explicit as it will avoid using "rebuilt" dims (which doesn't really matter anyway because we can't chunk indexes).

@TomNicholas
Copy link
Member

we're still missing an implementation for persist, which I skipped mostly because I wasn't sure how to test that. @TomNicholas, would you like to help me with that? If so, feel free to directly push to this PR.

I'm happy to help, but I think that (a) persist is much more niche than load and compute, so isn't as important to be in there before release, and (b) it's a separate method rather than a bugfix to this PR so it would make more sense for me to add that in a follow-up PR and we just merge this one without persist.

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

that would be fine with me.

That leaves us with the question whether we should guard against chunk with non-existing dims, and after resolving that we should be ready for a final review.

@TomNicholas
Copy link
Member

whether we should guard against chunk with non-existing dims

We should, see #9670 (comment)

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

We should, see #9670 (comment)

yeah, I saw that one after posting. Should be done now, though.

Comment on lines 1990 to 1991
Mapping from group paths to a mapping of dimension names to block lengths for this datatree's data, or None if
the underlying data is not a dask array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this docstring (and the one for .chunksizes on Dataset etc.) are wrong because this method will never return a None.

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

actually, I just tried to compute a DataTree object without chunked arrays (while trying to investigated your comment on chunksizes), and I got an ugly error. I'm investigating, but should not take too long to fix.

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

Tell me what you think about the reworded docstrings of chunksizes / Dataset.chunks, but otherwise this should be good to go, too.

@TomNicholas
Copy link
Member

Looks good! Let's merge.

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

(I just noticed that DataArray.chunksizes has the same issue, so I'll change that, too)

@keewis keewis enabled auto-merge (squash) October 24, 2024 18:41
@TomNicholas
Copy link
Member

NamedArray.chunksizes too hahaha

@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

should be good now, hopefully

@TomNicholas TomNicholas enabled auto-merge (squash) October 24, 2024 18:52
@TomNicholas TomNicholas merged commit 1f5889a into pydata:main Oct 24, 2024
29 checks passed
@keewis keewis deleted the datatree-dask-methods branch October 24, 2024 19:16
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2024
* main:
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 3, 2024
* main: (85 commits)
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-chunked-arrays Managing different chunked backends, e.g. dask topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

3 participants