Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce graph size through writing indexes directly into graph for map_blocks #9658

Merged
merged 4 commits into from
Oct 22, 2024

Conversation

phofl
Copy link
Contributor

@phofl phofl commented Oct 22, 2024

When looking at then map_blocks variant of the benchmark here, I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments.

The tokenise adds some overhead, so there is a drawback of this.

Happy to open an issue if required

Is this something that you all would consider merging?

cc @dcherian

@dcherian
Copy link
Contributor

dcherian commented Oct 22, 2024

Is this something that you all would consider merging?

Yes. This would extend what was done in #8412 to address #8409

I think I initially chose this approach to make the graph absolutely embarrassingly parallel.

Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?

@phofl
Copy link
Contributor Author

phofl commented Oct 22, 2024

Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)?

This is the same pattern as we have with the open_dataset tasks and in the new dask array take implementation, we have existing tests covering these 2 patterns already

@dcherian dcherian enabled auto-merge (squash) October 22, 2024 15:55
@dcherian
Copy link
Contributor

This is the same pattern as we have with the open_dataset tasks

Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!

@phofl
Copy link
Contributor Author

phofl commented Oct 22, 2024

Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR!

TLDR is: If your task does not have a callable, it will be ignored during ordering

@dcherian dcherian merged commit 5632c8e into pydata:main Oct 22, 2024
34 checks passed
@phofl phofl deleted the graph-size branch October 22, 2024 16:36
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2024
* main:
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 3, 2024
* main: (85 commits)
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task graphs on .map_blocks with many chunks can be huge
2 participants