-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce graph size through writing indexes directly into graph for map_blocks
#9658
Conversation
Yes. This would extend what was done in #8412 to address #8409 I think I initially chose this approach to make the graph absolutely embarrassingly parallel. Can you add a test to dask or confirm that dask has a test that makes sure the ordering/scheduling is "nice" for this kind of case (widely duplicated dependency)? |
This is the same pattern as we have with the open_dataset tasks and in the new dask array take implementation, we have existing tests covering these 2 patterns already |
Ya this pattern used to suck, so I avoided it. But good to know it's fixed. Thanks for the PR! |
TLDR is: If your task does not have a callable, it will be ignored during ordering |
* main: Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
* main: (85 commits) Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691) Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) ...
.map_blocks
with many chunks can be huge #8409When looking at then
map_blocks
variant of the benchmark here, I noticed that we had a 30 MiB graph for the medium variant. 10MiB of those were just the repeated adding of the PandasIndexes as an argument for map_blocks. Writing them directly to the graph will de-duplicate the value, and thus only have this object once instead of many many times. We can then reference the key for the function arguments.The tokenise adds some overhead, so there is a drawback of this.
Happy to open an issue if required
Is this something that you all would consider merging?
cc @dcherian