Make dask names change when chunking Variables by different amounts.#3584
Make dask names change when chunking Variables by different amounts.#3584dcherian merged 9 commits intopydata:masterfrom
Conversation
When rechunking by the current chunk size, name should not change. Add a __dask_tokenize__ method for ReprObject so that this behaviour is present when DataArrays are converted to temporary Datasets and back.
…chunk-unique-token * 'chunk-unique-token' of github.com:dcherian/xarray: remove more computes.
|
The tests fail on import dask
import xarray as xr
ds = xr.Dataset({'x': (('y',), dask.array.ones(10, chunks=(3,)))})
mapped = ds.map_blocks(lambda x: x)
mapped.compute() # works
xr.testing.assert_equal(mapped, ds) # does not work
xr.testing.assert_equal(mapped, ds.compute()) # works
xr.testing.assert_equal(mapped.compute(), ds) # works
xr.testing.assert_equal(mapped.compute(), ds.compute()) # worksThe traceback is This key is not in for name, layer in graph.layers.items():
deps = graph.dependencies[name]
if (
isinstance(layer, Blockwise)
and len(deps) > 1
and not any(dependencies[dep] for dep in deps) # no need to fuse if 0 or 1
and all(len(dependents[dep]) == 1 for dep in deps)
):
new = toolz.merge(layer, *[layers[dep] for dep in deps])
new, _ = fuse(new, keys, ave_width=len(deps))I'm not sure whether this is a bug in xarray/xarray/core/parallel.py Line 315 in 69c85b8 cc @mrocklin |
|
So this is enough to fix this in Dask diff --git a/dask/blockwise.py b/dask/blockwise.py
index 52a36c246..84e0ecc08 100644
--- a/dask/blockwise.py
+++ b/dask/blockwise.py
@@ -818,7 +818,7 @@ def fuse_roots(graph: HighLevelGraph, keys: list):
if (
isinstance(layer, Blockwise)
and len(deps) > 1
- and not any(dependencies[dep] for dep in deps) # no need to fuse if 0 or 1
+ and not any(dependencies.get(dep, {}) for dep in deps) # no need to fuse if 0 or 1
and all(len(dependents[dep]) == 1 for dep in deps)
):
new = toolz.merge(layer, *[layers[dep] for dep in deps])I'm trying to understand why we're getting this KeyError though. I want to make sure that we have a valid HighLevelGraph before making that change. |
|
@mrocklin if you get a chance, can you confirm that the values in So in the following, the (Pdb) pp list(self.layers)
['eq-e98e52fb2b8e27b4b5158d399330c72d',
'lambda-0f1d0bc5e7df462d7125839aed006e04',
'ones-c4a83f4b990021618d55e0fa61a351d6']
(Pdb) pp self.dependencies
{'eq-e98e52fb2b8e27b4b5158d399330c72d': {'lambda-0f1d0bc5e7df462d7125839aed006e04-x',
'ones-c4a83f4b990021618d55e0fa61a351d6'},
'lambda-0f1d0bc5e7df462d7125839aed006e04': {'ones-c4a83f4b990021618d55e0fa61a351d6'},
'ones-c4a83f4b990021618d55e0fa61a351d6': set()}That's coming from the |
That sounds like a reasonable expectation, but honestly it's been a while, so I don't fully trust my knowledge here. It might be worth adding some runtime checks into the |
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
This fixes an issue with the HighLevelGraph noted in pydata#3584, and exposed by a recent change in Dask to do more HLG fusion.
* Fix map_blocks HLG layering This fixes an issue with the HighLevelGraph noted in #3584, and exposed by a recent change in Dask to do more HLG fusion. * update * black * update
* upstream/master: Fix map_blocks HLG layering (pydata#3598) Silence sphinx warnings: Round 2 (pydata#3592) 2x~5x speed up for isel() in most cases (pydata#3533) remove xarray again (pydata#3591) fix plotting with transposed nondim coords. (pydata#3441) make coarsen reductions consistent with reductions on other classes (pydata#3500) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)
…oken * 'master' of github.com:pydata/xarray: Add nanmedian for dask arrays (pydata#3604) added pyinterp to related projects (pydata#3655) Allow incomplete hypercubes in combine_by_coords (pydata#3649) concat keeps attrs from first variable. (pydata#3637) Extend DatetimeAccessor properties and support `.dt` accessor for Timedelta (pydata#3612) update readthedocs.yml (pydata#3639) silence sphinx warnings round 3 (pydata#3602) Fix/quantile wrong errmsg (pydata#3635) Provide shape info in shape mismatch error. (pydata#3619) Minor doc fixes (pydata#3615) Respect user-specified coordinates attribute. (pydata#3487) Add Facetgrid.row_labels & Facetgrid.col_labels (pydata#3597) Fix pint integration tests (pydata#3600) Minor fix to combine_by_coords to allow for the combination of CFTimeIndexes separated by large time intervals (pydata#3543)
|
gentle ping @crusaderky |
Co-Authored-By: crusaderky <crusaderky@gmail.com>
|
Thanks @crusaderky |
* upstream/master: allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) Support swap_dims to dimension names that are not existing variables (pydata#3636) Add map_blocks example to docs. (pydata#3667) add multiindex level name checking to .rename() (pydata#3658)
* upstream/master: (23 commits) Feature/align in dot (pydata#3699) ENH: enable `H5NetCDFStore` to work with already open h5netcdf.File a… (pydata#3618) One-off isort run (pydata#3705) hardcoded xarray.__all__ (pydata#3703) Bump mypy to v0.761 (pydata#3704) remove DataArray and Dataset constructor deprecations for 0.15 (pydata#3560) Tests for variables with units (pydata#3654) Add an example notebook using apply_ufunc to vectorize 1D functions (pydata#3629) Use encoding['dtype'] over data.dtype when possible within CFMaskCoder.encode (pydata#3652) allow passing any iterable to drop when dropping variables (pydata#3693) Typo on DataSet/DataArray.to_dict documentation (pydata#3692) Fix mypy type checking tests failure in ds.merge (pydata#3690) Explicitly convert result of pd.to_datetime to a timezone-naive type (pydata#3688) ds.merge(da) bugfix (pydata#3677) fix docstring for combine_first: returns a Dataset (pydata#3683) Add option to choose mfdataset attributes source. (pydata#3498) How do I add a new variable to dataset. (pydata#3679) Add map_blocks example to whats-new (pydata#3682) Make dask names change when chunking Variables by different amounts. (pydata#3584) raise an error when renaming dimensions to existing names (pydata#3645) ...
When rechunking by the current chunk size, name should not change.
Add a
__dask_tokenize__method for ReprObject so that this behaviour is presentwhen DataArrays are converted to temporary Datasets and back.
black . && mypy . && flake8whats-new.rstfor all changes andapi.rstfor new API