Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change isinstance checks to duck Dask Array checks #4208 #4221

Merged
merged 27 commits into from
Sep 2, 2020

Conversation

rpmanser
Copy link
Contributor

@rpmanser rpmanser commented Jul 13, 2020

I added the discussed is_duck_dask_array(x) function and replaced isinstance(x, dask_array_type) checks with is_duck_dask_array(x). Existing tests are passing. I did not change isinstance(x, dask_array_type) checks in the testing modules since I am unsure if that is appropriate.

As for additional tests, I am leaning towards testing Xarray ( Pint ( Dask ) ) ) objects as implemented in upcoming Pint v0.15 rather than a mock class, but I am open to discussion. I am unfamiliar with the release schedules of Pint and Xarray but I imagine that would influence which direction the testing should go.

Copy link
Contributor

@jthielen jthielen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for going through and finding/adjusting all of these Dask checks!

I've left some comments below, which, other than the one about is_duck_dask_array still using isinstance instead of is_dask_collection, are mostly there to help get discussion going on a few points.

xarray/core/dask_array_compat.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
xarray/core/formatting.py Outdated Show resolved Hide resolved
xarray/core/formatting.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/nanops.py Outdated Show resolved Hide resolved
xarray/core/rolling_exp.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
@jthielen
Copy link
Contributor

Also, a broader discussion that's I've seen hinted to in the past, but is brought to the forefront by this PR: how should xarray be checking for general duck array types it can wrap?

Right now, it looks like there is a mix of

is_array_like(data)

and

hasattr(data, "__array_function__") or isinstance(data, dask_array_type)

and it would be nice to bring consistency. However, both these checks seem like they'd let too many types through. For example, is_array_like as currently implemented would still let through too much, such as MemoryCachedArray and xarray Variables/DataArrays, and __array_function__ doesn't truly indicate a NumPy duck array on its own (and it really only works to only capture downcast types right now since xarray does not yet implement __array_function__ itself, #3917).

@keewis
Copy link
Collaborator

keewis commented Jul 14, 2020

Revisiting it I now realize that is_array_like is actually not a good name for that function: officially, "array_like" means that it can be casted using np.asarray. Maybe we should rename that to is_duck_array? But then we'd also need to check for __array_ufunc__ and __array_function__ because we also need those to properly wrap duck arrays.

@jthielen
Copy link
Contributor

jthielen commented Jul 14, 2020

Revisiting it I now realize that is_array_like is actually not a good name for that function: officially, "array_like" means that it can be casted using np.asarray. Maybe we should rename that to is_duck_array? But then we'd also need to check for __array_ufunc__ and __array_function__ because we also need those to properly wrap duck arrays.

As long as that doesn't break any of the current uses, I think that would be the best way forwards. This would require xarray to be on NumPy 1.16+ (in order to ensure hasattr(np.ndarray, "__array_function__"), but that's only 9 days away by NEP 29, so hopefully that isn't too much of an issue. Then, most of the checks for "can this be wrapped by xarray" should be able to become something like

is_duck_array(x) and not isinstance(x, (DataArray, Variable))

with is_duck_dask_array only being needed for checking Dask Array-like support.

Although, while we're at it, do we also need more careful handling of upcast types? I'm not sure if there even are any out there right now (not sure if a HoloViews Dataset counts here), but that doesn't necessarily mean there never will be any.

@pep8speaks
Copy link

pep8speaks commented Jul 14, 2020

Hello @rpmanser! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-31 21:28:01 UTC

@rpmanser
Copy link
Contributor Author

I'm getting a bit lost within the testing suite while trying to figure out the implications of this change and what kinds of tests are necessary for it. Would a test module for this

  • Individually test all functions that include the newly implemented is_duck_dask_array()?
  • Test that a duck dask Array has attributes of a dask Array and redirects the appropriate methods to those defined by Variable, DataArray, and Dataset?
  • Be similar to test_dask.py?
  • All, some, or none of the above?

Any suggestions would be greatly appreciated.

@jthielen
Copy link
Contributor

@rpmanser For what it's worth, I'd think doing tests like test_dask.py but with a post-hgrecco/pint#1129 Pint Quantity or suitably mocked wrapper class between xarray and Dask would be best. But, I think this is more of a maintainer-level question.

@dcherian
Copy link
Contributor

Thanks for working on this @rpmanser.

Since you've changed every isinstance check, if test_dask passes, you're doing fine in general.

Test that a duck dask Array has attributes of a dask Array and redirects the appropriate methods to those defined by Variable, DataArray, and Dataset?

Seems like this is covered by test_dask already but I could be wrong.

We could add specific pint-as-dask tests to test_units with appropriate xfails (they should run on the upstream-dev test config).

This could be a discussion topic for our next dev meeting on wednesday.

@rpmanser
Copy link
Contributor Author

I have erred on the side of leaving the isinstance checks in place where @jthielen reviewed them. If it is decided that no further tests are needed, then this should be ready for review.

On another note, if it is appropriate to apply these changes:

Revisiting it I now realize that is_array_like is actually not a good name for that function: officially, "array_like" means that it can be casted using np.asarray. Maybe we should rename that to is_duck_array? But then we'd also need to check for __array_ufunc__ and __array_function__ because we also need those to properly wrap duck arrays.

As long as that doesn't break any of the current uses, I think that would be the best way forwards. This would require xarray to be on NumPy 1.16+ (in order to ensure hasattr(np.ndarray, "__array_function__"), but that's only 9 days away by NEP 29, so hopefully that isn't too much of an issue. Then, most of the checks for "can this be wrapped by xarray" should be able to become something like

is_duck_array(x) and not isinstance(x, (DataArray, Variable))

with is_duck_dask_array only being needed for checking Dask Array-like support.

I can do that as well. If not, should a new issue be opened for this?

xarray/backends/common.py Show resolved Hide resolved
xarray/backends/common.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Show resolved Hide resolved
xarray/core/nanops.py Outdated Show resolved Hide resolved
xarray/core/variable.py Show resolved Hide resolved
@rpmanser rpmanser changed the title WIP: Change isinstance checks to duck Dask Array checks #4208 Change isinstance checks to duck Dask Array checks #4208 Jul 27, 2020
@rpmanser rpmanser marked this pull request as ready for review July 27, 2020 21:25
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rpmanser, this is quite close! Thanks for your efforts here.

* Rename `is_array_like` to `is_duck_array`

* `is_duck_array` checks for `__array_function__` and `__array_ufunc__`
  in addition to previous checks

* Replace checks for `is_duck_dask_array` and `__array_function__` with
  `is_duck_array`
@keewis
Copy link
Collaborator

keewis commented Aug 27, 2020

we went with

or (not IS_NEP18_ACTIVE and isinstance(data, np.ndarray))
in #4379, you could use something like that here:
if is_duck_array(a_mapping[k]) or is_duck_array(b_mapping[k]):
too, e.g. with:

def is_duck_array_or_ndarray(array):
    return is_duck_array(array) or (not IS_NEP18_ACTIVE and isinstance(data, np.ndarray))

xarray/core/utils.py Outdated Show resolved Hide resolved
* Add explicit check for NumPy array to is_duck_array

* Replace is_duck_array_or_ndarray checks with is_duck_array
@rpmanser
Copy link
Contributor Author

I meant is_duck_array_or_ndarray, not is_duck_dask_array_or_ndarray. I don't suppose there's a way to fix that at this point?

@keewis
Copy link
Collaborator

keewis commented Aug 28, 2020

you mean the commit message? If so, you can use git commit --amend to change it (make sure you don't have anything staged because that would be added to the commit) and force-push. It's not that important, though, I doubt anyone really reads the commit messages.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rpmanser

xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
Co-authored-by: keewis <keewis@users.noreply.github.com>
Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments, otherwise looks good to me.

xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
@@ -389,7 +389,7 @@ def _bottleneck_reduce(self, func, **kwargs):
valid = (slice(None),) * axis + (slice(-shift, None),)
padded = padded.pad({self.dim[0]: (0, -shift)}, mode="constant")

if isinstance(padded.data, dask_array_type):
if is_duck_dask_array(padded.data):
raise AssertionError("should not be reachable")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pydata/xarray: why is that here? If it is never reachable, it should be fine to just not do the branching (and if it is necessary, it should be fine to use assert not is_duck_dask_array(padded.data), "should not be reachable").

Also, if we raise the code below will never be reached, so we should be able to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's from #3040 . We would like to use this code path but there are bugs IIRC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I guess we should raise a NotImplementedError or a ValueError instead of a AssertionError and improve the error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could comment it out and add a TODO note. The if condition in Line 420 makes sure that this is never reached.

xarray/tests/test_dask.py Outdated Show resolved Hide resolved
xarray/tests/test_dask.py Outdated Show resolved Hide resolved
xarray/tests/test_dask.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The reason for requires_pint_0_15 would be nice but I don't feel strongly about it so it's up to you.

Thanks for being patient, @rpmanser.

@rpmanser
Copy link
Contributor Author

LGTM. The reason for requires_pint_0_15 would be nice but I don't feel strongly about it so it's up to you.

Thanks for being patient, @rpmanser.

No worries, thanks for the reviews!

@dcherian
Copy link
Contributor

dcherian commented Sep 2, 2020

Things seem to have died down here so let's test it out.

Thanks @rpmanser This is an amazing first PR. Thanks for your efforts and welcome to xarray!

@dcherian dcherian merged commit dc2dd89 into pydata:master Sep 2, 2020
@rpmanser
Copy link
Contributor Author

rpmanser commented Sep 2, 2020

@dcherian No problem! Thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for duck Dask Arrays
6 participants