-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 'no_conflicts' as compat option for merging non-conflicting data #996
Conversation
@@ -352,6 +352,43 @@ def test_broadcast_equals(self): | |||
self.assertFalse(a.broadcast_equals(c)) | |||
self.assertFalse(c.broadcast_equals(a)) | |||
|
|||
def test_notnull_equals(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than writing such exhaustive test cases for each xarray class, consider writing comprehensive unit tests just for array_notnull_equiv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll move some the redundant tests to do with different types of overlap etc. into test_ops
.
This is quite clever! I'm impressed this was so easy using One potential concern here is that performance is not going to be so great if you attempt to combine a bunch of variables with lazy data loaded with dask, because each comparison will trigger a separate computation. To that end, it would be nice to do the safety check in a single dask operation. To be fair, this is already an issue if you're trying to merge lots of datasets together. But I expect this will become more of an issue when there are useful ways to merge a bunch of datasets, which is what this PR enables. |
|
||
flag_array = (arr1 == arr2) | ||
|
||
# GH837, GH861 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we missed this before, but these comments (and ignoring TypeError) should probably go in the isnull
function instead, rather than repeating it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would be good not to repeat. However, if put in isnull
, then TypeError
might be suppressed when useful. Excepting the error I think can only work in situations (such as these two functions) where an array is being modified by the result and the whole operation can be ignored if fails. But maybe there is another way to do it?
The only quite ugly way I can think of is something like (leaving isnull
as-is):
def isnull_with_no_type_fail(a):
# comments about why needed etc.
try:
b = isnull(a)
except TypeError:
b = None
return b
and then in array_equiv
/array_notnull_equiv
...
arr1_isnull = isnull_with_no_type_fail(arr1)
if arr1_isnull is not None:
arr2_isnull = isnull_with_no_type_fail(arr2)
if arr2_isnull is not None:
flag_array |= (arr1_isnull | arr2_isnull)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this version?
def isnull(a):
try:
b = isnull(a)
except TypeError:
b = np.zeros(a.shape, bool)
return b
The TypeError
should be pretty rare, so it's OK to do a little extra work in that case.
👍 it's much less obvious to treat missing variables as "missing data."
Yes, this is consistent with the xarray/pandas type system. (Which is not ideal, but that's another bigger issue.) |
|
One of the commits in this PR is not yours. Take a look here -- you want to rebase instead of merge (yes git is confusing): I think you can still safely rebase now. |
Ah sorry - yes rebased locally then mistakenly merged the remote fork...
Yes I thought that might be better also, the advantages of |
I will have a look into how to do this, but am not that familiar with dask. |
c3ad9c7
to
ae90fa1
Compare
I think this is nearly ready. We just need to decide on the final keyword name (e.g., I'm not sure we really need this new method for Dataset/DataArray (for the current implementation, adding it on Variable would be enough). Given that the user can catch |
OK, I have stripped the Dataset/Array methods which I agree were largely redundant. Since this sets this type of comparison/merge slightly apart, And I've done a first pass at updating the docs. |
and fix Raises ValueError -> MergeError
Update combining.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- thanks!
xr.merge([ds1, ds2], compat='no_conflicts') | ||
|
||
Note that due to the underlying representation of missing values as floating | ||
point numbers (``NaN``) or generic objects (``None``), variable data type is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted
or generic objects (``None``)
We only use NaN
(like pandas)
This solves #742 and partially #835 (does not address a combine_first option yet). It essentially adds a notnull_equals method to Variable (and currently DataArray/set) that merge can use to compare objects, subsequently combining them with 'fillna'. Used as such:
Should be very easy to add other merging options such as overwriting left to write and vice versa.
TODO
equals
and might not be necessary (e.g. for numpy structured arrays).ISSUES
Currently all tests passing (6 new), and no flake8 errors on the diff...