Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 'no_conflicts' as compat option for merging non-conflicting data #996

Merged
merged 18 commits into from
Sep 15, 2016

Conversation

jcmgray
Copy link
Contributor

@jcmgray jcmgray commented Aug 31, 2016

This solves #742 and partially #835 (does not address a combine_first option yet). It essentially adds a notnull_equals method to Variable (and currently DataArray/set) that merge can use to compare objects, subsequently combining them with 'fillna'. Used as such:

>>> import xarray as xr
>>> ds1 = xr.Dataset(data_vars={'x': ('a', [1, 2])})
>>> ds2 = xr.Dataset(data_vars={'x': ('a', [2, 3])}, coords={'a': [1, 2]})
>>> xr.merge([ds1, ds2], compat='notnull_equals')
<xarray.Dataset>
Dimensions:  (a: 3)
Coordinates:
  * a        (a) int64 0 1 2
Data variables:
    x        (a) float64 1.0 2.0 3.0

Should be very easy to add other merging options such as overwriting left to write and vice versa.

TODO

  • docs
  • tests for various type combinations
  • some of the error excepting is mirrored from equals and might not be necessary (e.g. for numpy structured arrays).

ISSUES

  • It seemed natural to add notnull_equals as a method to Dataset, but it might be unnecessary/not useful. Current version is conservative in that it still requires aligned Datasets with all the same variables.
  • Due to the float nature of NaN, type is sometimes not preserved, e.g. merging two int arrays yields a float array, even when the final array has no NaN values itself.
  • float rounding errors can cause unintended failures, so a notnull_allclose option might be nice.

Currently all tests passing (6 new), and no flake8 errors on the diff...

@@ -352,6 +352,43 @@ def test_broadcast_equals(self):
self.assertFalse(a.broadcast_equals(c))
self.assertFalse(c.broadcast_equals(a))

def test_notnull_equals(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than writing such exhaustive test cases for each xarray class, consider writing comprehensive unit tests just for array_notnull_equiv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll move some the redundant tests to do with different types of overlap etc. into test_ops.

@shoyer
Copy link
Member

shoyer commented Sep 1, 2016

This is quite clever! I'm impressed this was so easy using fillna.

One potential concern here is that performance is not going to be so great if you attempt to combine a bunch of variables with lazy data loaded with dask, because each comparison will trigger a separate computation. To that end, it would be nice to do the safety check in a single dask operation.

To be fair, this is already an issue if you're trying to merge lots of datasets together. But I expect this will become more of an issue when there are useful ways to merge a bunch of datasets, which is what this PR enables.


flag_array = (arr1 == arr2)

# GH837, GH861
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we missed this before, but these comments (and ignoring TypeError) should probably go in the isnull function instead, rather than repeating it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be good not to repeat. However, if put in isnull, then TypeError might be suppressed when useful. Excepting the error I think can only work in situations (such as these two functions) where an array is being modified by the result and the whole operation can be ignored if fails. But maybe there is another way to do it?

The only quite ugly way I can think of is something like (leaving isnull as-is):

def isnull_with_no_type_fail(a):
    # comments about why needed etc.
    try:
        b = isnull(a)
    except TypeError:
        b = None
    return b

and then in array_equiv/array_notnull_equiv

...
arr1_isnull = isnull_with_no_type_fail(arr1)
if arr1_isnull is not None:
    arr2_isnull = isnull_with_no_type_fail(arr2)
    if arr2_isnull is not None:
        flag_array |= (arr1_isnull | arr2_isnull)
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this version?

def isnull(a):
    try:
        b = isnull(a)
    except TypeError:
        b = np.zeros(a.shape, bool)
    return b

The TypeError should be pretty rare, so it's OK to do a little extra work in that case.

@shoyer
Copy link
Member

shoyer commented Sep 1, 2016

Current version is conservative in that it still requires aligned Datasets with all the same variables.

👍 it's much less obvious to treat missing variables as "missing data."

Due to the float nature of NaN, type is sometimes not preserved, e.g. merging two int arrays yields a float array, even when the final array has no NaN values itself.

Yes, this is consistent with the xarray/pandas type system. (Which is not ideal, but that's another bigger issue.)

@shoyer
Copy link
Member

shoyer commented Sep 1, 2016

compat='no_conflicts' is possibly a better keyword argument.

@shoyer
Copy link
Member

shoyer commented Sep 3, 2016

One of the commits in this PR is not yours. Take a look here -- you want to rebase instead of merge (yes git is confusing):
http://stackoverflow.com/questions/5968964/avoid-unwanted-merge-commits-and-other-commits-when-doing-pull-request-on-github

I think you can still safely rebase now.

@jcmgray
Copy link
Contributor Author

jcmgray commented Sep 4, 2016

Ah sorry - yes rebased locally then mistakenly merged the remote fork...

compat='no_conflicts' is possibly a better keyword argument.

Yes I thought that might be better also, the advantages of 'notnull_equals', (slightly more precise operational description and slightly better 'grammar' when using as a method: if ds1.notnull_equals(ds2): ... vs if ds1.no_conflicts(ds2): ...) are probably negligible since 'no_conflicts' is more intuitive for merge - and this is likely the main usage.

@jcmgray
Copy link
Contributor Author

jcmgray commented Sep 4, 2016

One potential concern here is that performance is not going to be so great if you attempt to combine a bunch of variables with lazy data loaded with dask, because each comparison will trigger a separate computation. To that end, it would be nice to do the safety check in a single dask operation.

I will have a look into how to do this, but am not that familiar with dask.

@jcmgray jcmgray force-pushed the master branch 2 times, most recently from c3ad9c7 to ae90fa1 Compare September 7, 2016 22:20
@shoyer
Copy link
Member

shoyer commented Sep 13, 2016

I think this is nearly ready. We just need to decide on the final keyword name (e.g., notnull_equals or no_conflicts) and add mention of this in the documentation / what's new.

I'm not sure we really need this new method for Dataset/DataArray (for the current implementation, adding it on Variable would be enough). Given that the user can catch MergeError, it seems straightforward enough just call merge to see if merging is safe.

@jcmgray jcmgray changed the title add 'notnull_equals' as compat option for merging non-conflicting data add 'no_conflicts' as compat option for merging non-conflicting data Sep 14, 2016
@jcmgray
Copy link
Contributor Author

jcmgray commented Sep 14, 2016

OK, I have stripped the Dataset/Array methods which I agree were largely redundant. Since this sets this type of comparison/merge slightly apart, 'no_conflicts' seems the more intuitive wording when used only as a compat option so I've changed it to that.

And I've done a first pass at updating the docs.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- thanks!

xr.merge([ds1, ds2], compat='no_conflicts')

Note that due to the underlying representation of missing values as floating
point numbers (``NaN``) or generic objects (``None``), variable data type is not
Copy link
Member

@shoyer shoyer Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted

or generic objects (``None``)

We only use NaN (like pandas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants