-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
combine_first by using apply_ufunc in ops.fillna #1204
Changes from 6 commits
8c6a9ab
2fdd135
c190f84
02a4a14
f0c0866
b172eda
6e65c8b
67a599f
8c46c51
3c59009
a72e2a1
8bf856e
56c752a
4bf4efe
367e3ca
793552e
f5ebf78
998cd71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ Combining data | |
|
||
* For combining datasets or data arrays along a dimension, see concatenate_. | ||
* For combining datasets with different variables, see merge_. | ||
# For combining datasets or data arrays with outer-join alignment, see combine_. | ||
|
||
.. _concatenate: | ||
|
||
|
@@ -116,6 +117,38 @@ used in the :py:class:`~xarray.Dataset` constructor: | |
|
||
xr.Dataset({'a': arr[:-1], 'b': arr[1:]}) | ||
|
||
.. _combine: | ||
|
||
Combine | ||
~~~~~~~ | ||
|
||
The instance method ``combine_first`` combines two datasets/data arrays and | ||
defaults to non-null values in the calling object, using values from the called | ||
object to fill holes. The resulting coordinates are the union of coordinate labels. | ||
Vacant cells as a result of the outer-join are filled with nan. | ||
|
||
Mimics the behavior of ``pandas.Dataframe.combine_first`` | ||
|
||
For data array, | ||
|
||
.. ipython:: python | ||
|
||
ar0 = DataArray([[0, 0], [0, 0]], [('x', ['a', 'b']), ('y', [-1, 0])]) | ||
ar1 = DataArray([[1, 1], [1, 1]], [('x', ['b', 'c']), ('y', [0, 1])]) | ||
ar2 = DataArray([2], [('x', ['d'])]) | ||
ar0.combine_first(ar1) | ||
ar1.combine_first(ar0) | ||
ar0.combine_first(ar2) | ||
|
||
For datasets, ``ds0.combine_first(ds1)`` works just like ``xr.merge([ds0, ds1])`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, this isn't quite true -- it succeeds in cases where |
||
|
||
.. ipython:: python | ||
|
||
dsx0 = DataArray([0, 0], [('x', ['a', 'b'])]).to_dataset(name='dsx0') | ||
dsx1 = DataArray([1, 1], [('x', ['b', 'c'])]).to_dataset(name='dsx1') | ||
dsx0.combine_first(dsx1) | ||
xr.merge([dsx1, dsx0]) | ||
|
||
.. _update: | ||
|
||
Update | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -106,6 +106,13 @@ Deprecations | |
|
||
Enhancements | ||
~~~~~~~~~~~~ | ||
|
||
- Added the xarray equivalent of `pandas.Dataframe.combine_first` as an instance | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a note -- this will need to go in a new section for 0.9.1, since 0.9.0 will probably be released first. |
||
method to DataArray/Dataset objects, facilitated by the new `ops.fillna` that | ||
uses `apply_ufunc` for different `join` options. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would drop implementation details like the fact that this uses |
||
(see :ref:`combine`) | ||
By `Chun-Wei Yuan <https://github.com/chunweiyuan>`_. | ||
|
||
- Added the ability to change default automatic alignment (arithmetic_join="inner") | ||
for binary operations via :py:func:`~xarray.set_options()` | ||
(see :ref:`automatic alignment`). | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1075,7 +1075,7 @@ def dropna(self, dim, how='any', thresh=None): | |
ds = self._to_temp_dataset().dropna(dim, how=how, thresh=thresh) | ||
return self._from_temp_dataset(ds) | ||
|
||
def fillna(self, value): | ||
def fillna(self, value, join="left"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||
"""Fill missing values in this object. | ||
|
||
This operation follows the normal broadcasting and alignment rules that | ||
|
@@ -1097,10 +1097,28 @@ def fillna(self, value): | |
if utils.is_dict_like(value): | ||
raise TypeError('cannot provide fill value as a dictionary with ' | ||
'fillna on a DataArray') | ||
out = self._fillna(value) | ||
out = ops.fillna(self, value, join=join) | ||
out.attrs = self.attrs | ||
return out | ||
|
||
def combine_first(self, other): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure it's worth making a separate method here to save a few characters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think either we have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fact that |
||
"""Combine two DataArray objects, with union of coordinates. | ||
|
||
This operation follows the normal broadcasting and alignment rules of | ||
``join='outer'``. Default to non-null values of array calling the | ||
method. Use np.nan to fill in vacant cells after alignment. | ||
|
||
Parameters | ||
---------- | ||
other : DataArray | ||
Used to fill all matching missing values in this array. | ||
|
||
Returns | ||
------- | ||
DataArray | ||
""" | ||
return self.fillna(other, join="outer") | ||
|
||
def reduce(self, func, dim=None, axis=None, keep_attrs=False, **kwargs): | ||
"""Reduce this array by applying `func` along some dimension(s). | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1955,7 +1955,35 @@ def fillna(self, value): | |
------- | ||
Dataset | ||
""" | ||
out = self._fillna(value) | ||
if utils.is_dict_like(value): | ||
value_keys = value.data_vars.keys() if isinstance(value, Dataset)\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe slightly cleaner to use duck typing, e.g., |
||
else value.keys() | ||
if not set(value_keys) <= set(self.data_vars.keys()): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we will need to separate the
The later should probably be renamed to something like |
||
raise ValueError('all variables in the argument to `fillna` ' | ||
'must be contained in the original dataset') | ||
out = ops.fillna(self, value, join="left") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
out._copy_attrs_from(self) | ||
return out | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a good time to remove all the |
||
|
||
def combine_first(self, other): | ||
"""Combine two Datasets, default to data_vars of self. | ||
|
||
The new coordinates follow the normal broadcasting and alignment rules | ||
of ``join='outer'``. Vacant cells in the expanded coordinates are | ||
filled with np.nan. | ||
|
||
Renders the same result as xr.merge([self, other]). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not quite right -- |
||
|
||
Parameters | ||
---------- | ||
other : DataArray | ||
Used to fill all matching missing values in this array. | ||
|
||
Returns | ||
------- | ||
DataArray | ||
""" | ||
out = ops.fillna(self, other, join="outer") | ||
out._copy_attrs_from(self) | ||
return out | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -382,7 +382,15 @@ def fillna(self, value): | |
Dataset.fillna | ||
DataArray.fillna | ||
""" | ||
return self._fillna(value) | ||
def _yield_applied(this, other): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, if I replace the whole method with just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One solution is perhaps to add something into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I like this -- it would be useful for other functions in Or if you don't want to bother with that, could just use
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I copped out and took your simple approach. It's mainly because I'm not sure how to deal with more than 2 objects in |
||
"""apply fillna to each individual groupby ds""" | ||
for group_value, obj in this: | ||
other_sel = other.sel(**{this._group.name: group_value}) | ||
yield obj.fillna(other_sel) | ||
|
||
datasets = _yield_applied(self, value) | ||
combined = self._combine(datasets) | ||
return combined | ||
|
||
def where(self, cond): | ||
"""Return an object of the same shape with all entries where cond is | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -279,11 +279,18 @@ def count(data, axis=None): | |
return sum(~isnull(data), axis=axis) | ||
|
||
|
||
def fillna(data, other): | ||
def fillna(data, other, join="left"): | ||
"""Fill missing values in this object with data from the other object. | ||
Follows normal broadcasting and alignment rules. | ||
""" | ||
return where(isnull(data), other, data) | ||
from .computation import apply_ufunc | ||
|
||
def _fillna(data, other): | ||
left, right = np.broadcast_arrays(data, other) | ||
result = left.copy() # view must be copied before being written | ||
result[isnull(result)] = right[isnull(result)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An advantage of using |
||
return result | ||
return apply_ufunc(_fillna, data, other, join=join) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be helpful to separate the
For |
||
|
||
|
||
def where_method(data, cond, other=np.nan): | ||
|
@@ -445,11 +452,6 @@ def inject_binary_ops(cls, inplace=False): | |
for name, f in [('eq', array_eq), ('ne', array_ne)]: | ||
setattr(cls, op_str(name), cls._binary_op(f)) | ||
|
||
# patch in fillna | ||
f = _func_slash_method_wrapper(fillna) | ||
method = cls._binary_op(f, join='left', fillna=True) | ||
setattr(cls, '_fillna', method) | ||
|
||
# patch in where | ||
f = _func_slash_method_wrapper(where_method, 'where') | ||
setattr(cls, '_where', cls._binary_op(f)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2370,6 +2370,27 @@ def test_binary_op_join_setting(self): | |
expected = xr.DataArray([np.nan, 2, 4, np.nan], [(dim, [0, 1, 2, 3])]) | ||
self.assertDataArrayEqual(actual, expected) | ||
|
||
def test_combine_first(self): | ||
ar0 = DataArray([[0, 0], [0, 0]], [('x', ['a', 'b']), ('y', [-1, 0])]) | ||
ar1 = DataArray([[1, 1], [1, 1]], [('x', ['b', 'c']), ('y', [0, 1])]) | ||
ar2 = DataArray([2], [('x', ['d'])]) | ||
|
||
actual = ar0.combine_first(ar1) | ||
expected = DataArray([[0, 0, np.nan], [0, 0, 1], [np.nan, 1, 1]], | ||
[('x', ['a', 'b', 'c']), ('y', [-1, 0, 1])]) | ||
self.assertDataArrayEqual(actual, expected) | ||
|
||
actual = ar1.combine_first(ar0) | ||
expected = DataArray([[0, 0, np.nan], [0, 1, 1], [np.nan, 1, 1]], | ||
[('x', ['a', 'b', 'c']), ('y', [-1, 0, 1])]) | ||
self.assertDataArrayEqual(actual, expected) | ||
|
||
actual = ar0.combine_first(ar2) | ||
expected = DataArray([[0, 0], [0, 0], [2, 2]], | ||
[('x', ['a', 'b', 'd']), ('y', [-1, 0])]) | ||
self.assertDataArrayEqual(actual, expected) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: two line breaks should separate code blocks, not three |
||
|
||
|
||
@pytest.fixture(params=[1]) | ||
def da(request): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3132,17 +3132,17 @@ def test_filter_by_attrs(self): | |
|
||
def test_binary_op_join_setting(self): | ||
# arithmetic_join applies to data array coordinates | ||
missing_2 = xr.Dataset({'x':[0, 1]}) | ||
missing_0 = xr.Dataset({'x':[1, 2]}) | ||
missing_2 = xr.Dataset({'x': [0, 1]}) | ||
missing_0 = xr.Dataset({'x': [1, 2]}) | ||
with xr.set_options(arithmetic_join='outer'): | ||
actual = missing_2 + missing_0 | ||
expected = xr.Dataset({'x':[0, 1, 2]}) | ||
expected = xr.Dataset({'x': [0, 1, 2]}) | ||
self.assertDatasetEqual(actual, expected) | ||
|
||
# arithmetic join also applies to data_vars | ||
ds1 = xr.Dataset({'foo': 1, 'bar': 2}) | ||
ds2 = xr.Dataset({'bar': 2, 'baz': 3}) | ||
expected = xr.Dataset({'bar': 4}) # default is inner joining | ||
expected = xr.Dataset({'bar': 4}) # default is inner joining | ||
actual = ds1 + ds2 | ||
self.assertDatasetEqual(actual, expected) | ||
|
||
|
@@ -3165,7 +3165,7 @@ def test_full_like(self): | |
# For more thorough tests, see test_variable.py | ||
# Note: testing data_vars with mismatched dtypes | ||
ds = Dataset({ | ||
'd1': DataArray([1,2,3], dims=['x'], coords={'x': [10,20,30]}), | ||
'd1': DataArray([1,2,3], dims=['x'], coords={'x': [10, 20, 30]}), | ||
'd2': DataArray([1.1, 2.2, 3.3], dims=['y']) | ||
}, attrs={'foo': 'bar'}) | ||
actual = full_like(ds, 2) | ||
|
@@ -3186,6 +3186,23 @@ def test_full_like(self): | |
self.assertEqual(expect['d2'].dtype, bool) | ||
self.assertDatasetIdentical(expect, actual) | ||
|
||
def test_combine_first(self): # works just like xr.merge([self, other]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move this comment below, to just above the code block where you compare to the result of using |
||
dsx0 = DataArray([0, 0], [('x', ['a', 'b'])]).to_dataset(name='dsx0') | ||
dsx1 = DataArray([1, 1], [('x', ['b', 'c'])]).to_dataset(name='dsx1') | ||
|
||
actual = dsx0.combine_first(dsx1) | ||
expected = Dataset({'dsx0': ('x', [0, 0, np.nan]), | ||
'dsx1': ('x', [np.nan, 1, 1])}, | ||
coords={'x': ['a', 'b', 'c']}) | ||
self.assertDatasetEqual(actual, expected) | ||
self.assertDatasetEqual(actual, xr.merge([dsx0, dsx1])) | ||
|
||
dsy2 = DataArray([2, 2, 2], [('x', ['b', 'c', 'd'])]).\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per PEP8, please avoid writing |
||
to_dataset(name='dsy2') | ||
actual = dsx0.combine_first(dsy2) | ||
expected = xr.merge([dsy2, dsx0]) | ||
self.assertDatasetEqual(actual, expected) | ||
|
||
### Py.test tests | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
For combining datasets with different coordinates or missing values