Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_first by using apply_ufunc in ops.fillna #1204

Merged
merged 18 commits into from
Jan 23, 2017
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions doc/combining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Combining data

* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
# For combining datasets or data arrays with outer-join alignment, see combine_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: For combining datasets with different coordinates or missing values


.. _concatenate:

Expand Down Expand Up @@ -116,6 +117,38 @@ used in the :py:class:`~xarray.Dataset` constructor:

xr.Dataset({'a': arr[:-1], 'b': arr[1:]})

.. _combine:

Combine
~~~~~~~

The instance method ``combine_first`` combines two datasets/data arrays and
defaults to non-null values in the calling object, using values from the called
object to fill holes. The resulting coordinates are the union of coordinate labels.
Vacant cells as a result of the outer-join are filled with nan.

Mimics the behavior of ``pandas.Dataframe.combine_first``

For data array,

.. ipython:: python

ar0 = DataArray([[0, 0], [0, 0]], [('x', ['a', 'b']), ('y', [-1, 0])])
ar1 = DataArray([[1, 1], [1, 1]], [('x', ['b', 'c']), ('y', [0, 1])])
ar2 = DataArray([2], [('x', ['d'])])
ar0.combine_first(ar1)
ar1.combine_first(ar0)
ar0.combine_first(ar2)

For datasets, ``ds0.combine_first(ds1)`` works just like ``xr.merge([ds0, ds1])``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this isn't quite true -- it succeeds in cases where merge will raise an error. It would be nice to highlight the difference here.


.. ipython:: python

dsx0 = DataArray([0, 0], [('x', ['a', 'b'])]).to_dataset(name='dsx0')
dsx1 = DataArray([1, 1], [('x', ['b', 'c'])]).to_dataset(name='dsx1')
dsx0.combine_first(dsx1)
xr.merge([dsx1, dsx0])

.. _update:

Update
Expand Down
7 changes: 7 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,13 @@ Deprecations

Enhancements
~~~~~~~~~~~~

- Added the xarray equivalent of `pandas.Dataframe.combine_first` as an instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note -- this will need to go in a new section for 0.9.1, since 0.9.0 will probably be released first.

method to DataArray/Dataset objects, facilitated by the new `ops.fillna` that
uses `apply_ufunc` for different `join` options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop implementation details like the fact that this uses apply_ufunc from the user facing release notes.

(see :ref:`combine`)
By `Chun-Wei Yuan <https://github.com/chunweiyuan>`_.

- Added the ability to change default automatic alignment (arithmetic_join="inner")
for binary operations via :py:func:`~xarray.set_options()`
(see :ref:`automatic alignment`).
Expand Down
1 change: 1 addition & 0 deletions xarray/core/computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,7 @@ def apply_dataset_ufunc(func, *args, **kwargs):
list_of_coords = build_output_coords(args, signature, exclude_dims)

args = [getattr(arg, 'data_vars', arg) for arg in args]

result_vars = apply_dict_of_variables_ufunc(
func, *args, signature=signature, join=join, fill_value=fill_value)

Expand Down
22 changes: 20 additions & 2 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1075,7 +1075,7 @@ def dropna(self, dim, how='any', thresh=None):
ds = self._to_temp_dataset().dropna(dim, how=how, thresh=thresh)
return self._from_temp_dataset(ds)

def fillna(self, value):
def fillna(self, value, join="left"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add join to the docstring

"""Fill missing values in this object.

This operation follows the normal broadcasting and alignment rules that
Expand All @@ -1097,10 +1097,28 @@ def fillna(self, value):
if utils.is_dict_like(value):
raise TypeError('cannot provide fill value as a dictionary with '
'fillna on a DataArray')
out = self._fillna(value)
out = ops.fillna(self, value, join=join)
out.attrs = self.attrs
return out

def combine_first(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth making a separate method here to save a few characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either we have combine_first for both Dataset and DataArray, or we lump everything into fillna. The use of combine_first is just that there's this concrete direct analogy from pandas. What's the decision here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that data_vars_join could sensibly differ for combine_first (as you have implemented it) suggests that a second method is reasonable. Consistency with pandas is also a nice factor. So I guess I've come around!

"""Combine two DataArray objects, with union of coordinates.

This operation follows the normal broadcasting and alignment rules of
``join='outer'``. Default to non-null values of array calling the
method. Use np.nan to fill in vacant cells after alignment.

Parameters
----------
other : DataArray
Used to fill all matching missing values in this array.

Returns
-------
DataArray
"""
return self.fillna(other, join="outer")

def reduce(self, func, dim=None, axis=None, keep_attrs=False, **kwargs):
"""Reduce this array by applying `func` along some dimension(s).

Expand Down
30 changes: 29 additions & 1 deletion xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1955,7 +1955,35 @@ def fillna(self, value):
-------
Dataset
"""
out = self._fillna(value)
if utils.is_dict_like(value):
value_keys = value.data_vars.keys() if isinstance(value, Dataset)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe slightly cleaner to use duck typing, e.g., value_keys = getattr(value, 'data_vars', value).keys()

else value.keys()
if not set(value_keys) <= set(self.data_vars.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need to separate the join argument to apply_ufunc into two parts:

  1. The join used for aligning indexes.
  2. The join used for aligning data variables between datasets.

The later should probably be renamed to something like data_vars_join

raise ValueError('all variables in the argument to `fillna` '
'must be contained in the original dataset')
out = ops.fillna(self, value, join="left")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join should be a parameter for Dataset.fillna, too.

out._copy_attrs_from(self)
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good time to remove all the fillna specific logic from Dataset._calculate_binary_op that is no longer used


def combine_first(self, other):
"""Combine two Datasets, default to data_vars of self.

The new coordinates follow the normal broadcasting and alignment rules
of ``join='outer'``. Vacant cells in the expanded coordinates are
filled with np.nan.

Renders the same result as xr.merge([self, other]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right -- combine_first/fillna are OK with conflicting values in self and other, but merge will raise an error.


Parameters
----------
other : DataArray
Used to fill all matching missing values in this array.

Returns
-------
DataArray
"""
out = ops.fillna(self, other, join="outer")
out._copy_attrs_from(self)
return out

Expand Down
10 changes: 9 additions & 1 deletion xarray/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,7 +382,15 @@ def fillna(self, value):
Dataset.fillna
DataArray.fillna
"""
return self._fillna(value)
def _yield_applied(this, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think apply_ufunc should even work on groupby objects, so you shouldn't need this helper function. I would be interested to know I'm wrong about that, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I replace the whole method with just return apply_ufunc(ops.fillna, self, value), I'd fail line 2507 of test_dataset.py, where self.assertEqual(actual.attrs, ds.attrs) fails, because the attrs aren't copied over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is perhaps to add something into apply_groupby_ufunc to make sure the attrs are copied over, similar to what's in _copy_attrs_from in dataset.py/dataarray.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution is perhaps to add something into apply_groupby_ufunc to make sure the attrs are copied over, similar to what's in _copy_attrs_from in dataset.py/dataarray.py.

I like this -- it would be useful for other functions in apply_ufunc, too. For consistency, I would call the new argument keep_attrs.

Or if you don't want to bother with that, could just use _copy_attrs_from in this method:

out = ops.fillna(self, other)
out._copy_attrs_from(self._obj)
return out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copped out and took your simple approach. It's mainly because I'm not sure how to deal with more than 2 objects in apply_ufunc when keep_attrs=True. Do I only take the first object's attributes? Maybe best left for another PR.

"""apply fillna to each individual groupby ds"""
for group_value, obj in this:
other_sel = other.sel(**{this._group.name: group_value})
yield obj.fillna(other_sel)

datasets = _yield_applied(self, value)
combined = self._combine(datasets)
return combined

def where(self, cond):
"""Return an object of the same shape with all entries where cond is
Expand Down
16 changes: 9 additions & 7 deletions xarray/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,11 +279,18 @@ def count(data, axis=None):
return sum(~isnull(data), axis=axis)


def fillna(data, other):
def fillna(data, other, join="left"):
"""Fill missing values in this object with data from the other object.
Follows normal broadcasting and alignment rules.
"""
return where(isnull(data), other, data)
from .computation import apply_ufunc

def _fillna(data, other):
left, right = np.broadcast_arrays(data, other)
result = left.copy() # view must be copied before being written
result[isnull(result)] = right[isnull(result)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An advantage of using where(isnull(data), other, data) instead of all the logic in this function is that it works on dask arrays, too.

return result
return apply_ufunc(_fillna, data, other, join=join)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have dask_array='allowed'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to separate the join argument to apply_ufunc into two parts:

  1. The join used for aligning indexes.
  2. The join used for aligning data variables between datasets.

For fillna, I think we would always want data_vars_join='left', though I guess it's equivalent to data_vars_join='left' when we constrain value not to contain new keys.



def where_method(data, cond, other=np.nan):
Expand Down Expand Up @@ -445,11 +452,6 @@ def inject_binary_ops(cls, inplace=False):
for name, f in [('eq', array_eq), ('ne', array_ne)]:
setattr(cls, op_str(name), cls._binary_op(f))

# patch in fillna
f = _func_slash_method_wrapper(fillna)
method = cls._binary_op(f, join='left', fillna=True)
setattr(cls, '_fillna', method)

# patch in where
f = _func_slash_method_wrapper(where_method, 'where')
setattr(cls, '_where', cls._binary_op(f))
Expand Down
2 changes: 1 addition & 1 deletion xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -855,7 +855,7 @@ def unstack(self, **dimensions):
return result

def fillna(self, value):
return self._fillna(value)
return ops.fillna(self, value)

def where(self, cond):
return self._where(cond)
Expand Down
21 changes: 21 additions & 0 deletions xarray/test/test_dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2370,6 +2370,27 @@ def test_binary_op_join_setting(self):
expected = xr.DataArray([np.nan, 2, 4, np.nan], [(dim, [0, 1, 2, 3])])
self.assertDataArrayEqual(actual, expected)

def test_combine_first(self):
ar0 = DataArray([[0, 0], [0, 0]], [('x', ['a', 'b']), ('y', [-1, 0])])
ar1 = DataArray([[1, 1], [1, 1]], [('x', ['b', 'c']), ('y', [0, 1])])
ar2 = DataArray([2], [('x', ['d'])])

actual = ar0.combine_first(ar1)
expected = DataArray([[0, 0, np.nan], [0, 0, 1], [np.nan, 1, 1]],
[('x', ['a', 'b', 'c']), ('y', [-1, 0, 1])])
self.assertDataArrayEqual(actual, expected)

actual = ar1.combine_first(ar0)
expected = DataArray([[0, 0, np.nan], [0, 1, 1], [np.nan, 1, 1]],
[('x', ['a', 'b', 'c']), ('y', [-1, 0, 1])])
self.assertDataArrayEqual(actual, expected)

actual = ar0.combine_first(ar2)
expected = DataArray([[0, 0], [0, 0], [2, 2]],
[('x', ['a', 'b', 'd']), ('y', [-1, 0])])
self.assertDataArrayEqual(actual, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two line breaks should separate code blocks, not three



@pytest.fixture(params=[1])
def da(request):
Expand Down
27 changes: 22 additions & 5 deletions xarray/test/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3132,17 +3132,17 @@ def test_filter_by_attrs(self):

def test_binary_op_join_setting(self):
# arithmetic_join applies to data array coordinates
missing_2 = xr.Dataset({'x':[0, 1]})
missing_0 = xr.Dataset({'x':[1, 2]})
missing_2 = xr.Dataset({'x': [0, 1]})
missing_0 = xr.Dataset({'x': [1, 2]})
with xr.set_options(arithmetic_join='outer'):
actual = missing_2 + missing_0
expected = xr.Dataset({'x':[0, 1, 2]})
expected = xr.Dataset({'x': [0, 1, 2]})
self.assertDatasetEqual(actual, expected)

# arithmetic join also applies to data_vars
ds1 = xr.Dataset({'foo': 1, 'bar': 2})
ds2 = xr.Dataset({'bar': 2, 'baz': 3})
expected = xr.Dataset({'bar': 4}) # default is inner joining
expected = xr.Dataset({'bar': 4}) # default is inner joining
actual = ds1 + ds2
self.assertDatasetEqual(actual, expected)

Expand All @@ -3165,7 +3165,7 @@ def test_full_like(self):
# For more thorough tests, see test_variable.py
# Note: testing data_vars with mismatched dtypes
ds = Dataset({
'd1': DataArray([1,2,3], dims=['x'], coords={'x': [10,20,30]}),
'd1': DataArray([1,2,3], dims=['x'], coords={'x': [10, 20, 30]}),
'd2': DataArray([1.1, 2.2, 3.3], dims=['y'])
}, attrs={'foo': 'bar'})
actual = full_like(ds, 2)
Expand All @@ -3186,6 +3186,23 @@ def test_full_like(self):
self.assertEqual(expect['d2'].dtype, bool)
self.assertDatasetIdentical(expect, actual)

def test_combine_first(self): # works just like xr.merge([self, other])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this comment below, to just above the code block where you compare to the result of using merge

dsx0 = DataArray([0, 0], [('x', ['a', 'b'])]).to_dataset(name='dsx0')
dsx1 = DataArray([1, 1], [('x', ['b', 'c'])]).to_dataset(name='dsx1')

actual = dsx0.combine_first(dsx1)
expected = Dataset({'dsx0': ('x', [0, 0, np.nan]),
'dsx1': ('x', [np.nan, 1, 1])},
coords={'x': ['a', 'b', 'c']})
self.assertDatasetEqual(actual, expected)
self.assertDatasetEqual(actual, xr.merge([dsx0, dsx1]))

dsy2 = DataArray([2, 2, 2], [('x', ['b', 'c', 'd'])]).\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per PEP8, please avoid writing \. Either split this on two lines or use parentheses for the implicit line continuation.

to_dataset(name='dsy2')
actual = dsx0.combine_first(dsy2)
expected = xr.merge([dsy2, dsx0])
self.assertDatasetEqual(actual, expected)

### Py.test tests


Expand Down