Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert DataArray to DataSet before combine #3312

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

friedrichknuth
Copy link

Enables combine_by_coords on DataArrays. Will convert DataArray to DataSet before proceeding.

As mentioned in #3248, this will still fail if the DataArray is unnamed, but at least the error message tells the user why.

Previously, combining both named

da1 = xr.DataArray(name='foo', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [1, 2, 3]),
                          ('y', [1, 2, 3])])

da2 = xr.DataArray(name='foo2', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [5, 6, 7]),
                          ('y', [5, 6, 7])])

xr.combine_by_coords([da1, da2])

and unnamed DataArrays

da1 = xr.DataArray(data=np.random.rand(3,3), 
                  coords=[('x', [1, 2, 3]),
                          ('y', [1, 2, 3])])

da2 = xr.DataArray(data=np.random.rand(3,3), 
                  coords=[('x', [5, 6, 7]),
                          ('y', [5, 6, 7])])

xr.combine_by_coords([da1, da2])

failed with ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

With this PR, combining the named DataArrays results in a combined DataSet, while the latter example will result in
ValueError: unable to convert unnamed DataArray to a Dataset without providing an explicit name

@pep8speaks
Copy link

pep8speaks commented Sep 16, 2019

Hello @friedrichknuth! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-17 21:05:00 UTC

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friedrichknuth! Welcome to xarray

This looks like a good start.

  1. Can you extend these changes to the other combine functions too please?

  2. You'll also need to add tests to test_combine.py. We have some documentation on that here: https://xarray.pydata.org/en/stable/contributing.html#using-pytest
    The example code in the issues and this PR should be a good start for the tests.

  3. You should add a line to whats-new.rst and take credit for your work.

xarray/core/combine.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@friedrichknuth
Copy link
Author

friedrichknuth commented Sep 17, 2019

pytest -q xarray/tests/test_combine.py is telling me that

def test_concat_name_symmetry(self):
    """Inspired by the discussion on GH issue #2777"""

    da1 = DataArray(name="a", data=[[0]], dims=["x", "y"])
    da2 = DataArray(name="b", data=[[1]], dims=["x", "y"])
    da3 = DataArray(name="a", data=[[2]], dims=["x", "y"])
    da4 = DataArray(name="b", data=[[3]], dims=["x", "y"])

    x_first = combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])

fails with:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-bc4a941bd0c3> in <module>
      3 da3 = xr.DataArray(name="a", data=[[2]], dims=["x", "y"])
      4 da4 = xr.DataArray(name="b", data=[[3]], dims=["x", "y"])
----> 5 xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])

~/repos/contribute/xarray/xarray/core/combine.py in combine_nested(objects, concat_dim, compat, data_vars, coords, fill_value, join)
    468         ids=False,
    469         fill_value=fill_value,
--> 470         join=join,
    471     )
    472 

~/repos/contribute/xarray/xarray/core/combine.py in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join)
    305         coords=coords,
    306         fill_value=fill_value,
--> 307         join=join,
    308     )
    309     return combined

~/repos/contribute/xarray/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join)
    196             compat=compat,
    197             fill_value=fill_value,
--> 198             join=join,
    199         )
    200     (combined_ds,) = combined_ids.values()

~/repos/contribute/xarray/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join)
    218         datasets = combined_ids.values()
    219         new_combined_ids[new_id] = _combine_1d(
--> 220             datasets, dim, compat, data_vars, coords, fill_value, join
    221         )
    222     return new_combined_ids

~/repos/contribute/xarray/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
    246                 compat=compat,
    247                 fill_value=fill_value,
--> 248                 join=join,
    249             )
    250         except ValueError as err:

~/repos/contribute/xarray/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    131             "objects, got %s" % type(first_obj)
    132         )
--> 133     return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    134 
    135 

~/repos/contribute/xarray/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join)
    363     for k in datasets[0].variables:
    364         if k in concat_over:
--> 365             vars = ensure_common_dims([ds.variables[k] for ds in datasets])
    366             combined = concat_vars(vars, dim, positions)
    367             assert isinstance(combined, Variable)

~/repos/contribute/xarray/xarray/core/concat.py in <listcomp>(.0)
    363     for k in datasets[0].variables:
    364         if k in concat_over:
--> 365             vars = ensure_common_dims([ds.variables[k] for ds in datasets])
    366             combined = concat_vars(vars, dim, positions)
    367             assert isinstance(combined, Variable)

~/repos/contribute/xarray/xarray/core/utils.py in __getitem__(self, key)
    383 
    384     def __getitem__(self, key: K) -> V:
--> 385         return self.mapping[key]
    386 
    387     def __iter__(self) -> Iterator[K]:

KeyError: 'a'

It looks like the existing combine_nested() routine actually wants a DataArray and fails if passed a DataSet.

The following should work with current master.

da1 = xr.DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = xr.DataArray(name="b", data=[[1]], dims=["x", "y"])
da3 = xr.DataArray(name="a", data=[[2]], dims=["x", "y"])
da4 = xr.DataArray(name="b", data=[[3]], dims=["x", "y"])
xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])

While converting to DataSet will cause the same error expressed by the test.

ds1 = da1.to_dataset()
ds2 = da2.to_dataset()
ds3 = da3.to_dataset()
ds4 = da4.to_dataset()
xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=["x", "y"])

@friedrichknuth
Copy link
Author

Opened #3315 regarding combine_nested() failing when being passed nested DataSets.

@friedrichknuth friedrichknuth changed the title convert DataArray to DataSet before combine_by_coords convert DataArray to DataSet before combine Sep 18, 2019
@shoyer
Copy link
Member

shoyer commented Sep 19, 2019

Thinking about the interface here, would it make sense for the combine methods to always return an object of the same type as the inputs? E.g., list of DataArray -> DataArray, list of Dataset -> Dataset? Effectively, we would disable the automatic merging when working with DataArray objects and only do concat (ignoring DataArray names).

I guess the confusing part is that in concat we always return the same type as the inputs, vs merge where we always return a Dataset. And of course, combine can do both!

@TomNicholas
Copy link
Member

TomNicholas commented Sep 25, 2019

would it make sense for the combine methods to always return an object of the same type as the inputs? E.g., list of DataArray -> DataArray, list of Dataset -> Dataset?

I don't think so. Even for an input of only DataArrays, depending on the actual names and values in the DataArrays, the result of a combine could be a DataArray or a Dataset. So would it not it be simpler to

  1. Promote all inputs to Datasets (or @friedrichknuth's "dict_like_objects")
  2. Do the combining
  3. If the result has only a single variable then demote from Dataset to DataArray?

That way the result is always the simplest object that can hold the result of combining that particular set of inputs, and the combining internals only have to handle DataSet objects.

EDIT: Oh wait but that won't work in the case of unnamed DataArrays right?
EDIT2: Actually _to_temp_dataset will handle that case too?

@dcherian
Copy link
Contributor

  1. Promote all inputs to Datasets (or @friedrichknuth's "dict_like_objects")
  2. Do the combining

I agree.

  1. If the result has only a single variable then demote from Dataset to DataArray?

I think the simplest solution is to always return Dataset? Then behaviour is always predictable. Otherwise you could combine a bunch of Datasets with a single variable and end up with a DataArray. Alternatively, return a DataArray only when all inputs are DataArrays with the same name.

@shoyer
Copy link
Member

shoyer commented Sep 25, 2019

Even for an input of only DataArrays, depending on the actual names and values in the DataArrays, the result of a combine could be a DataArray or a Dataset.

If we gave all the DataArray objects the same name when converted into Dataset objects, then I think the result could always be a single DataArray?

This would be consistent with how we do arithmetic with a DataArrays: the names get ignored, and then we assign a name to the result only if there are no conflicting names on the inputs.

@TomNicholas
Copy link
Member

If we gave all the DataArray objects the same name when converted into Dataset objects, then I think the result could always be a single DataArray?

I suppose so, but this seems like an odd way to handle it to me. You're throwing away data (the names) which in other circumstances would be used.

This would be consistent with how we do arithmetic with a DataArrays: the names get ignored, and then we assign a name to the result only if there are no conflicting names on the inputs.

Do we want consistency with arithmetic, or consistency with merge? I strongly feel it should be the latter, as combine wraps merge (In fact combine_nested(dataarrays, concat_dim=None) == merge(dataarrays)). More generally, I think we should try to make the behaviour of all our "combining" functions (i.e. merge, concat, update, combine_nested, and combine_by_coords) be name-aware.

Let me try to clarify by summarizing. Currently, merge and combine_nested both do the same thing for named DataArrays (return a Dataset) and un-named DataArrays (throw an error).

da1 = xr.DataArray(name='foo', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [1, 2, 3]),
                          ('y', [1, 2, 3])])
da2 = xr.DataArray(name='foo2', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [5, 6, 7]),
                          ('y', [5, 6, 7])])
merge([da1, da2])

and

da1 = xr.DataArray(name='foo', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [1, 2, 3]),
                          ('y', [1, 2, 3])])
da2 = xr.DataArray(name='foo2', 
                  data=np.random.rand(3,3), 
                  coords=[('x', [5, 6, 7]),
                          ('y', [5, 6, 7])])
xr.combine_nested([da1, da2], concat_dim=None) 

both return

<xarray.Dataset>
Dimensions:  (x: 6, y: 6)
Coordinates:
  * x        (x) int64 1 2 3 5 6 7
  * y        (y) int64 1 2 3 5 6 7
Data variables:
    foo      (x, y) float64 0.5235 0.4114 0.7112 nan nan ... nan nan nan nan nan
    foo2     (x, y) float64 nan nan nan nan nan ... nan 0.08344 0.8844 0.7462

This all makes intuitive sense to me.

combine_by_coords is basically the same operation as combine, just with automated rather than manual ordering. It will also merge different variables together, so it should do the same thing as merge and combine_nested: fill the gaps in the hypercube up with NaNs (as per #3649) and return a Dataset with two variables, set by the names of the input DataArrays. (and throw an error for un-named DataArrays).

However, as shown above, combine_by_coords is not consistent with merge or combine_nested, which this PR will fix.

This is all different to the arithmetic logic, but I think it makes way more intuitive sense. It's okay for arithmetic and combining logic to be different, as they are used in different contexts and it's an unambiguous delineation to ignore names in arithmetic, and use them in top-level combining functions.

Also, to complete the consistency of the "combining" functions, I think we should make concat name-aware, as described in #3315.

In short: I propose that "combining" isn't arithmetic, and should be treated separately (and consistently across all types of combine functions).

@TomNicholas
Copy link
Member

Also separately I think I've discovered a related weird bug:

da1 = xr.DataArray([0], name='a')
da2 = xr.DataArray([1], name='b')
xr.combine_by_coords([da1, da2])                                                                                                                                                        

returns

/xarray/xarray/core/dataarray.py:682: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return key in self.data
/xarray/xarray/core/dataarray.py:682: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  return key in self.data
Out[5]:
<xarray.Dataset>
Dimensions:  (dim_0: 1)
Dimensions without coordinates: dim_0
Data variables:
    a        (dim_0) int64 0
    b        (dim_0) int64 1

which it shouldn't do, it should have just immediately failed because there are no dimension coordinates, and otherwise single-element arrays should obviously behave the same as the other examples above. I think it's because it's reading the single element data 0 or 1 as a boolean at some point...

@TomNicholas
Copy link
Member

@shoyer I just re-encountered this bug whilst doing actual work - I would like to get it fixed, but need your input. In particular on

Do we want consistency with arithmetic, or consistency with merge?

(if this is particularly complex we could perhaps discuss it in the bi-weekly meeting?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants