Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge and align DataArrays/Datasets on different domains #742

Closed
jcmgray opened this issue Feb 2, 2016 · 11 comments
Closed

merge and align DataArrays/Datasets on different domains #742

jcmgray opened this issue Feb 2, 2016 · 11 comments

Comments

@jcmgray
Copy link
Contributor

jcmgray commented Feb 2, 2016

Firstly, I think xarray is great and for the type of physics simulations I run n-dimensional labelled arrays is exactly what I need. But, and I may be missing something, is there a way to merge (or concatenate/update) DataArrays with different domains on the same coordinates?

For example consider this setup:

import xarray as xr

x1 = [100]
y1 = [1, 2, 3, 4, 5]
dat1 = [[101, 102, 103, 104, 105]]

x2 = [200]
y2 = [3, 4, 5, 6]  # different size and domain
dat2 = [[203, 204, 205, 206]]

da1 = xr.DataArray(dat1, dims=['x', 'y'], coords={'x': x1, 'y': y1})
da2 = xr.DataArray(dat2, dims=['x', 'y'], coords={'x': x2, 'y': y2})

I would like to aggregate such DataArrays into a new, single DataArray with nan padding such that:

>>> merge(da1, da2, align=True)  # made up syntax
<xarray.DataArray (x: 2, y: 6)>
array([[ 101.,  102.,  103.,  104.,  105.,   nan],
       [  nan,   nan,  203.,  204.,  205.,  206.]])
Coordinates:
  * x        (x) int64 100 200
  * y        (y) int64 1 2 3 4 5 6

Here is a quick function I wrote to do such but I would worried about the performance of 'expanding' the new data to the old data's size every iteration (i.e. supposing that the first argument is a large DataArray that you are adding to but doesn't necessarily contain the dimensions already).

def xrmerge(*das, accept_new=True):
    da = das[0]
    for new_da in das[1:]:
        # Expand both to have same dimensions, padding with NaN
        da, new_da = xr.align(da, new_da, join='outer')
        # Fill NaNs one way or the other re. accept_new
        da = new_da.fillna(da) if accept_new else da.fillna(new_da)
    return da

Might this be (or is this already!) possible in simpler form in xarray? I know Datasets have merge and update methods but I couldn't make them work as above.
I also notice there are possible plans ( #417 ) to introduce a merge function for DataArrays.

@shoyer
Copy link
Member

shoyer commented Feb 3, 2016

This is actually closer to the functionality of concat than merge. Hypothetically, something like the following would do what you want:

# note: this is *not* valid syntax currently! the dims arguments
# does not yet exist.
# this would hypothetically only align along the 'y' dimension, not 'x'
aligned = xr.align(*das, join='outer', dims='y')
combined = xr.concat(aligned, dim='x')

In cases where each array does not already have the dimension you want to concat along, this already works fine, because you can simply omit dims in align.

@JamesPHoughton
Copy link

I'm having a similar issue, expanding the complexity in that I want to concatenate across multiple dimensions. I'm not sure if that's a cogent way to explain it, but here's an example. I have:

m = xr.DataArray(data=[[[1.1, 1.2, 1.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['F']})
n = xr.DataArray(data=[[[2.1, 2.2, 2.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['F']})
o = xr.DataArray(data=[[[3.1, 3.2, 3.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['G']})
p = xr.DataArray(data=[[[4.1, 4.2, 4.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['G']})

Which I want to merge into a single, fully populated array similar to what I'd get if I did:

data =[[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]]

xr.DataArray(data=data, 
             coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D', 'E'], 'Dim3':['F', 'G']})

i.e.

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) |S1 'D' 'E'
  * Dim3     (Dim3) |S1 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

@jcmgray's function is pretty close, although the array indicies are described slightly differently (I'm not sure if this is a big deal or not...). Note the 'object' type for Dim2 and Dim3:

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) object 'D' 'E'
  * Dim3     (Dim3) object 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

It would be great to have a canonical way to do this. What should I try?

@jcmgray
Copy link
Contributor Author

jcmgray commented Jun 15, 2016

Just a comment that the appearance of object types is likely due to the fact that numpy's NaNs are inherently 'floats' - so this will be an issue for any method with an intermediate `missing data' stage if non-floats are being used.

I still use use the align and fillna method since I mostly deal with floats/complex numbers, although @shoyer 's suggestion of a partial align and then concat could definitely be cleaner when the added coordinates are all 'new'.

@shoyer
Copy link
Member

shoyer commented Jun 15, 2016

I think this could make it into merge, which I am in the process of refactoring in #857.

The key difference from @jcmgray's implementation that I would want is a check to make sure that the data is all on different domains when using fillna. merge should not run the risk of removing non-NaN data.

@JamesPHoughton I agree with @jcmgray that the dtype=object is what you should expect here. It's hard to create fixed length strings in xarray/pandas because that precludes the possibility of missing values, so we tend to convert strings to object dtype when merged/concatenated.

@JamesPHoughton
Copy link

JamesPHoughton commented Jun 16, 2016

Something akin to the pandas dataframe update would have value - then you could create an empty array structure and populate it as necessary:

import pandas as pd
df = pd.DataFrame(index=range(5), columns=['a','b','c','d'])
df2 = pd.DataFrame(index=range(3), columns=['a'], data=range(3))
df.update(df2)
     a    b    c    d
0    0  NaN  NaN  NaN
1    1  NaN  NaN  NaN
2    2  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

But, not sure if empty array construction is supported?

@jcmgray
Copy link
Contributor Author

jcmgray commented Jun 16, 2016

Yes following a similar line of thought to you I recently wrote an 'all missing' dataset constructor (rather than 'empty' which I think of as no variables):

def all_missing_ds(coords, var_names, var_dims, var_types):
    """
    Make a dataset whose data is all missing.
    """
    # Empty dataset with appropirate coordinates
    ds = xr.Dataset(coords=coords)
    for v_name, v_dims, v_type in zip(var_names, var_dims, var_types):
        shape = tuple(ds[d].size for d in v_dims)
        if v_type == int or v_type == float:
            # Warn about up-casting int to float?
            nodata = np.tile(np.nan, shape)
        elif v_type == complex:
            # astype(complex) produces (nan + 0.0j)
            nodata = np.tile(np.nan + np.nan*1.0j, shape)
        else:
            nodata = np.tile(np.nan, shape).astype(object)
        ds[v_name] = (v_dims, nodata)
    return ds

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of

ds.sel(...).var = new_values
ds.sel(...)['var'] = new_values
ds.var.sel(...) = new_values
ds['var'].sel(...) = new_values

guarantees assigning a new value, (currently only the last syntax I believe).

@shoyer
Copy link
Member

shoyer commented Jun 20, 2016

@JamesPHoughton @jcmgray For empty array creation, take a look at #277 and #878 -- this functionality would certainly be welcome.

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of...

@jcmgray Beware -- none of these are actually supported! See the big warning here in the docs. If you think a set_value method would be a better reminder than such warnings in the docs I would be totally open to it. But let's open another issue to discuss it.

@jcmgray
Copy link
Contributor Author

jcmgray commented Jun 21, 2016

Woops - I actually meant to put

ds['var'].loc[{...}]

in there as the one that works ... my understanding is that this is supported as long as the specified coordinates are 'nice' (according to pandas) slices/scalars.

And yes, default values for DataArray/Dataset would definitely fill the "create_all_missing" need.

@jcmgray
Copy link
Contributor Author

jcmgray commented Aug 24, 2016

@shoyer My 2 cents for how this might work after 0.8+ (auto-align during concat, merge and auto_combine goes a long to solving this already) is that the compat option of merge etc could have a 4th option 'nonnull_equals' (or better named...), with compatibility tested by e.g.

import xarray.ufuncs as xrufuncs

def nonnull_compatible(first, second):
    """ Check whether two (aligned) datasets have any conflicting non-null values. """

    # mask for where both objects are not null
    both_not_null = xrufuncs.logical_not(first.isnull() | second.isnull())

    # check remaining values are equal
    return first.where(both_not_null).equals(second.where(both_not_null))

And then fillna to combine variables. Looking now I think this is very similar to what you are suggesting in #835.

@jcmgray jcmgray changed the title merge and align DataArrays on different domains merge and align DataArrays/Datasets on different domains Aug 24, 2016
@shoyer
Copy link
Member

shoyer commented Aug 24, 2016

@jcmgray Yes, that looks about right to me. The place to add this in would be the unique_variable function:
https://github.com/pydata/xarray/blob/master/xarray/core/merge.py#L39

I would use 'notnull_equals' rather than 'nonnull_equals' just because that's the pandas term.

@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

Fixed by #996

@shoyer shoyer closed this as completed Jan 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants