merge and align DataArrays/Datasets on different domains #742

jcmgray · 2016-02-02T17:27:17Z

Firstly, I think xarray is great and for the type of physics simulations I run n-dimensional labelled arrays is exactly what I need. But, and I may be missing something, is there a way to merge (or concatenate/update) DataArrays with different domains on the same coordinates?

For example consider this setup:

import xarray as xr

x1 = [100]
y1 = [1, 2, 3, 4, 5]
dat1 = [[101, 102, 103, 104, 105]]

x2 = [200]
y2 = [3, 4, 5, 6]  # different size and domain
dat2 = [[203, 204, 205, 206]]

da1 = xr.DataArray(dat1, dims=['x', 'y'], coords={'x': x1, 'y': y1})
da2 = xr.DataArray(dat2, dims=['x', 'y'], coords={'x': x2, 'y': y2})

I would like to aggregate such DataArrays into a new, single DataArray with nan padding such that:

>>> merge(da1, da2, align=True)  # made up syntax
<xarray.DataArray (x: 2, y: 6)>
array([[ 101.,  102.,  103.,  104.,  105.,   nan],
       [  nan,   nan,  203.,  204.,  205.,  206.]])
Coordinates:
  * x        (x) int64 100 200
  * y        (y) int64 1 2 3 4 5 6

Here is a quick function I wrote to do such but I would worried about the performance of 'expanding' the new data to the old data's size every iteration (i.e. supposing that the first argument is a large DataArray that you are adding to but doesn't necessarily contain the dimensions already).

def xrmerge(*das, accept_new=True):
    da = das[0]
    for new_da in das[1:]:
        # Expand both to have same dimensions, padding with NaN
        da, new_da = xr.align(da, new_da, join='outer')
        # Fill NaNs one way or the other re. accept_new
        da = new_da.fillna(da) if accept_new else da.fillna(new_da)
    return da

Might this be (or is this already!) possible in simpler form in xarray? I know Datasets have merge and update methods but I couldn't make them work as above.
I also notice there are possible plans ( #417 ) to introduce a merge function for DataArrays.

The text was updated successfully, but these errors were encountered:

shoyer · 2016-02-03T04:23:31Z

This is actually closer to the functionality of concat than merge. Hypothetically, something like the following would do what you want:

# note: this is *not* valid syntax currently! the dims arguments
# does not yet exist.
# this would hypothetically only align along the 'y' dimension, not 'x'
aligned = xr.align(*das, join='outer', dims='y')
combined = xr.concat(aligned, dim='x')

In cases where each array does not already have the dimension you want to concat along, this already works fine, because you can simply omit dims in align.

JamesPHoughton · 2016-06-14T20:18:11Z

I'm having a similar issue, expanding the complexity in that I want to concatenate across multiple dimensions. I'm not sure if that's a cogent way to explain it, but here's an example. I have:

m = xr.DataArray(data=[[[1.1, 1.2, 1.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['F']})
n = xr.DataArray(data=[[[2.1, 2.2, 2.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['F']})
o = xr.DataArray(data=[[[3.1, 3.2, 3.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D'], 'Dim3':['G']})
p = xr.DataArray(data=[[[4.1, 4.2, 4.3]]], 
                 coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['E'], 'Dim3':['G']})

Which I want to merge into a single, fully populated array similar to what I'd get if I did:

data =[[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]]

xr.DataArray(data=data, 
             coords={'Dim1': ['A', 'B', 'C'], 'Dim2':['D', 'E'], 'Dim3':['F', 'G']})

i.e.

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) |S1 'D' 'E'
  * Dim3     (Dim3) |S1 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

@jcmgray's function is pretty close, although the array indicies are described slightly differently (I'm not sure if this is a big deal or not...). Note the 'object' type for Dim2 and Dim3:

<xarray.DataArray (Dim2: 2, Dim3: 2, Dim1: 3)>
array([[[ 1.1,  1.2,  1.3],
        [ 3.1,  3.2,  3.3]],

       [[ 2.1,  2.2,  2.3],
        [ 4.1,  4.2,  4.3]]])
Coordinates:
  * Dim2     (Dim2) object 'D' 'E'
  * Dim3     (Dim3) object 'F' 'G'
  * Dim1     (Dim1) |S1 'A' 'B' 'C'

It would be great to have a canonical way to do this. What should I try?

jcmgray · 2016-06-15T12:59:08Z

Just a comment that the appearance of object types is likely due to the fact that numpy's NaNs are inherently 'floats' - so this will be an issue for any method with an intermediate `missing data' stage if non-floats are being used.

I still use use the align and fillna method since I mostly deal with floats/complex numbers, although @shoyer 's suggestion of a partial align and then concat could definitely be cleaner when the added coordinates are all 'new'.

shoyer · 2016-06-15T16:54:45Z

I think this could make it into merge, which I am in the process of refactoring in #857.

The key difference from @jcmgray's implementation that I would want is a check to make sure that the data is all on different domains when using fillna. merge should not run the risk of removing non-NaN data.

@JamesPHoughton I agree with @jcmgray that the dtype=object is what you should expect here. It's hard to create fixed length strings in xarray/pandas because that precludes the possibility of missing values, so we tend to convert strings to object dtype when merged/concatenated.

JamesPHoughton · 2016-06-16T13:36:28Z

Something akin to the pandas dataframe update would have value - then you could create an empty array structure and populate it as necessary:

import pandas as pd
df = pd.DataFrame(index=range(5), columns=['a','b','c','d'])
df2 = pd.DataFrame(index=range(3), columns=['a'], data=range(3))
df.update(df2)

     a    b    c    d
0    0  NaN  NaN  NaN
1    1  NaN  NaN  NaN
2    2  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

But, not sure if empty array construction is supported?

jcmgray · 2016-06-16T16:57:48Z

Yes following a similar line of thought to you I recently wrote an 'all missing' dataset constructor (rather than 'empty' which I think of as no variables):

def all_missing_ds(coords, var_names, var_dims, var_types):
    """
    Make a dataset whose data is all missing.
    """
    # Empty dataset with appropirate coordinates
    ds = xr.Dataset(coords=coords)
    for v_name, v_dims, v_type in zip(var_names, var_dims, var_types):
        shape = tuple(ds[d].size for d in v_dims)
        if v_type == int or v_type == float:
            # Warn about up-casting int to float?
            nodata = np.tile(np.nan, shape)
        elif v_type == complex:
            # astype(complex) produces (nan + 0.0j)
            nodata = np.tile(np.nan + np.nan*1.0j, shape)
        else:
            nodata = np.tile(np.nan, shape).astype(object)
        ds[v_name] = (v_dims, nodata)
    return ds

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of

ds.sel(...).var = new_values
ds.sel(...)['var'] = new_values
ds.var.sel(...) = new_values
ds['var'].sel(...) = new_values

guarantees assigning a new value, (currently only the last syntax I believe).

shoyer · 2016-06-20T05:29:59Z

@JamesPHoughton @jcmgray For empty array creation, take a look at #277 and #878 -- this functionality would certainly be welcome.

To go with this (and this might be separate issue), a set_value method would be helpful --- just so that one does not have to remember which particular combination of...

@jcmgray Beware -- none of these are actually supported! See the big warning here in the docs. If you think a set_value method would be a better reminder than such warnings in the docs I would be totally open to it. But let's open another issue to discuss it.

jcmgray · 2016-06-21T21:11:21Z

Woops - I actually meant to put

ds['var'].loc[{...}]

in there as the one that works ... my understanding is that this is supported as long as the specified coordinates are 'nice' (according to pandas) slices/scalars.

And yes, default values for DataArray/Dataset would definitely fill the "create_all_missing" need.

jcmgray · 2016-08-24T23:05:49Z

@shoyer My 2 cents for how this might work after 0.8+ (auto-align during concat, merge and auto_combine goes a long to solving this already) is that the compat option of merge etc could have a 4th option 'nonnull_equals' (or better named...), with compatibility tested by e.g.

import xarray.ufuncs as xrufuncs

def nonnull_compatible(first, second):
    """ Check whether two (aligned) datasets have any conflicting non-null values. """

    # mask for where both objects are not null
    both_not_null = xrufuncs.logical_not(first.isnull() | second.isnull())

    # check remaining values are equal
    return first.where(both_not_null).equals(second.where(both_not_null))

And then fillna to combine variables. Looking now I think this is very similar to what you are suggesting in #835.

shoyer · 2016-08-24T23:27:41Z

@jcmgray Yes, that looks about right to me. The place to add this in would be the unique_variable function:
https://github.com/pydata/xarray/blob/master/xarray/core/merge.py#L39

I would use 'notnull_equals' rather than 'nonnull_equals' just because that's the pandas term.

shoyer · 2017-01-23T22:42:18Z

Fixed by #996

jcmgray changed the title ~~merge and align DataArrays on different domains~~ merge and align DataArrays/Datasets on different domains Aug 24, 2016

jcmgray mentioned this issue Aug 31, 2016

add 'no_conflicts' as compat option for merging non-conflicting data #996

Merged

shoyer mentioned this issue Nov 2, 2016

Allow concat() to drop/replace duplicate index labels? #1072

Closed

shoyer closed this as completed Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge and align DataArrays/Datasets on different domains #742

merge and align DataArrays/Datasets on different domains #742

jcmgray commented Feb 2, 2016

shoyer commented Feb 3, 2016 •

edited

Loading

JamesPHoughton commented Jun 14, 2016

jcmgray commented Jun 15, 2016

shoyer commented Jun 15, 2016

JamesPHoughton commented Jun 16, 2016 •

edited

Loading

jcmgray commented Jun 16, 2016

shoyer commented Jun 20, 2016

jcmgray commented Jun 21, 2016

jcmgray commented Aug 24, 2016

shoyer commented Aug 24, 2016

shoyer commented Jan 23, 2017

merge and align DataArrays/Datasets on different domains #742

merge and align DataArrays/Datasets on different domains #742

Comments

jcmgray commented Feb 2, 2016

shoyer commented Feb 3, 2016 • edited Loading

JamesPHoughton commented Jun 14, 2016

jcmgray commented Jun 15, 2016

shoyer commented Jun 15, 2016

JamesPHoughton commented Jun 16, 2016 • edited Loading

jcmgray commented Jun 16, 2016

shoyer commented Jun 20, 2016

jcmgray commented Jun 21, 2016

jcmgray commented Aug 24, 2016

shoyer commented Aug 24, 2016

shoyer commented Jan 23, 2017

shoyer commented Feb 3, 2016 •

edited

Loading

JamesPHoughton commented Jun 16, 2016 •

edited

Loading