Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concat doesn't concatenate dimension coordinates along new dims #7539

Open
TomNicholas opened this issue Feb 16, 2023 · 4 comments
Open

Concat doesn't concatenate dimension coordinates along new dims #7539

TomNicholas opened this issue Feb 16, 2023 · 4 comments

Comments

@TomNicholas
Copy link
Member

What is your issue?

xr.concat doesn't concatenate dimension coordinates along new dimensions, which leads to pretty unintuitive behavior.

Take this example (motivated by #7532 (reply in thread))

segments = []
for i in range(2):
    time = np.sort(np.random.random(4))
    da = xr.DataArray(
        np.random.randn(4,2),
        dims=["time", "cols"],
        coords=dict(time=('time', time), cols=["col1", "col2"]),
        )
    segments.append(da)
In [86]: segments
Out[86]: 
[<xarray.DataArray (time: 4, cols: 2)>
 array([[-0.61199576, -0.9012078 ],
        [-0.54187577,  1.30509994],
        [-3.53720471,  0.97607797],
        [ 0.2593455 ,  0.95920031]])
 Coordinates:
   * time     (time) float64 0.1048 0.168 0.869 0.9432
   * cols     (cols) <U4 'col1' 'col2',
 <xarray.DataArray (time: 4, cols: 2)>
 array([[ 0.90266408, -0.54294821],
        [-1.09087103, -0.17484417],
        [-0.21679558, -0.57377412],
        [ 0.07570151,  0.27433728]])
 Coordinates:
   * time     (time) float64 0.03627 0.09754 0.2434 0.592
   * cols     (cols) <U4 'col1' 'col2']
In [85]: xr.concat(segments, dim='new')
Out[85]: 
<xarray.DataArray (new: 2, time: 8, cols: 2)>
array([[[        nan,         nan],
        [        nan,         nan],
        [-0.61199576, -0.9012078 ],
        [-0.54187577,  1.30509994],
        [        nan,         nan],
        [        nan,         nan],
        [-3.53720471,  0.97607797],
        [ 0.2593455 ,  0.95920031]],

       [[ 0.90266408, -0.54294821],
        [-1.09087103, -0.17484417],
        [        nan,         nan],
        [        nan,         nan],
        [-0.21679558, -0.57377412],
        [ 0.07570151,  0.27433728],
        [        nan,         nan],
        [        nan,         nan]]])
Coordinates:
  * time     (time) float64 0.03627 0.09754 0.1048 0.168 ... 0.592 0.869 0.9432
  * cols     (cols) <U4 'col1' 'col2'
Dimensions without coordinates: new

I would have expected to get a result of size {new: 2, time: 4, cols: 2}. That would be intuitive, because the default is coords='different', and that would be the result of concatenating each time coordinate (which have different values) and just propagating the cols coordinate (as they have the same values).

Instead what happened is that xr.concat treats the dimension coordinates as indexes to align, and defaults to an outer join. This auto-alignment behaviour has been discussed at length before, I'm just trying to point out another place in which its problematic.

This is kind of briefly mentioned in the concat docstring under coords='all':

“all”: All coordinate variables will be concatenated, except those corresponding to other dimensions.

but it's not even mentioned under coords='different'

I don't really know what I would prefer to happen with the coordinates. I guess to have created a time coordinate of size {new: 2, time: 4, cols: 2}, but then I don't know what that implies for the underlying index. @benbovy do you have any thoughts?

At the very least we should make this a lot clearer in the docs.

@benbovy
Copy link
Member

benbovy commented Feb 21, 2023

In general I also find that xr.concat is a powerful feature (incl. auto-alignment and merge options) at the expense that it may sometimes (often?) be hard to reason about. Would it make sense to have a simpler version? To avoid making xr.concat signature even more complicated, maybe another top-level function like xr.concat_noalign? Or any suggestion in #7045 to deactivate auto-alignment Xarray-wise. Or indeed at least make it clearer in the docs that something like drop_indexes or reset_coords should be used first in order to skip auto-alignment for some variables.

I don't really know what I would prefer to happen with the coordinates. I guess to have created a time coordinate of size {new: 2, time: 4, cols: 2}, but then I don't know what that implies for the underlying index. @benbovy do you have any thoughts?

I guess easiest for a concat version with no auto-alignment would be to drop the index when such case happens. (note: one problem in your example is that the Xarray data model still does not allow having a multi-dimensional "time" variable with "time" as also one of its dimensions, but this could be now relaxed).

I've been also wondering whether some kind of NDPandasIndex would make any sense, i.e., a n-d coordinate variable with an internal 1-d (flattened) pandas index and some logic to convert between those n-d vs. 1-d spaces. This is the kind of approach used in xoak for using a kd-tree with coordinates of arbitrary dimensions, where labels in the form of nd-arrays for each coordinate are mapped into the [n_points, n_coords] shape (and inversely for getting the integer indices back as nd-arrays). This works well for point-wise indexing, but I doubt it would be very useful beyond that (e.g., slicing, etc.).

@dcherian
Copy link
Contributor

We could get halfway to a better xr.concat by changing the defaults IMO.

I propose join="exact", data_vars="minimal", coords="minimal", compat="override"

  1. this would only concatenate variables that have concat_dim (if none do, then I think it will just concatenate all of them?)
  2. It would skip any expensive equality checks and concat() with compat='no_conflicts' on dask arrays has accidentally quadratic runtime #5381

This would also drastically improve open_mfdataset by default. It will be pretty backwards-incompatible though.

@TomNicholas
Copy link
Member Author

Or indeed at least make it clearer in the docs that something like drop_indexes or reset_coords should be used first in order to skip auto-alignment for some variables.

I think we should do this regardless. I don't know of anywhere in the docs that these kind of subtleties with concat are clearly documented.

I guess easiest for a concat version with no auto-alignment would be to drop the index when such case happens.

Right - in this case that would have given the intuitive result.

We could get halfway to a better xr.concat by changing #2064 IMO.

I propose join="exact", data_vars="minimal", coords="minimal", compat="override"

That wouldn't have helped with this specific issue though right?

@dcherian
Copy link
Contributor

That's true,join="exact" would've raised an error, which would've been less confusing perhaps.

If we do want this use-case to work out of the box, then I too agree that a new top-level method would be best. stack would've been a good name but that;s taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants