Improved default behavior when concatenating DataArrays #2777

Zac-HD · 2019-02-19T05:43:44Z

Closes Improved inference of names when concatenating arrays #2775
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

This is really nice to have when producing faceted plots of satellite observations in various bands, and should be somewhere between useful and harmless in other cases.

Example code:

ds = xr.Dataset({
    k: xr.DataArray(np.random.random((2, 2)), dims="x y".split(), name=k) 
    for k in "blue green red".split()
})
xr.concat([ds.blue, ds.green, ds.red], dim="band").plot.imshow(col="band")

Before - facets have an index, colorbar has misleading label:

After - facets have meaningful labels, colorbar has no label:

shoyer

This seems like a nice usability improvement!

xarray/core/combine.py

Zac-HD · 2019-02-19T08:11:56Z

Thanks for the support and quick review @shoyer!

Any idea when Xarray 0.12 might be out? I'm teaching some remote sensing workshops in mid-March and would love to have this merged, as a colleague's review of those notebooks prompted this PR 😄

xarray/core/combine.py

Zac-HD · 2019-02-20T23:51:04Z

The docs build failed due to a (transient) http error when loading tutorial data for the docs, so I've also finalised the planned conversion from xarray.tutorial.load_dataset to xarray.tutorial.open_dataset.

Zac-HD · 2019-02-21T00:32:10Z

Hmm, it looks like the failure to download the naturalearth coastlines.zip wasn't so transient after all - but it does work on my machine!

Zac-HD · 2019-02-22T11:41:37Z

OK! @shoyer, I've got everything passing and it's ready for review.

Even the accidental tutorial/docs fixes 😄

doc/whats-new.rst

xarray/core/combine.py

shoyer · 2019-02-22T15:23:06Z

I'll check on the doc build failure, but I would rather not remove load_dataset yet -- we didn't deprecate it that long ago.

…

On Fri, Feb 22, 2019 at 5:28 AM Zac Hatfield-Dodds ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In xarray/core/combine.py <#2777 (comment)>: > - elif name != arr.name: - if compat == 'identical': - raise ValueError('array names not identical') - else: - arr = arr.rename(name) - datasets.append(arr._to_temp_dataset()) + name = result_name(arrays) + names = [arr.name for arr in arrays] + if compat == 'identical' and len(set(names)) != 1: + raise ValueError( + "compat='identical', but array names {!r} are not identical" + .format(names if len(names) <= 10 else sorted(set(names))) + ) + datasets = [arr.rename(name)._to_temp_dataset() for arr in arrays] + + if isinstance(dim, str) and len(set(names) - {None}) == len(names) \ Not for this - we're checking that names is a list of unique strings. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2777 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1nbwyPzYE3HzT30pWhWignDUrbBSks5vP_B_gaJpZM4bCN3h> .

Zac-HD · 2019-02-22T23:47:30Z

@shoyer - don't worry about the docs build, I'm pretty sure that was just a flaky network from Travis and it's working now in any case.

I've left tutorial.load_dataset in, just changed "removed in 0.12" to "removed in a future version".

shoyer

This looks good to me now. I'll merge in a day or two unless anyone else has review comments to add.

xarray/core/combine.py

pep8speaks · 2019-02-26T22:28:42Z

Hello @Zac-HD! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 27, 2019 at 00:51 Hours UTC

shoyer · 2019-02-26T22:52:28Z

@pep8speaks seems to have gone hay-wire -- maybe you have a syntax error?

Thinking about this a little more, one hazard of converting names into index labels is that we lose the invariant that you get the same result regardless of order in which you call concat, e.g., something like these expressions could now give different results:

xarray.concat([a, b], dim='x')

vs

xarray.concat([xarray.concat([a], dim='x'), xarray.concat([b], dim='x')], dim='x')

If a and b have the same name, then you'd get an index with the two duplicate entries in the second case but not the first case.

I'm not entirely sure this is a deal-breaker but it makes me a little nervous reluctant. In particular, it might break some the invariants we're relying upon for the next version of open_mfdataset (#2616, cc @TomNicholas )

Zac-HD · 2019-02-27T00:07:34Z

maybe you have a syntax error?

...yep, an unmatched paren. 😥

If a and b have the same name, then you'd get an index with the two duplicate entries in the second case but not the first case. [which would be bad]

I think it's impossible to avoid this when using inference in the general case. Two options I think would be decent-if-unsatisfying:

Explicitly manage this in the new combining functions, e.g. clear the concat dim coords if they are not unique and the input arrays did not have coords in that dimension.
Add an argument to xr.concat to enable or disable this, e.g. infer_coords=True, and disable it when calling xr.concat from other combining functions.

This is really nice to have when using concat to produce faceted plots of various kinds, and harmless when it's useless.

it's still deprecated, but we'll leave it for a bit longer before removal.

TomNicholas · 2019-02-27T17:04:39Z

@Zac-HD forgive me for this but I think this PR is unnecessary because what you need basically already exists in the API.

Going back to your original example, you could have got the same indexing by creating a DataArray to use as a coordinate to concatenate over:

colors = "blue green red".split()
ds = xr.Dataset({
    k: xr.DataArray(np.random.random((2, 2)), dims="x y".split(), name=k)
    for k in colors
})

band = xr.DataArray(colors, name="band", dims=["band"])
xr.concat([ds.blue, ds.green, ds.red], dim=band).plot.imshow(col="band")

This still leaves the wrong label on the colorbar, but that could be fixed separately and has to do with concat using the attrs of the first dataset in the list for the final dataset (a similar problem to #2382). I think it would be easier to change that behaviour instead (perhaps to if all names the same, use that name, else name of result = None, but this also relates to #1614).

Creating a new coordinate using a DataArray is in the docstring for xr.concat:

If dimension is provided as a DataArray or Index, its name is used as the dimension to concatenate along and the values are added as a coordinate.

but I think there should be an example there too. (Also I think this is relevant to #1646)

I'm not entirely sure this is a deal-breaker but it makes me a little nervous

@shoyer I agree, although I like the idea then I think this could introduce all sorts of complex concatentation edge cases.

At the very least the new API should have symmetry properties something like:

da1 = DataArray(name='a', data=[[0]], dims=['x', 'y'])
da2 = DataArray(name='b', data=[[1]], dims=['x', 'y'])
da3 = DataArray(name='a', data=[[2]], dims=['x', 'y'])
da4 = DataArray(name='b', data=[[3]], dims=['x', 'y'])

xr.manual_combine([[da1, da2], [da3, da4]], concat_dim=['x', 'y'])
# should give the same result as 
xr.manual_combine([[da1, da3], [da2, da4]], concat_dim=['y', 'x'])

but with this PR I don't think it would. In the first case the x coord would be created with values ['a', 'b'], and no y coord would be created, while in the second case no y coord would be created, and the intermediate DataSet would be nameless, so then no x coord would be created either.

I think my suggestion for naming would pass this test because the result would be nameless and have no coords either way.

I might have got that wrong but I definitely think this kind of change should be carefully considered 😕

(EDIT: I just added this example as a test to #2616)

TomNicholas · 2019-02-27T19:16:11Z

I've just submitted a PR which solves this issue in the way I just suggested instead #2792.

TomNicholas · 2019-02-28T00:38:11Z

@Zac-HD there's actually another way to get the indexing behaviour you wanted with the current API:

colors = "blue green red".split()
das = [xr.DataArray(np.random.random((2, 2)), dims="x y".split(),
                    coords={"band": k})
       for k in colors]
xr.concat(das, dim="band").plot.imshow(col="band")

Here instead of using the name attribute to label each band I've used a scalar coordinate called "band", so that when you concat along "band" it will just stack along that coordinate.

This never touches the names so actually gives the desired output without the need for #2792:

shoyer · 2019-03-03T19:39:05Z

I guess we should probably roll back the "name to scalar coordinates" part of this change.

@Zac-HD do you want to do that here or should we go with @TomNicholas's PR?

Zac-HD · 2019-03-03T21:41:21Z

No objection to going with #2792; I'm just happy to have the change merged 😄

It would be nice for someone to cherry-pick 63da214 before releasing 0.12 though, just to fix that warning.

* concatenates along a single dimension * Wrote function to find correct tile_IDs from nested list of datasets * Wrote function to check that combined_tile_ids structure is valid * Added test of 2d-concatenation * Tests now check that dataset ordering is correct * Test concatentation along a new dimension * Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm * All unit tests now passing * Fixed a failing test which I didn't notice because I don't have pseudoNetCDF * Began updating open_mfdataset to handle N-D input * Refactored to remove duplicate logic in open_mfdataset & auto_combine * Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator * --amend * Now raises ValueError if input not ordered correctly before concatenation * Added some more prototype tests defining desired behaviour more clearly * Now raises informative errors on invalid forms of input * Refactoring to alos merge along each dimension * Refactored to literally just apply the old auto_combine along each dimension * Added unit tests for open_mfdatset * Removed TODOs * Removed format strings * test_get_new_tile_ids now doesn't assume dicts are ordered * Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered * Test for getting new tile id * Fixed itertoolz import so that it's compatible with older versions * Increased test coverage * Added toolz as an explicit dependency to pass tests on python2.7 * Updated 'what's new' * No longer attempts to shortcut all concatenation at once if concat_dims=None * Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz * Fixed erroneous removal of utils import * Updated docstrings to include an example of multidimensional concatenation * Clarified auto_combine docstring for N-D behaviour * Added unit test for nested list of Datasets with different variables * Minor spelling and pep8 fixes * Started working on a new api with both auto_combine and manual_combine * Wrote basic function to infer concatenation order from coords. Needs better error handling though. * Attempt at finalised version of public-facing API. All the internals still need to be redone to match though. * No longer uses entire old auto_combine internally, only concat or merge * Updated what's new * Removed uneeded addition to what's new for old release * Fixed incomplete merge in docstring for open_mfdataset * Tests for manual combine passing * Tests for auto_combine now passing * xfailed weird behaviour with manual_combine trying to determine concat_dim * Add auto_combine and manual_combine to API page of docs * Tests now passing for open_mfdataset * Completed merge so that #2648 is respected, and added tests. Also moved concat to it's own file to avoid a circular dependency * Separated the tests for concat and both combines * Some PEP8 fixes * Pre-empting a test which will fail with opening uamiv format * Satisfy pep8speaks bot * Python 3.5 compatibile after changing some error string formatting * Order coords using pandas.Index objects * Fixed performance bug from GH #2662 * Removed ToDos about natural sorting of string coords * Generalized auto_combine to handle monotonically-decreasing coords too * Added more examples to docstring for manual_combine * Added note about globbing aspect of open_mfdataset * Removed auto-inferring of concatenation dimension in manual_combine * Added example to docstring for auto_combine * Minor correction to docstring * Another very minor docstring correction * Added test to guard against issue #2777 * Started deprecation cycle for auto_combine * Fully reverted open_mfdataset tests * Updated what's new to match deprecation cycle * Reverted uamiv test * Removed dependency on itertools * Deprecation tests fixed * Satisfy pycodestyle * Started deprecation cycle of auto_combine * Added specific error for edge case combine_manual can't handle * Check that global coordinates are monotonic * Highlighted weird behaviour when concatenating with no data variables * Added test for impossible-to-auto-combine coordinates * Removed uneeded test * Satisfy linter * Added airspeedvelocity benchmark for combining functions * Benchmark will take longer now * Updated version numbers in deprecation warnings to fit with recent release of 0.12 * Updated api docs for new function names * Fixed docs build failure * Revert "Fixed docs build failure" This reverts commit ddfc6dd. * Updated documentation with section explaining new functions * Suppressed deprecation warnings in test suite * Resolved ToDo by pointing to issue with concat, see #2975 * Various docs fixes * Slightly renamed tests to match new name of tested function * Included minor suggestions from shoyer * Removed trailing whitespace * Simplified error message for case combine_manual can't handle * Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim * Simple fixes suggested by shoyer * Change deprecation warning behaviour * linting

Zac-HD force-pushed the concat-arrays branch 2 times, most recently from 280ce92 to a2df249 Compare February 19, 2019 06:36

shoyer reviewed Feb 19, 2019

View reviewed changes

xarray/core/combine.py Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

Zac-HD mentioned this pull request Feb 20, 2019

0.12.0 release #2776

Closed

shoyer reviewed Feb 20, 2019

View reviewed changes

xarray/core/combine.py Outdated Show resolved Hide resolved

Zac-HD force-pushed the concat-arrays branch from ab26245 to 562ffb8 Compare February 20, 2019 23:04

Zac-HD closed this Feb 22, 2019

Zac-HD reopened this Feb 22, 2019

Better names for concat arrays

8793cb9

Zac-HD force-pushed the concat-arrays branch from d6e15cc to 1b51681 Compare February 22, 2019 11:02

dcherian reviewed Feb 22, 2019

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

xarray/core/combine.py Outdated Show resolved Hide resolved

Zac-HD force-pushed the concat-arrays branch from 1b51681 to 97ac4a6 Compare February 22, 2019 23:18

shoyer approved these changes Feb 25, 2019

View reviewed changes

xarray/core/combine.py Outdated Show resolved Hide resolved

Zac-HD force-pushed the concat-arrays branch from e0ae758 to 4f3751e Compare February 26, 2019 23:48

Zac-HD and others added 2 commits February 27, 2019 11:50

Infer coord for array concat

681d082

This is really nice to have when using concat to produce faceted plots of various kinds, and harmless when it's useless.

load_dataset is not removed yet

63da214

it's still deprecated, but we'll leave it for a bit longer before removal.

Zac-HD force-pushed the concat-arrays branch from 4f3751e to 63da214 Compare February 27, 2019 00:51

TomNicholas mentioned this pull request Feb 27, 2019

Improve name concat #2792

Merged

3 tasks

Zac-HD closed this Mar 3, 2019

shoyer mentioned this pull request Mar 12, 2019

Fix CRS being WKT instead of PROJ.4 #2715

Merged

3 tasks

TomNicholas mentioned this pull request Sep 18, 2019

xr.combine_nested() fails when passed nested DataSets #3315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved default behavior when concatenating DataArrays #2777

Improved default behavior when concatenating DataArrays #2777

Zac-HD commented Feb 19, 2019 •

edited

Loading

shoyer left a comment

Zac-HD commented Feb 19, 2019

Zac-HD commented Feb 20, 2019 •

edited

Loading

Zac-HD commented Feb 21, 2019

Zac-HD commented Feb 22, 2019

shoyer commented Feb 22, 2019 via email

Zac-HD commented Feb 22, 2019

shoyer left a comment

pep8speaks commented Feb 26, 2019 •

edited

Loading

shoyer commented Feb 26, 2019

Zac-HD commented Feb 27, 2019

TomNicholas commented Feb 27, 2019 •

edited

Loading

TomNicholas commented Feb 27, 2019 •

edited

Loading

TomNicholas commented Feb 28, 2019

shoyer commented Mar 3, 2019

Zac-HD commented Mar 3, 2019

Improved default behavior when concatenating DataArrays #2777

Improved default behavior when concatenating DataArrays #2777

Conversation

Zac-HD commented Feb 19, 2019 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

Zac-HD commented Feb 19, 2019

Zac-HD commented Feb 20, 2019 • edited Loading

Zac-HD commented Feb 21, 2019

Zac-HD commented Feb 22, 2019

shoyer commented Feb 22, 2019 via email

Zac-HD commented Feb 22, 2019

shoyer left a comment

Choose a reason for hiding this comment

pep8speaks commented Feb 26, 2019 • edited Loading

Comment last updated on February 27, 2019 at 00:51 Hours UTC

shoyer commented Feb 26, 2019

Zac-HD commented Feb 27, 2019

TomNicholas commented Feb 27, 2019 • edited Loading

TomNicholas commented Feb 27, 2019 • edited Loading

TomNicholas commented Feb 28, 2019

shoyer commented Mar 3, 2019

Zac-HD commented Mar 3, 2019

Zac-HD commented Feb 19, 2019 •

edited

Loading

Zac-HD commented Feb 20, 2019 •

edited

Loading

pep8speaks commented Feb 26, 2019 •

edited

Loading

TomNicholas commented Feb 27, 2019 •

edited

Loading

TomNicholas commented Feb 27, 2019 •

edited

Loading