Avoid auto creation of indexes in concat by TomNicholas · Pull Request #8872 · pydata/xarray

TomNicholas · 2024-03-25T05:16:33Z

If we create a Coordinates object using the concatenated result_indexes, and pass that to the Dataset constructor, we can explicitly set the correct indexes from the start, instead of auto-creating the wrong ones and then trying to overwrite them with the correct indexes later (which is what the current implementation does).

Possible fix for Concatenation automatically creates indexes where none existed #8871
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
~~New functions/methods are listed in api.rst~~

… explicitly

TomNicholas · 2024-03-25T05:21:35Z

All the tests in test_concat.py pass, but this change causes one failure in test_groupby.py:

FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'}

xarray/core/concat.py

benbovy · 2024-03-26T12:12:56Z

xarray/core/concat.py

-        result_indexes[dim] = index
-
-    # TODO: add indexes at Dataset creation (when it is supported)
-    result = result._overwrite_indexes(result_indexes)


Removing those lines doesn't break any existing test because the line above result[dim] = index_vars[dim] actually re-creates a default PandasIndex when assigning the new dim variable. However, this unnecessarily re-creates a new index (or re-wrap an existing one) and this may not work in the future if we allow passing a custom xarray index as dim argument to concat.

It would be better to explicitly add both index and index_vars to result. Best way would be to assign them to result_indexes and coord_vars respectively before constructing the Coordinates object and then the result object, unless there are cases where result.drop_vars(unlabeled_dims) would delete the index coordinate.

Thanks @benbovy. I think your comment might explain the behaviour I just noticed in https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860

I think I've addressed your comment now @benbovy

…o-creation

TomNicholas · 2024-03-28T04:13:30Z

#8884 is now merged into this PR but all that should do is make the MergeError I'm currently seeing get emitted slightly earlier.

EDIT: Since reverted in light of #8886

This reverts commit beb665a.

This reverts commit 22f361d.

This reverts commit 55166fc.

TomNicholas · 2024-03-28T16:29:51Z

xarray/tests/test_concat.py


        assert np.issubdtype(actual.x2.dtype, dtype)

+    def test_concat_avoids_index_auto_creation(self) -> None:


Reminder to myself: The reason this test didn't catch the problem described in https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860 is because this test checks that concatenating datasets that start without indexes stay without indexes, whereas that problem is from concatenating datasets with indexes but having the coordinate variables be silently replaced by IndexVariable objects created from the index data.

TomNicholas · 2024-03-28T20:47:51Z

The groupby test is still failing

FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'}

This is weird, because it seems groupby creates a call to concat (on line 1861 of groupby.py) in which it attempts to concatenate along dim='id', but id exists as an index and as a data variable on the inputs, but not as a coordinate variable. The Coordinates constructor I added inside concat then complains about this situation. I can't tell whether this is a bug in groupby's input to concat that I've exposed (i.e. it should have made 'id' a coord), or an edge case for concat to handle.

TomNicholas · 2024-03-28T20:51:37Z

xarray/core/concat.py

+    else:
+        if dim in result_data_vars:
+            coord_vars[dim] = result_data_vars[dim]
+            result_data_vars.pop(dim)


I was able to fix the groupby test failure by adding this piece of code, but I'm still not sure that I should have needed to add this. See #8872 (comment).

@dcherian your input here might be helpful.

which it attempts to concatenate along dim='id', but id exists as an index and as a data variable on the inputs, but not as a coordinate variable.

This sounds like a bug in handling id elsewhere.

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

…/TomNicholas/xarray into concat-avoid-index-auto-creation

dcherian

LGTM

keewis

I can't spot anything important, either.

It looks like the parameter type spec nit has been undone, though.

xarray/core/concat.py

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

xarray/core/dataarray.py

xarray/core/dataset.py

…/TomNicholas/xarray into concat-avoid-index-auto-creation

TomNicholas · 2024-05-08T19:39:33Z

Thank you both! ❤️

* test not creating indexes on concatenation * construct result dataset using Coordinates object with indexes passed explicitly * remove unnecessary overwriting of indexes * ConcatenatableArray class * use ConcatenableArray in tests * add regression tests * fix by performing check * refactor assert_valid_explicit_coords and rename dims->sizes * Revert "add regression tests" This reverts commit beb665a. * Revert "fix by performing check" This reverts commit 22f361d. * Revert "refactor assert_valid_explicit_coords and rename dims->sizes" This reverts commit 55166fc. * fix failing test * possible fix for failing groupby test * Revert "possible fix for failing groupby test" This reverts commit 6e9ead6. * test expand_dims doesn't create Index * add option to not create 1D index in expand_dims * refactor tests to consider data variables and coordinate variables separately * test expand_dims doesn't create Index * add option to not create 1D index in expand_dims * refactor tests to consider data variables and coordinate variables separately * fix bug causing new test to fail * test index auto-creation when iterable passed as new coordinate values * make test for iterable pass * added kwarg to dataarray * whatsnew * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "refactor tests to consider data variables and coordinate variables separately" This reverts commit ba5627e. * Revert "add option to not create 1D index in expand_dims" This reverts commit 95d453c. * test that concat doesn't raise if create_1d_index=False * make test pass by passing create_1d_index down through concat * assert that an UnexpectedDataAccess error is raised when create_1d_index=True * eliminate possibility of xarray internals bypassing UnexpectedDataAccess error by accessing .array * update tests to use private versions of assertions * create_1d_index->create_index * Update doc/whats-new.rst Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Rename create_1d_index -> create_index * fix ConcatenatableArray * formatting * whatsnew * add new create_index kwarg to overloads * split vars into data_vars and coord_vars in one loop * avoid mypy error by using new variable name * warn if create_index=True but no index created because dimension variable was a data var not a coord * add string marks in warning message * regression test for dtype changing in to_stacked_array * correct doctest * Remove outdated comment * test we can skip creation of indexes during shape promotion * make shape promotion test pass * point to issue in whatsnew * don't create dimension coordinates just to drop them at the end * Remove ToDo about not using Coordinates object to pass indexes Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * get rid of unlabeled_dims variable entirely * move ConcatenatableArray and similar to new file * formatting nit Co-authored-by: Justus Magin <keewis@users.noreply.github.com> * renamed create_index -> create_index_for_new_dim in concat * renamed create_index -> create_index_for_new_dim in expand_dims * fix incorrect arg name * add example to docstring * add example of using new kwarg to docstring of expand_dims * add example of using new kwarg to docstring of concat * re-nit the nit Co-authored-by: Justus Magin <keewis@users.noreply.github.com> * more instances of the nit * fix docstring doctest formatting nit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)

* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)

* Install xarray from main now that [#8872](pydata/xarray#8872) has merged * Remove note in docs

TomNicholas added 2 commits March 25, 2024 01:11

test not creating indexes on concatenation

22995e9

construct result dataset using Coordinates object with indexes passed…

7142c9f

… explicitly

TomNicholas requested a review from benbovy March 25, 2024 05:16

TomNicholas changed the title ~~Concat avoid index auto creation~~ Avoid auto creation of indexes in concat Mar 25, 2024

TomNicholas mentioned this pull request Mar 25, 2024

Concatenation automatically creates indexes where none existed #8871

Open

5 tasks

TomNicholas commented Mar 25, 2024

View reviewed changes

xarray/core/concat.py Outdated Show resolved Hide resolved

remove unnecessary overwriting of indexes

7fb075a

This was referenced Mar 25, 2024

Test concat of dimension coordinate not backed by an index zarr-developers/VirtualiZarr#44

Merged

Required changes in xarray to avoid creating indexes zarr-developers/VirtualiZarr#14

Closed

benbovy reviewed Mar 26, 2024

View reviewed changes

TomNicholas mentioned this pull request Mar 27, 2024

Inferring concatenation order from coordinate data values zarr-developers/VirtualiZarr#18

Closed

TomNicholas added 6 commits March 27, 2024 20:43

ConcatenatableArray class

285c1de

use ConcatenableArray in tests

cc24757

Merge branch 'main' into concat-avoid-index-auto-creation

90a2592

add regression tests

beb665a

fix by performing check

22f361d

refactor assert_valid_explicit_coords and rename dims->sizes

55166fc

TomNicholas mentioned this pull request Mar 28, 2024

Coordinates object permits invalid state #8883

Closed

5 tasks

Merge branch 'forbid_invalid_coordinates' into concat-avoid-index-aut…

322b76e

…o-creation

TomNicholas added 3 commits March 28, 2024 10:48

Revert "add regression tests"

da6692b

This reverts commit beb665a.

Revert "fix by performing check"

35dfb67

This reverts commit 22f361d.

Revert "refactor assert_valid_explicit_coords and rename dims->sizes"

fd3de2b

This reverts commit 55166fc.

TomNicholas commented Mar 28, 2024

View reviewed changes

TomNicholas added 2 commits March 28, 2024 14:44

Merge branch 'main' into concat-avoid-index-auto-creation

0a60172

fix failing test

21afbb1

possible fix for failing groupby test

6e9ead6

TomNicholas commented Mar 28, 2024

View reviewed changes

TomNicholas mentioned this pull request May 8, 2024

Stricter check for .array attribute #9016

Open

TomNicholas and others added 5 commits May 8, 2024 14:11

move ConcatenatableArray and similar to new file

6d825e5

formatting nit

b88b5a6

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

Merge branch 'concat-avoid-index-auto-creation' of https://github.com…

30c7408

…/TomNicholas/xarray into concat-avoid-index-auto-creation

renamed create_index -> create_index_for_new_dim in concat

b243150

renamed create_index -> create_index_for_new_dim in expand_dims

9e9e168

dcherian approved these changes May 8, 2024

View reviewed changes

TomNicholas added 5 commits May 8, 2024 14:52

fix incorrect arg name

dca2fb9

add example to docstring

c979672

add example of using new kwarg to docstring of expand_dims

ac27ce0

add example of using new kwarg to docstring of concat

d73ac48

Merge branch 'main' into concat-avoid-index-auto-creation

9ebbb33

keewis approved these changes May 8, 2024

View reviewed changes

xarray/core/concat.py Outdated Show resolved Hide resolved

re-nit the nit

d1b656d

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

keewis reviewed May 8, 2024

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

keewis and others added 3 commits May 8, 2024 21:05

more instances of the nit

ac998e9

fix docstring doctest formatting nit

0849b94

Merge branch 'concat-avoid-index-auto-creation' of https://github.com…

25764ca

…/TomNicholas/xarray into concat-avoid-index-auto-creation

TomNicholas enabled auto-merge (squash) May 8, 2024 19:11

TomNicholas merged commit 6057128 into pydata:main May 8, 2024

TomNicholas deleted the concat-avoid-index-auto-creation branch May 8, 2024 19:39

TomNicholas mentioned this pull request May 9, 2024

Adding reader_options kwargs to open_virtual_dataset. zarr-developers/VirtualiZarr#67

Merged

jsignell added a commit to jsignell/VirtualiZarr that referenced this pull request May 10, 2024

Install xarray from main now that [#8872](pydata/xarray#8872) has merged

eabdcdb

jsignell mentioned this pull request May 10, 2024

Install xarray from main zarr-developers/VirtualiZarr#106

Merged

TomNicholas pushed a commit to zarr-developers/VirtualiZarr that referenced this pull request May 10, 2024

Install xarray from main (#106)

c5c38fb

* Install xarray from main now that [#8872](pydata/xarray#8872) has merged * Remove note in docs

TomNicholas mentioned this pull request May 13, 2024

Depend on latest version of xarray zarr-developers/VirtualiZarr#109

Merged


		assert np.issubdtype(actual.x2.dtype, dtype)

		def test_concat_avoids_index_auto_creation(self) -> None:

Uh oh!

Conversation

TomNicholas commented Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

benbovy Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

TomNicholas Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Mar 28, 2024

Uh oh!

TomNicholas Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

dcherian Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

keewis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomNicholas commented May 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomNicholas commented Mar 25, 2024 •

edited

Loading

TomNicholas commented Mar 25, 2024 •

edited

Loading

benbovy Mar 26, 2024 •

edited

Loading

TomNicholas commented Mar 28, 2024 •

edited

Loading