-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid auto creation of indexes in concat #8872
Avoid auto creation of indexes in concat #8872
Conversation
All the tests in FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'} |
xarray/core/concat.py
Outdated
result_indexes[dim] = index | ||
|
||
# TODO: add indexes at Dataset creation (when it is supported) | ||
result = result._overwrite_indexes(result_indexes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing those lines doesn't break any existing test because the line above result[dim] = index_vars[dim]
actually re-creates a default PandasIndex when assigning the new dim
variable. However, this unnecessarily re-creates a new index (or re-wrap an existing one) and this may not work in the future if we allow passing a custom xarray index as dim
argument to concat
.
It would be better to explicitly add both index
and index_vars
to result
. Best way would be to assign them to result_indexes
and coord_vars
respectively before constructing the Coordinates
object and then the result
object, unless there are cases where result.drop_vars(unlabeled_dims)
would delete the index coordinate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benbovy. I think your comment might explain the behaviour I just noticed in zarr-developers/VirtualiZarr#18 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've addressed your comment now @benbovy
@@ -978,6 +979,34 @@ def test_concat_str_dtype(self, dtype, dim) -> None: | |||
|
|||
assert np.issubdtype(actual.x2.dtype, dtype) | |||
|
|||
def test_concat_avoids_index_auto_creation(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to myself: The reason this test didn't catch the problem described in zarr-developers/VirtualiZarr#18 (comment) is because this test checks that concatenating datasets that start without indexes stay without indexes, whereas that problem is from concatenating datasets with indexes but having the coordinate variables be silently replaced by IndexVariable
objects created from the index data.
The groupby test is still failing FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'} This is weird, because it seems groupby creates a call to concat (on line 1861 of |
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
…/TomNicholas/xarray into concat-avoid-index-auto-creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't spot anything important, either.
It looks like the parameter type spec nit has been undone, though.
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
…/TomNicholas/xarray into concat-avoid-index-auto-creation
Thank you both! ❤️ |
* test not creating indexes on concatenation * construct result dataset using Coordinates object with indexes passed explicitly * remove unnecessary overwriting of indexes * ConcatenatableArray class * use ConcatenableArray in tests * add regression tests * fix by performing check * refactor assert_valid_explicit_coords and rename dims->sizes * Revert "add regression tests" This reverts commit beb665a. * Revert "fix by performing check" This reverts commit 22f361d. * Revert "refactor assert_valid_explicit_coords and rename dims->sizes" This reverts commit 55166fc. * fix failing test * possible fix for failing groupby test * Revert "possible fix for failing groupby test" This reverts commit 6e9ead6. * test expand_dims doesn't create Index * add option to not create 1D index in expand_dims * refactor tests to consider data variables and coordinate variables separately * test expand_dims doesn't create Index * add option to not create 1D index in expand_dims * refactor tests to consider data variables and coordinate variables separately * fix bug causing new test to fail * test index auto-creation when iterable passed as new coordinate values * make test for iterable pass * added kwarg to dataarray * whatsnew * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "refactor tests to consider data variables and coordinate variables separately" This reverts commit ba5627e. * Revert "add option to not create 1D index in expand_dims" This reverts commit 95d453c. * test that concat doesn't raise if create_1d_index=False * make test pass by passing create_1d_index down through concat * assert that an UnexpectedDataAccess error is raised when create_1d_index=True * eliminate possibility of xarray internals bypassing UnexpectedDataAccess error by accessing .array * update tests to use private versions of assertions * create_1d_index->create_index * Update doc/whats-new.rst Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * Rename create_1d_index -> create_index * fix ConcatenatableArray * formatting * whatsnew * add new create_index kwarg to overloads * split vars into data_vars and coord_vars in one loop * avoid mypy error by using new variable name * warn if create_index=True but no index created because dimension variable was a data var not a coord * add string marks in warning message * regression test for dtype changing in to_stacked_array * correct doctest * Remove outdated comment * test we can skip creation of indexes during shape promotion * make shape promotion test pass * point to issue in whatsnew * don't create dimension coordinates just to drop them at the end * Remove ToDo about not using Coordinates object to pass indexes Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * get rid of unlabeled_dims variable entirely * move ConcatenatableArray and similar to new file * formatting nit Co-authored-by: Justus Magin <keewis@users.noreply.github.com> * renamed create_index -> create_index_for_new_dim in concat * renamed create_index -> create_index_for_new_dim in expand_dims * fix incorrect arg name * add example to docstring * add example of using new kwarg to docstring of expand_dims * add example of using new kwarg to docstring of concat * re-nit the nit Co-authored-by: Justus Magin <keewis@users.noreply.github.com> * more instances of the nit * fix docstring doctest formatting nit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)
* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)
* Install xarray from main now that [#8872](pydata/xarray#8872) has merged * Remove note in docs
If we create a
Coordinates
object using the concatenatedresult_indexes
, and pass that to theDataset
constructor, we can explicitly set the correct indexes from the start, instead of auto-creating the wrong ones and then trying to overwrite them with the correct indexes later (which is what the current implementation does).whats-new.rst
New functions/methods are listed inapi.rst