Conversation
4 tasks
max-sixty
reviewed
May 5, 2024
xarray/core/variable.py
Outdated
| if fastpath and getattr(data, "ndim", 0) > 0: | ||
| # can't use fastpath (yet) for scalars | ||
| return cast("T_DuckArray", _maybe_wrap_data(data)) | ||
| ndim = getattr(data, "ndim", None) |
Collaborator
There was a problem hiding this comment.
Small point relative to avoiding _maybe_wrap_data:
Wouldn't we only want to get ndim if fastpath is True? So we could only change the comparison to 0 to a comparison to None on L272?
Contributor
Author
There was a problem hiding this comment.
Thank you for the thorough and quick review. you are correct.
Contributor
There was a problem hiding this comment.
Looks like both pd.Index and ExtensionArray define ndim so should be OK. It'd be nice to add this as a comment.
Collaborator
|
Merging pending tests. I'm not 100% on the implications of skipping |
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
May 6, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
4 tasks
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
May 6, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
May 6, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
andersy005
pushed a commit
that referenced
this pull request
May 10, 2024
andersy005
added a commit
that referenced
this pull request
May 10, 2024
* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)
andersy005
added a commit
to hmaarrfk/xarray
that referenced
this pull request
May 10, 2024
* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
Jun 12, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
Jun 12, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
Jun 19, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this pull request
Jun 22, 2024
This targets optimization for datasets with many "scalar" variables (that is variables without any dimensions). This can happen in the context where you have many pieces of small metadata that relate to various facts about an experimental condition. For example, we have about 80 of these in our datasets (and I want to incrase this number) Our datasets are quite large (On the order of 1TB uncompresed) so we often have one dimension that is in the 10's of thousands. However, it has become quite slow to index in the dataset. We therefore often "carefully slice out the matadata we need" prior to doing anything with our dataset, but that isn't quite possible with you want to orchestrate things with a parent application. These optimizations are likely "minor" but considering the results of the benchmark, I think they are quite worthwhile: * main (as of pydata#9001) - 2.5k its/s * With pydata#9002 - 4.2k its/s * With this Pull Request (on top of pydata#9002) -- 6.1k its/s Thanks for considering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I noticed that slicing into
datetime64[ns]arrays was really starting to limit our overall performance.What is strangest is that it is faster to do it on a lazy array than an in-memory array.
I think it comes down to the code on the main branch triggers:
xarray/xarray/core/variable.py
Line 220 in aaa778c
which does a numpy->pandas -> numpy conversion for safety.
But that seems a little over the top.
My benchmark of interest is:
This often occurs when we slice into our larger datasets.
on main
on this branch:
The benchmark
I can try to expand the benchmark, though it is difficult to "see" these slowdowns with toy examples sometimes.
software_timestamp.zip
whats-new.rstapi.rstxref: #2799
xref: #7045