-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr: Optimize region="auto"
detection
#8997
Conversation
for k, v in self.get_variables().items() | ||
if k in existing_variable_names | ||
}, | ||
{k: self.open_store_variable(name=k) for k in existing_variable_names}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just open the needed variables instead of opening all of them.
variable = conventions.decode_cf_variable( | ||
dim, self.open_store_variable(dim).compute() | ||
) | ||
assert variable.dims == (dim,) | ||
index = pd.Index(variable.data) | ||
idxs = index.get_indexer(ds[dim].data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 812-817: This is the main logic change.
xarray/backends/zarr.py
Outdated
region[dim] = slice(idxs[0], idxs[-1] + 1) | ||
return region | ||
|
||
def _validate_and_autodetect_region(self, ds, region) -> dict[str, slice]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly copied over
@@ -1815,6 +1710,16 @@ def to_zarr( | |||
write_empty=write_empty_chunks, | |||
) | |||
|
|||
if region is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved down so we only open the Zarr store once.
2acb6f4
to
7489aba
Compare
@@ -890,7 +890,7 @@ where the data should be written (in index space, not label space), e.g., | |||
|
|||
# For convenience, we'll slice a single dataset, but in the real use-case | |||
# we would create them separately possibly even from separate processes. | |||
ds = xr.Dataset({"foo": ("x", np.arange(30))}) | |||
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)}) | |||
# Any of the following region specifications are valid | |||
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto") | |||
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last line does not do what it looks like it's doing if there are no indexes!
region="auto"
detection
Looks great! |
Thanks for taking a look, Max! |
* Zarr: Optimize region detection * Fix for unindexed dimensions. * Better example * small cleanup
* main: Avoid auto creation of indexes in concat (#8872) Fix benchmark CI (#9013) Avoid extra read from disk when creating Pandas Index. (#8893) Add a benchmark to monitor performance for large dataset indexing (#9012) Zarr: Optimize `region="auto"` detection (#8997) Trigger CI only if code files are modified. (#9006) Fix for ruff 0.4.3 (#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004) Speed up localize (#8536) Simplify fast path (#9001) Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991) Fix syntax error in test related to cupy (#9000)
* backend-indexing: Trigger CI only if code files are modified. (pydata#9006) Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870) add `.oindex` and `.vindex` to `BackendArray` (pydata#8885) temporary enable CI triggers on feature branch Avoid auto creation of indexes in concat (pydata#8872) Fix benchmark CI (pydata#9013) Avoid extra read from disk when creating Pandas Index. (pydata#8893) Add a benchmark to monitor performance for large dataset indexing (pydata#9012) Zarr: Optimize `region="auto"` detection (pydata#8997) Trigger CI only if code files are modified. (pydata#9006) Fix for ruff 0.4.3 (pydata#9007) Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999) Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004) Speed up localize (pydata#8536) Simplify fast path (pydata#9001) Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991) Fix syntax error in test related to cupy (pydata#9000)
ZarrStore
so we only open the store once.pd.Index
directly to "auto"-infer the region.The diff is large mostly because a bunch of code moved from
backends/api.py
tobackends/zarr.py