Skip to content

Conversation

@dcherian
Copy link
Contributor

@dcherian dcherian commented May 3, 2024

  1. This moves the region detection code into ZarrStore so we only open the store once.
  2. Instead of opening the store as a dataset, construct a pd.Index directly to "auto"-infer the region.

The diff is large mostly because a bunch of code moved from backends/api.py to backends/zarr.py

for k, v in self.get_variables().items()
if k in existing_variable_names
},
{k: self.open_store_variable(name=k) for k in existing_variable_names},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just open the needed variables instead of opening all of them.

Comment on lines +807 to +812
variable = conventions.decode_cf_variable(
dim, self.open_store_variable(dim).compute()
)
assert variable.dims == (dim,)
index = pd.Index(variable.data)
idxs = index.get_indexer(ds[dim].data)
Copy link
Contributor Author

@dcherian dcherian May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 812-817: This is the main logic change.

region[dim] = slice(idxs[0], idxs[-1] + 1)
return region

def _validate_and_autodetect_region(self, ds, region) -> dict[str, slice]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly copied over

write_empty=write_empty_chunks,
)

if region is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down so we only open the Zarr store once.

@dcherian dcherian force-pushed the cleanup-zarr-region branch from 2acb6f4 to 7489aba Compare May 3, 2024 22:16
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
# Any of the following region specifications are valid
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
Copy link
Contributor Author

@dcherian dcherian May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last line does not do what it looks like it's doing if there are no indexes!

@dcherian dcherian changed the title Zarr: Optimize region detection Zarr: Optimize region="auto" detection May 3, 2024
@max-sixty
Copy link
Collaborator

Looks great!

@dcherian dcherian added the plan to merge Final call for comments label May 7, 2024
@dcherian
Copy link
Contributor Author

dcherian commented May 7, 2024

Thanks for taking a look, Max!

@dcherian dcherian merged commit dcf2ac4 into pydata:main May 7, 2024
andersy005 pushed a commit that referenced this pull request May 10, 2024
* Zarr: Optimize region detection

* Fix for unindexed dimensions.

* Better example

* small cleanup
andersy005 added a commit that referenced this pull request May 10, 2024
* main:
  Avoid auto creation of indexes in concat (#8872)
  Fix benchmark CI (#9013)
  Avoid extra read from disk when creating Pandas Index. (#8893)
  Add a benchmark to monitor performance for large dataset indexing (#9012)
  Zarr: Optimize `region="auto"` detection (#8997)
  Trigger CI only if code files are modified. (#9006)
  Fix for ruff 0.4.3 (#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004)
  Speed up localize (#8536)
  Simplify fast path (#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991)
  Fix syntax error in test related to cupy (#9000)
andersy005 added a commit to hmaarrfk/xarray that referenced this pull request May 10, 2024
* backend-indexing:
  Trigger CI only if code files are modified. (pydata#9006)
  Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870)
  add `.oindex` and `.vindex` to `BackendArray` (pydata#8885)
  temporary enable CI triggers on feature branch
  Avoid auto creation of indexes in concat (pydata#8872)
  Fix benchmark CI (pydata#9013)
  Avoid extra read from disk when creating Pandas Index. (pydata#8893)
  Add a benchmark to monitor performance for large dataset indexing (pydata#9012)
  Zarr: Optimize `region="auto"` detection (pydata#8997)
  Trigger CI only if code files are modified. (pydata#9006)
  Fix for ruff 0.4.3 (pydata#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004)
  Speed up localize (pydata#8536)
  Simplify fast path (pydata#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991)
  Fix syntax error in test related to cupy (pydata#9000)
@dcherian dcherian deleted the cleanup-zarr-region branch July 11, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan to merge Final call for comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants