Skip to content

Commit

Permalink
Micro optimize dataset.isel for speed on large datasets
Browse files Browse the repository at this point in the history
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
  • Loading branch information
hmaarrfk committed May 6, 2024
1 parent 50f8726 commit 472b66f
Showing 1 changed file with 19 additions and 4 deletions.
23 changes: 19 additions & 4 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2980,20 +2980,35 @@ def isel(
coord_names = self._coord_names.copy()

indexes, index_variables = isel_indexes(self.xindexes, indexers)
all_keys = set(indexers.keys())

for name, var in self._variables.items():
# preserve variable order
if name in index_variables:
var = index_variables[name]
else:
var_indexers = {k: v for k, v in indexers.items() if k in var.dims}
if var_indexers:
dims.update(zip(var.dims, var.shape))
# Fastpath, skip all of this for variables with no dimensions
# Keep the result cached for future dictionary update
elif var_dims := var.dims:
# Large datasets with alot of metadata will have many scalars
# without any relevant dimensions for slicing.
# Pick those out quickly
# Very likey many variables will not interact with the keys at
# all, just avoid iterating through thing
var_indexer_keys = all_keys.intersection(var_dims)
if var_indexer_keys:
var_indexers = {
k: indexers[k]
for k in var_indexer_keys
}
var = var.isel(var_indexers)
if drop and var.ndim == 0 and name in coord_names:
coord_names.remove(name)
continue
# Update after slicing
var_dims = var.dims
dims.update(zip(var_dims, var.shape))
variables[name] = var
dims.update(zip(var.dims, var.shape))

return self._construct_direct(
variables=variables,
Expand Down

0 comments on commit 472b66f

Please sign in to comment.