Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insertion speed of new dataset elements #7224

Open
hmaarrfk opened this issue Oct 26, 2022 · 3 comments
Open

Insertion speed of new dataset elements #7224

hmaarrfk opened this issue Oct 26, 2022 · 3 comments

Comments

@hmaarrfk
Copy link
Contributor

What is your issue?

In #7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement.

The benchmarks results 7221 and 7222 are pretty useful to look at.

Thank you for encouraging the creation of a "benchmark" so that we can monitor the performance of element insertion.

Unfortunately, that was the only "free" lunch I got.

A few other minor improvements can be obtained with:
#7222

However, it seems to me that the fundamental reason this is "slow" is because element insertion is not so much "insertion" as it is:

  • Dataset Merge
  • Dataset Replacement of the internal methods.

This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918

In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to "search" for variables that were "indexed"

indexed_elements = [

I think a few speedsups can be obtained by avoiding these kinds of "searches" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the merge_core routine, instead of the merge_core routine recreating it all the time.

Ultimately, I think you trade off "memory footprint" (due to the potential increase of datastructures you keep around) of a dataset, and "speed".

Anyway, I just wanted to share where I got.

@hmaarrfk hmaarrfk added the needs triage Issue that has not been reviewed by xarray team member label Oct 26, 2022
@TomNicholas TomNicholas added topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 26, 2022
@Illviljan
Copy link
Contributor

I have thought a little about this as well and went and looked in my old code. Creating a data_vars dict with the data and then at the end creating the dataset seems to be the way to go:

import numpy as np
import xarray as xr
from time import perf_counter


# %% Inputs
names = np.core.defchararray.add("long_variable_name", np.arange(0, 100).astype(str))
time = np.array([0, 1])
coords = dict(time=time)
value = np.array(["0", "b"], dtype=str)

# %% Insert to Dataset with DataArray:
time_start = perf_counter()

ds = xr.Dataset(coords=coords)
for v in names:
    ds[v] = xr.DataArray(data=value, coords=coords)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with DataArray:", time_elapsed)

# %% Insert to Dataset with Variable:
time_start = perf_counter()

ds = xr.Dataset(coords=coords)
for v in names:
    ds[v] = xr.Variable("time", value)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with Variable:", time_elapsed)

# %% Insert to Dataset with tuple:
time_start = perf_counter()

ds = xr.Dataset(coords=coords)
for v in names:
    ds[v] = ("time", value)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with tuple:", time_elapsed)


# %% Dict of DataArray then create Dataset:
time_start = perf_counter()

data_vars = dict()
for v in names:
    data_vars[v] = xr.DataArray(data=value, coords=coords)
ds = xr.Dataset(data_vars=data_vars, coords=coords)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of DataArrays then create Dataset:", time_elapsed)

# %% Dict of Variables then create Dataset:
time_start = perf_counter()

data_vars = dict()
for v in names:
    data_vars[v] = xr.Variable("time", value)
ds = xr.Dataset(data_vars=data_vars, coords=coords)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of Variables then create Dataset:", time_elapsed)

# %% Dict of tuples then create Dataset:
time_start = perf_counter()

data_vars = dict()
for v in names:
    data_vars[v] = ("time", value)
ds = xr.Dataset(data_vars=data_vars, coords=coords)

time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of tuples then create Dataset:", time_elapsed)
Insert to Dataset with DataArray: 0.3787728999996034
Insert to Dataset with Variable: 0.3083788999997523
Insert to Dataset with tuple: 0.30018929999960164
Dict of DataArrays then create Dataset: 0.07277609999982815
Dict of Variables then create Dataset: 0.005166500000086671
Dict of tuples then create Dataset: 0.003186699999787379  # Winner! :)

hmaarrfk added a commit to hmaarrfk/xarray that referenced this issue Oct 29, 2022
hmaarrfk added a commit to hmaarrfk/xarray that referenced this issue Oct 29, 2022
hmaarrfk added a commit to hmaarrfk/xarray that referenced this issue Oct 29, 2022
hmaarrfk added a commit to hmaarrfk/xarray that referenced this issue Oct 29, 2022
@hmaarrfk
Copy link
Contributor Author

Ok, I don't think I have the right tools to really get to the bottom of this. The spyder profiler just seems to slowdown code too much. Any other tools to recommend?

@hmaarrfk
Copy link
Contributor Author

xref: pandas-dev/pandas#49393

Illviljan added a commit that referenced this issue Oct 31, 2022
* Expand benchmarks for dataset insertion and creation

Taken from discussions in #7224 (comment)

Thank you @Illviljan

* Apply suggestions from code review

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* Update asv_bench/benchmarks/merge.py

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* Move data set creation definition

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add attrs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update asv_bench/benchmarks/merge.py

* Update asv_bench/benchmarks/merge.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants