-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insertion speed of new dataset elements #7224
Comments
hmaarrfk
added
the
needs triage
Issue that has not been reviewed by xarray team member
label
Oct 26, 2022
TomNicholas
added
topic-performance
and removed
needs triage
Issue that has not been reviewed by xarray team member
labels
Oct 26, 2022
I have thought a little about this as well and went and looked in my old code. Creating a import numpy as np
import xarray as xr
from time import perf_counter
# %% Inputs
names = np.core.defchararray.add("long_variable_name", np.arange(0, 100).astype(str))
time = np.array([0, 1])
coords = dict(time=time)
value = np.array(["0", "b"], dtype=str)
# %% Insert to Dataset with DataArray:
time_start = perf_counter()
ds = xr.Dataset(coords=coords)
for v in names:
ds[v] = xr.DataArray(data=value, coords=coords)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with DataArray:", time_elapsed)
# %% Insert to Dataset with Variable:
time_start = perf_counter()
ds = xr.Dataset(coords=coords)
for v in names:
ds[v] = xr.Variable("time", value)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with Variable:", time_elapsed)
# %% Insert to Dataset with tuple:
time_start = perf_counter()
ds = xr.Dataset(coords=coords)
for v in names:
ds[v] = ("time", value)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Insert to Dataset with tuple:", time_elapsed)
# %% Dict of DataArray then create Dataset:
time_start = perf_counter()
data_vars = dict()
for v in names:
data_vars[v] = xr.DataArray(data=value, coords=coords)
ds = xr.Dataset(data_vars=data_vars, coords=coords)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of DataArrays then create Dataset:", time_elapsed)
# %% Dict of Variables then create Dataset:
time_start = perf_counter()
data_vars = dict()
for v in names:
data_vars[v] = xr.Variable("time", value)
ds = xr.Dataset(data_vars=data_vars, coords=coords)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of Variables then create Dataset:", time_elapsed)
# %% Dict of tuples then create Dataset:
time_start = perf_counter()
data_vars = dict()
for v in names:
data_vars[v] = ("time", value)
ds = xr.Dataset(data_vars=data_vars, coords=coords)
time_end = perf_counter()
time_elapsed = time_end - time_start
print("Dict of tuples then create Dataset:", time_elapsed) Insert to Dataset with DataArray: 0.3787728999996034
Insert to Dataset with Variable: 0.3083788999997523
Insert to Dataset with tuple: 0.30018929999960164
Dict of DataArrays then create Dataset: 0.07277609999982815
Dict of Variables then create Dataset: 0.005166500000086671
Dict of tuples then create Dataset: 0.003186699999787379 # Winner! :) |
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this issue
Oct 29, 2022
Taken from discussions in pydata#7224 (comment) Thank you @Illviljan
4 tasks
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this issue
Oct 29, 2022
Taken from discussions in pydata#7224 (comment) Thank you @Illviljan
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this issue
Oct 29, 2022
Taken from discussions in pydata#7224 (comment) Thank you @Illviljan
hmaarrfk
added a commit
to hmaarrfk/xarray
that referenced
this issue
Oct 29, 2022
Taken from discussions in pydata#7224 (comment) Thank you @Illviljan
Ok, I don't think I have the right tools to really get to the bottom of this. The spyder profiler just seems to slowdown code too much. Any other tools to recommend? |
xref: pandas-dev/pandas#49393 |
Illviljan
added a commit
that referenced
this issue
Oct 31, 2022
* Expand benchmarks for dataset insertion and creation Taken from discussions in #7224 (comment) Thank you @Illviljan * Apply suggestions from code review Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> * Update asv_bench/benchmarks/merge.py Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> * Move data set creation definition * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add attrs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update asv_bench/benchmarks/merge.py * Update asv_bench/benchmarks/merge.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What is your issue?
In #7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement.
The benchmarks results 7221 and 7222 are pretty useful to look at.
Thank you for encouraging the creation of a "benchmark" so that we can monitor the performance of element insertion.
Unfortunately, that was the only "free" lunch I got.
A few other minor improvements can be obtained with:
#7222
However, it seems to me that the fundamental reason this is "slow" is because element insertion is not so much "insertion" as it is:
This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918
In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to "search" for variables that were "indexed"
xarray/xarray/core/merge.py
Line 267 in ca57e5c
I think a few speedsups can be obtained by avoiding these kinds of "searches" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the
merge_core
routine, instead of themerge_core
routine recreating it all the time.Ultimately, I think you trade off "memory footprint" (due to the potential increase of datastructures you keep around) of a dataset, and "speed".
Anyway, I just wanted to share where I got.
The text was updated successfully, but these errors were encountered: