Skip to content

Insertion speed of new dataset elements #7224

Open
@hmaarrfk

Description

@hmaarrfk

What is your issue?

In #7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement.

The benchmarks results 7221 and 7222 are pretty useful to look at.

Thank you for encouraging the creation of a "benchmark" so that we can monitor the performance of element insertion.

Unfortunately, that was the only "free" lunch I got.

A few other minor improvements can be obtained with:
#7222

However, it seems to me that the fundamental reason this is "slow" is because element insertion is not so much "insertion" as it is:

  • Dataset Merge
  • Dataset Replacement of the internal methods.

This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918

In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to "search" for variables that were "indexed"

indexed_elements = [

I think a few speedsups can be obtained by avoiding these kinds of "searches" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the merge_core routine, instead of the merge_core routine recreating it all the time.

Ultimately, I think you trade off "memory footprint" (due to the potential increase of datastructures you keep around) of a dataset, and "speed".

Anyway, I just wanted to share where I got.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions