Skip to content

Performance: numpy indexes small amounts of data 1000 faster than xarray #2799

Open
@nbren12

Description

@nbren12

Machine learning applications often require iterating over every index along some of the dimensions of a dataset. For instance, iterating over all the (lat, lon) pairs in a 4D dataset with dimensions (time, level, lat, lon). Unfortunately, this is very slow with xarray objects compared to numpy (or h5py) arrays. When the Pangeo machine learning working group met today, we found that several of us have struggled with this.

I made some simplified benchmarks, which show that xarray is about 1000 times slower than numpy when repeatedly grabbing a small amount of data from an array. This is a problem with both isel or [] indexing. After doing some profiling, the main culprits seem to be xarray routines like _validate_indexers and _broadcast_indexes.

While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions