-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: numpy indexes small amounts of data 1000 faster than xarray #2799
Comments
cc @rabernat |
I'm sure it's possible to optimize this significantly, but short of rewriting this logic in a lower level language it's pretty much impossible to match the speed of NumPy. This benchmark might give some useful context:
On my machine, this is already twice as slow as your NumPy benchmark (497 µs vs 251 µs) , and all it's doing is parsing Right now we're at about 130 µs per indexing operation. In the best case, we might make this 10x faster but even that would be quite challenging, e.g., consider that even creating a DataArray takes about 20 µs. |
Thanks so much @shoyer. I didn't realize there was that much overhead for a single function call. OTOH, 2x slower than numpy would be way better than 1000x. After looking at the profiling info more, I tend to agree with your 10x maximum speed-up. A couple of particularly slow functions (e.g. |
To be clear, pull requests improving performance (without significantly loss of readability) would be very welcome. Be sure to include a new benchmark in our benchmark suite. |
Thanks for the benchmarks @nbren12, and for the clear explanation @shoyer While we could do some performance work on that loop, I think we're likely to see a material change by enabling the external library to access directly from the array, without a looped python call. That's consistent with the ideas @jhamman had a few days ago. |
@max-sixty I tend to agree this use case could be outside of the scope of xarray. It sounds like significant progress might require re-implementing core |
You can always use xarray to process the data, and then extract the underlying array ( |
Sure, I've been using that as a workaround as well. Unfortunately, that approach throws away all the nice info (e.g. metadata, coordinate) that xarray objects have and requires duplicating much of xarray's indexing logic. |
To put the relative speed of numpy access into perspective, I found this insightful: https://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/ (it's now a few years out of date, but I think the fundamentals still stand) Pasted from there:
So if we're running an inner loop on an array, accessing it using numpy in python is an order of magnitude slower than accessing it using numpy in C (and that's an order of magnitude slower than using a slice, and that's an order of magnitude slower than using raw pointers) So - let's definitely speed xarray up (your benchmarks are excellent, thank you again, and I think you're right there are opportunities for significant increases). But where speed is paramount above all else, we shouldn't use any access in python, let alone the niceties of xarray access. |
Cython + memoryviews isn't quite the right comparison here. I'm sure ordering here is correct, but relative magnitude of the performance difference should be smaller. Xarray's core is bottlenecked on:
C++ offers very low-cost abstraction but dynamism is still slow. Even then, compilers are much better at speeding up tight numeric loops than complex domain logic. As a point of reference, it would be interesting to see these performance numbers running pypy, which I think should be able to handle everything in xarray. You'll note that pypy is something like 7x faster than CPython in their benchmark suite, which I suspect is closer to what we'd see if we wrote xarray's core in a language like C++, e.g., as Python interface to xframe. |
Right, tbc, I'm only referring to the top two lines of the pasted benchmark; i.e. once we enter python (even if only to access a numpy array) we're already losing a lot of the speed relative to the loop staying in C / Cython. So even if xarray were a python front-end to a C++ library, it still wouldn't be competitive if performance were paramount. |
It might be interesting to see, if pythran is an alternative to Cython. It seems like it handles high level But it seems like other libraries like e.g. scikit-image made some good experience with it. Sadly I can't be of much help, as I lack experience (and most importantly time). |
This is not a great start :( It's the first time I hear about Pythran. At first sight it looks somewhat like a hybrid between Cython (for the ahead-of-time transpiling to C++) and numba (for having python-compatible syntax). That said, I didn't see anything that hints at potential speedups on the python boilerplate code. I already had experience with compiling pure-python code (tight This said, I'd have to spend more time on it to get a more informed opinion. |
Not really. Pythran always releases the GIL and does a bunch of optimizations between transpilation and compilations. A good approach would be try out different compilers and see what performance is obtained, without losing readability (#2799 (comment)). See scikit-image/scikit-image/issues/4199 where the package |
I simplified the benchmark: from itertools import product
import numpy as np
import xarray as xr
shape = (10, 10, 10, 10)
index = (0, 0, 0, 0)
np_arr = np.ones(shape)
arr = xr.DataArray(np_arr)
named_index = dict(zip(arr.dims, index))
print(index)
print(named_index)
%timeit -n 1000 arr[index]
%timeit -n 1000 arr.isel(**named_index)
%timeit -n 1000 np_arr[index]
%%prun -s cumulative
for _ in range(10000):
arr[index]
Time breakdown:
I can spot a few low-hanging fruits there:
So in short while I don't think we can feasibly close the order-of-magnitude gap (800x) with numpy, I suspect we could get at least a 5x speedup here. |
All those integer indexes were cast into Variables. #3375 stops that. |
After #3375:
The offending lines in Dataset.isel are these, and I strongly suspect they are improvable: Lines 1922 to 1930 in 4254b4a
|
Great analysis, thanks Do we have any idea of which of those lines are offending? I used a tool |
I tried playing around with pypy 3.6. #!/bin/bash
set -o errexit
set -o pipefail
set -o nounset
set -o xtrace
tar -xvjf Downloads/pypy3.6-v7.1.1-linux64.tar.bz2
cd pypy3.6-v7.1.1-linux64/bin
./pypy3 -m ensurepip
./pip3.6 install -U pip wheel
./pip list | awk 'NR > 2 {print $1}' | grep -v greenlet | xargs ./pip install -U
# sudo apt-get install libopenblas-dev gfortran
./pip install numpy pandas xarray
import time
import numpy as np
import xarray as xr
shape = (10, 10, 10, 10)
index = (0, 0, 0, 0)
np_arr = np.ones(shape)
arr = xr.DataArray(np_arr)
N = 10000
def bench_slice(obj):
for _ in range(4):
t0 = time.time()
for _ in range(N):
obj[index]
t1 = time.time()
t_ns = (t1 - t0) / N * 1e9
print(f"{t_ns:6.0f} ns {obj.__class__.__name__}")
bench_slice(arr)
bench_slice(np_arr) Benchmark outputs:
PyPy 7.1 3.6:
Big important reminder: all results are for a very small array. I would expect the gap between CPython and pypy to get narrower in % (both for numpy and xarray) as the array size gets larger and more time is spent in the pure C numpy code. |
I suspect system jitter in the profiling as the time for |
Hmm, slicing should basically be a no-op. The fact that xarray makes it about 100x slower is a real killer. It seems from this conversation that it might be hard to workaround import xarray as xr
import numpy as np
n = np.zeros(shape=(1024, 1024))
x = xr.DataArray(n, dims=('y', 'x'))
the_slice = np.s_[256:512, 256:512]
%timeit n[the_slice]
%timeit x[the_slice]
186 ns ± 0.778 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
70.3 µs ± 593 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
TBC I think there's plenty we could do with relatively little complexity to speed up indexing operations on That's a different problem from making xarray as fast as indexing a numpy array, or allowing libraries to iterate through a |
Sure, I just wanted to make the note that this operation should be more or less constant time, as opposed to dependent on the size of the array. Somebody had mentionned it should increase with the size of the array. |
Yes, I think this is still the case for slicing in xarray. There's just much larger constant overhead than in NumPy. (And this is difficult to fix short of rewriting xarray's core in C.) |
One note: if you're indexing into a dataarray and don't care about the coords, index into the variable. 2x numpy time, rather than 30x: In [26]: da = xr.tutorial.open_dataset('air_temperature')['air']
In [27]: da
Out[27]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
In [20]: %timeit da.variable[0]
28.2 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %timeit da[0]
459 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [22]: %timeit da.variable.values[0]
14.1 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
This |
That's great that's helpful @nbren12 . Maybe we should add to docs (we don't really have a performance section at the moment, maybe we start something on performance tips?) There's some info on the differences in the Terminology that @gwgundersen wrote: https://github.com/pydata/xarray/blob/master/doc/terminology.rst#L18 Essentially: by indexing on the variable, you ignore the coordinates, and so skip a bunch of code that takes the object apart and puts it back together. A variable is much more similar to a numpy array, so you can't do |
#3533 closes the gap between DataArray and numpy from 500x slower to "just" 100x slower :) |
Hi, I'm working on a machine learning application where I want to stream data and use xarray containers to store them in a buffer (with an additional "lag" dimension) and guaranty good alignement of the coordinates on various dimensions of the streamed data. I have the impression that for this kind of applications or more generally for intensive algorithmic usages, also as stated at the begining of this issue, a light (with less functionalities and checks) and fast version of xarray DataArray and Dataset containers could be developped. Do you think this could be something doable in the scope of xarray? |
@jhamman Weren't you talking about an xarray lite (TM) package? |
I agree, I think a "xarray lite" package with only named dimensions could indeed be a valuable contribution. I'd love to optimize xarray further, but I suspect you would probably have to write the core in a language like C++ to achieve similar performance to NumPy. |
I hope the following can help users that struggle with the speed of xarray: I've found that when doing numerical computation, I often use the xarray to grab all the metadata relevant to my computation. Scale, chromaticity, experimental information. Eventually, i create a function that acts as a barrier:
The low level implementation can operate on the fast numpy arrays. I've found this to be the struggle with creating high level APIs that do things like sanitize inputs (xarray routines like For the example that @nbren12 brought up originally, it might be better to create xarray routines (if they don't exist already) that can create fast iterators for the underlying numpy arrays given a set of dimensions that the user cares about. |
FWIW, I think the xarray-lite concept would be a great chunk of work to write a small-ish proposal around. I think we could target the next round of CZI EOSS with such a concept. |
Thanks all for your prompt responses! @hmaarrfk , I share your recommendation and it's a great thing to be able to fallback to numpy arrays when the algorithmic part is well decoupled from the data preparation process. It's what I also do when I can. @shoyer I think a first "lite" implementation fully implemented in python could be already a great thing. An additional suggestion: if the target is computational workflows, trying to have some compatibility with packages such as eagerpy would enabling working with other tensor frameworks commonly used in machine learning. @jhamman, @shoyer I would be pleased to share my work on buffer data array if you think it could serve as kind of use-case. In this context, I experimented a bit with a « crafted » lite version of xarray and I could achieve a x10 factor in performance improvement. |
In case it could be usefull, and be reused for benchmarks, I released on my github two notebooks with an implementation of a faster (but somehow very simplified and not very optimized in term of code architecture) version of DataArray and Dataset containers. The second notebook contains some line profilings for buffer experiments with various containers. This permits to point on operations which are slow in Datarray implementation for this use case. |
@jhamman I'll be happy to participate in the discussion. |
Hello, I don't want to disrupt the issue too much (so let me know if you'd rather we continue the discussion outside). Somewhat related to the discussions in this issue, I recently released an open-source library: WAX-ML, https://github.com/eserie/wax-ml, where I implement an accessor to unroll JAX transformations on DataSet and Dataarray xarray containers along a time dimension. I hope this can help! |
I'm really not understanding why indexing is so slow. My dataarray has 2 dims, one axis 1.5 million long ('node') and the other 1500 ('time'). Trying to pull a single timeseries by indexing 1 node takes 16 seconds. the Variable workaround or playing around with chunking doesn't change anything. The only thing loading into memory should be array of 1500 values. Not sure what's going on under the hood but there may be a way to specify that you're only looking to optimize indexing along 1 dim. Once it gets indexed it becomes a very tiny data set. I would think |
As I've been recently going down this performance rabbit hole, I think the discussion around #7045 is relevant and provides some additional historical context as to "why" this performance penalty might be happening. |
So a workaround I was able to use was to load the whole thing into a np array (18GB!) in 1 minute |
You all know this, but I didn't realize how very slow read gunzip / write gzip can be.
Here xarray and numpy times, both with no compression, are not so far apart -- Not surprisingly, xarray with complevel 1 is terribly slow, gunzip / gzip. Two obvious suggestions:
|
Machine learning applications often require iterating over every index along some of the dimensions of a dataset. For instance, iterating over all the
(lat, lon)
pairs in a 4D dataset with dimensions(time, level, lat, lon)
. Unfortunately, this is very slow with xarray objects compared to numpy (or h5py) arrays. When the Pangeo machine learning working group met today, we found that several of us have struggled with this.I made some simplified benchmarks, which show that xarray is about 1000 times slower than numpy when repeatedly grabbing a small amount of data from an array. This is a problem with both
isel
or[]
indexing. After doing some profiling, the main culprits seem to be xarray routines like_validate_indexers
and_broadcast_indexes
.While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.
The text was updated successfully, but these errors were encountered: