Closed
Description
What is your issue?
I'm doing some benchmarks on Xarray + Zarr vs. some other formats, and I get quite a surprising result — in a very simple array, xarray is adding a lot of overhead to reading a Zarr array.
Here's a quick script — super simple, just a single chunk. It's 800MB of data — so not some tiny array where reading a metadata json file or allocating an index is going to throw the results.
import numpy as np
import zarr
import xarray as xr
import dask
print(zarr.__version__, xr.__version__, dask.__version__)
(
xr.DataArray(np.random.rand(10000, 10000), name="foo")
.to_dataset()
.chunk(None)
.to_zarr("test.zarr", mode="w")
)
%timeit xr.open_zarr("test.zarr").compute()
%timeit zarr.open("test.zarr")["foo"][:]
2.17.2 2024.5.1.dev37+gce196d56 2024.5.2
551 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
183 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So:
- 551ms for xarray
- 183ms for zarr
Having a quick look with py-spy
suggests there might be some thread contention, but not sure how much is really contention vs. idle threads waiting.
Making the array 10x bigger (with 10 chunks) reduces the relative difference, but it's still fairly large:
2.17.2 2024.5.1.dev37+gce196d56 2024.5.2
6.88 s ± 353 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.15 s ± 264 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any thoughts on what might be happening? Is the benchmark at least correct?