Skip to content

xr.open_zarr is 3x slower than zarr.open, even at scale #9111

Closed
@max-sixty

Description

@max-sixty

What is your issue?

I'm doing some benchmarks on Xarray + Zarr vs. some other formats, and I get quite a surprising result — in a very simple array, xarray is adding a lot of overhead to reading a Zarr array.

Here's a quick script — super simple, just a single chunk. It's 800MB of data — so not some tiny array where reading a metadata json file or allocating an index is going to throw the results.

import numpy as np
import zarr
import xarray as xr
import dask
print(zarr.__version__, xr.__version__, dask.__version__)

(
    xr.DataArray(np.random.rand(10000, 10000), name="foo")
    .to_dataset()
    .chunk(None)
    .to_zarr("test.zarr", mode="w")
)

%timeit xr.open_zarr("test.zarr").compute()
%timeit zarr.open("test.zarr")["foo"][:]
2.17.2 2024.5.1.dev37+gce196d56 2024.5.2
551 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
183 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So:

  • 551ms for xarray
  • 183ms for zarr

Having a quick look with py-spy suggests there might be some thread contention, but not sure how much is really contention vs. idle threads waiting.


Making the array 10x bigger (with 10 chunks) reduces the relative difference, but it's still fairly large:

2.17.2 2024.5.1.dev37+gce196d56 2024.5.2
6.88 s ± 353 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.15 s ± 264 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Any thoughts on what might be happening? Is the benchmark at least correct?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions