Skip to content

How should Xarray control asynchronous calls? #10622

@TomNicholas

Description

@TomNicholas

Need for async

There are several places in Xarray where utilizing asynchronous calls would hugely boost performance. This comes up particularly when interacting with high-latency storage backends, such as Zarr on remote object storage:

Each of these performance issues can be solved by leveraging async code somewhere in the stack, requiring explicit control of concurrency somewhere.

Where to async?

But where in the stack should that control be implemented?

As @rabernat noted in #8965 (comment), historically we have avoided this question originally by outsourcing nearly all parallelism to dask. Now that zarr-python v3 learned how to async, some of the above issues were addressed by controlling concurrency in that layer.

But pushing everything down to zarr only gets us so far. It requires zarr-python to implement API that would be more natural to live in xarray's zarr backend, prevents other backends benefiting from async (such as PyDAP), and sometimes we do want to expose the async functions publicly to the user.

Inter-library coordination

In general there are perhaps 5 layers at which concurrency might need be controlled:

  1. User code
  2. Xarray
  3. Dask / Cubed
  4. Zarr-python
  5. Icechunk / fsspec

Xarray is clearly one amongst many so we need to think about how to coordinate across layers to avoid opaque and unpredictable interactions between libraries.

Choices in Xarray

If xarray does not explicitly control concurrency but does use asyncio.gather calls to issue requests across multiple variables then it could lead to oversubscribing threads, and some of the above issues are awkward to solve with this limitation.

If xarray does explicitly control concurrency then it requires either a separate event loop, or a threadpool with some way to configure the number of threads, plus some code to get this to run inside an already-running event loop (e.g. inside ipython/jupyter). It also has other risks.

#10327 adds a new primitive (the async method xr.Variable.load_async()) that will be useful for solving many of the above issues, but faces this question of whether or not to limit concurrency (see thread). For now I'm leaning towards punting on the question by not limiting it, simply exposing the awaitables all the way to the top, but with the understanding that once we have agreement here on a general approach we should go back and add in some concurrency-limiting code.

cc @d-v-b, @TomAugspurger, @jhamman, @dcherian, @shoyer

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions