-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Need for async
There are several places in Xarray where utilizing asynchronous calls would hugely boost performance. This comes up particularly when interacting with high-latency storage backends, such as Zarr on remote object storage:
- When opening data, for loading metadata for many arrays/groups Reducing IO in
ZarrStore
#9853 - When loading multiple variables / groups
- For creating indexes up-front
open_dataset
creates default indexes sequentially, causing significant latency in cloud high-latency stores #10579 - For on-demand (lazy) loading Add an asynchronous load method? #10326 / Support concurrent loading of variables #8965
- For creating indexes up-front
- For awaiting lazily-loading variables
Dataset.to_zarr
compute=False should allow access to awaitable #6383 - When loading multiple datasets Add an asynchronous load method? #10326
- When saving multiple variables / groups
DataTree.to_zarr()
is very slow writing to high latency store #9455
Each of these performance issues can be solved by leveraging async
code somewhere in the stack, requiring explicit control of concurrency somewhere.
Where to async
?
But where in the stack should that control be implemented?
As @rabernat noted in #8965 (comment), historically we have avoided this question originally by outsourcing nearly all parallelism to dask. Now that zarr-python v3 learned how to async, some of the above issues were addressed by controlling concurrency in that layer.
But pushing everything down to zarr only gets us so far. It requires zarr-python to implement API that would be more natural to live in xarray's zarr backend, prevents other backends benefiting from async (such as PyDAP), and sometimes we do want to expose the async
functions publicly to the user.
Inter-library coordination
In general there are perhaps 5 layers at which concurrency might need be controlled:
- User code
- Xarray
- Dask / Cubed
- Zarr-python
- Icechunk / fsspec
Xarray is clearly one amongst many so we need to think about how to coordinate across layers to avoid opaque and unpredictable interactions between libraries.
Choices in Xarray
If xarray does not explicitly control concurrency but does use asyncio.gather
calls to issue requests across multiple variables then it could lead to oversubscribing threads, and some of the above issues are awkward to solve with this limitation.
If xarray does explicitly control concurrency then it requires either a separate event loop, or a threadpool with some way to configure the number of threads, plus some code to get this to run inside an already-running event loop (e.g. inside ipython/jupyter). It also has other risks.
#10327 adds a new primitive (the async method xr.Variable.load_async()
) that will be useful for solving many of the above issues, but faces this question of whether or not to limit concurrency (see thread). For now I'm leaning towards punting on the question by not limiting it, simply exposing the awaitables all the way to the top, but with the understanding that once we have agreement here on a general approach we should go back and add in some concurrency-limiting code.
cc @d-v-b, @TomAugspurger, @jhamman, @dcherian, @shoyer