-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distarray backend? #591
Comments
I think this is an interesting idea. I was just asking @shoyer the other day if we could use @shoyer will be able to speak more to the applicability and effort required to implement One major drawback I see is that we don't have a parallel netCDF library in python so I/O will be an issue. |
Right now we have some custom logic in xray to support dask. This is OK, but ideally we would have generic hooks that would let other libraries supply ndarray objects suitable for putting in xray data structures that we don't know anything about. My hope is to resolve this upstream in NumPy (see numpy/numpy#4164 and related issues). In particularly, I'd like to start with a proposal for bolt-project/bolt#58 provides an overview of the array API that we need to sensibly wrap another array library as an alternative to numpy/dask. So let me make this a challenge: if anyone has an open source array library that checks off most of those boxes and that they want to use with xray, I will gladly help you wrap it. Even before we have a full "duck array" API in NumPy itself, if you can provide a compatibility module that makes your array library work like NumPy (e.g., like Distarray does look interesting, but it seems like it hasn't gotten much real world use (yet), so it's hard to say how well it will work. The project has tackled some pretty hard problems with representing and manipulating fully distributed arrays in a variety of underlying representations, but that means they've had less time to focus on building out the long tail of features necessary for a useful ndarray library. That said, I would love to be proven wrong -- if distarray can do most of the items on my list and a numpy compatible API, then it's worth thinking about what it would take to wrap it. Perhaps @bgrant can help clarify? (I enjoyed your talk at SciPy, by the way!) Here are a few other projects that are also worth keeping an eye on:
|
I think @shoyer's characterization of the current state of DistArray is pretty good. That said, I think this proposal is a good target for us to think about, and bolt-project/bolt#58 is helpful. I'll try to look into this more if I can make some time. |
@shoyer I really like the "challenge" proposed here, and terrific summary of the state of things. It's definitely on our to-do list with Bolt and will keep you posted on progress, though we're currently knee-deep in the very related problem of context switching between different backend services (local, dask, distributed w/ spark, etc.) |
Closing this old issue. The distrarray project seems to be done and we have more recent issues discussing flexible array interfaces. |
This is probably a long shot, but I think a distarray backend could potentially be very useful in xray. Distarray implements the numpy interface, so it should be possible in principle.
Distarray has a different architecture from dask (using MPI for parallelization) and in this way is more similar to traditional HPC codes. The application I have in mind is very high resolution GCM output where one wants to tile the data spatially across multiple nodes on a cluster. (This is how a GCM itself works.)
The text was updated successfully, but these errors were encountered: