Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distarray backend? #591

Closed
rabernat opened this issue Sep 28, 2015 · 5 comments
Closed

distarray backend? #591

rabernat opened this issue Sep 28, 2015 · 5 comments

Comments

@rabernat
Copy link
Contributor

This is probably a long shot, but I think a distarray backend could potentially be very useful in xray. Distarray implements the numpy interface, so it should be possible in principle.

Distarray has a different architecture from dask (using MPI for parallelization) and in this way is more similar to traditional HPC codes. The application I have in mind is very high resolution GCM output where one wants to tile the data spatially across multiple nodes on a cluster. (This is how a GCM itself works.)

@jhamman
Copy link
Member

jhamman commented Sep 29, 2015

I think this is an interesting idea. I was just asking @shoyer the other day if we could use dask.distributed with xray objects.

@shoyer will be able to speak more to the applicability and effort required to implement distarray. I'm not sure how many array abstractions we want in xray. Currently we have two (numpy and dask). Fully distributed arrays may be the logical extension.

One major drawback I see is that we don't have a parallel netCDF library in python so I/O will be an issue.

@shoyer
Copy link
Member

shoyer commented Sep 30, 2015

Right now we have some custom logic in xray to support dask. This is OK, but ideally we would have generic hooks that would let other libraries supply ndarray objects suitable for putting in xray data structures that we don't know anything about. My hope is to resolve this upstream in NumPy (see numpy/numpy#4164 and related issues). In particularly, I'd like to start with a proposal for __array_concatenate__ and then add enough special methods that we can get ride of almost all the isinstance checks in xray.core.ops.

bolt-project/bolt#58 provides an overview of the array API that we need to sensibly wrap another array library as an alternative to numpy/dask.

So let me make this a challenge: if anyone has an open source array library that checks off most of those boxes and that they want to use with xray, I will gladly help you wrap it. Even before we have a full "duck array" API in NumPy itself, if you can provide a compatibility module that makes your array library work like NumPy (e.g., like dask.array), I will do the necessary hacks to make it work in xray.

Distarray does look interesting, but it seems like it hasn't gotten much real world use (yet), so it's hard to say how well it will work. The project has tackled some pretty hard problems with representing and manipulating fully distributed arrays in a variety of underlying representations, but that means they've had less time to focus on building out the long tail of features necessary for a useful ndarray library. That said, I would love to be proven wrong -- if distarray can do most of the items on my list and a numpy compatible API, then it's worth thinking about what it would take to wrap it. Perhaps @bgrant can help clarify? (I enjoyed your talk at SciPy, by the way!)

Here are a few other projects that are also worth keeping an eye on:

@bgrant
Copy link

bgrant commented Oct 5, 2015

I think @shoyer's characterization of the current state of DistArray is pretty good. That said, I think this proposal is a good target for us to think about, and bolt-project/bolt#58 is helpful. I'll try to look into this more if I can make some time.

@freeman-lab
Copy link

@shoyer I really like the "challenge" proposed here, and terrific summary of the state of things. It's definitely on our to-do list with Bolt and will keep you posted on progress, though we're currently knee-deep in the very related problem of context switching between different backend services (local, dask, distributed w/ spark, etc.)

@jhamman
Copy link
Member

jhamman commented Jan 13, 2019

Closing this old issue. The distrarray project seems to be done and we have more recent issues discussing flexible array interfaces.

@jhamman jhamman closed this as completed Jan 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants