Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate parallel backend from frontend #198

Open
CJ-Wright opened this issue Aug 12, 2018 · 2 comments
Open

Separate parallel backend from frontend #198

CJ-Wright opened this issue Aug 12, 2018 · 2 comments

Comments

@CJ-Wright
Copy link
Member

Recently I've been trying to split my pipelines up into chunks so the chunks can be re-used as much as possible. One issue I foresee running into is that scatter may not work as intended. scatter will send the data to the cluser for processing, but since the downstream nodes are already defined (I'm linking them via connect) they won't properly operate on the data (since they aren't DaskStream nodes).

It seems that it may be beneficial to separate the front end pipeline definition from the decision about which backend is being used, at least at the class level. This would make pipelines more modular since we don't need to re-write the same pipeline for each backend. Additionally this would allow the user executing a pipeline to decide what backend they want/are able to use.

One approach to handle this could be to specify a client for the entire pipeline, where a client mimics some of the dask client API (mostly submit and loop I think). If the client associated with any node is changed/updated then the entire pipeline shifts to use that client. This would also require nodes to know if they are downstream of a scatter but haven't been gathered yet, since then they would know to use the pipeline's client rather than standard local execution (which most likely should be a dummy client). Maybe this can be inspected when the client is set, since it could change during the pipeline build process?

@martindurant
Copy link
Member

I believe you had a simpler implementation of this idea that what is being described above? Having scatter/gather not use dask explicitly seems like an OK ting to want to me, but personally I also think there's a lot to be gained from the extra knowledge we have about the execution framework.

@CJ-Wright
Copy link
Member Author

Can you elaborate on what is gained from the knowledge of the framework?
I do have a working implementation of this which I'll try to PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants