You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I've been trying to split my pipelines up into chunks so the chunks can be re-used as much as possible. One issue I foresee running into is that scatter may not work as intended. scatter will send the data to the cluser for processing, but since the downstream nodes are already defined (I'm linking them via connect) they won't properly operate on the data (since they aren't DaskStream nodes).
It seems that it may be beneficial to separate the front end pipeline definition from the decision about which backend is being used, at least at the class level. This would make pipelines more modular since we don't need to re-write the same pipeline for each backend. Additionally this would allow the user executing a pipeline to decide what backend they want/are able to use.
One approach to handle this could be to specify a client for the entire pipeline, where a client mimics some of the dask client API (mostly submit and loop I think). If the client associated with any node is changed/updated then the entire pipeline shifts to use that client. This would also require nodes to know if they are downstream of a scatter but haven't been gathered yet, since then they would know to use the pipeline's client rather than standard local execution (which most likely should be a dummy client). Maybe this can be inspected when the client is set, since it could change during the pipeline build process?
The text was updated successfully, but these errors were encountered:
I believe you had a simpler implementation of this idea that what is being described above? Having scatter/gather not use dask explicitly seems like an OK ting to want to me, but personally I also think there's a lot to be gained from the extra knowledge we have about the execution framework.
Recently I've been trying to split my pipelines up into chunks so the chunks can be re-used as much as possible. One issue I foresee running into is that
scatter
may not work as intended.scatter
will send the data to the cluser for processing, but since the downstream nodes are already defined (I'm linking them viaconnect
) they won't properly operate on the data (since they aren'tDaskStream
nodes).It seems that it may be beneficial to separate the front end pipeline definition from the decision about which backend is being used, at least at the class level. This would make pipelines more modular since we don't need to re-write the same pipeline for each backend. Additionally this would allow the user executing a pipeline to decide what backend they want/are able to use.
One approach to handle this could be to specify a client for the entire pipeline, where a
client
mimics some of thedask
client API (mostlysubmit
andloop
I think). If the client associated with any node is changed/updated then the entire pipeline shifts to use that client. This would also require nodes to know if they are downstream of a scatter but haven't been gathered yet, since then they would know to use the pipeline's client rather than standard local execution (which most likely should be a dummyclient
). Maybe this can be inspected when the client is set, since it could change during the pipeline build process?The text was updated successfully, but these errors were encountered: