-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script does not finish with dask distributed #257
Comments
Could it be that your main process is ending immediately after you send the last file for processing, so that the cluster gets shut down at that point and remaining processing tasks in flight are lost? In that case, you need your script to wait until all processing is done, perhaps by awaiting on the futures coming back from the pipeline, or sleeping, or polling the cluster to see if it's still working. |
That is probably the issue I'm facing.
How would I do this? |
Start emitting to the source inside of a daemon thread. On the main have a long-running while loop hold the program up till conclusion. source = Source()
source.sink(print)
def start_emitting(s):
for i in range(1000):
s.emit(i)
if __name__ == "__main__":
activated_thread = Thread(daemon=True, target=start_emitting, args=(source,))
activated_thread.start()
while True:
sleep(5) |
@kivo360 : your situation does not seem to involve distributed at all, perhaps a different issue? Please also provide the way in which you are stalling the main thread. In any case, could it be that simply the
I believe the list will be populated with values. |
Nah, it is involved with distributed. You need to leave the program time to finish. Often times I default to giving the program infinite time by keeping the main process open for extended periods of time. |
What I mean is, your code doesn't invoke distributed. You should be able to achieve something similar with event loops, but your way may be simpler when none of the source nodes need an event loop anyway (but distributed always has one!). There may perhaps be a way say "run until done" on a source (i.e., stop when all of the events have been processed), which in the case with no timing of backpressure would be immediately. |
Does |
In the original case of working with distributed, yes, you can wait on futures from emit to be processed, but not in the simpler case. |
In the simpler case can the thread be |
I'm using a pipeline that reads text from file via Apache Tika, performs some pre-processing and writes it into a MongoDB.
The following is a truncated version of my script.
Everything works well, but the last few documents are missing. It seems like the dask process is killed before the task has finished. The resulting warnings/errors below support this.
Is this behavior expected and I'm using the interface wrong or is this a bug?
Update:
This does not happen when used in a jupyter notebook. Could this be related to the event loop?
relevant package versions
The text was updated successfully, but these errors were encountered: