-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker containers are kept open when they crash #7560
Comments
I manage to locate more information about what is going on inside the hanging docker container (the
|
Still an issue as of 2.10.4 - as far as I can see it's related to the task runner not shutting down probably when everything crashes. Stacktrace from the hanging docker container:
|
@eudyptula do you have a MRE so I can debug this? |
@madkinsz Unfortunately not on hand, but my best guess is that you'll need the following:
The timeouts themselves are covered by #9323, but they seem to trigger this issue. Last one I got, looked like the following in the agent logs. It managed to create 274 of about 1000 task runs before failing.
Which API call they fall on doesn't seem relevant, I also had them failing on The 500 errors seems to consistently leave the containers hanging for us. I might try the other task runners later to see if it makes a difference... |
DaskTaskRunner generally seem to perform much better than ConcurrentTaskRunner, but 500 Internal do cause hanging containers also. Just switched to A dump from the hanging container with dask:
After today's testing, the solution seem to be to increase the default timeout values, add retry on HTTP 500 to clients, and switch to the DaskTaskRunner. |
Just a quick update. DaskTaskRunner left a container running with 100% CPU usage, which ended up causing several issues across VMs that was sharing the same host - including our Prefect production setup. Reverted all the way back to SequentialTaskRunner... hopefully that is more stable than the other two. |
This looks like a duplicate of #9229 - the forked subprocesses here are likely the cause. |
First check
Bug summary
On the agent we have have long running containers that are never closed (seen 40 hours+):
Simple flows (few tasks, no sub flows, etc.) seems to run fine, but our more advanced flows (staring many tasks with map, etc.) are consistently crashing. Does seem that the containers sometimes stop correctly when a flow crash, and sometimes. We will be looking into the flows and whether we made some errors there, but either way a container should be stopped when a flow crash.
The container and server logs indicate that a HTTP 500 is caused by a database timeout - so I will try to increase PREFECT_ORION_DATABASE_TIMEOUT. Also, notice the successful calls before and after in the server logs.
Will try to update Prefect in the near future as well, but we're also occasionally experiencing the network issue (#7512) - so doing a little trial and error with versions at the moment.
Logs from docker
Logs from the server
Reproduction
# Install and run prefect...
Error
No response
Versions
Additional context
No response
The text was updated successfully, but these errors were encountered: