-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal server error & Lack of retries on Agents when server is failing #9323
Comments
After having spent more time looking into this, am I right in the following?
If all queuing is done in the connection pool with only a 5 sec. timeout, this could explain our issue (at least turning it up seems to help, but am still testing atm.). |
Hi! Sorry to hear you're having a tough time. Please try to remember we're a small team with a lot of objectives. Maintaining your own server is not trivial at scale. We do not retry on 500 internal server errors by default because it is not guaranteed to be idempotent. You can use prefect/src/prefect/client/base.py Lines 254 to 257 in ca59d72
If your database is under heavy load, you may indeed benefit from increasing the Note the server side stack trace you provided is not for the Are you running replicas of the API? Are you using something to manage the Postgres connection pool e.g. pgbouncer? |
Hmm, yea, I see the mismatch in logs now - sorry about that. Can't seem to find any matching errors in the server logs right now. Will try to reproduce it later so I hopefully have matching logs. Also I notice that in the Prefect code (begin_task_map in engine.py) that tasks, created with map, seems to be created one by one. This means one API call and one DB insert query per task created - and we do have a couple of flows that creates 5000 tasks. I assume it's a similar thing with logging, cancelling, etc. You guys really should consider batch processing/inserting these instead. No replications of PostgreSQL, just running an instance, dedicated to Prefect, as a docker container on the same server as prefect. No extra pooling neither, but that should be all handled inside Prefect by sqlalchemy/asyncpg so I don't really see there should be a need to. |
Increase timeouts does seem to help us a lot, I tried to set them back to default to reproduce the stack trace, and this time I got the following (but it does seem to be a load issue, so I suppose that the API call will vary from time to time: Agent:
Server
|
Okay, on top of increasing timeouts and setting retries - I also discovered that switching to DaskTaskRunner (default settings) seems to help a lot - I'm guessing it's because it limits the number of simultaneous task runs to the number of cores available. |
Thanks for the additional information! Some notes:
|
Limiter will be very helpful. For now we're on the SequentialTaskRunner as we did not have much success with the Dask one. Kept dockers running with 100% CPU. Didn't spend the time to debug that issue, just switched to the Sequential instead. Seems to do the job for now, but may give some issues in scalability - but that'll be a challenge for the future. |
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment. |
This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it. |
First check
Bug summary
With Prefect 2.10.4, we're still seeing Internal Server Errors on the Prefect API from the agents, causing our flows to fail on a regular basis:
Attached a stacktrace from the server below, from a couple of minutes earlier (closest I could find).
I assume this still happens when the DB cannot keep up and the retry code you added have retried too many times.
Do you have sufficient delay in the retry code or is it just retrying a couple of times in quick succession ?
If the Prefect API fails with 500 Internal Server Error, wouldn't it make sense for the Agent to wait a little while and then retry instead of just giving up at first attempt?
We been really struggling to get Prefect to run our semi-critical control systems reliably, and spending way too many manhours on platform upgrades & maintenance. Was hoping that Prefect would make our lives easier, not the other way around. Do hope you have some plans for reaching a stable release soon?
Reproduction
# Flow that started 40 tasks
Error
Versions
Additional context
No response
The text was updated successfully, but these errors were encountered: