-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent fatal error running deployment with docker infra #6519
Comments
Thanks for the report! It seems like this has one of two causes:
If you could add a |
I just got another crash, after having added the print (actually used self.logger.info) as instructed. I've also re-created my VM where the agent runs for a quick upgrade to 2.2.0, so here's my current "prefect version":
Print from a functioning run: Print from a crashing run:
The two prints look identical to me. Let me know if there's anything else I can do in order to narrow this down. |
Had the same crash today. This time I decided to add a snippet of code to inspect the docker object. I also have a theory regarding the cause. I haven't paid much attention to the circumstances of when the crash happens, but it seems to be much more common when two deployments are started at nearly the same exact time. That's been the case these last 3 days. The first thing our main flow does is start about 10 async tasks that use prefect.client to start flows. Both these tasks and these flows are limited by task- and flow concurrency limits set to 2, so in effect this results in two near-simultaneous calls to the Orion API. If you look at the successful and crashing run logs below you can see that they are 0.3 sec apart. The main flow is run on a different agent than these 10 async flows. Is it possible that when the main flow runs (using schedule or manual trigger), it manages to lazy load the docker module, but when it then uses prefect.client to trigger two flows on a different vm/agent, at the near-same exact time, there's some sort of condition where one execution starts lazy-loading the docker module while the other one tries to use the barely-loaded instance? My knowledge regarding the inner workings of Python are essentially zero so this is pure guesswork. Working run:
Crashing run:
Not sure what the 9 attributes not shown are about, but maybe it'll tell you something. Since this is a pretty strange error I'll include some more info. The agent runs on Ubuntu 20.04 on an Azure VM that was last re-created on 2022-08-24 with the following cloud config:
These are the contents of the task that runs deployments:
I'll post again if I discover anything else that might be helpful. |
After setting the concurrency limit to 1 we've been running with no issues for 3 days. The issue likely relates to starting two deployments at the same (or nearly the same) time, probably due to hitting the lazy loader twice within a short time span. |
Thanks for all these additional details! Looks like there is definitely some sort of concurrency issue with the lazy loader. We're going to need to create a MRE that use uses lazy loading of the Docker module to isolate this from all of the other mechanisms. Something like.. import anyio
from prefect.utilities.importtools import lazy_import
docker = lazy_import("docker")
async def load_attribute():
docker.errors
async def main():
async with anyio.create_task_group() as tg:
for _ in range(20):
tg.start_soon(load_attribute)
anyio.run(main) Once we've reproduced the issue in this isolated context we can investigate a fix. I'm a bit confused after looking at our implementation as it seems like it should be robust to concurrency. |
The MRE looks reasonable to me. Unfortunately it runs on my local machine with no issues, so maybe the problem isn't as simple as I was hoping. Very strange.. |
I've just got the same problem:
prefect version: 2.6.7 |
Having the same issue with prefect-2.7.10. Anyone tried just renaming the infrastructure/docker.py file to something else? (docker/docker-py#1370) Is it possible to configure Prefect to automatically retry on these errors? As @bjorhn stated rerunning the flow seems to be a temporary solution. |
I ran into this issue too with prefect 2.7.10. This occurred when the same flow was run twice at the same time do to the agent process stopping and the flow was behind schedule. |
Thanks for the link to the docker-py issue; we can just rename our module to something else once we have the deprecation utils to do so in a backwards compatible way (#8469) |
Any updates on this ? We still have critical control flows crashing because of this error (still on Prefect 2.7.12 though). |
Any updates on the status of the fix? We're running Prefect 2.9.0 and still see a lot of these issues (especially at times with high load on the Prefect database running in a self-hosted setup - not sure how causal that link is, but the correlation is quite visible) |
There's WIP at the linked pull request (#8788) but I'm too busy to get it done right now. I'd be happy if someone took it over. |
I've managed to remedy this problem by adding |
Haven't seen the issue while testing with the new workers though (yet). |
Interested in the progress on this too, had the same issues. |
Likewise, interested in this as we are having the same issue. |
@tsugliani & @toby-coleman: We still haven't seen this issue since we switched from the agents to the new workers, so they seemed to have fixed this issue there. |
@eudyptula how complex was the switch from agents to workers? Are they are near drop-in replacement for agents? |
I found it pretty much a drop-in for agents. I had our agents already updated to queues, pools, etc., when I did it on Prefect 2.10.6. Only had a minor issue with env. vars, where they stopped supporting a specific naming scheme (#9397). Probably still wise to attempt it on a test setup beforehand :) |
I have a report that this did not resolve the issue; perhaps because of the backwards compatibility aliases? |
Just wanted to say we saw this error again today with prefect==2.10.12
|
I'm looking at migrating from prefect agent to prefect works to mitigate this on the setup I have. Currently I have:
Would I be correct in thinking that this needs to change, i.e. to the following:
The default image for the Docker work pool is the Prefect image, which is not what I want to run the jobs in. |
I keep getting the message "State message: Flow run infrastructure exited with non-zero status code 1." when I'm running on docker. It's intermittent, and it only happens a few times, but as I run quite a few flows a day, it's becoming a problem for me. It happens with any flow at any time, there is no pattern, I need help... |
Sorry to keep bumping this issue... but as @leonardorochaperazzini points out this becomes quite a big problem when you have a lot of flows running. I've investigated migrating from Prefect agents to workers, but this is currently blocked for us by #12933. |
First check
Bug summary
We run 40-50 deployments per day and one or two of them will usually crash within the first 5 seconds with a docker error. Re-running the deployment always works. All of our deployments are docker deployments. No logs are sent to the server GUI, but they can be retrieved from the agent.
Let me know if there's any additional information you need.
Reproduction
Since the problem is intermittent it's difficult to put together a minimal example.
Error
Versions
Additional context
Here's the deployment file, in case it's of any help, with certain strings removed.
The text was updated successfully, but these errors were encountered: