Intermittent fatal error running deployment with docker infra #6519

bjorhn · 2022-08-23T08:33:26Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

We run 40-50 deployments per day and one or two of them will usually crash within the first 5 seconds with a docker error. Re-running the deployment always works. All of our deployments are docker deployments. No logs are sent to the server GUI, but they can be retrieved from the agent.

Let me know if there's any additional information you need.

Reproduction

Since the problem is intermittent it's difficult to put together a minimal example.

Error

08:11:55.728 | INFO    | prefect.agent - Submitting flow run '00dc9668-fb5b-4935-9bfb-0892a865d982'
08:11:56.201 | ERROR   | prefect.agent - Failed to submit flow run '00dc9668-fb5b-4935-9bfb-0892a865d982' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 443, in _get_client
    docker_client = docker.from_env()
AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/prefect/agent.py", line 199, in submit_run
    await self.task_group.start(submit_flow_run, flow_run, infrastructure)
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in start
    return await future
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/submission.py", line 48, in submit_flow_run
    return await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 209, in run
    container_id = await run_sync_in_worker_thread(self._create_and_start_container)
  File "/usr/local/lib/python3.8/dist-packages/prefect/utilities/asyncutils.py", line 56, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 245, in _create_and_start_container
    docker_client = self._get_client()
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 445, in _get_client
    except docker.errors.DockerException as exc:
AttributeError: module 'docker' has no attribute 'errors'

Versions

Version:             2.1.1
API version:         0.8.0
Python version:      3.8.10
Git commit:          dc2ba222
Built:               Thu, Aug 18, 2022 10:18 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1

Additional context

Here's the deployment file, in case it's of any help, with certain strings removed.

###
### A complete description of a Prefect Deployment for flow 'RA Integration'
###
name: RA Integration (prod)
description: Flow for the Reporting Analytics (RA) integration.
version: fbd05955547891a8bc78436ca1598ffd
# The work queue that will handle this deployment's runs
work_queue_name: integration-prod
tags:
- prod
- integration
parameters: {}
schedule: null
infra_overrides:
  image: REMOVED
infrastructure:
  type: docker-container
  env: {}
  labels: {}
  name: null
  command:
  - python
  - -m
  - prefect.engine
  image: prefecthq/prefect:2.1.1-python3.8
  image_pull_policy: null
  image_registry: null
  networks: []
  network_mode: null
  auto_remove: false
  volumes: []
  stream_output: true

###
### DO NOT EDIT BELOW THIS LINE
###
flow_name: RA Integration
manifest_path: null
storage:
  bucket_path: flows/ra_integration
  azure_storage_connection_string: REMOVED
  azure_storage_account_name: null
  azure_storage_account_key: null
  _is_anonymous: true
  _block_document_name: REMOVED
  _block_document_id: REMOVED
  _block_type_slug: azure
path: null
entrypoint: orchestration/flows/ra_integration.py:default
parameter_openapi_schema:
  title: Parameters
  type: object
  properties:
    loglevel:
      title: loglevel
      default: INFO
      type: string
    integration_parameters:
      title: integration_parameters
      default:
        namespace:
        - LOCATION_ACTIVITY_DB
        - COREDB
        backfill: false
        manual stream select: null
        manual delta key override: null
        manual delta range override:
          lower-bound: null
          upper-bound: null
  required: null
  definitions: null

The text was updated successfully, but these errors were encountered:

zanieb · 2022-08-23T14:27:10Z

Thanks for the report! It seems like this has one of two causes:

The docker.py file we have is on your PYTHON_PATH and being loaded instead of the docker module. It seems weird that this would be intermittent.
The lazy module loading we use for docker breaks intermittently.

If you could add a print(docker) line in _get_client before we call from_env so we can see where the module is coming from it'd help us narrow things down.

bjorhn · 2022-08-24T09:49:40Z

I just got another crash, after having added the print (actually used self.logger.info) as instructed. I've also re-created my VM where the agent runs for a quick upgrade to 2.2.0, so here's my current "prefect version":

Version:             2.2.0
API version:         0.8.0
Python version:      3.8.10
Git commit:          e3651362
Built:               Tue, Aug 23, 2022 2:18 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1

Print from a functioning run:
08:52:53.712 | INFO | prefect.infrastructure.docker-container - madkinsz-log: <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init__.py'>

Print from a crashing run:

08:52:58.952 | INFO    | prefect.infrastructure.docker-container - madkinsz-log: <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init__.py'>
08:52:58.954 | ERROR   | prefect.agent - Failed to submit flow run '2e3b5431-2235-4345-9eb1-e39e0e31eaf1' to infrastructure. AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:
AttributeError: module 'docker' has no attribute 'errors'

The two prints look identical to me. Let me know if there's anything else I can do in order to narrow this down.

bjorhn · 2022-08-25T08:55:00Z

Had the same crash today. This time I decided to add a snippet of code to inspect the docker object. I also have a theory regarding the cause.

I haven't paid much attention to the circumstances of when the crash happens, but it seems to be much more common when two deployments are started at nearly the same exact time. That's been the case these last 3 days.

The first thing our main flow does is start about 10 async tasks that use prefect.client to start flows. Both these tasks and these flows are limited by task- and flow concurrency limits set to 2, so in effect this results in two near-simultaneous calls to the Orion API. If you look at the successful and crashing run logs below you can see that they are 0.3 sec apart.

The main flow is run on a different agent than these 10 async flows. Is it possible that when the main flow runs (using schedule or manual trigger), it manages to lazy load the docker module, but when it then uses prefect.client to trigger two flows on a different vm/agent, at the near-same exact time, there's some sort of condition where one execution starts lazy-loading the docker module while the other one tries to use the barely-loaded instance? My knowledge regarding the inner workings of Python are essentially zero so this is pure guesswork.

Working run:

06:53:00.120 | INFO    | prefect.infrastructure.docker-container - Inspecting docker object...
╭─ <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init─╮
│          api = <module 'docker.api' from                                     │
│                '/usr/local/lib/python3.8/dist-packages/docker/api/__init__.… │
│         auth = <module 'docker.auth' from                                    │
│                '/usr/local/lib/python3.8/dist-packages/docker/auth.py'>      │
│       client = <module 'docker.client' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/client.py'>    │
│    constants = <module 'docker.constants' from                               │
│                '/usr/local/lib/python3.8/dist-packages/docker/constants.py'> │
│      context = <module 'docker.context' from                                 │
│                '/usr/local/lib/python3.8/dist-packages/docker/context/__ini… │
│  credentials = <module 'docker.credentials' from                             │
│                '/usr/local/lib/python3.8/dist-packages/docker/credentials/_… │
│       errors = <module 'docker.errors' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/errors.py'>    │
│       models = <module 'docker.models' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/models/__init… │
│          tls = <module 'docker.tls' from                                     │
│                '/usr/local/lib/python3.8/dist-packages/docker/tls.py'>       │
│    transport = <module 'docker.transport' from                               │
│                '/usr/local/lib/python3.8/dist-packages/docker/transport/__i… │
│        types = <module 'docker.types' from                                   │
│                '/usr/local/lib/python3.8/dist-packages/docker/types/__init_… │
│        utils = <module 'docker.utils' from                                   │
│                '/usr/local/lib/python3.8/dist-packages/docker/utils/__init_… │
│      version = <module 'docker.version' from                                 │
│                '/usr/local/lib/python3.8/dist-packages/docker/version.py'>   │
│    APIClient = class APIClient(base_url=None, version=None, timeout=60,      │
│                tls=False, user_agent='docker-sdk-python/6.0.0',              │
│                num_pools=None, credstore_env=None, use_ssh_client=False,     │
│                max_pool_size=10): A low-level client for the Docker Engine   │
│                API.                                                          │
│      Context = class Context(name, orchestrator=None, host=None,             │
│                endpoints=None, tls=False): A context.                        │
│   ContextAPI = class ContextAPI():                                           │
│                Context API.                                                  │
│                Contains methods for context management:                      │
│                create, list, remove, get, inspect.                           │
│ DockerClient = class DockerClient(*args, **kwargs): A client for             │
│                communicating with a Docker server.                           │
│     from_env = def from_env(**kwargs): Return a client configured from       │
│                environment variables.                                        │
│    TLSConfig = class TLSConfig(client_cert=None, ca_cert=None, verify=None,  │
│                ssl_version=None, assert_hostname=None,                       │
│                assert_fingerprint=None): TLS configuration.                  │
╰──────────────────────────────────────────────────────────────────────────────╯
06:53:00.160 | INFO    | prefect.infrastructure.docker-container - Pulling image 'my-container-registry.azurecr.io/my:tag'...

Crashing run:

06:52:59.704 | INFO    | prefect.infrastructure.docker-container - Inspecting docker object...
╭─ <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init─╮
│ 9 attribute(s) not shown. Run inspect(inspect) for options.                  │
╰──────────────────────────────────────────────────────────────────────────────╯
06:52:59.708 | ERROR   | prefect.agent - Failed to submit flow run 'eb7c3b16-0877-4acb-b0cb-8256aab64241' to infrastructure.

Not sure what the 9 attributes not shown are about, but maybe it'll tell you something.

Since this is a pretty strange error I'll include some more info. The agent runs on Ubuntu 20.04 on an Azure VM that was last re-created on 2022-08-24 with the following cloud config:

#cloud-config
package_upgrade: true
apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
packages:
  - python3-pip
  - docker-ce
  - docker-ce-cli
write_files:
  - path: /etc/systemd/system/prefect-agent.service
    permissions: '0644'
    content: |
      [Unit]
      Description=Prefect Agent

      [Service]
      User=myuser
      WorkingDirectory=/home/myuser
      Environment=PREFECT_API_URL="https://api.prefect.cloud/api/accounts/my-account/workspaces/my-workspace"
      Environment=PREFECT_API_KEY="{6}"
      ExecStart=/usr/local/bin/prefect agent start --work-queue "{1}-{0}"
      Restart=always
      StandardOutput=append:/var/log/prefect.out.log
      StandardError=append:/var/log/prefect.err.log

      [Install]
      WantedBy=multi-user.target
runcmd:
  - pip3 install --upgrade pip
  - pip3 install prefect==2.2.0
  - runuser -l myuser -c "docker login {3}.azurecr.io -u {4} -p {5}"
  - systemctl daemon-reload
  - systemctl start prefect-agent

These are the contents of the task that runs deployments:

@task
sync def run_deployment(flow_name: str, env: Env, run_name: str, parameters: Dict[str, Any] = None, fire_and_forget: bool = False):
    """
    Creates a flow run from a deployment and polls it until it reaches a terminal state.
    If fire and forget is true, no polling is performed and the function exits early.
    """
    logger = get_run_logger()

    async with get_client() as client:
        deployments = await client.read_deployments()

        for deployment in deployments:
            if deployment.name == f"{flow_name} ({env.value})":
                deployment_id = deployment.id
                break

        flow_run = await client.create_flow_run_from_deployment(deployment_id=deployment_id, name=run_name, parameters=parameters)
        flow_run_id = flow_run.id

        logger.info(f"Created flow run for flow {flow_name} with name: {flow_run.name} and id: {flow_run.id}")

        # If fire_and_forget is true, don't poll, just return.
        while not fire_and_forget:
            flow_run = await client.read_flow_run(flow_run_id=flow_run_id)

            if flow_run.state_type in TERMINAL_STATES:
                if flow_run.state_type == StateType.COMPLETED:
                    logger.info(f"Child flow run {flow_name} / {flow_run.name} completed successfully.")
                elif flow_run.state_type in (StateType.CANCELLED, StateType.FAILED, StateType.CRASHED):
                    logger.error(f"Child flow run {flow_name} / {flow_run.name} exited non-successfully.")
                    return Failed()
                else:
                    logger.critical(f"Encountered an unknown terminal state in child flow run {flow_name}: {flow_run.state_type}")
                    return Failed()

                return

            logger.info(f"Waiting for child flow run {flow_name} / {flow_run.name} with state {flow_run.state_type}")

            await asyncio.sleep(30)

I'll post again if I discover anything else that might be helpful.

bjorhn · 2022-08-27T23:21:40Z

After setting the concurrency limit to 1 we've been running with no issues for 3 days. The issue likely relates to starting two deployments at the same (or nearly the same) time, probably due to hitting the lazy loader twice within a short time span.

zanieb · 2022-08-28T00:23:24Z

Thanks for all these additional details! Looks like there is definitely some sort of concurrency issue with the lazy loader. We're going to need to create a MRE that use uses lazy loading of the Docker module to isolate this from all of the other mechanisms. Something like..

import anyio
from prefect.utilities.importtools import lazy_import

docker = lazy_import("docker")

async def load_attribute():
    docker.errors

async def main():
    async with anyio.create_task_group() as tg:
         for _ in range(20):
             tg.start_soon(load_attribute)

anyio.run(main)

Once we've reproduced the issue in this isolated context we can investigate a fix. I'm a bit confused after looking at our implementation as it seems like it should be robust to concurrency.

bjorhn · 2022-09-01T08:24:23Z

The MRE looks reasonable to me. Unfortunately it runs on my local machine with no issues, so maybe the problem isn't as simple as I was hoping. Very strange..

nibrag · 2022-11-17T22:17:25Z

I've just got the same problem:

State message: Submission failed. Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/infrastructure/docker.py", line 541, in _get_client
    docker_client = docker.from_env()
AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:

AttributeError: module 'docker' has no attribute 'errors'

prefect version: 2.6.7

eudyptula · 2023-02-01T06:44:16Z

Having the same issue with prefect-2.7.10.

Anyone tried just renaming the infrastructure/docker.py file to something else? (docker/docker-py#1370)

Is it possible to configure Prefect to automatically retry on these errors? As @bjorhn stated rerunning the flow seems to be a temporary solution.

ncoy-hlp · 2023-02-09T20:52:49Z

I ran into this issue too with prefect 2.7.10. This occurred when the same flow was run twice at the same time do to the agent process stopping and the flow was behind schedule.

zanieb · 2023-02-09T22:04:21Z

Thanks for the link to the docker-py issue; we can just rename our module to something else once we have the deprecation utils to do so in a backwards compatible way (#8469)

eudyptula · 2023-04-12T07:51:33Z

Any updates on this ? We still have critical control flows crashing because of this error (still on Prefect 2.7.12 though).

Pierre-VF · 2023-04-18T11:30:11Z

Any updates on the status of the fix? We're running Prefect 2.9.0 and still see a lot of these issues (especially at times with high load on the Prefect database running in a self-hosted setup - not sure how causal that link is, but the correlation is quite visible)

zanieb · 2023-04-18T14:56:28Z

There's WIP at the linked pull request (#8788) but I'm too busy to get it done right now. I'd be happy if someone took it over.

runarfu · 2023-04-24T07:19:53Z

I've managed to remedy this problem by adding import docker in the module that contains my @task that is run concurrently (with my_task.map(input_values) or my_task.submit(input_value)). Looking forward to a more permanent fix :)

eudyptula · 2023-05-02T07:05:39Z

import docker solution may make this less likely to happen, but I still seen it.

Haven't seen the issue while testing with the new workers though (yet).

tsugliani · 2023-05-03T09:52:47Z

Interested in the progress on this too, had the same issues.

toby-coleman · 2023-05-25T08:06:48Z

Likewise, interested in this as we are having the same issue.

eudyptula · 2023-05-25T08:33:31Z

@tsugliani & @toby-coleman: We still haven't seen this issue since we switched from the agents to the new workers, so they seemed to have fixed this issue there.

toby-coleman · 2023-05-25T08:44:22Z

@eudyptula how complex was the switch from agents to workers? Are they are near drop-in replacement for agents?

eudyptula · 2023-05-25T11:12:23Z

I found it pretty much a drop-in for agents. I had our agents already updated to queues, pools, etc., when I did it on Prefect 2.10.6.

Only had a minor issue with env. vars, where they stopped supporting a specific naming scheme (#9397).

Probably still wise to attempt it on a test setup beforehand :)

zanieb · 2023-05-26T15:18:18Z

I have a report that this did not resolve the issue; perhaps because of the backwards compatibility aliases?

jashbycu · 2023-06-07T14:06:14Z

Just wanted to say we saw this error again today with prefect==2.10.12

Submission failed. Traceback (most recent call last): File "/home/oit-ds-prefect/miniconda3/envs/prefect-2/lib/python3.8/site-packages/prefect/infrastructure/container.py", line 650, in _get_client docker_client = docker.from_env() AttributeError: module 'docker' has no attribute 'from_env' During handling of the above exception, another exception occurred: AttributeError: module 'docker' has no attribute 'errors'

toby-coleman · 2023-07-06T15:42:35Z

I'm looking at migrating from prefect agent to prefect works to mitigate this on the setup I have. Currently I have:

A single work pool, with a prefect agent subscribed to it;
A Docker infrastructure block on each deployment, which specifies the image that should be used for that job.

Would I be correct in thinking that this needs to change, i.e. to the following:

A work pool for each Docker image that is in use;
A instance of the worker for each work pool?

The default image for the Docker work pool is the Prefect image, which is not what I want to run the jobs in.

leonardorochaperazzini · 2023-07-19T13:48:23Z

I keep getting the message "State message: Flow run infrastructure exited with non-zero status code 1." when I'm running on docker. It's intermittent, and it only happens a few times, but as I run quite a few flows a day, it's becoming a problem for me. It happens with any flow at any time, there is no pattern, I need help...

toby-coleman · 2023-07-21T09:35:05Z

Sorry to keep bumping this issue... but as @leonardorochaperazzini points out this becomes quite a big problem when you have a lot of flows running. I've investigated migrating from Prefect agents to workers, but this is currently blocked for us by #12933.

bjorhn added bug Something isn't working status:triage labels Aug 23, 2022

zanieb added status:backlog and removed status:triage labels Aug 23, 2022

discdiver added the v2 label Aug 24, 2022

zanieb added the needs:research Blocked by investigation into feasibility and cause label Sep 1, 2022

zanieb mentioned this issue Mar 13, 2023

Rename docker modules to avoid conflict #8788

Merged

gabcoyne added the from:sales label Apr 26, 2023

zanieb closed this as completed in #8788 May 25, 2023

github-actions bot removed the status:accepted label May 25, 2023

zanieb reopened this May 26, 2023

zanieb added the status:accepted label May 26, 2023

cicdw removed priority:medium labels Aug 15, 2023

desertaxle mentioned this issue Jun 19, 2024

Move DeploymentImage to prefect.docker #14151

Merged

cicdw removed status:accepted labels Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent fatal error running deployment with docker infra #6519

Intermittent fatal error running deployment with docker infra #6519

bjorhn commented Aug 23, 2022 •

edited

Loading

zanieb commented Aug 23, 2022

bjorhn commented Aug 24, 2022 •

edited

Loading

bjorhn commented Aug 25, 2022 •

edited

Loading

bjorhn commented Aug 27, 2022

zanieb commented Aug 28, 2022

bjorhn commented Sep 1, 2022

nibrag commented Nov 17, 2022

eudyptula commented Feb 1, 2023 •

edited

Loading

ncoy-hlp commented Feb 9, 2023

zanieb commented Feb 9, 2023

eudyptula commented Apr 12, 2023

Pierre-VF commented Apr 18, 2023 •

edited

Loading

zanieb commented Apr 18, 2023

runarfu commented Apr 24, 2023

eudyptula commented May 2, 2023

tsugliani commented May 3, 2023

toby-coleman commented May 25, 2023

eudyptula commented May 25, 2023

toby-coleman commented May 25, 2023

eudyptula commented May 25, 2023

zanieb commented May 26, 2023

jashbycu commented Jun 7, 2023 •

edited

Loading

toby-coleman commented Jul 6, 2023 •

edited

Loading

leonardorochaperazzini commented Jul 19, 2023

toby-coleman commented Jul 21, 2023

Intermittent fatal error running deployment with docker infra #6519

Intermittent fatal error running deployment with docker infra #6519

Comments

bjorhn commented Aug 23, 2022 • edited Loading

First check

Bug summary

Reproduction

Error

Versions

Additional context

zanieb commented Aug 23, 2022

bjorhn commented Aug 24, 2022 • edited Loading

bjorhn commented Aug 25, 2022 • edited Loading

bjorhn commented Aug 27, 2022

zanieb commented Aug 28, 2022

bjorhn commented Sep 1, 2022

nibrag commented Nov 17, 2022

eudyptula commented Feb 1, 2023 • edited Loading

ncoy-hlp commented Feb 9, 2023

zanieb commented Feb 9, 2023

eudyptula commented Apr 12, 2023

Pierre-VF commented Apr 18, 2023 • edited Loading

zanieb commented Apr 18, 2023

runarfu commented Apr 24, 2023

eudyptula commented May 2, 2023

tsugliani commented May 3, 2023

toby-coleman commented May 25, 2023

eudyptula commented May 25, 2023

toby-coleman commented May 25, 2023

eudyptula commented May 25, 2023

zanieb commented May 26, 2023

jashbycu commented Jun 7, 2023 • edited Loading

toby-coleman commented Jul 6, 2023 • edited Loading

leonardorochaperazzini commented Jul 19, 2023

toby-coleman commented Jul 21, 2023

bjorhn commented Aug 23, 2022 •

edited

Loading

bjorhn commented Aug 24, 2022 •

edited

Loading

bjorhn commented Aug 25, 2022 •

edited

Loading

eudyptula commented Feb 1, 2023 •

edited

Loading

Pierre-VF commented Apr 18, 2023 •

edited

Loading

jashbycu commented Jun 7, 2023 •

edited

Loading

toby-coleman commented Jul 6, 2023 •

edited

Loading