Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent fatal error running deployment with docker infra #6519

Open
4 tasks done
bjorhn opened this issue Aug 23, 2022 · 25 comments · Fixed by #8788
Open
4 tasks done

Intermittent fatal error running deployment with docker infra #6519

bjorhn opened this issue Aug 23, 2022 · 25 comments · Fixed by #8788
Labels
bug Something isn't working needs:research Blocked by investigation into feasibility and cause

Comments

@bjorhn
Copy link
Contributor

bjorhn commented Aug 23, 2022

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

We run 40-50 deployments per day and one or two of them will usually crash within the first 5 seconds with a docker error. Re-running the deployment always works. All of our deployments are docker deployments. No logs are sent to the server GUI, but they can be retrieved from the agent.

Let me know if there's any additional information you need.

Reproduction

Since the problem is intermittent it's difficult to put together a minimal example.

Error

08:11:55.728 | INFO    | prefect.agent - Submitting flow run '00dc9668-fb5b-4935-9bfb-0892a865d982'
08:11:56.201 | ERROR   | prefect.agent - Failed to submit flow run '00dc9668-fb5b-4935-9bfb-0892a865d982' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 443, in _get_client
    docker_client = docker.from_env()
AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/prefect/agent.py", line 199, in submit_run
    await self.task_group.start(submit_flow_run, flow_run, infrastructure)
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in start
    return await future
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/submission.py", line 48, in submit_flow_run
    return await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 209, in run
    container_id = await run_sync_in_worker_thread(self._create_and_start_container)
  File "/usr/local/lib/python3.8/dist-packages/prefect/utilities/asyncutils.py", line 56, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 245, in _create_and_start_container
    docker_client = self._get_client()
  File "/usr/local/lib/python3.8/dist-packages/prefect/infrastructure/docker.py", line 445, in _get_client
    except docker.errors.DockerException as exc:
AttributeError: module 'docker' has no attribute 'errors'

Versions

Version:             2.1.1
API version:         0.8.0
Python version:      3.8.10
Git commit:          dc2ba222
Built:               Thu, Aug 18, 2022 10:18 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1

Additional context

Here's the deployment file, in case it's of any help, with certain strings removed.

###
### A complete description of a Prefect Deployment for flow 'RA Integration'
###
name: RA Integration (prod)
description: Flow for the Reporting Analytics (RA) integration.
version: fbd05955547891a8bc78436ca1598ffd
# The work queue that will handle this deployment's runs
work_queue_name: integration-prod
tags:
- prod
- integration
parameters: {}
schedule: null
infra_overrides:
  image: REMOVED
infrastructure:
  type: docker-container
  env: {}
  labels: {}
  name: null
  command:
  - python
  - -m
  - prefect.engine
  image: prefecthq/prefect:2.1.1-python3.8
  image_pull_policy: null
  image_registry: null
  networks: []
  network_mode: null
  auto_remove: false
  volumes: []
  stream_output: true

###
### DO NOT EDIT BELOW THIS LINE
###
flow_name: RA Integration
manifest_path: null
storage:
  bucket_path: flows/ra_integration
  azure_storage_connection_string: REMOVED
  azure_storage_account_name: null
  azure_storage_account_key: null
  _is_anonymous: true
  _block_document_name: REMOVED
  _block_document_id: REMOVED
  _block_type_slug: azure
path: null
entrypoint: orchestration/flows/ra_integration.py:default
parameter_openapi_schema:
  title: Parameters
  type: object
  properties:
    loglevel:
      title: loglevel
      default: INFO
      type: string
    integration_parameters:
      title: integration_parameters
      default:
        namespace:
        - LOCATION_ACTIVITY_DB
        - COREDB
        backfill: false
        manual stream select: null
        manual delta key override: null
        manual delta range override:
          lower-bound: null
          upper-bound: null
  required: null
  definitions: null
@bjorhn bjorhn added bug Something isn't working status:triage labels Aug 23, 2022
@zanieb
Copy link
Contributor

zanieb commented Aug 23, 2022

Thanks for the report! It seems like this has one of two causes:

  • The docker.py file we have is on your PYTHON_PATH and being loaded instead of the docker module. It seems weird that this would be intermittent.
  • The lazy module loading we use for docker breaks intermittently.

If you could add a print(docker) line in _get_client before we call from_env so we can see where the module is coming from it'd help us narrow things down.

@bjorhn
Copy link
Contributor Author

bjorhn commented Aug 24, 2022

I just got another crash, after having added the print (actually used self.logger.info) as instructed. I've also re-created my VM where the agent runs for a quick upgrade to 2.2.0, so here's my current "prefect version":

Version:             2.2.0
API version:         0.8.0
Python version:      3.8.10
Git commit:          e3651362
Built:               Tue, Aug 23, 2022 2:18 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1

Print from a functioning run:
08:52:53.712 | INFO | prefect.infrastructure.docker-container - madkinsz-log: <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init__.py'>

Print from a crashing run:

08:52:58.952 | INFO    | prefect.infrastructure.docker-container - madkinsz-log: <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init__.py'>
08:52:58.954 | ERROR   | prefect.agent - Failed to submit flow run '2e3b5431-2235-4345-9eb1-e39e0e31eaf1' to infrastructure. AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:
AttributeError: module 'docker' has no attribute 'errors'

The two prints look identical to me. Let me know if there's anything else I can do in order to narrow this down.

@bjorhn
Copy link
Contributor Author

bjorhn commented Aug 25, 2022

Had the same crash today. This time I decided to add a snippet of code to inspect the docker object. I also have a theory regarding the cause.

I haven't paid much attention to the circumstances of when the crash happens, but it seems to be much more common when two deployments are started at nearly the same exact time. That's been the case these last 3 days.

The first thing our main flow does is start about 10 async tasks that use prefect.client to start flows. Both these tasks and these flows are limited by task- and flow concurrency limits set to 2, so in effect this results in two near-simultaneous calls to the Orion API. If you look at the successful and crashing run logs below you can see that they are 0.3 sec apart.

The main flow is run on a different agent than these 10 async flows. Is it possible that when the main flow runs (using schedule or manual trigger), it manages to lazy load the docker module, but when it then uses prefect.client to trigger two flows on a different vm/agent, at the near-same exact time, there's some sort of condition where one execution starts lazy-loading the docker module while the other one tries to use the barely-loaded instance? My knowledge regarding the inner workings of Python are essentially zero so this is pure guesswork.

Working run:

06:53:00.120 | INFO    | prefect.infrastructure.docker-container - Inspecting docker object...
╭─ <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init─╮
│          api = <module 'docker.api' from                                     │
│                '/usr/local/lib/python3.8/dist-packages/docker/api/__init__.… │
│         auth = <module 'docker.auth' from                                    │
│                '/usr/local/lib/python3.8/dist-packages/docker/auth.py'>      │
│       client = <module 'docker.client' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/client.py'>    │
│    constants = <module 'docker.constants' from                               │
│                '/usr/local/lib/python3.8/dist-packages/docker/constants.py'> │
│      context = <module 'docker.context' from                                 │
│                '/usr/local/lib/python3.8/dist-packages/docker/context/__ini… │
│  credentials = <module 'docker.credentials' from                             │
│                '/usr/local/lib/python3.8/dist-packages/docker/credentials/_… │
│       errors = <module 'docker.errors' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/errors.py'>    │
│       models = <module 'docker.models' from                                  │
│                '/usr/local/lib/python3.8/dist-packages/docker/models/__init… │
│          tls = <module 'docker.tls' from                                     │
│                '/usr/local/lib/python3.8/dist-packages/docker/tls.py'>       │
│    transport = <module 'docker.transport' from                               │
│                '/usr/local/lib/python3.8/dist-packages/docker/transport/__i… │
│        types = <module 'docker.types' from                                   │
│                '/usr/local/lib/python3.8/dist-packages/docker/types/__init_… │
│        utils = <module 'docker.utils' from                                   │
│                '/usr/local/lib/python3.8/dist-packages/docker/utils/__init_… │
│      version = <module 'docker.version' from                                 │
│                '/usr/local/lib/python3.8/dist-packages/docker/version.py'>   │
│    APIClient = class APIClient(base_url=None, version=None, timeout=60,      │
│                tls=False, user_agent='docker-sdk-python/6.0.0',              │
│                num_pools=None, credstore_env=None, use_ssh_client=False,     │
│                max_pool_size=10): A low-level client for the Docker Engine   │
│                API.                                                          │
│      Context = class Context(name, orchestrator=None, host=None,             │
│                endpoints=None, tls=False): A context.                        │
│   ContextAPI = class ContextAPI():                                           │
│                Context API.                                                  │
│                Contains methods for context management:                      │
│                create, list, remove, get, inspect.                           │
│ DockerClient = class DockerClient(*args, **kwargs): A client for             │
│                communicating with a Docker server.                           │
│     from_env = def from_env(**kwargs): Return a client configured from       │
│                environment variables.                                        │
│    TLSConfig = class TLSConfig(client_cert=None, ca_cert=None, verify=None,  │
│                ssl_version=None, assert_hostname=None,                       │
│                assert_fingerprint=None): TLS configuration.                  │
╰──────────────────────────────────────────────────────────────────────────────╯
06:53:00.160 | INFO    | prefect.infrastructure.docker-container - Pulling image 'my-container-registry.azurecr.io/my:tag'...

Crashing run:

06:52:59.704 | INFO    | prefect.infrastructure.docker-container - Inspecting docker object...
╭─ <module 'docker' from '/usr/local/lib/python3.8/dist-packages/docker/__init─╮
│ 9 attribute(s) not shown. Run inspect(inspect) for options.                  │
╰──────────────────────────────────────────────────────────────────────────────╯
06:52:59.708 | ERROR   | prefect.agent - Failed to submit flow run 'eb7c3b16-0877-4acb-b0cb-8256aab64241' to infrastructure.

Not sure what the 9 attributes not shown are about, but maybe it'll tell you something.

Since this is a pretty strange error I'll include some more info. The agent runs on Ubuntu 20.04 on an Azure VM that was last re-created on 2022-08-24 with the following cloud config:

#cloud-config
package_upgrade: true
apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
packages:
  - python3-pip
  - docker-ce
  - docker-ce-cli
write_files:
  - path: /etc/systemd/system/prefect-agent.service
    permissions: '0644'
    content: |
      [Unit]
      Description=Prefect Agent

      [Service]
      User=myuser
      WorkingDirectory=/home/myuser
      Environment=PREFECT_API_URL="https://api.prefect.cloud/api/accounts/my-account/workspaces/my-workspace"
      Environment=PREFECT_API_KEY="{6}"
      ExecStart=/usr/local/bin/prefect agent start --work-queue "{1}-{0}"
      Restart=always
      StandardOutput=append:/var/log/prefect.out.log
      StandardError=append:/var/log/prefect.err.log

      [Install]
      WantedBy=multi-user.target
runcmd:
  - pip3 install --upgrade pip
  - pip3 install prefect==2.2.0
  - runuser -l myuser -c "docker login {3}.azurecr.io -u {4} -p {5}"
  - systemctl daemon-reload
  - systemctl start prefect-agent

These are the contents of the task that runs deployments:

@task
sync def run_deployment(flow_name: str, env: Env, run_name: str, parameters: Dict[str, Any] = None, fire_and_forget: bool = False):
    """
    Creates a flow run from a deployment and polls it until it reaches a terminal state.
    If fire and forget is true, no polling is performed and the function exits early.
    """
    logger = get_run_logger()

    async with get_client() as client:
        deployments = await client.read_deployments()

        for deployment in deployments:
            if deployment.name == f"{flow_name} ({env.value})":
                deployment_id = deployment.id
                break

        flow_run = await client.create_flow_run_from_deployment(deployment_id=deployment_id, name=run_name, parameters=parameters)
        flow_run_id = flow_run.id

        logger.info(f"Created flow run for flow {flow_name} with name: {flow_run.name} and id: {flow_run.id}")

        # If fire_and_forget is true, don't poll, just return.
        while not fire_and_forget:
            flow_run = await client.read_flow_run(flow_run_id=flow_run_id)

            if flow_run.state_type in TERMINAL_STATES:
                if flow_run.state_type == StateType.COMPLETED:
                    logger.info(f"Child flow run {flow_name} / {flow_run.name} completed successfully.")
                elif flow_run.state_type in (StateType.CANCELLED, StateType.FAILED, StateType.CRASHED):
                    logger.error(f"Child flow run {flow_name} / {flow_run.name} exited non-successfully.")
                    return Failed()
                else:
                    logger.critical(f"Encountered an unknown terminal state in child flow run {flow_name}: {flow_run.state_type}")
                    return Failed()

                return

            logger.info(f"Waiting for child flow run {flow_name} / {flow_run.name} with state {flow_run.state_type}")

            await asyncio.sleep(30)

I'll post again if I discover anything else that might be helpful.

@bjorhn
Copy link
Contributor Author

bjorhn commented Aug 27, 2022

After setting the concurrency limit to 1 we've been running with no issues for 3 days. The issue likely relates to starting two deployments at the same (or nearly the same) time, probably due to hitting the lazy loader twice within a short time span.

@zanieb
Copy link
Contributor

zanieb commented Aug 28, 2022

Thanks for all these additional details! Looks like there is definitely some sort of concurrency issue with the lazy loader. We're going to need to create a MRE that use uses lazy loading of the Docker module to isolate this from all of the other mechanisms. Something like..

import anyio
from prefect.utilities.importtools import lazy_import

docker = lazy_import("docker")

async def load_attribute():
    docker.errors

async def main():
    async with anyio.create_task_group() as tg:
         for _ in range(20):
             tg.start_soon(load_attribute)

anyio.run(main)

Once we've reproduced the issue in this isolated context we can investigate a fix. I'm a bit confused after looking at our implementation as it seems like it should be robust to concurrency.

@bjorhn
Copy link
Contributor Author

bjorhn commented Sep 1, 2022

The MRE looks reasonable to me. Unfortunately it runs on my local machine with no issues, so maybe the problem isn't as simple as I was hoping. Very strange..

@zanieb zanieb added the needs:research Blocked by investigation into feasibility and cause label Sep 1, 2022
@nibrag
Copy link

nibrag commented Nov 17, 2022

I've just got the same problem:

State message: Submission failed. Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/infrastructure/docker.py", line 541, in _get_client
    docker_client = docker.from_env()
AttributeError: module 'docker' has no attribute 'from_env'

During handling of the above exception, another exception occurred:

AttributeError: module 'docker' has no attribute 'errors'

prefect version: 2.6.7

@eudyptula
Copy link
Contributor

eudyptula commented Feb 1, 2023

Having the same issue with prefect-2.7.10.

Anyone tried just renaming the infrastructure/docker.py file to something else? (docker/docker-py#1370)

Is it possible to configure Prefect to automatically retry on these errors? As @bjorhn stated rerunning the flow seems to be a temporary solution.

@ncoy-hlp
Copy link

ncoy-hlp commented Feb 9, 2023

I ran into this issue too with prefect 2.7.10. This occurred when the same flow was run twice at the same time do to the agent process stopping and the flow was behind schedule.

@zanieb
Copy link
Contributor

zanieb commented Feb 9, 2023

Thanks for the link to the docker-py issue; we can just rename our module to something else once we have the deprecation utils to do so in a backwards compatible way (#8469)

@eudyptula
Copy link
Contributor

Any updates on this ? We still have critical control flows crashing because of this error (still on Prefect 2.7.12 though).

@Pierre-VF
Copy link

Pierre-VF commented Apr 18, 2023

Any updates on the status of the fix? We're running Prefect 2.9.0 and still see a lot of these issues (especially at times with high load on the Prefect database running in a self-hosted setup - not sure how causal that link is, but the correlation is quite visible)

@zanieb
Copy link
Contributor

zanieb commented Apr 18, 2023

There's WIP at the linked pull request (#8788) but I'm too busy to get it done right now. I'd be happy if someone took it over.

@runarfu
Copy link

runarfu commented Apr 24, 2023

I've managed to remedy this problem by adding import docker in the module that contains my @task that is run concurrently (with my_task.map(input_values) or my_task.submit(input_value)). Looking forward to a more permanent fix :)

@eudyptula
Copy link
Contributor

import docker solution may make this less likely to happen, but I still seen it.

Haven't seen the issue while testing with the new workers though (yet).

@tsugliani
Copy link

Interested in the progress on this too, had the same issues.

@toby-coleman
Copy link
Contributor

Likewise, interested in this as we are having the same issue.

@eudyptula
Copy link
Contributor

@tsugliani & @toby-coleman: We still haven't seen this issue since we switched from the agents to the new workers, so they seemed to have fixed this issue there.

@toby-coleman
Copy link
Contributor

@eudyptula how complex was the switch from agents to workers? Are they are near drop-in replacement for agents?

@eudyptula
Copy link
Contributor

I found it pretty much a drop-in for agents. I had our agents already updated to queues, pools, etc., when I did it on Prefect 2.10.6.

Only had a minor issue with env. vars, where they stopped supporting a specific naming scheme (#9397).

Probably still wise to attempt it on a test setup beforehand :)

@zanieb
Copy link
Contributor

zanieb commented May 26, 2023

I have a report that this did not resolve the issue; perhaps because of the backwards compatibility aliases?

@jashbycu
Copy link

jashbycu commented Jun 7, 2023

Just wanted to say we saw this error again today with prefect==2.10.12

Submission failed. Traceback (most recent call last): File "/home/oit-ds-prefect/miniconda3/envs/prefect-2/lib/python3.8/site-packages/prefect/infrastructure/container.py", line 650, in _get_client docker_client = docker.from_env() AttributeError: module 'docker' has no attribute 'from_env' During handling of the above exception, another exception occurred: AttributeError: module 'docker' has no attribute 'errors'

@toby-coleman
Copy link
Contributor

toby-coleman commented Jul 6, 2023

I'm looking at migrating from prefect agent to prefect works to mitigate this on the setup I have. Currently I have:

  • A single work pool, with a prefect agent subscribed to it;
  • A Docker infrastructure block on each deployment, which specifies the image that should be used for that job.

Would I be correct in thinking that this needs to change, i.e. to the following:

  • A work pool for each Docker image that is in use;
  • A instance of the worker for each work pool?

The default image for the Docker work pool is the Prefect image, which is not what I want to run the jobs in.

@leonardorochaperazzini
Copy link

I keep getting the message "State message: Flow run infrastructure exited with non-zero status code 1." when I'm running on docker. It's intermittent, and it only happens a few times, but as I run quite a few flows a day, it's becoming a problem for me. It happens with any flow at any time, there is no pattern, I need help...

@toby-coleman
Copy link
Contributor

Sorry to keep bumping this issue... but as @leonardorochaperazzini points out this becomes quite a big problem when you have a lot of flows running. I've investigated migrating from Prefect agents to workers, but this is currently blocked for us by #12933.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:research Blocked by investigation into feasibility and cause
Projects
None yet
Development

Successfully merging a pull request may close this issue.