Skip to content

Scheduler not terminating in case of repeated DB errors. #43440

@iw-pavan

Description

@iw-pavan

Apache Airflow version

2.10.2

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Scheduler was running and launching tasks normally.
Suddenly there was auth error on database operations.

psycopg2.OperationalError: connection to server at "<Host>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)
connection to server at "<HOST>" (<IP>), port 6432 failed: FATAL:  server login has been failing, try again later (server_login_retry)


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/airflow/jobs/scheduler_job_runner.py", line 984, in _execute
    self._run_scheduler_loop()

After few retries it exited scheduler loop but process was not terminated.

What you think should happen instead?

After shutting down all executor and dag_processer process should exit.

How to reproduce

Using hybrid executors with Celery, Kubernetes

Introduce db errors.

Operating System

Mac/Linux

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else?

There are below logs repeated which indicates some threads not exited.

[2024-10-27T06:17:10.658+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.
[2024-10-27T06:17:11.658+0000] {kubernetes_executor_utils.py:140} INFO - Event: and now my watch begins starting at resource_version: 0
[2024-10-27T06:17:11.702+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-8ace-s4sjuxfo is Running, annotations: <omitted>
[2024-10-27T06:17:11.712+0000] {kubernetes_executor_utils.py:309} INFO - Event: 666aac59b268675b6b2590ff-bs-44fe-iwuzjfao is Running, annotations: <omitted>
[2024-10-27T06:17:41.715+0000] {kubernetes_executor_utils.py:101} INFO - Kubernetes watch timed out waiting for events. Restarting watch.

I see old PR for similar issue #28685
Should I change catch block to catch all exceptions?

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions