Skip to content

celery worker task goes out of memory silently #61521

@DonHaul

Description

@DonHaul

Apache Airflow version

Other Airflow 3 version (please specify below)

If "Other Airflow 3 version" selected, which one?

3.1.6

What happened?

We have this issue where a tasks gets suddenly killed by out of memory ( exit_code=-9 ) without showing it in the UIs task logs.
In some case we do see a CRITICAL log error for a give task saying that it indeed exited with code -9. In other cases an empty log is shown (for attempt 2):

Image

If i check the logs for attempt1 I see, which also does not specify anything about why this worker was killed

airflow@airflow-worker-1:/opt/airflow/logs$ cat "/opt/airflow/logs/dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=1.log"
{"timestamp":"2026-02-06T09:37:41.890482Z","level":"warning","event":"pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.","category":"UserWarning","filename":"/home/airflow/.local/lib/python3.11/site-packages/inspire_schemas/utils.py","lineno":50,"logger":"py.warnings"}
{"timestamp":"2026-02-06T09:37:42.406828Z","level":"info","event":"DAG bundles loaded: dags-folder","logger":"airflow.dag_processing.bundles.manager.DagBundlesManager","filename":"manager.py","lineno":179}
{"timestamp":"2026-02-06T09:37:42.407277Z","level":"info","event":"Filling up the DagBag from /opt/airflow/dags/literature/hep_create_dag.py","logger":"airflow.models.dagbag.DagBag","filename":"dagbag.py","lineno":593}
{"timestamp":"2026-02-06T09:37:46.352446Z","level":"info","event":"AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.","logger":"airflow.providers.amazon.aws.utils.connection_wrapper.AwsConnectionWrapper","filename":"connection_wrapper.py","lineno":331}
{"timestamp":"2026-02-06T09:37:48.926710Z","level":"info","event":"Done. Returned value was: None","logger":"airflow.task.operators.airflow.providers.standard.decorators.python._PythonDecoratedOperator","filename":"python.py","lineno":217}
{"timestamp":"2026-02-06T09:37:48.982790Z","level":"error","event":"Top level error","logger":"task","filename":"task_runner.py","lineno":1482,"error_detail":[{"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'failed'}}}","exc_notes":[],"syntax_error":null,"is_cause":false,"frames":[{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1475,"name":"main"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1013,"name":"run"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":207,"name":"send"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":271,"name":"_get_response"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":258,"name":"_from_frame"}],"is_group":false,"exceptions":[]}]}
{"timestamp":"2026-02-06T09:37:49.121613Z","level":"warning","event":"Process exited abnormally","exit_code":1,"logger":"task"}

Only by checking the full worker logs I can understand that indeed the process was kille due to OOM:

2026-02-06T09:37:41.897416Z [info     ] Task execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] received [celery.worker.strategy] loc=strategy.py:161
2026-02-06T09:37:41.919012Z [info     ] [c20747ad-5381-4fea-8395-c3aa6a9e92cf] Executing workload in Celery: token='eyJ***' ti=TaskInstance(id=UUID('019c30bc-6c04-7dba-8588-3c592e04a708'), dag_version_id=UUID('019c299c-4324-7618-8b47-ed60ca990ba7'), task_id='halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories', dag_id='hep_create_dag', run_id='579c45ef-2c6d-471f-a893-1fd4cc26fbb2', try_number=2, map_index=-1, pool_slots=1, queue='default', priority_weight=32, executor_config=None, parent_context_carrier={}, context_carrier={}) dag_rel_path=PurePosixPath('literature/hep_create_dag.py') bundle_info=BundleInfo(name='dags-folder', version=None) log_path='dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=2.log' type='ExecuteTask' [airflow.providers.celery.executors.celery_executor_utils] loc=celery_executor_utils.py:156
2026-02-06T09:37:41.952184Z [info     ] Secrets backends loaded for worker [supervisor] backend_classes=['EnvironmentVariablesBackend'] count=1 loc=supervisor.py:1975
2026-02-06T09:37:42.004915Z [info     ] Process exited                 [supervisor] exit_code=-9 loc=supervisor.py:710 pid=836631 signal_sent=SIGKILL
2026-02-06T09:37:42.017909Z [error    ] Task execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] raised unexpected: ServerResponseError('Server returned error') [celery.app.trace] loc=trace.py:285
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 479, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 779, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 164, in execute_workload
    supervise(
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 1984, in supervise
    process = ActivitySubprocess.start(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 955, in start
    proc._on_child_started(ti=what, dag_rel_path=dag_rel_path, bundle_info=bundle_info)
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 966, in _on_child_started
    ti_context = self.client.task_instances.start(ti.id, self.pid, start_date)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 215, in start
    resp = self.client.patch(f"task-instances/{id}/run", content=body.model_dump_json())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 1218, in patch
    return self.request(
           ^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 338, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 477, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 378, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 400, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 480, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 885, in request
    return super().request(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 999, in _send_handling_redirects
    raise exc
  File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 982, in _send_handling_redirects
    hook(response)
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 186, in raise_on_4xx_5xx_with_note
    return get_json_error(response) or response.raise_for_status()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 176, in get_json_error
    raise err
airflow.sdk.api.client.ServerResponseError: Server returned error
Correlation-id=019c3250-7e4c-7c55-b16a-2d607aceb0ef

What you think should happen instead?

An CRITICAL error should be displayed in the ui specifying that this process died due to OOM . I've seen this being shown sometimes ( I see it also implemented here but there are cases where this code is not reached, as shown above. even though we do see in the logs that a SIGKILL happened with exit_code -9

How to reproduce

Have many small tasks running on a celery worker that doenst enough memory to process them all

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

helm chart depoyed on k8s.

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions