Skip to content

Conversation

@jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Nov 30, 2025

Another small increment to remove global statements for PR #58116

This removes global statement in celery, a classsic thing where functools.cache() can help. Also centralization of config retrieval in this case makes code better in general.

global is evil.

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-celery branch 2 times, most recently from 7d647ea to 9a1f29d Compare December 2, 2025 21:41
@jscheffl jscheffl force-pushed the bugfix/remove-global-from-celery branch from 9a1f29d to e326e6d Compare December 3, 2025 05:49
@jscheffl jscheffl merged commit 28e6122 into apache:main Dec 3, 2025
79 checks passed
bbovenzi pushed a commit to astronomer/airflow that referenced this pull request Dec 3, 2025
* Remove global from celery provider

* Remove global from celery provider, add caching
itayweb pushed a commit to itayweb/airflow that referenced this pull request Dec 6, 2025
* Remove global from celery provider

* Remove global from celery provider, add caching
@michaelosthege
Copy link
Contributor

michaelosthege commented Jan 29, 2026

We're pretty sure that we traced worker processes getting SIGKILLed with the following error to the changes from this PR between 3.13.1 and 3.14.1:

[error] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-2, started daemon)> [celery.concurrency.asynpool] loc=asynpool.py:631
[2026-01-29 15:56:58,818: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:18 exited with 'signal 9 (SIGKILL)'

We upgraded Airflow 3.1.1 → 3.1.6 and while doing so went celery 5.5.3 → 5.6.2 and apache-airflow-providers-celery 3.12.4 → 3.15.0.

By downgrading apache-airflow-providers-celery, stepping through previous versions we found that the changes from 3.14.0 to 3.14.1 (this PR) 3.13.1 → 3.14.0 broke it.

UPDATE: Looks like we got confused during debugging and it was actually the change 3.13.1 → 3.14.0 from #58591.
Maybe the subprocesses within the worker are given the same name, and that's why the check introduced in #58591 kills them before they send their UP message?

@jscheffl
Copy link
Contributor Author

UPDATE: Looks like we got confused during debugging and it was actually the change 3.13.1 → 3.14.0 from #58591. Maybe the subprocesses within the worker are given the same name, and that's why the check introduced in #58591 kills them before they send their UP message?

@michaelosthege hypothesis ... if not duplicate hostname could it be that the duplicate hostname check sometimes needs longer running into timeout as well? Not sure about a specific error message...

@michaelosthege
Copy link
Contributor

@michaelosthege hypothesis ... if not duplicate hostname could it be that the duplicate hostname check sometimes needs longer running into timeout as well? Not sure about a specific error message...

No idea how long that code takes to execute, but we had tried increasing celery.concurrency.asynpool.PROC_ALIVE_TIMEOUT to 60 seconds and all it did was delaying the killing. If the hostname check went SystemExit the child process must have been already been dead at that point.
We never saw a log message by those worker child processes. The check from #58591 also appears to run before logging is configured.

How does the UP message travel to the parent? Via the message broker?

We also observed a "clocks out of sync" warning in the logs, but with the worker and Redis containers running on the same host that doesn't make any sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants