Skip to content

Conversation

@kaxil
Copy link
Member

@kaxil kaxil commented May 23, 2025

closes #50500

Adds a new safeguard for cases where the task subprocess closes before all pipe sockets send EOF.

The supervisor now records the
process exit time and forcibly closes any sockets still open after workers.socket_cleanup_delay. This stops the supervisor loop from hanging indefinitely and allows the process to exit cleanly.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

closes apache#50500

Adds a new safeguard for cases where the task subprocess closes
before all pipe sockets send EOF.

The supervisor now records the
process exit time and forcibly closes any sockets still open after
`workers.task_supervisor_socket_cleanup_delay`. This stops the
supervisor loop from hanging indefinitely and allows the process
to exit cleanly.
@kaxil kaxil force-pushed the supervisor-socket-delay branch from 754b1f7 to 12d8452 Compare May 23, 2025 22:38
@kaxil kaxil added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label May 23, 2025
@ashb
Copy link
Member

ashb commented May 24, 2025

I don't love this - it "shouldn't" get in this state, and something might be missed as a result

@kaxil
Copy link
Member Author

kaxil commented May 24, 2025

I don't love this - it "shouldn't" get in this state, and something might be missed as a result

True, I was being conservative here but I have a different lead too. Verifying 🕐

@kaxil
Copy link
Member Author

kaxil commented May 24, 2025

I don't love this - it "shouldn't" get in this state, and something might be missed as a result

True, I was being conservative here but I have a different lead too. Verifying 🕐

That being said, while I don't love this either, it might not be a bad safeguard since it checks for the exit_code - which should only be set when task process exits.

@kaxil kaxil closed this May 25, 2025
@kaxil
Copy link
Member Author

kaxil commented May 25, 2025

I have managed to reproduce the bug - not consistently but enough that I can debug this further

I thought I did -- but not anymore

@kaxil kaxil deleted the supervisor-socket-delay branch December 31, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The task supervisor continues running indefinitely, even after the associated task process has completed

2 participants