Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Jun 19, 2025

  • Switch the Supervisor/task process from line-based to length-prefixed

The existing JSON Lines based approach had two major drawbacks

  1. In the case of really large lines (in the region of 10 or 20MB) the python
    line buffering could sometimes result in a partial read
  2. The JSON based approach didn't have the ability to add any metadata (such
    as errors).
  3. Not every message type/call-site waited for a response, which meant those
    client functions could never get told about an error

One of the ways this line-based approach fell down was if you suddenly tried
to run 100s of triggers at the same time you would get an error like this:

Traceback (most recent call last):
  File "/Users/ash/.local/share/uv/python/cpython-3.12.7-macos-aarch64-none/lib/python3.12/asyncio/streams.py", line 568, in readline
    line = await self.readuntil(sep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ash/.local/share/uv/python/cpython-3.12.7-macos-aarch64-none/lib/python3.12/asyncio/streams.py", line 663, in readuntil
    raise exceptions.LimitOverrunError(
asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit

The other way this caused problems was if you parse a large dag (as in one
with 20k tasks or more) the DagFileProcessor could end up getting a partial
read which would be invalid JSON.

This changes the communications protocol in in a couple of ways.

First off at the python level the separate send and receive methods in the
client/task side have been removed and replaced with a single send() that
sends the request, reads the response and raises an error if one is returned.
(But note, right now almost nothing in the supervisor side sets the error,
that will be a future PR.)

Secondly the JSON Lines approach has been changed from a line-based protocol
to a binary "frame" one. The protocol (which is the same for whichever side is
sending) is length-prefixed, i.e. we first send the length of the data as a
4byte big-endian integer, followed by the data itself. This should remove the
possibility of JSON parse errors due to reading incomplete lines

Finally the last change made in this PR is to remove the "extra" requests
socket/channel. Upon closer examination with this comms path I realised that
this socket is unnecessary: Since we are in 100% control of the client side we
can make use of the bi-directional nature of socketpair and save file
handles. This also happens to help the run_as_user feature which is
currently broken, as without extra config to sudoers file, sudo will close
all filehandles other than stdin, stdout, and stderr -- so by introducing this
change we make it easier to re-add run_as_user support.

In order to support this in the DagFileProcessor (as the fact that the proc
manager uses a single selector for multiple processes) means I have moved the
on_close callback to be part of the object we store in the selector object
in the supervisors, previoulsy it was the "on_read" callback, now we store a
tuple of (on_read, on_close) and on_close is called once universally.

This also changes the way comms are handled from the (async) TriggerRunner
process. Previously we had a sync+async lock, but that made it possible to end
up deadlocking things. The change now is to have send on
TriggerCommsDecoder "go back" to the async even loop via async_to_sync, so
that only async code deals with the socket, and we can use an async lock
(rather than the hybrid sync and async lock we tried before). This seems to
help the deadlock issue, but I'm not 100% sure it will remove it entirely, but
it makes it much much harder to hit - I've not been able to reprouce it with
this change

  • Deal with compat in tests

This compat issue is only in tests, as nothing in the runtime of airflow-core
imports/calls methods directly on SUPERVISOR_COMMS, we are only importing it
in tests to mkae assertions about the behavour/to stub the return values.

(cherry picked from commit 492518e)

@ashb
Copy link
Member Author

ashb commented Jun 19, 2025

This is a manual backport of #51699 as the automatic one failed with conflicts

@ashb ashb force-pushed the v3-backport-rework-tasksdk-supervisor-comms-protocol branch 3 times, most recently from 8e75460 to 3174938 Compare June 19, 2025 09:43
@ashb
Copy link
Member Author

ashb commented Jun 19, 2025

I've clearly messed something up in the backport, these tests are not stable or working.

@ashb ashb force-pushed the v3-backport-rework-tasksdk-supervisor-comms-protocol branch from ae328a1 to 8cfc277 Compare June 20, 2025 11:36
@ashb ashb marked this pull request as draft June 20, 2025 15:15
@ashb ashb force-pushed the v3-backport-rework-tasksdk-supervisor-comms-protocol branch from 8cfc277 to 1205a25 Compare June 20, 2025 15:15
@gopidesupavan
Copy link
Member

gopidesupavan commented Jun 22, 2025

@ashb can we add this commit #51992 it to this backport PR

@ashb
Copy link
Member Author

ashb commented Jun 23, 2025

@gopidesupavan We try to do 1:1 of RPs when backporting

…gth-prefixed (#51699)

* Switch the Supervisor/task process from line-based to length-prefixed

The existing JSON Lines based approach had two major drawbacks

1. In the case of really large lines (in the region of 10 or 20MB) the python
   line buffering could _sometimes_ result in a partial read
2. The JSON based approach didn't have the ability to add any metadata (such
   as errors).
3. Not every message type/call-site waited for a response, which meant those
   client functions could never get told about an error

One of the ways this line-based approach fell down was if you suddenly tried
to run 100s of triggers at the same time you would get an error like this:

```
Traceback (most recent call last):
  File "/Users/ash/.local/share/uv/python/cpython-3.12.7-macos-aarch64-none/lib/python3.12/asyncio/streams.py", line 568, in readline
    line = await self.readuntil(sep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ash/.local/share/uv/python/cpython-3.12.7-macos-aarch64-none/lib/python3.12/asyncio/streams.py", line 663, in readuntil
    raise exceptions.LimitOverrunError(
asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit
```

The other way this caused problems was if you parse a large dag (as in one
with 20k tasks or more) the DagFileProcessor could end up getting a partial
read which would be invalid JSON.

This changes the communications protocol in in a couple of ways.

First off at the python level the separate send and receive methods in the
client/task side have been removed and replaced with a single `send()` that
sends the request, reads the response and raises an error if one is returned.
(But note, right now almost nothing in the supervisor side sets the error,
that will be a future PR.)

Secondly the JSON Lines approach has been changed from a line-based protocol
to a binary "frame" one. The protocol (which is the same for whichever side is
sending) is length-prefixed, i.e. we first send the length of the data as a
4byte big-endian integer, followed by the data itself. This should remove the
possibility of JSON parse errors due to reading incomplete lines

Finally the last change made in this PR is to remove the "extra" requests
socket/channel. Upon closer examination with this comms path I realised that
this socket is unnecessary: Since we are in 100% control of the client side we
can make use of the bi-directional nature of `socketpair` and save file
handles. This also happens to help the `run_as_user` feature which is
currently broken, as without extra config to `sudoers` file, `sudo` will close
all filehandles other than stdin, stdout, and stderr -- so by introducing this
change we make it easier to re-add run_as_user support.

In order to support this in the DagFileProcessor (as the fact that the proc
manager uses a single selector for multiple processes) means I have moved the
`on_close` callback to be part of the object we store in the `selector` object
in the supervisors, previoulsy it was the "on_read" callback, now we store a
tuple of `(on_read, on_close)` and on_close is called once universally.

This also changes the way comms are handled from the (async) TriggerRunner
process. Previously we had a sync+async lock, but that made it possible to end
up deadlocking things. The change now is to have `send` on
`TriggerCommsDecoder` "go back" to the async even loop via `async_to_sync`, so
that only async code deals with the socket, and we can use an async lock
(rather than the hybrid sync and async lock we tried before). This seems to
help the deadlock issue, but I'm not 100% sure it will remove it entirely, but
it makes it much much harder to hit - I've not been able to reprouce it with
this change

* Deal with compat in tests

This compat issue is only in tests, as nothing in the runtime of airflow-core
imports/calls methods directly on SUPERVISOR_COMMS, we are only importing it
in tests to mkae assertions about the behavour/to stub the return values.

(cherry picked from commit 492518e)
@ashb ashb force-pushed the v3-backport-rework-tasksdk-supervisor-comms-protocol branch from 1100163 to bf76fbc Compare June 23, 2025 11:21
@ashb ashb marked this pull request as ready for review June 23, 2025 12:02
Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is green!

@ashb ashb merged commit accfbc3 into v3-0-test Jun 23, 2025
133 of 135 checks passed
@ashb ashb deleted the v3-backport-rework-tasksdk-supervisor-comms-protocol branch June 23, 2025 12:49
Lee-W added a commit to astronomer/airflow that referenced this pull request Jun 23, 2025
Lee-W added a commit that referenced this pull request Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants