K8s worker fails monitoring flow and sets it to crashed. #14954

meggers · 2024-08-15T14:27:03Z

Bug summary

We are seeing an issue with the latest k8s worker monitoring flow runs. After some amount of time (5-10 minutes?) we see the worker fail to monitor the flow run with the message provided in additional context. The worker then marks the flow as failed. The flow continues to run until it tries to update its task status at which point it fails because its task status is already in a terminal state.

Version info (`prefect version` output)

prefect-client: 2.20.0
prefect-kubernetes: 0.4.1
chart version: 2024.8.3144127

Additional context

Error during task execution: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>
NoneType: None

The text was updated successfully, but these errors were encountered:

desertaxle · 2024-08-19T16:24:27Z

Thanks for the issue @meggers!

I'm trying to reproduce this issue, but I haven't been able to do so yet. Do you have any additional logs or tracebacks that you can share?

meggers · 2024-08-19T16:42:30Z

Thank you @desertaxle ! The worker logs don't have anything more than what I added for additional context. If there is a setting that would expose more information such as a stack trace from the logs definitely happy to try and get more info!

desertaxle · 2024-08-20T15:11:25Z

Hmmm, it looks like we're light on logging in the worker. I'll need to add some logging to get more info on the failure. While in there, I'll see if I can add some error handling and retries to the watch logic. I'll respond here once there's a new version to try!

meggers · 2024-08-20T15:18:19Z

Thank you! We are eager to resolve this, let us know!

jeanluciano · 2024-08-21T14:47:58Z

Other users are reporting the same error. We might have to handle issues ourselves. See.

meggers · 2024-08-21T15:23:56Z

I wonder if it is something like this: tomplus/kubernetes_asyncio#235 (comment)

I am following up internally to see what we might have in the middle, but could be something like istio.

Wonder if it is possible to configure the k8s aiohttp client with keepalives? Example in aio-libs/aiohttp#3904 (comment)

kevingrismore · 2024-08-21T15:43:29Z

@meggers That's a good callout, the sync version of the worker did have this, but it doesn't seem like the async version does.

meggers · 2024-08-21T15:47:57Z

Makes sense. I was able to confirm we have a timeout of ~5 minutes in our mesh which aligns with the timing we see with this issue. I think a keepalive would be helpful.

charlotte-shreve · 2024-08-26T20:02:20Z

I'm not sure about all of the teams experiencing this but at least my team sees the following error immediately after the 400 TransferEncodingError

Could not determine exit code for 'messy-python-hgtwd'.Exit code will be reported as -1.First container status info did not report an exit code.First container info: {'allocated_resources': None,
 'container_id': 'containerd://336b5ee0dc994e4fa2a635724c9e354f9f1b859b213f2c64af16959a7c822517',
 'image': 'r1k8sacrdev.azurecr.io/istio/proxyv2:1.21.5-distroless',
 'image_id': 'sha256:3b779807c2c39177a38a0713f6b692cee5241afe749a9ecd78171549f2eb7ecf',
 'last_state': {'running': None, 'terminated': None, 'waiting': None},
 'name': 'istio-proxy',
 'ready': True,
 'resources': None,
 'restart_count': 0,
 'started': True,
 'state': {'running': {'started_at': datetime.datetime(2024, 8, 15, 14, 27, 54, tzinfo=tzlocal())},
           'terminated': None,
           'waiting': None},'volume_mounts': None}

Which looks like it might be related to this bit here:

prefect/src/integrations/prefect-kubernetes/prefect_kubernetes/worker.py

Line 1021 in 56d6fb0

most_recent_pod = pods.items[0] if pods.items else None

When we run our job pods there are 3 containers inside a given pod, and it appears that on occasion this is checking on the Istio-Proxy instead of the job container...

For my jobs, when we encounter 400 TransferEncodingError followed by Could not determine exit code like above, the UI shows the flow as Crashed until the job container completes, and at that exact moment the flow is marked Successful - we can see the color change from Blue -> Orange -> Green in the graph in the UI.

Previous, when my team was running versions 2024.5.23194919, 2.16.5, and 0.3.7 , we periodically would see the following error message:

An error occurred while monitoring flow run '31893f0f-ddc0-4b60-87cf-b58138515726'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/prefect/workers/base.py", line 908, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/dist-packages/prefect_kubernetes/worker.py", line 599, in run
    status_code = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/prefect_kubernetes/worker.py", line 1028, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1057, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1206, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1136, in _update_chunk_length
    raise ProtocolError("Response ended prematurely") from None
urllib3.exceptions.ProtocolError: Response ended prematurely

While this error showed up in the worker logs and the UI, the flow was still marked as Running - though the UI would stop updating about Subflow/Task status. Then, when the job container completed, the flow would be marked as Successful/Failed as appropriate and the UI would fill in.

I'm not sure if the Istio-Proxy container being referenced in the new error message is actually relevant to the issue, or if it's something else and then it's just further tripping up trying to write out an error message.

meggers added the bug Something isn't working label Aug 15, 2024

zangell44 assigned desertaxle Aug 19, 2024

jeanluciano assigned jeanluciano and unassigned desertaxle Aug 29, 2024

jeanluciano mentioned this issue Aug 29, 2024

Kubernetes worker uses enable_socket_keep_alive #15142

Merged

4 tasks

jeanluciano closed this as completed in #15142 Sep 3, 2024

lucylu-coveo mentioned this issue Sep 23, 2024

Issue during deployment run failure - 'Could not determine exit code for X' #15259

Open

soamicharan mentioned this issue Oct 14, 2024

fix rest api aiohttp timeout tomplus/kubernetes_asyncio#337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s worker fails monitoring flow and sets it to crashed. #14954

K8s worker fails monitoring flow and sets it to crashed. #14954

meggers commented Aug 15, 2024

desertaxle commented Aug 19, 2024

meggers commented Aug 19, 2024

desertaxle commented Aug 20, 2024

meggers commented Aug 20, 2024

jeanluciano commented Aug 21, 2024

meggers commented Aug 21, 2024 •

edited

Loading

kevingrismore commented Aug 21, 2024

meggers commented Aug 21, 2024

charlotte-shreve commented Aug 26, 2024

K8s worker fails monitoring flow and sets it to crashed. #14954

K8s worker fails monitoring flow and sets it to crashed. #14954

Comments

meggers commented Aug 15, 2024

Bug summary

Version info (prefect version output)

Additional context

desertaxle commented Aug 19, 2024

meggers commented Aug 19, 2024

desertaxle commented Aug 20, 2024

meggers commented Aug 20, 2024

jeanluciano commented Aug 21, 2024

meggers commented Aug 21, 2024 • edited Loading

kevingrismore commented Aug 21, 2024

meggers commented Aug 21, 2024

charlotte-shreve commented Aug 26, 2024

Version info (`prefect version` output)

meggers commented Aug 21, 2024 •

edited

Loading