Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s worker fails monitoring flow and sets it to crashed. #14954

Closed
meggers opened this issue Aug 15, 2024 · 9 comments · Fixed by #15142 or #15220
Closed

K8s worker fails monitoring flow and sets it to crashed. #14954

meggers opened this issue Aug 15, 2024 · 9 comments · Fixed by #15142 or #15220
Assignees
Labels
bug Something isn't working

Comments

@meggers
Copy link

meggers commented Aug 15, 2024

Bug summary

We are seeing an issue with the latest k8s worker monitoring flow runs. After some amount of time (5-10 minutes?) we see the worker fail to monitor the flow run with the message provided in additional context. The worker then marks the flow as failed. The flow continues to run until it tries to update its task status at which point it fails because its task status is already in a terminal state.

Version info (prefect version output)

prefect-client: 2.20.0
prefect-kubernetes: 0.4.1
chart version: 2024.8.3144127

Additional context

Error during task execution: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>
NoneType: None
@meggers meggers added the bug Something isn't working label Aug 15, 2024
@desertaxle
Copy link
Member

Thanks for the issue @meggers!

I'm trying to reproduce this issue, but I haven't been able to do so yet. Do you have any additional logs or tracebacks that you can share?

@meggers
Copy link
Author

meggers commented Aug 19, 2024

Thank you @desertaxle ! The worker logs don't have anything more than what I added for additional context. If there is a setting that would expose more information such as a stack trace from the logs definitely happy to try and get more info!

@desertaxle
Copy link
Member

Hmmm, it looks like we're light on logging in the worker. I'll need to add some logging to get more info on the failure. While in there, I'll see if I can add some error handling and retries to the watch logic. I'll respond here once there's a new version to try!

@meggers
Copy link
Author

meggers commented Aug 20, 2024

Thank you! We are eager to resolve this, let us know!

@jeanluciano
Copy link
Contributor

Other users are reporting the same error. We might have to handle issues ourselves. See.

@meggers
Copy link
Author

meggers commented Aug 21, 2024

I wonder if it is something like this: tomplus/kubernetes_asyncio#235 (comment)

I am following up internally to see what we might have in the middle, but could be something like istio.

Wonder if it is possible to configure the k8s aiohttp client with keepalives? Example in aio-libs/aiohttp#3904 (comment)

@kevingrismore
Copy link
Contributor

@meggers That's a good callout, the sync version of the worker did have this, but it doesn't seem like the async version does.

@meggers
Copy link
Author

meggers commented Aug 21, 2024

Makes sense. I was able to confirm we have a timeout of ~5 minutes in our mesh which aligns with the timing we see with this issue. I think a keepalive would be helpful.

@charlotte-shreve
Copy link

I'm not sure about all of the teams experiencing this but at least my team sees the following error immediately after the 400 TransferEncodingError

Could not determine exit code for 'messy-python-hgtwd'.Exit code will be reported as -1.First container status info did not report an exit code.First container info: {'allocated_resources': None,
 'container_id': 'containerd://336b5ee0dc994e4fa2a635724c9e354f9f1b859b213f2c64af16959a7c822517',
 'image': 'r1k8sacrdev.azurecr.io/istio/proxyv2:1.21.5-distroless',
 'image_id': 'sha256:3b779807c2c39177a38a0713f6b692cee5241afe749a9ecd78171549f2eb7ecf',
 'last_state': {'running': None, 'terminated': None, 'waiting': None},
 'name': 'istio-proxy',
 'ready': True,
 'resources': None,
 'restart_count': 0,
 'started': True,
 'state': {'running': {'started_at': datetime.datetime(2024, 8, 15, 14, 27, 54, tzinfo=tzlocal())},
           'terminated': None,
           'waiting': None},'volume_mounts': None}

Which looks like it might be related to this bit here:

most_recent_pod = pods.items[0] if pods.items else None

When we run our job pods there are 3 containers inside a given pod, and it appears that on occasion this is checking on the Istio-Proxy instead of the job container...

For my jobs, when we encounter 400 TransferEncodingError followed by Could not determine exit code like above, the UI shows the flow as Crashed until the job container completes, and at that exact moment the flow is marked Successful - we can see the color change from Blue -> Orange -> Green in the graph in the UI.

Previous, when my team was running versions 2024.5.23194919, 2.16.5, and 0.3.7 , we periodically would see the following error message:

An error occurred while monitoring flow run '31893f0f-ddc0-4b60-87cf-b58138515726'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/prefect/workers/base.py", line 908, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/dist-packages/prefect_kubernetes/worker.py", line 599, in run
    status_code = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/prefect_kubernetes/worker.py", line 1028, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.10/dist-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1057, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1206, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 1136, in _update_chunk_length
    raise ProtocolError("Response ended prematurely") from None
urllib3.exceptions.ProtocolError: Response ended prematurely

While this error showed up in the worker logs and the UI, the flow was still marked as Running - though the UI would stop updating about Subflow/Task status. Then, when the job container completed, the flow would be marked as Successful/Failed as appropriate and the UI would fill in.

I'm not sure if the Istio-Proxy container being referenced in the new error message is actually relevant to the issue, or if it's something else and then it's just further tripping up trying to write out an error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants