Replies: 2 comments 4 replies
-
+1: Experiencing 100% the same issue here after upgrading to 2.8.3 running on GKE, with K8S version 1.29. Briefly:
Restarting the scheduler solves both issue, but it's just a matter of time for the scheduler to stuck again. Even some advice on how to replicate or how to debug the issue is most welcome. |
Beta Was this translation helpful? Give feedback.
-
Yes, I have temporarelly solved it by passing request timeout for the client to throw after 4 minutes:
btw one might be tempted to use As far as I can tell it is something with the python kubernetes library, but not 100% sure. |
Beta Was this translation helpful? Give feedback.
-
After updating from Airflow 2.4.3 and cncf-provider 7.5.1 -> Airflow 2.8.3 and cncf-provider 8.0.1 we are observing that the executor pod doesn't get deleted when the task succedes or fails which ends up with accumulating completed pods in the namespace.
Aditionally (not completely sure it is related) after some time of accumulating completed pods, clearing the state on a task will result in it being marked as scheduled but never actually starting to run.
Restarting the scheduler solves both issues, in the case of the case of completed pods, when restarting the scheduler I can see that the scheduler attempts to adopt the completed pods and then deletes them.
Now, cannot reliably reproduce, meaning that after restarting the scheduler it might work for a while before it starts again experiencing the same behaviour (every time) but not sure what makes it start acting like this (thus a discussion, not an issue).
Both delete_worker_pods and delete_worker_pods_on_failure are True, as well as migrated from is_delete_operator to finish_action (default value)
Anyone any suggestions?
Update:
So a few things, that brings closer to something that could potentially be reproduced.
A few more facts:
Now, looking through the code and the kubernetes-client library and observing the logs KubernetesWatcher process should fail every 5 minutes or so, due to this bug:
kubernetes-client/python#2081
and we are observice that, where the watcher dies with the specified error and then the health check creates a new one:
as said this will happen every 5 minutes or so, but at some point something fails silently or the health checker fails to kick in, because we stop seing this error occuring every 5 mins or so (if there is not activity on the cluster), but after there are no more Events executor utilis - which makes us think that the Watcher is actually dead.
Beta Was this translation helpful? Give feedback.
All reactions