Failing to delete completed executor pods after updating to 2.8.3 (cncf 8.0.1) #38500

gaelxcowi · 2024-03-26T14:25:09Z

gaelxcowi
Mar 26, 2024

After updating from Airflow 2.4.3 and cncf-provider 7.5.1 -> Airflow 2.8.3 and cncf-provider 8.0.1 we are observing that the executor pod doesn't get deleted when the task succedes or fails which ends up with accumulating completed pods in the namespace.

Aditionally (not completely sure it is related) after some time of accumulating completed pods, clearing the state on a task will result in it being marked as scheduled but never actually starting to run.

Restarting the scheduler solves both issues, in the case of the case of completed pods, when restarting the scheduler I can see that the scheduler attempts to adopt the completed pods and then deletes them.

Now, cannot reliably reproduce, meaning that after restarting the scheduler it might work for a while before it starts again experiencing the same behaviour (every time) but not sure what makes it start acting like this (thus a discussion, not an issue).

Both delete_worker_pods and delete_worker_pods_on_failure are True, as well as migrated from is_delete_operator to finish_action (default value)

Anyone any suggestions?

Update:

So a few things, that brings closer to something that could potentially be reproduced.

A few more facts:

Airflow is running on a cluster in AKS with Kubernetes version 1.29.0
We have upgraded to the Airflow 2.8.4 - same issues

Now, looking through the code and the kubernetes-client library and observing the logs KubernetesWatcher process should fail every 5 minutes or so, due to this bug:

kubernetes-client/python#2081

and we are observice that, where the watcher dies with the specified error and then the health check creates a new one:

 Error while health checking kube watcher process for namespace airflow. Process died for unknown reasons
 Event: and now my watch begins starting at resource_version: 0

as said this will happen every 5 minutes or so, but at some point something fails silently or the health checker fails to kick in, because we stop seing this error occuring every 5 mins or so (if there is not activity on the cluster), but after there are no more Events executor utilis - which makes us think that the Watcher is actually dead.

celestial-nets · 2024-04-02T12:46:08Z

celestial-nets
Apr 2, 2024

+1: Experiencing 100% the same issue here after upgrading to 2.8.3 running on GKE, with K8S version 1.29. Briefly:

After a while, the scheduler does not delete successful pods. The scheduling function still works fine.
Occasionally, the scheduler loses response and all incoming jobs get stuck in "scheduled" status.
No error logs or whatsoever, and extremely hard to reproduce due to their "randomness" nature.

Restarting the scheduler solves both issue, but it's just a matter of time for the scheduler to stuck again. Even some advice on how to replicate or how to debug the issue is most welcome.

0 replies

gaelxcowi · 2024-04-02T13:16:50Z

gaelxcowi
Apr 2, 2024
Author

Yes, I have temporarelly solved it by passing request timeout for the client to throw after 4 minutes:

    AIRFLOW__KUBERNETES_EXECUTOR__KUBE_CLIENT_REQUEST_ARGS: '{"_request_timeout": [240, 240]}'

btw one might be tempted to use timeout_seconds argument, but that won't work as for example delete_namespaced_pod does not know what to do with it.

As far as I can tell it is something with the python kubernetes library, but not 100% sure.

4 replies

celestial-nets Apr 2, 2024

How did you identify the problem? On my side in GKE I could see no error logs or whatsover.

gaelxcowi Apr 3, 2024
Author

well I did not specifically identify, it is just a hunch that it is caused by the kubernetes python lib, as you say there's no error throwing or something. The workaround "finding" was basically by following the code/logs and trying to identify what's the most probably place it gets stuck. That's why it is only a discussion not a bug as I cannot reliably repropduce - I mean it happens ALWAYS but I don't have the exact reason and context of why.

Maybe someone from the Airflow team will see this at some point and point us in the right direction. There's also the configuration for enabling tcp keep alive that looking at the description is precicelly made for preventing this situation, at least based on my understanding, but it seems that it doesn't work - even tho there weren't any recent changes to that code base.

barun-mazumdar May 13, 2024

Seems like the issue still exists

GraphtyLove May 21, 2024

Having the same issue here, after upgrading Airflow's official helm chart from 1.11.0 to 1.13.1
For context, Before the upgrade Airflow was on version 2.8.0 and has been updated to 2.9.1.
Hope it helps!

My current fix is to run:

kubectl delete pods -n <airflow namespace> --field-selector=status.phase==Succeeded

On a regular basis. But still more a hack then a solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to delete completed executor pods after updating to 2.8.3 (cncf 8.0.1) #38500

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Failing to delete completed executor pods after updating to 2.8.3 (cncf 8.0.1) #38500

gaelxcowi Mar 26, 2024

Replies: 2 comments · 4 replies

celestial-nets Apr 2, 2024

gaelxcowi Apr 2, 2024 Author

celestial-nets Apr 2, 2024

gaelxcowi Apr 3, 2024 Author

barun-mazumdar May 13, 2024

GraphtyLove May 21, 2024

gaelxcowi
Mar 26, 2024

Replies: 2 comments 4 replies

celestial-nets
Apr 2, 2024

gaelxcowi
Apr 2, 2024
Author

gaelxcowi Apr 3, 2024
Author