Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329

e-galan · 2024-04-30T12:19:37Z

Prevent KubernetesPodOperator from raising an exception in a rare scenario, wherein the task is running in the sync mode with parameter reattach_on_restart equal to False, and the first task attempt fails because the task process is killed externally by the Kubernetes cluster or another process.

If the task is killed externally, it breaks the execution flow (including any try/except blocks) and immediately exists the task, resulting in a situation where the pod created for the first task run try is not properly deleted / updated, and consequently in the pod number exception, which will repeat in the next task tries until the dag will fail completely.

Behavior before the fix:

KubernetesPodOperator starts a new task.
A k8s pod is created to process the task.
For some reason the task in the pod is killed externally and exits with some code (-9 for example).
Since the reattach_on_restart parameter is set to False, the operator does not try to restart the task in the same pod for the next attempt, and tries to create a new one while the original pod still exists with the same labels.
The new pod is created.
Before continuing the task, KubernetesPodOperator tries to find the pod using the pod labels stored in the task context.
2 pods with such labels are found, resulting in the exception ("More than one pod running with labels").
The exception continues to be raised on the next tries.

Behavior after the fix:
1-6. Same behavior.
7. 2 pods with such labels are found.
9. If reattach_on_restart is False, then we loop through the pods and pick the one that was created last and assign it to be used for the next attempt.
10. We will update the labels of the previous pod and, depending on the value of the on_finish_action parameter, either keep or remove it.
11. The task will continue without the exception.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

airflow/providers/cncf/kubernetes/operators/pod.py

dirrao

Incorrect use of find_pod is causing this problem. Use the create_pod response object V1Pod for further operations instead of calling the find_pod. And cleanup the existing pods before starting the new pod.

@jedcunningham, @hussein-awala WDYT

e-galan · 2024-05-06T11:59:14Z

Incorrect use of find_pod is causing this problem. Use the create_pod response object V1Pod for further operations instead of calling the find_pod. And cleanup the existing pods before starting the new pod.

@jedcunningham, @hussein-awala WDYT

@dirrao It does not seem to me that cleanup() was designed to be run in the beginning of the execute_sync() method, given the many checks it contains. Should I refactor it or just create another method and call it at the start of execute_sync()?

Also, in the case with reattach_on_restart=False we still need to run find_pod() to actually find out that there is an extra pod left from a previous task attempt and then update its labels before calling cleanup() or something with a similar functionality. Otherwise it won't be called at all.

e-galan · 2024-05-08T09:02:28Z

@jedcunningham @hussein-awala could you take a look please?

airflow/providers/cncf/kubernetes/operators/pod.py

VladaZakharova · 2024-05-16T09:05:54Z

Hi @potiuk @hussein-awala !
Can you please check changes one more time? Thanks!

@eladkal , this fix is kinda urgent, if it will be merged soon, can we please also include this one to google-provider release that you said we will have this weekend with changes for AutoML? Thank you : )

shahar1

LGTM, small comment

airflow/providers/cncf/kubernetes/operators/pod.py

eladkal · 2024-06-04T12:42:42Z

Tests are failing. You will need to add test_exceptions.py
I think we have similar convention in docker provider / other providers so you can check the tests we have there for reference

e-galan requested review from hussein-awala and jedcunningham as code owners April 30, 2024 12:19

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 30, 2024

Taragolis reviewed Apr 30, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/operators/pod.py Show resolved Hide resolved

dirrao reviewed May 1, 2024

View reviewed changes

e-galan force-pushed the fix-kubernetes-pod-operator-reattach-on-restart-parameter branch from 89f92c1 to 0519403 Compare May 6, 2024 11:12

e-galan requested review from Taragolis and dirrao May 7, 2024 09:20

eladkal requested review from amoghrajesh and romsharon98 May 11, 2024 06:53

romsharon98 reviewed May 12, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/operators/pod.py Outdated Show resolved Hide resolved

shahar1 approved these changes May 20, 2024

View reviewed changes

shahar1 reviewed May 20, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/operators/pod.py Outdated Show resolved Hide resolved

e-galan force-pushed the fix-kubernetes-pod-operator-reattach-on-restart-parameter branch from 0519403 to 494467b Compare June 4, 2024 09:44

eladkal reviewed Jun 4, 2024

View reviewed changes

airflow/providers/cncf/kubernetes/operators/pod.py Outdated Show resolved Hide resolved

e-galan force-pushed the fix-kubernetes-pod-operator-reattach-on-restart-parameter branch from 494467b to bd68ec5 Compare June 4, 2024 12:09

Fix reattach_on_restart parameter for the sync mode

89949e4

e-galan force-pushed the fix-kubernetes-pod-operator-reattach-on-restart-parameter branch from bd68ec5 to 89949e4 Compare June 4, 2024 13:12

eladkal approved these changes Jun 5, 2024

View reviewed changes

eladkal merged commit 0f6e31f into apache:main Jun 5, 2024

fdemiane pushed a commit to fdemiane/airflow that referenced this pull request Jun 6, 2024

Fix reattach_on_restart parameter for the sync mode (apache#39329)

700872d

eladkal mentioned this pull request Jun 7, 2024

Status of testing Providers that were prepared on June 07, 2024 #40115

Closed

27 tasks

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024

Fix reattach_on_restart parameter for the sync mode (apache#39329)

7017ca7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329

Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329

Uh oh!

e-galan commented Apr 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

dirrao left a comment •

edited

Loading

Uh oh!

e-galan commented May 6, 2024

Uh oh!

e-galan commented May 8, 2024

Uh oh!

Uh oh!

VladaZakharova commented May 16, 2024 •

edited

Loading

Uh oh!

shahar1 left a comment

Uh oh!

Uh oh!

Uh oh!

eladkal commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329

Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329

Uh oh!

Conversation

e-galan commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dirrao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-galan commented May 6, 2024

Uh oh!

e-galan commented May 8, 2024

Uh oh!

Uh oh!

VladaZakharova commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shahar1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eladkal commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

e-galan commented Apr 30, 2024 •

edited

Loading

dirrao left a comment •

edited

Loading

VladaZakharova commented May 16, 2024 •

edited

Loading