-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Pod Number Exception in the sync mode if reattach_on_restart parameter is False #39329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect use of find_pod is causing this problem. Use the create_pod response object V1Pod for further operations instead of calling the find_pod. And cleanup the existing pods before starting the new pod.
@jedcunningham, @hussein-awala WDYT
89f92c1 to
0519403
Compare
@dirrao It does not seem to me that Also, in the case with |
|
@jedcunningham @hussein-awala could you take a look please? |
|
Hi @potiuk @hussein-awala ! @eladkal , this fix is kinda urgent, if it will be merged soon, can we please also include this one to google-provider release that you said we will have this weekend with changes for AutoML? Thank you : ) |
shahar1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small comment
0519403 to
494467b
Compare
494467b to
bd68ec5
Compare
|
Tests are failing. You will need to add |
bd68ec5 to
89949e4
Compare
Prevent KubernetesPodOperator from raising an exception in a rare scenario, wherein the task is running in the sync mode with parameter
reattach_on_restartequal to False, and the first task attempt fails because the task process is killed externally by the Kubernetes cluster or another process.If the task is killed externally, it breaks the execution flow (including any try/except blocks) and immediately exists the task, resulting in a situation where the pod created for the first task run try is not properly deleted / updated, and consequently in the pod number exception, which will repeat in the next task tries until the dag will fail completely.
Behavior before the fix:
KubernetesPodOperatorstarts a new task.reattach_on_restartparameter is set to False, the operator does not try to restart the task in the same pod for the next attempt, and tries to create a new one while the original pod still exists with the same labels.KubernetesPodOperatortries to find the pod using the pod labels stored in the task context.Behavior after the fix:
1-6. Same behavior.
7. 2 pods with such labels are found.
9. If
reattach_on_restartis False, then we loop through the pods and pick the one that was created last and assign it to be used for the next attempt.10. We will update the labels of the previous pod and, depending on the value of the
on_finish_actionparameter, either keep or remove it.11. The task will continue without the exception.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.