-
Notifications
You must be signed in to change notification settings - Fork 16.4k
EksCreateNodegroupOperator could leave nodegroups running after failure #61145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
EksCreateNodegroupOperator could leave nodegroups running after failure #61145
Conversation
| status_message="Nodegroup status is", | ||
| status_args=["nodegroup.status"], | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're now issuing a delete request in all cases, without even trying to determine if the exception was a AWS related error. We're simply catching any exception and deleting. I could also see users wanting the cadaver left behind to investigate why exactly the provisioning failed.
Overall I'm not fully +1 on these changes, but curious to hear from @vincbeck and @ferruzzi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We should:
- Filter the exceptions
- Have a flag to configure whether we want that auto deletion in case of failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We should:
- Filter the exceptions
- Have a flag to configure whether we want that auto deletion in case of failure
Exactly what I was considering when I saw the comment by @o-nikolas .
For the exceptions, do you want to limit it to WaiterError for now? And for the flag, what are your thoughts on opt-in (no cleanup by default) or opt-out (cleanup by default). I believe it should be opt-out because a lot of users might not initially be aware of this flag and then learn of it later after experiencing a bad resource leak. And if the user prefers observability over hard prevention of resource leaks, it should be an intentional decision. In either case, the user should be notified of the orphaned node group via logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we can start with WaiterError and extend later if needed but I think it should be enough. And I agree with you, I think it should be opted in by default.
Also, can you make similar changes to your other open PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to WaiterError
I don't feel too strongly about the on by default or off by default. It is definitely a behaviour change to have it on by default, so I think perhaps off by default is the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you make similar changes to your other open PRs?
@vincbeck
Yes. Please give me a day or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good to me. I'll wait to see if there are any concerns this this PR, and once merged, I'll merge the others
345ab8b to
2ff71f0
Compare
occur after successful creation (e.g. waiter failures due to missing DescribeNodegroup permissions). This change adds best-effort cleanup when post-create steps fail by attempting to delete the nodegroup that was successfully created. Cleanup errors are logged but do not mask the original exception. This mode is opt-in by default. Tests cover successful cleanup on waiter failure and ensure cleanup failures do not override the original error.
1ed447c to
26b4853
Compare
Description
Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups are deleted when failures occur after a nodegroup has been successfully created. Cleanup behavior is guarded by a flag and is opted in by default.
Previously, nodegroup creation could succeed via
create_nodegroup, but the operator could then fail during post-creation steps (for example, when waiting for nodegroup readiness withwait_for_completion=Trueand missingeks:DescribeNodegrouppermissions). In these cases, the Airflow task failed while the EKS managed nodegroup continued provisioning or running in AWS.Cleanup logic has now been added to the internal
_create_computehelper. If an exception is raised after nodegroup creation during the wait phase, the operator attempts a best-effort deletion of the nodegroup. Cleanup failures are logged but do not mask or replace the original exception.Cleanup is only triggered for post-start EKS nodegroup failures (including
WaiterError), ensuring deletion is attempted only when a nodegroup was successfully created and avoiding interception of non-AWS exceptions.Rationale
EKS managed nodegroups are external resources whose lifecycle extends beyond the execution of the Airflow task. If nodegroup creation succeeds but subsequent steps fail, Airflow may lose the ability to observe or manage the resource, potentially leaving nodegroups running unexpectedly.
Failures after nodegroup creation can occur for multiple reasons, including partial IAM permissions (for example, allowing
eks:CreateNodegroupbut denyingeks:DescribeNodegroup, which is required by the waiter). In such cases, the nodegroup may continue provisioning even though the Airflow task has failed.This change applies only to nodegroup creation and does not affect cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to nodegroups created during the current execution and is only attempted when nodegroup creation has already completed successfully. This prevents interference with unrelated resources while avoiding orphaned EKS-managed infrastructure on post-create failures.
Restricting cleanup to post-creation EKS nodegroup failures prevents unintended deletion in unrelated failure paths while still addressing orphaned nodegroups created during execution.
Notes
These series of changes intentionally avoid introducing a shared abstraction for AWS operator cleanup logic. Resource creation, ownership tracking, and cleanup semantics vary significantly across AWS services, and a generic solution would add complexity without clear benefit. Cleanup is therefore implemented locally where behavior and failure modes are well understood.
Tests
Documentation
The docstring for
EksCreateNodegroupOperatorhas been updated with a brief description of the new flagdelete_nodegroup_on_failure.Backwards Compatibility
A new flag called
delete_nodegroup_on_failurehas been added toEksCreateNodegroupOperatorwith a default setting ofTrue. Best-effort cleanup will now be attempted if a post-creation failure (includingWaiterError) occurs after the nodegroup has been successfully created.Closes: #61142