Skip to content

Conversation

@SameerMesiah97
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 commented Jan 27, 2026

Description

Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups are deleted when failures occur after a nodegroup has been successfully created. Cleanup behavior is guarded by a flag and is opted in by default.

Previously, nodegroup creation could succeed via create_nodegroup, but the operator could then fail during post-creation steps (for example, when waiting for nodegroup readiness with wait_for_completion=True and missing eks:DescribeNodegroup permissions). In these cases, the Airflow task failed while the EKS managed nodegroup continued provisioning or running in AWS.

Cleanup logic has now been added to the internal _create_compute helper. If an exception is raised after nodegroup creation during the wait phase, the operator attempts a best-effort deletion of the nodegroup. Cleanup failures are logged but do not mask or replace the original exception.

Cleanup is only triggered for post-start EKS nodegroup failures (including WaiterError), ensuring deletion is attempted only when a nodegroup was successfully created and avoiding interception of non-AWS exceptions.

Rationale

EKS managed nodegroups are external resources whose lifecycle extends beyond the execution of the Airflow task. If nodegroup creation succeeds but subsequent steps fail, Airflow may lose the ability to observe or manage the resource, potentially leaving nodegroups running unexpectedly.

Failures after nodegroup creation can occur for multiple reasons, including partial IAM permissions (for example, allowing eks:CreateNodegroup but denying eks:DescribeNodegroup, which is required by the waiter). In such cases, the nodegroup may continue provisioning even though the Airflow task has failed.

This change applies only to nodegroup creation and does not affect cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to nodegroups created during the current execution and is only attempted when nodegroup creation has already completed successfully. This prevents interference with unrelated resources while avoiding orphaned EKS-managed infrastructure on post-create failures.

Restricting cleanup to post-creation EKS nodegroup failures prevents unintended deletion in unrelated failure paths while still addressing orphaned nodegroups created during execution.

Notes

These series of changes intentionally avoid introducing a shared abstraction for AWS operator cleanup logic. Resource creation, ownership tracking, and cleanup semantics vary significantly across AWS services, and a generic solution would add complexity without clear benefit. Cleanup is therefore implemented locally where behavior and failure modes are well understood.

Tests

  • Added a unit test verifying that nodegroup deletion is attempted when a failure occurs during the wait phase after successful creation.
  • Added a unit test ensuring that failures during cleanup do not mask or override the original exception.

Documentation

The docstring for EksCreateNodegroupOperator has been updated with a brief description of the new flag delete_nodegroup_on_failure.

Backwards Compatibility

A new flag called delete_nodegroup_on_failure has been added to EksCreateNodegroupOperator with a default setting of True. Best-effort cleanup will now be attempted if a post-creation failure (including WaiterError) occurs after the nodegroup has been successfully created.

Closes: #61142

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 27, 2026
status_message="Nodegroup status is",
status_args=["nodegroup.status"],
)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're now issuing a delete request in all cases, without even trying to determine if the exception was a AWS related error. We're simply catching any exception and deleting. I could also see users wanting the cadaver left behind to investigate why exactly the provisioning failed.
Overall I'm not fully +1 on these changes, but curious to hear from @vincbeck and @ferruzzi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We should:

  • Filter the exceptions
  • Have a flag to configure whether we want that auto deletion in case of failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We should:

  • Filter the exceptions
  • Have a flag to configure whether we want that auto deletion in case of failure

Exactly what I was considering when I saw the comment by @o-nikolas .

For the exceptions, do you want to limit it to WaiterError for now? And for the flag, what are your thoughts on opt-in (no cleanup by default) or opt-out (cleanup by default). I believe it should be opt-out because a lot of users might not initially be aware of this flag and then learn of it later after experiencing a bad resource leak. And if the user prefers observability over hard prevention of resource leaks, it should be an intentional decision. In either case, the user should be notified of the orphaned node group via logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we can start with WaiterError and extend later if needed but I think it should be enough. And I agree with you, I think it should be opted in by default.

Also, can you make similar changes to your other open PRs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to WaiterError

I don't feel too strongly about the on by default or off by default. It is definitely a behaviour change to have it on by default, so I think perhaps off by default is the right thing.

Copy link
Contributor Author

@SameerMesiah97 SameerMesiah97 Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you make similar changes to your other open PRs?

@vincbeck
Yes. Please give me a day or 2.

Copy link
Contributor

@vincbeck vincbeck Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincbeck

Similar changes have been made to the PRs for EcsRunTaskOperator (#61051) and EmrCreateJobFlowOperator (#61010). PR descriptions have also been updated where applicable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good to me. I'll wait to see if there are any concerns this this PR, and once merged, I'll merge the others

@SameerMesiah97 SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch 2 times, most recently from 345ab8b to 2ff71f0 Compare January 28, 2026 19:24
occur after successful creation (e.g. waiter failures due to missing
DescribeNodegroup permissions).

This change adds best-effort cleanup when post-create steps fail by attempting
to delete the nodegroup that was successfully created. Cleanup errors are
logged but do not mask the original exception. This mode is opt-in by default.

Tests cover successful cleanup on waiter failure and ensure cleanup failures
do not override the original error.
@SameerMesiah97 SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch from 1ed447c to 26b4853 Compare January 28, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions

3 participants