Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure by SameerMesiah97 · Pull Request #61145 · apache/airflow

SameerMesiah97 · 2026-01-27T19:04:20Z

Description

Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups are deleted when failures occur after a nodegroup has been successfully created. Cleanup behavior is guarded by a flag and is opted in by default.

Previously, nodegroup creation could succeed via create_nodegroup, but the operator could then fail during post-creation steps (for example, when waiting for nodegroup readiness with wait_for_completion=True and missing eks:DescribeNodegroup permissions). In these cases, the Airflow task failed while the EKS managed nodegroup continued provisioning or running in AWS.

Cleanup logic has now been added to the internal _create_compute helper. If an exception is raised after nodegroup creation during the wait phase, the operator attempts a best-effort deletion of the nodegroup. Cleanup failures are logged but do not mask or replace the original exception.

Cleanup is only triggered for post-start EKS nodegroup failures (including WaiterError), ensuring deletion is attempted only when a nodegroup was successfully created and avoiding interception of non-AWS exceptions.

Rationale

EKS managed nodegroups are external resources whose lifecycle extends beyond the execution of the Airflow task. If nodegroup creation succeeds but subsequent steps fail, Airflow may lose the ability to observe or manage the resource, potentially leaving nodegroups running unexpectedly.

Failures after nodegroup creation can occur for multiple reasons, including partial IAM permissions (for example, allowing eks:CreateNodegroup but denying eks:DescribeNodegroup, which is required by the waiter). In such cases, the nodegroup may continue provisioning even though the Airflow task has failed.

This change applies only to nodegroup creation and does not affect cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to nodegroups created during the current execution and is only attempted when nodegroup creation has already completed successfully. This prevents interference with unrelated resources while avoiding orphaned EKS-managed infrastructure on post-create failures.

Restricting cleanup to post-creation EKS nodegroup failures prevents unintended deletion in unrelated failure paths while still addressing orphaned nodegroups created during execution.

Notes

These series of changes intentionally avoid introducing a shared abstraction for AWS operator cleanup logic. Resource creation, ownership tracking, and cleanup semantics vary significantly across AWS services, and a generic solution would add complexity without clear benefit. Cleanup is therefore implemented locally where behavior and failure modes are well understood.

Tests

Added a unit test verifying that nodegroup deletion is attempted when a failure occurs during the wait phase after successful creation.
Added a unit test ensuring that failures during cleanup do not mask or override the original exception.

Documentation

The docstring for EksCreateNodegroupOperator has been updated with a brief description of the new flag delete_nodegroup_on_failure.

Backwards Compatibility

A new flag called delete_nodegroup_on_failure has been added to EksCreateNodegroupOperator with a default setting of True. Best-effort cleanup will now be attempted if a post-creation failure (including WaiterError) occurs after the nodegroup has been successfully created.

Closes: #61142

providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py

occur after successful creation (e.g. waiter failures due to missing DescribeNodegroup permissions). This change adds best-effort cleanup when post-create steps fail by attempting to delete the nodegroup that was successfully created. Cleanup errors are logged but do not mask the original exception. This mode is opt-in by default. Tests cover successful cleanup on waiter failure and ensure cleanup failures do not override the original error.

…failure (apache#61145)

SameerMesiah97 requested a review from o-nikolas as a code owner January 27, 2026 19:04

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 27, 2026

o-nikolas requested review from ferruzzi and vincbeck January 28, 2026 00:18

o-nikolas reviewed Jan 28, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py Outdated Show resolved Hide resolved

SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch 2 times, most recently from 345ab8b to 2ff71f0 Compare January 28, 2026 19:24

SameerMesiah97 force-pushed the 61142-EKSCreateNodeGroupOperator-Cleanup branch from 1ed447c to 26b4853 Compare January 28, 2026 20:19

vincbeck approved these changes Jan 28, 2026

View reviewed changes

SameerMesiah97 mentioned this pull request Jan 30, 2026

Restrict EC2CreateInstanceOperator cleanup to waiter failures and add guard flag #61272

Merged

shahar1 changed the title ~~EksCreateNodegroupOperator could leave nodegroups running after failure~~ Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure Feb 10, 2026

shahar1 merged commit 6ca21bf into apache:main Feb 10, 2026
88 of 89 checks passed

shahar1 mentioned this pull request Feb 11, 2026

Status of testing Providers that were prepared on February 10, 2026 #61766

Closed

81 tasks

Alok-kumar-priyadarshi pushed a commit to Alok-kumar-priyadarshi/airflow that referenced this pull request Feb 11, 2026

Add best-effort cleanup to EksCreateNodegroupOperator on post-create …

a4f0705

…failure (apache#61145)

Ratasa143 pushed a commit to Ratasa143/airflow that referenced this pull request Feb 15, 2026

Add best-effort cleanup to EksCreateNodegroupOperator on post-create …

1cfde13

…failure (apache#61145)

choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026

Add best-effort cleanup to EksCreateNodegroupOperator on post-create …

de3957d

…failure (apache#61145)

potiuk mentioned this pull request Feb 26, 2026

Status of testing Providers that were prepared on February 26, 2026 #62537

Open

AkshayArali pushed a commit to AkshayArali/airflow_630 that referenced this pull request Feb 27, 2026

Add best-effort cleanup to EksCreateNodegroupOperator on post-create …

a99c577

…failure (apache#61145)

AkshayArali pushed a commit to AkshayArali/airflow_630 that referenced this pull request Feb 27, 2026

Add best-effort cleanup to EksCreateNodegroupOperator on post-create …

b21a244

…failure (apache#61145)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure#61145

Add best-effort cleanup to EksCreateNodegroupOperator on post-create failure#61145
shahar1 merged 1 commit intoapache:mainfrom
SameerMesiah97:61142-EKSCreateNodeGroupOperator-Cleanup

SameerMesiah97 commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SameerMesiah97 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SameerMesiah97 commented Jan 27, 2026 •

edited

Loading