Skip to content

Fix EKSPodOperator credential refresh errors and improve error handling (#57585)#58743

Merged
vincbeck merged 13 commits intoapache:mainfrom
aviralgarg05:fix/eks-pod-operator-credential-refresh-issue-57585
Feb 24, 2026
Merged

Fix EKSPodOperator credential refresh errors and improve error handling (#57585)#58743
vincbeck merged 13 commits intoapache:mainfrom
aviralgarg05:fix/eks-pod-operator-credential-refresh-issue-57585

Conversation

@aviralgarg05
Copy link
Contributor

This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.

Changes:

  1. Enhanced COMMAND template in eks.py to validate token generation output

    • Added validation to ensure last_line is not empty
    • Added validation to ensure timestamp and token are successfully parsed
    • Provides clear error messages when token parsing fails
    • Prevents NoneType errors from bash script failures
  2. Improved error handling in pod.py _handle_api_exception method

    • Added try-except block around credential refresh logic
    • Logs detailed error information when credential refresh fails
    • Helps diagnose the root cause of credential refresh failures

These changes prevent cryptic NoneType errors and provide better visibility into credential refresh failures for EKS Pod operators.

Fixes #57585

@boring-cyborg
Copy link

boring-cyborg bot commented Nov 26, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Nov 26, 2025
Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.

Does this PR actually fix anything? I'm just seeing improved logging, which is nice, but not a fix right? Or am I missing something?

@aviralgarg05
Copy link
Contributor Author

This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.

Does this PR actually fix anything? I'm just seeing improved logging, which is nice, but not a fix right? Or am I missing something?

Ahh! I missed a few things from the issue, will fix them and commit the changes

@aviralgarg05 aviralgarg05 force-pushed the fix/eks-pod-operator-credential-refresh-issue-57585 branch 2 times, most recently from cbf7880 to 8fd41b7 Compare November 27, 2025 08:48
@o-nikolas
Copy link
Contributor

This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.

Does this PR actually fix anything? I'm just seeing improved logging, which is nice, but not a fix right? Or am I missing something?

Ahh! I missed a few things from the issue, will fix them and commit the changes

I'm still not really seeing any fixes for the credential management here, even with the new commits?

Copilot AI review requested due to automatic review settings December 3, 2025 08:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes credential refresh failures and duplicate log issues in long-running EKS Pod tasks. The changes address the root cause of 'NoneType' object errors during AWS credential refresh by adding proper validation and implementing a credential refresh mechanism, while also preventing duplicate logs after reconnection.

  • Adds credential validation checks before token generation to prevent NoneType errors
  • Implements credential refresh mechanism for long-running EKS Pod tasks
  • Adds log timestamp tracking to prevent duplicate log entries after reconnection

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py Adds container_log_times dictionary to track last log timestamps per container and passes since_time to prevent duplicate logs after reconnection
providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py Adds credential validation, implements _refresh_cached_properties() override to update credentials file during long-running tasks, and adds helper method to write credentials
providers/amazon/src/airflow/providers/amazon/aws/utils/eks_get_token.py Adds credential validation before token generation to provide clear error messages when credentials are missing
providers/amazon/tests/unit/amazon/aws/operators/test_eks.py Adds comprehensive test coverage for credential refresh success and failure scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aviralgarg05
Copy link
Contributor Author

I will fix all the failing tests

@vincbeck
Copy link
Contributor

vincbeck commented Dec 8, 2025

There are some conflicts and one static error:

  providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py:495: error: Incompatible types in assignment (expression has type "DateTime | None", target has type "DateTime") 

@aviralgarg05
Copy link
Contributor Author

I'll fix it

@aviralgarg05 aviralgarg05 force-pushed the fix/eks-pod-operator-credential-refresh-issue-57585 branch from 0900bbd to 211edfb Compare December 19, 2025 18:12
aviralgarg05 added a commit to aviralgarg05/airflow that referenced this pull request Jan 9, 2026
- Use log.exception instead of log.error for better exception traceback
  in _refresh_cached_properties() method (addresses uranusjr's review comment)
- Add comprehensive tests for container_log_times functionality:
  - test_container_log_times_tracks_last_timestamp: verifies timestamp storage
  - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage
  (addresses uranusjr's question about test coverage)
@NilsJPWerner
Copy link
Contributor

Very excited for this fix!

@potiuk
Copy link
Member

potiuk commented Jan 23, 2026

Can you fix the remaining issus / rebase @aviralgarg05 ?

@aviralgarg05 aviralgarg05 force-pushed the fix/eks-pod-operator-credential-refresh-issue-57585 branch from 1357b3b to dd02999 Compare January 23, 2026 15:10
@eladkal eladkal force-pushed the fix/eks-pod-operator-credential-refresh-issue-57585 branch from 1ce012c to 545f658 Compare February 5, 2026 08:13
@eladkal eladkal requested a review from jscheffl as a code owner February 5, 2026 08:13
eladkal pushed a commit to aviralgarg05/airflow that referenced this pull request Feb 5, 2026
- Use log.exception instead of log.error for better exception traceback
  in _refresh_cached_properties() method (addresses uranusjr's review comment)
- Add comprehensive tests for container_log_times functionality:
  - test_container_log_times_tracks_last_timestamp: verifies timestamp storage
  - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage
  (addresses uranusjr's question about test coverage)
@shahar1 shahar1 requested a review from vincbeck February 10, 2026 13:10
@eladkal
Copy link
Contributor

eladkal commented Feb 11, 2026

@aviralgarg05 can you look into the failing tests?

@eladkal eladkal requested a review from o-nikolas February 11, 2026 17:17
…perly (apache#57585)

This commit addresses the root cause of the issue where long-running EKS Pod
tasks fail with 'NoneType' errors when AWS credentials expire.

Root Cause:
When AWS credentials expire during long-running pod execution, session.get_credentials()
returns None. The code was calling methods on this None object, causing AttributeError.

Changes:
1. eks_get_token.py: Check if credentials are None before using them in RequestSigner
   - Raise descriptive ValueError if credentials are not available
   - This prevents NoneType errors when the token script is executed

2. eks.py (EksPodOperator):
   - Add None check in execute() method before calling get_frozen_credentials()
   - Add None check in trigger_reentry() method for pod reattachment scenarios
   - Raise clear AirflowException if credentials are None

This fix ensures proper error handling at the source rather than masking the
issue with downstream validations.

Fixes apache#57585
When KubernetesPodOperator retries due to credential expiration (or other
reasons), it restarts log fetching from the beginning, causing duplicated logs.

This change updates PodManager to track the last captured log timestamp for
each container. When fetch_requested_container_logs is called again (during retry),
it resumes fetching logs from the last recorded timestamp.

Changes:
- Add container_log_times to PodManager to track progress
- Update fetch_container_logs to save the last timestamp
- Update fetch_requested_container_logs and fetch_requested_init_container_logs
  to use the saved timestamp as 'since_time'

This ensures seamless log streaming even across retries.
When EKS pods run longer than STS token lifetime (~15 minutes for
assumed roles), the Kubernetes API returns 401 errors. The existing
_refresh_cached_properties() method only resets the hook/client/pod_manager
but doesn't refresh the AWS credentials file that the kubeconfig
exec credential plugin reads from.

This commit adds proper credential refresh by:

1. Storing the credentials file path during execute() and trigger_reentry()
   so it can be updated during reconnection attempts

2. Overriding _refresh_cached_properties() in EksPodOperator to:
   - Get fresh credentials from AWS via EksHook session
   - Write the new credentials to the existing credentials file
   - Call the parent implementation to reset Kubernetes clients

3. Adding _write_credentials_to_file() helper method to update the
   credentials file that the exec credential plugin sources

This ensures that when a 401 error triggers _refresh_cached_properties(),
the kubeconfig will use fresh credentials for subsequent authentication
attempts.

Fixes apache#57585
1. Fix CodeQL security alert for clear-text credential storage:
   - Use os.open with O_WRONLY|O_TRUNC and restrictive permissions (0600)
   - Use os.fdopen to properly handle file descriptor
   - Add proper exception handling for file descriptor cleanup
   - Match the same security pattern used in EksHook._secure_credential_context

2. Address vincbeck's review question about where 401 handling occurs:
   - Updated _refresh_cached_properties docstring with detailed call chain
   - Documents that KubernetesPodOperator._handle_api_exception triggers refresh
   - References the exact file path: providers/cncf/kubernetes/operators/pod.py
   - Explains the complete flow from 401 error to credential refresh

3. Move os and stat imports to module level for consistency with codebase style
Split long dictionary key assignment lines to comply with line length
requirements. This fixes the ruff-format pre-commit check failure.
Add None check before assigning to container_log_times dictionary
to satisfy mypy type checking. The dictionary is typed as
dict[tuple[str, str, str], DateTime], so we must ensure we don't
assign None values.

Fixes the error:
providers/cncf/kubernetes/.../pod_manager.py:495: error:
Incompatible types in assignment (expression has type 'DateTime | None',
target has type 'DateTime')
- Use log.exception instead of log.error for better exception traceback
  in _refresh_cached_properties() method (addresses uranusjr's review comment)
- Add comprehensive tests for container_log_times functionality:
  - test_container_log_times_tracks_last_timestamp: verifies timestamp storage
  - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage
  (addresses uranusjr's question about test coverage)
- Fix NameError in type hint in pod_manager.py
- Handle missing resource_version_match in older kubernetes clients in pod_manager.py
- Add assertions in eks.py to satisfy MyPy type narrowing
- Fix EKS test: Un-nest test_execute_with_wait_when_nodegroup_does_not_already_exist
  which was incorrectly indented inside test_execute_when_nodegroup_does_not_already_exist

- Fix assertion typo: Change 'create_nodegroup_params' to 'create_nodegroup_kwargs'
  in both nodegroup test methods to check the correct key

- Make rollout status check robust: kubectl output can vary between versions/resource types.
  Accept either 'successfully rolled out' or 'roll out complete' to reduce test flakiness.

Fixes both failing CI checks mentioned in PR feedback.
…rror

The result variable was only assigned inside the with block, but was
referenced outside it. If kubernetes_stream() or any operation inside
the with block raised an exception before result was assigned, accessing
result at line 920 would cause a NameError.

Initialize result = None before the with block to ensure it's always defined.
@eladkal eladkal force-pushed the fix/eks-pod-operator-credential-refresh-issue-57585 branch from f91f9fe to c70a232 Compare February 24, 2026 09:38
@vincbeck vincbeck merged commit 880efe6 into apache:main Feb 24, 2026
105 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 24, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EKSPodOperator reconnection errors and log duplication

8 participants