Fix EKSPodOperator credential refresh errors and improve error handling (#57585)#58743
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
o-nikolas
left a comment
There was a problem hiding this comment.
This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.
Does this PR actually fix anything? I'm just seeing improved logging, which is nice, but not a fix right? Or am I missing something?
Ahh! I missed a few things from the issue, will fix them and commit the changes |
cbf7880 to
8fd41b7
Compare
I'm still not really seeing any fixes for the credential management here, even with the new commits? |
There was a problem hiding this comment.
Pull request overview
This PR fixes credential refresh failures and duplicate log issues in long-running EKS Pod tasks. The changes address the root cause of 'NoneType' object errors during AWS credential refresh by adding proper validation and implementing a credential refresh mechanism, while also preventing duplicate logs after reconnection.
- Adds credential validation checks before token generation to prevent NoneType errors
- Implements credential refresh mechanism for long-running EKS Pod tasks
- Adds log timestamp tracking to prevent duplicate log entries after reconnection
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py | Adds container_log_times dictionary to track last log timestamps per container and passes since_time to prevent duplicate logs after reconnection |
| providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py | Adds credential validation, implements _refresh_cached_properties() override to update credentials file during long-running tasks, and adds helper method to write credentials |
| providers/amazon/src/airflow/providers/amazon/aws/utils/eks_get_token.py | Adds credential validation before token generation to provide clear error messages when credentials are missing |
| providers/amazon/tests/unit/amazon/aws/operators/test_eks.py | Adds comprehensive test coverage for credential refresh success and failure scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py
Outdated
Show resolved
Hide resolved
|
I will fix all the failing tests |
|
There are some conflicts and one static error: |
|
I'll fix it |
0900bbd to
211edfb
Compare
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py
Show resolved
Hide resolved
providers/amazon/src/airflow/providers/amazon/aws/operators/eks.py
Outdated
Show resolved
Hide resolved
- Use log.exception instead of log.error for better exception traceback in _refresh_cached_properties() method (addresses uranusjr's review comment) - Add comprehensive tests for container_log_times functionality: - test_container_log_times_tracks_last_timestamp: verifies timestamp storage - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage (addresses uranusjr's question about test coverage)
|
Very excited for this fix! |
|
Can you fix the remaining issus / rebase @aviralgarg05 ? |
1357b3b to
dd02999
Compare
1ce012c to
545f658
Compare
- Use log.exception instead of log.error for better exception traceback in _refresh_cached_properties() method (addresses uranusjr's review comment) - Add comprehensive tests for container_log_times functionality: - test_container_log_times_tracks_last_timestamp: verifies timestamp storage - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage (addresses uranusjr's question about test coverage)
|
@aviralgarg05 can you look into the failing tests? |
…perly (apache#57585) This commit addresses the root cause of the issue where long-running EKS Pod tasks fail with 'NoneType' errors when AWS credentials expire. Root Cause: When AWS credentials expire during long-running pod execution, session.get_credentials() returns None. The code was calling methods on this None object, causing AttributeError. Changes: 1. eks_get_token.py: Check if credentials are None before using them in RequestSigner - Raise descriptive ValueError if credentials are not available - This prevents NoneType errors when the token script is executed 2. eks.py (EksPodOperator): - Add None check in execute() method before calling get_frozen_credentials() - Add None check in trigger_reentry() method for pod reattachment scenarios - Raise clear AirflowException if credentials are None This fix ensures proper error handling at the source rather than masking the issue with downstream validations. Fixes apache#57585
When KubernetesPodOperator retries due to credential expiration (or other reasons), it restarts log fetching from the beginning, causing duplicated logs. This change updates PodManager to track the last captured log timestamp for each container. When fetch_requested_container_logs is called again (during retry), it resumes fetching logs from the last recorded timestamp. Changes: - Add container_log_times to PodManager to track progress - Update fetch_container_logs to save the last timestamp - Update fetch_requested_container_logs and fetch_requested_init_container_logs to use the saved timestamp as 'since_time' This ensures seamless log streaming even across retries.
When EKS pods run longer than STS token lifetime (~15 minutes for assumed roles), the Kubernetes API returns 401 errors. The existing _refresh_cached_properties() method only resets the hook/client/pod_manager but doesn't refresh the AWS credentials file that the kubeconfig exec credential plugin reads from. This commit adds proper credential refresh by: 1. Storing the credentials file path during execute() and trigger_reentry() so it can be updated during reconnection attempts 2. Overriding _refresh_cached_properties() in EksPodOperator to: - Get fresh credentials from AWS via EksHook session - Write the new credentials to the existing credentials file - Call the parent implementation to reset Kubernetes clients 3. Adding _write_credentials_to_file() helper method to update the credentials file that the exec credential plugin sources This ensures that when a 401 error triggers _refresh_cached_properties(), the kubeconfig will use fresh credentials for subsequent authentication attempts. Fixes apache#57585
1. Fix CodeQL security alert for clear-text credential storage: - Use os.open with O_WRONLY|O_TRUNC and restrictive permissions (0600) - Use os.fdopen to properly handle file descriptor - Add proper exception handling for file descriptor cleanup - Match the same security pattern used in EksHook._secure_credential_context 2. Address vincbeck's review question about where 401 handling occurs: - Updated _refresh_cached_properties docstring with detailed call chain - Documents that KubernetesPodOperator._handle_api_exception triggers refresh - References the exact file path: providers/cncf/kubernetes/operators/pod.py - Explains the complete flow from 401 error to credential refresh 3. Move os and stat imports to module level for consistency with codebase style
Split long dictionary key assignment lines to comply with line length requirements. This fixes the ruff-format pre-commit check failure.
Add None check before assigning to container_log_times dictionary to satisfy mypy type checking. The dictionary is typed as dict[tuple[str, str, str], DateTime], so we must ensure we don't assign None values. Fixes the error: providers/cncf/kubernetes/.../pod_manager.py:495: error: Incompatible types in assignment (expression has type 'DateTime | None', target has type 'DateTime')
- Use log.exception instead of log.error for better exception traceback in _refresh_cached_properties() method (addresses uranusjr's review comment) - Add comprehensive tests for container_log_times functionality: - test_container_log_times_tracks_last_timestamp: verifies timestamp storage - test_fetch_requested_container_logs_uses_since_time: verifies since_time parameter usage (addresses uranusjr's question about test coverage)
- Fix NameError in type hint in pod_manager.py - Handle missing resource_version_match in older kubernetes clients in pod_manager.py - Add assertions in eks.py to satisfy MyPy type narrowing
- Fix EKS test: Un-nest test_execute_with_wait_when_nodegroup_does_not_already_exist which was incorrectly indented inside test_execute_when_nodegroup_does_not_already_exist - Fix assertion typo: Change 'create_nodegroup_params' to 'create_nodegroup_kwargs' in both nodegroup test methods to check the correct key - Make rollout status check robust: kubectl output can vary between versions/resource types. Accept either 'successfully rolled out' or 'roll out complete' to reduce test flakiness. Fixes both failing CI checks mentioned in PR feedback.
…rror The result variable was only assigned inside the with block, but was referenced outside it. If kubernetes_stream() or any operation inside the with block raised an exception before result was assigned, accessing result at line 920 would cause a NameError. Initialize result = None before the with block to ensure it's always defined.
f91f9fe to
c70a232
Compare
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
This commit fixes the issue where long-running EKS Pod tasks fail with 'NoneType' object has no attribute 'groups' error during AWS credential refresh and where logs get duplicated after reconnection.
Changes:
Enhanced COMMAND template in eks.py to validate token generation output
Improved error handling in pod.py _handle_api_exception method
These changes prevent cryptic NoneType errors and provide better visibility into credential refresh failures for EKS Pod operators.
Fixes #57585