[Feature][Engine][Checkpoint Storage] Add Kubernetes IRSA support for S3 checkpoint storage#10858
[Feature][Engine][Checkpoint Storage] Add Kubernetes IRSA support for S3 checkpoint storage#10858suhyeon729 wants to merge 12 commits into
Conversation
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for working on this. I went through the full diff and also traced the actual checkpoint-storage path in the current codebase. The dependency upgrade is moving in the right direction, and I confirmed locally that the old aws-java-sdk-bundle:1.11.271 really does not contain com.amazonaws.auth.WebIdentityTokenCredentialsProvider, while 1.12.770 does. So the underlying problem is real. I did find one blocking gap before merge though: the IRSA user-facing configuration is still not documented in the checkpoint storage docs.
What problem this PR solves
- User pain point
Fortype: hdfs + storage.type: s3, SeaTunnel eventually initializes Hadoop S3A fromHdfsStorage.initStorage(...). Before this PR, the distributed AWS SDK was1.11.271, and that jar does not containcom.amazonaws.auth.WebIdentityTokenCredentialsProvider. In Kubernetes IRSA setups, that means the runtime classpath is not sufficient for the default AWS credential chain to pick up Web Identity credentials correctly. - Fix approach
This PR upgrades the AWS SDK to1.12.770in bothseatunnel-dist/pom.xmlandcheckpoint-storage-hdfs/pom.xml, then adds focused tests to verify two things: the S3A credential provider config is propagated into HadoopConfiguration, andWebIdentityTokenCredentialsProvideris present on the test classpath. - One-line summary
This is a classpath-level fix that makes the S3 checkpoint path capable of supporting IRSA, but the final user-facing documentation handoff is still incomplete.
I. Code review
1.1 Core logic analysis
Precise change summary
- Main files
seatunnel-dist/pom.xml:104-108seatunnel-engine/seatunnel-engine-storage/checkpoint-storage-plugins/checkpoint-storage-hdfs/pom.xml:33-63.../S3ConfigurationProviderTest.java:37-58.../S3FileCheckpointWithIRSATest.java:31-45
- Main runtime path
HdfsStorage.initStorage(...)->HdfsStorage.getConfiguration(...)->S3Configuration.buildConfiguration(...)->FileSystem.get(hadoopConf)
- Important point
The production code path itself is unchanged. What changes here is the runtime classpath that Hadoop S3A sees whenFileSystem.get(...)initializes the S3 filesystem.
Before / after snippets
Before:
<!-- seatunnel-dist/pom.xml -->
<aws-java-sdk.version>1.11.271</aws-java-sdk.version>
<!-- checkpoint-storage-hdfs/pom.xml -->
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.11.271</version>
<scope>provided</scope>
</dependency>After:
<!-- seatunnel-dist/pom.xml -->
<aws-java-sdk.version>1.12.770</aws-java-sdk.version>
<!-- checkpoint-storage-hdfs/pom.xml -->
<aws-java-sdk.version>1.12.770</aws-java-sdk.version>
...
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>${aws-java-sdk.version}</version>
<scope>provided</scope>
</dependency>The new tests verify:
Configuration hadoopConf = s3Config.buildConfiguration(config);
assertEquals(
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
hadoopConf.get("fs.s3a.aws.credentials.provider"));
assertDoesNotThrow(
() -> Class.forName("com.amazonaws.auth.WebIdentityTokenCredentialsProvider"));Key findings
- The normal path really does hit this PR when checkpoint storage is configured as S3:
HdfsStorage.initStorage(...)always ends atFileSystem.get(hadoopConf). - The fix is targeted at a real classpath problem, not at a hypothetical one. I verified locally that
aws-java-sdk-bundle:1.11.271does not containWebIdentityTokenCredentialsProvider, while1.12.770does. - The change is a precise dependency/classpath repair. It does not add a new SeaTunnel checkpoint branch.
- The remaining gap is documentation, not runtime code: the checkpoint-storage docs still do not tell users what to configure for IRSA, or that this relies on
DefaultAWSCredentialsProviderChainpicking up Web Identity support from the newer SDK.
Deep logic analysis
Full runtime path:
SeaTunnel Engine checkpoint initialization
-> HdfsStorage.initStorage(configuration) [HdfsStorage.java:62-72]
-> getConfiguration(configuration) [HdfsStorage.java:75-83]
-> FileConfiguration.valueOf(storageType).getConfiguration()
-> S3Configuration.buildConfiguration(config) [S3Configuration.java:61-77]
-> validate s3.bucket
-> choose s3n / s3a from the bucket prefix
-> setExtraConfiguration(hadoopConf, config, "fs.s3a.")
so fs.s3a.aws.credentials.provider is passed through unchanged
-> FileSystem.get(hadoopConf) [HdfsStorage.java:69]
-> Hadoop S3AFileSystem initialization
-> loads the AWS SDK provider chain from the runtime classpath
-> old 1.11.271: no WebIdentityTokenCredentialsProvider
-> new 1.12.770: class is available, so IRSA can be supported
That means:
- It works in the real checkpoint path when users store checkpoints in S3.
- It does not affect non-S3 checkpoint backends.
- The value of this PR only becomes usable to end users if the docs show the actual IRSA configuration entry.
1.2 Compatibility impact
- Conclusion: fully compatible
- API: no change
- Config options: no added/removed/renamed options
- Defaults: unchanged
- Protocol: unchanged (
type: hdfs + storage.type: s3) - Serialization format: unchanged
- Historical behavior: unchanged for non-S3 checkpoint users; for S3 checkpoint users this broadens runtime capability without changing checkpoint file format
1.3 Performance / side effects
- CPU / memory / GC: no new hot-path logic
- Network: still handled by Hadoop S3A
- Concurrency / locking: unchanged
- Retry / idempotency: unchanged
- Resource release: unchanged
- Main side effect surface is dependency behavior after the AWS SDK upgrade, not SeaTunnel logic regression
1.4 Error handling and logging
- The existing failure path in
HdfsStorage.initStorage(...)is preserved:FileSystem.get(...)failures are still wrapped asCheckpointStorageException. - No new logging branch or sensitive-data risk is introduced here.
Issue 1: The IRSA capability reaches the code, but the checkpoint-storage docs still do not expose a usable configuration entry for users
- Location:
docs/en/engines/zeta/checkpoint-storage.md:150,docs/zh/engines/zeta/checkpoint-storage.md:163 - Description:
The runtime path fixed by this PR isHdfsStorage.initStorage -> S3Configuration.buildConfiguration -> FileSystem.get(...). That path can now support IRSA at the classpath level, but the official checkpoint-storage docs still only showInstanceProfileCredentialsProviderand MinIO-styleSimpleAWSCredentialsProvider. There is still no explicit IRSA/Kubernetes example telling users to configurefs.s3a.aws.credentials.provider: com.amazonaws.auth.DefaultAWSCredentialsProviderChain, and there is no note explaining that IRSA support depends on the default provider chain being able to pick up Web Identity from the newer AWS SDK. - Risk:
This leaves a real main-path delivery gap. Users will reasonably expect the feature in the PR title to be directly usable after reading the docs, but right now they still cannot derive the correct IRSA configuration from the official guide. In practice that means misconfiguration, repeated support questions, and “the fix landed but I still can't use it” confusion. - Best improvement:
Option A: add an explicit IRSA/Kubernetes example to bothdocs/en/engines/zeta/checkpoint-storage.mdanddocs/zh/engines/zeta/checkpoint-storage.md, includingfs.s3a.aws.credentials.provider: com.amazonaws.auth.DefaultAWSCredentialsProviderChain, and explain that the newer AWS SDK is what makes Web Identity work here.
Option B: if you want to keep it smaller, at least add a note right after the existing S3 example clarifying the difference between EC2 instance profile usage and Kubernetes IRSA usage. - Severity: High
II. Code quality
2.1 Code conventions
- The overall style is consistent with the project.
- No new production method was introduced, so I did not find a missing-comment issue on newly added core runtime methods.
2.2 Test coverage
- Added coverage
S3ConfigurationProviderTestchecks provider propagation and class availability.S3FileCheckpointWithIRSATestprovides an environment-dependent example path.
- Remaining blind spot
S3FileCheckpointWithIRSATestis disabled, so CI still does not automatically prove a real IRSA-backedFileSystem.get(...)initialization. I see that as residual risk rather than a blocker for this PR.
2.3 Documentation updates
- This is a user-visible behavior change, so both
docs/enanddocs/zhneed to be updated. - Right now:
- the S3 connector docs only update the jar version text,
- the Chinese checkpoint-storage doc adds a MinIO example,
- neither checkpoint-storage doc adds the actual IRSA usage guidance.
- So the documentation handoff is still incomplete, which is why I’m treating Issue 1 as blocking.
III. Architecture
3.1 Elegance of the solution
- This is a precise fix.
- The root cause is in the runtime dependency set, and the PR addresses it there instead of adding workaround branches into checkpoint logic.
3.2 Maintainability
- The code change is small and easy to reason about.
- The remaining maintainability risk is mostly operational: users still do not know the correct entry point unless the docs are updated.
3.3 Extensibility
- The underlying design is already reasonably extensible because
S3Configuration.buildConfiguration(...)passesfs.s3a.*through to Hadoop instead of hardcoding a tiny allowlist. - That makes the documentation even more important here: the runtime is flexible, but users should not have to guess the right provider setting.
3.4 Historical-version compatibility
- Still compatible with existing behavior.
- No checkpoint data migration is needed.
- The missing piece is documentation, not compatibility handling.
IV. Issue summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | The IRSA capability reaches the code, but the checkpoint-storage docs still do not expose a usable configuration entry for users | docs/en/engines/zeta/checkpoint-storage.md:150; docs/zh/engines/zeta/checkpoint-storage.md:163 |
High |
V. Merge conclusion
Conclusion: can merge after fixes
- Blocking items
- Issue 1: the feature is not fully delivered to users until the official checkpoint-storage docs explain the real IRSA configuration entry and provider choice.
- Suggested but non-blocking improvements
- No additional medium/low-severity items from me in this round.
Overall assessment:
The technical direction here makes sense, and the root issue is real. The classpath fix is the right place to solve it. The remaining gap is the last-mile user handoff: once the IRSA usage is documented clearly in both English and Chinese checkpoint-storage docs, this should be in good shape for merge.
Possible alternatives:
- Option A: keep the current dependency upgrade and add explicit IRSA docs in both languages. This is the smallest complete fix, and my recommended path.
- Option B: if you want to strengthen the safety net later, add a more focused automated test around the
HdfsStorage -> S3Configurationinitialization path. I don’t see that as the primary blocker for this PR right now.
CI note:
- The current
Buildfailure is inall-connectors-it-7/connector-file-sftp-it, which is outside the files touched by this PR. I did not see a checkpoint-storage-specific failure signal in the reported failing jobs. After the doc fix, I’d recommend rerunning CI or rebasing if needed.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head from scratch. The main blocker from the previous round was that the SeaTunnel-side IRSA configuration path was still under-documented. The current revision closes that gap, and I do not see a blocking code issue in the current version.
What problem this PR solves
- User pain point
Having code-level support for S3 checkpoint storage is not enough if users still do not know how to configure SeaTunnel itself for Kubernetes IRSA. Without an explicitseatunnel.yamlexample, the feature remains technically present but practically hard to use. - Fix approach
The PR keeps the existingfs.s3a.*pass-through configuration path and now documents the IRSA provider explicitly in both English and Chinese docs. It also adds tests for provider propagation and for class availability. - One-line summary
This turns IRSA support from “the code can probably do it” into “users can actually configure it in SeaTunnel”.
1. Code review
1.1 Core logic analysis
Precise change scope
The important pieces are:
-
runtime path
seatunnel-engine/seatunnel-engine-storage/checkpoint-storage-plugins/checkpoint-storage-hdfs/src/main/java/org/apache/seatunnel/engine/checkpoint/storage/hdfs/common/S3Configuration.java:61-77 -
tests
seatunnel-engine/.../S3ConfigurationProviderTest.java:37-58
seatunnel-engine/.../S3FileCheckpointWithIRSATest.java:31-46 -
docs
docs/en/engines/zeta/checkpoint-storage.md:188-205
docs/zh/engines/zeta/checkpoint-storage.md
Before / after
The runtime path is still based on the existing pass-through model:
hadoopConf.set(FS_DEFAULT_NAME_KEY, config.get(S3_BUCKET_KEY));
hadoopConf.set(formatKey(protocol, HDFS_IMPL_KEY), fsImpl);
setExtraConfiguration(hadoopConf, config, FS_KEY + protocol + SPLIT_CHAR);So if the user provides:
fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProviderinside plugin-config, it will be forwarded into the Hadoop Configuration.
What this revision really adds is the missing user-facing contract and the regression coverage around it:
- explicit IRSA config examples in the docs,
- a test that verifies credentials-provider propagation,
- and a test that verifies the
WebIdentityTokenCredentialsProviderclass is available with the current dependency set.
Key findings
- The normal runtime path definitely hits
S3Configuration.buildConfiguration(...), because checkpoint storage initialization goes through that configuration builder. - This revision does not alter checkpoint storage protocol or defaults; it mainly makes the SeaTunnel-side configuration contract explicit.
- The previous blocking concern was the missing SeaTunnel-side IRSA example. That is now fixed.
- The new tests do not replace a real Kubernetes IRSA environment, but they do cover the minimum regression surface for provider propagation and dependency visibility.
Full runtime chain
seatunnel.yaml defines checkpoint storage
-> seatunnel.engine.checkpoint.storage.plugin-config
-> for example storage.type=s3, s3.bucket=..., fs.s3a.aws.credentials.provider=...
Checkpoint storage initialization
-> S3Configuration.buildConfiguration(config) [61-77]
-> detect s3a based on bucket prefix [64-68]
-> set fs.defaultFS and S3A filesystem impl [69-75]
-> setExtraConfiguration(..., "fs.s3a.") [76]
-> propagate fs.s3a.aws.credentials.provider into Hadoop Configuration
Runtime S3 checkpoint access
-> Hadoop S3A resolves the configured credentials provider
-> if provider = WebIdentityTokenCredentialsProvider
-> use IRSA web identity credentials in Kubernetes
Validation / docs
-> docs/en/.../checkpoint-storage.md [188-205]
-> S3ConfigurationProviderTest [37-58]
-> S3FileCheckpointWithIRSATest [31-46]
1.2 Compatibility impact
Conclusion: fully compatible.
- API: unchanged
- Configs: no existing config removed or renamed
- Defaults: unchanged
- Protocol: unchanged
- Serialization: unchanged
- Historical behavior: existing credential-provider paths still work as before
1.3 Performance / side effects
- CPU / memory / GC: no meaningful change
- Network: no extra polling or new network loop introduced
- Concurrency: no new shared mutable state
- Retry / idempotency: unchanged
- Resource lifecycle: no new leak introduced here
1.4 Error handling and logging
I did not find a blocking code issue in the current version.
CI note:
- GitHub
Buildis currently red, but the actual failing logs come from the fork workflowsuhyeon729/seatunnel. - The failing jobs include:
unit-test (8, ubuntu-latest)withCoordinatorServiceTest.testRestoreUsesProvidedJobInfoInitializationTimestampall-connectors-it-3all-connectors-it-7
- Those failures do not point to the changed
checkpoint-storage-hdfs/ S3 provider propagation path, and they do not fail in the new IRSA-related tests added by this PR.
2. Code quality
2.1 Code style
The new tests and docs fit the existing project style well. The important part here is that the configuration contract is now much clearer for real users.
2.2 Test coverage
The coverage matches the main risk surface of this PR:
- provider propagation into Hadoop config,
- availability of
WebIdentityTokenCredentialsProvider, - and a manually enabled IRSA-oriented checkpoint test entrypoint.
2.3 Documentation
This doc update is necessary, and the current version does it well:
docs/en/engines/zeta/checkpoint-storage.md:188-205now includes an IRSA exampledocs/zh/engines/zeta/checkpoint-storage.mdis updated in parallel
The English and Chinese docs are aligned in direction.
3. Architecture assessment
3.1 Solution quality
This is a precise completion, not an overreach.
3.2 Maintainability
The pass-through configuration model stays simple, and the docs now explain the user-facing contract clearly.
3.3 Extensibility
This approach scales well to other credentials providers too, because it still relies on the generic Hadoop configuration pass-through boundary.
3.4 Historical compatibility
No migration burden introduced by this change.
4. Issue summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| - | No formal issue found in the current revision | - | - |
5. Merge conclusion
Conclusion: can merge
-
Blocking items
- No code blocker from my side in the current revision.
-
Non-blocking suggestions
- The GitHub
Buildcheck is still red, but the current failures point to unrelated unit-test / connector-IT jobs in the fork workflow. I would suggest syncing the latestdevor rerunning CI before merge.
- The GitHub
Overall, the previous blocker was about the missing SeaTunnel-side IRSA configuration contract. That part is now covered by the docs and backed by regression tests, so I do not have a new blocker in the current version.
… S3 checkpoint storage This commit adds support for Kubernetes IRSA (IAM Roles for Service Accounts) authentication when using S3 as checkpoint storage in SeaTunnel Engine. Changes: - Upgraded hadoop-aws from 3.1.4 to 3.3.6 to support WebIdentityTokenCredentialsProvider - Upgraded aws-java-sdk-bundle from 1.11.271 to 1.12.770 for IRSA compatibility - Added WebIdentityTokenCredentialsProvider option for direct IRSA authentication - Added DefaultAWSCredentialsProviderChain for automatic credential detection - Updated documentation with IRSA configuration examples and Kubernetes setup - Added unit test framework for IRSA authentication (requires K8s environment) The changes are backward compatible - existing credential providers (SimpleAWSCredentialsProvider, InstanceProfileCredentialsProvider) continue to work unchanged.
The hadoop-aws 3.3.6 upgrade (for IRSA support) created a version mismatch with the seatunnel-hadoop3-3.1.4-uber jar, which was still using Hadoop 3.1.4. This would cause: - NoSuchMethodError or ClassCastException at runtime - S3A initialization failures - Checkpoint storage failures breaking fault tolerance This commit upgrades the Hadoop uber jar from 3.1.4 → 3.3.6 to match hadoop-aws, ensuring: - Version alignment across the distribution - IRSA (WebIdentityToken) support remains functional (requires Hadoop 3.3+) - No runtime classpath conflicts Note: Artifact name kept as 'seatunnel-hadoop3-3.1.4-uber' to avoid updating 19+ connector references. The artifact name can be refactored in a future PR.
- Support WebIdentityTokenCredentialsProvider and DefaultAWSCredentialsProviderChain
in S3Configuration to enable IRSA-based authentication on Kubernetes
- Upgrade aws-java-sdk-bundle to 1.12.770 in seatunnel-dist and align
checkpoint-storage-hdfs to use shared version property
- Add S3ConfigurationProviderTest to verify credential provider propagation
and IRSA class availability without requiring AWS connectivity
- Update S3File connector docs (en/zh) to list all supported credential providers
- Revert hadoop3-uber jar to 3.1.4; Hadoop runtime upgrade is a separate concern
…d test conventions
…int docs, and update aws-java-sdk-bundle version in S3File docs
…ime SDK dependency is resolved
8120daa to
6fd666e
Compare
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-checked the latest head from the real checkpoint-storage path instead of only looking at the docs/test diff, and I did not find a new source-level blocker in the current implementation.
What problem this PR solves
- User pain point
In Kubernetes IRSA setups, Zeta checkpoint storage needs to use an S3A credentials provider such asWebIdentityTokenCredentialsProvider, but the older AWS SDK bundle did not expose the required class. - Fix approach
This PR upgrades the AWS SDK bundle, keeps the existingfs.s3a.*passthrough path intact, and adds docs/tests for configuringfs.s3a.aws.credentials.provider. - One-line summary
This enables IRSA-style S3 checkpoint authentication through the standard Hadoop S3A provider mechanism.
1. Code change review
1.1 Core logic analysis
The real runtime path is:
checkpoint-storage plugin-config
-> S3Configuration.buildConfiguration(...) [S3Configuration.java:61-77]
-> setExtraConfiguration(..., "fs.s3a.")
-> Hadoop S3A reads fs.s3a.aws.credentials.provider
So the core of this PR is not a new hard-coded IRSA branch. The important part is:
- the provider key is still propagated into Hadoop
Configuration, and - the runtime classpath now contains
com.amazonaws.auth.WebIdentityTokenCredentialsProvider.
That matches the new tests:
S3ConfigurationProviderTestchecks the provider key is propagated.S3ConfigurationProviderTestalso checks thatWebIdentityTokenCredentialsProvideris loadable.S3FileCheckpointWithIRSATestdocuments the real-environment IRSA path and is intentionally disabled in CI.
1.2 Compatibility impact
Fully compatible from the configuration-contract perspective.
This does not change protocol fields, serialization, config names, or defaults. It extends the set of provider classes that can actually work on the existing fs.s3a.* path.
1.3 Performance / side effects
I do not see a new CPU, memory, GC, concurrency, retry, idempotency, or resource-release problem in the current source path. The remaining watchpoint is the dependency upgrade itself, which CI still needs to validate.
1.4 Error handling and logging
No new source-level blocking issue found.
2. Code quality evaluation
2.1 Code style
The scope is clean and focused.
2.2 Test coverage and stability
- The added tests cover provider propagation and class availability.
- I did not find a flaky-test pattern in the new test code.
Test stability rating: Stable.
2.3 Documentation
Both English and Chinese docs were updated with IRSA examples, which is the right thing for this user-visible configuration path.
3. Architecture
3.1 Solution quality
This is a precise fix. It reuses the existing Hadoop S3A provider contract instead of introducing a SeaTunnel-specific IRSA abstraction.
3.2 Maintainability
Good. Future provider types can still use the same fs.s3a.* passthrough path.
3.3 Extensibility
Good. The solution is not IRSA-specific in code shape; it just makes the standard provider mechanism usable for IRSA.
3.4 Historical compatibility
Compatible from the source/config point of view.
4. Issue summary
No new source-level blocking issue found.
5. Merge conclusion
Conclusion: can merge after fixes
-
Blocking items
I did not find a new code blocker on the latest head, but the Build check is still red. At minimum,unit-test (8, ubuntu-latest)andconnector-file-sftp-it (11, ubuntu-latest)should be cleared before merge. -
Suggested follow-ups
None.
From the current source path, the IRSA support direction looks good to me. The remaining gate is CI, not a new logic issue in the S3 checkpoint implementation.
Fixes: #10302
Purpose of this pull request
This PR adds support for Kubernetes IRSA (IAM Roles for Service Accounts) in S3-based authentication for the Zeta engine's S3 checkpoint storage. This is a follow-up to address the technical requirements from #10324.
As discussed with @DanielLeens, this PR follows a staged approach to ensure project stability:
aws-java-sdk-bundleversion to1.12.770inseatunnel-dist. This ensuresWebIdentityTokenCredentialsProviderandDefaultAWSCredentialsProviderChainare available at runtime, preventingClassNotFoundException.3.1.4to avoid the risks of a broad runtime upgrade. Full runtime compatibility (requiring Hadoop 3.3.x) will be handled in a subsequent dedicated PR.Does this PR introduce any user-facing change?
Yes (Structural changes for future IRSA support).
DefaultAWSCredentialsProviderChaincan now detect EKS workload identity environment variables.S3ConfigurationProviderTest) to verify that theS3Configurationcorrectly propagates the IRSA provider class into the Hadoop configuration.S3FileCheckpointWithIRSATest(annotated with@Disabled) to verify class availability. This test can be enabled locally in a live EKS environment to validate end-to-end checkpoint IO.Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.