[Fix][Connectors-v2] Avoid Iceberg DROP_DATA full-table delete commit#10851
[Fix][Connectors-v2] Avoid Iceberg DROP_DATA full-table delete commit#10851corgy-w wants to merge 3 commits into
Conversation
2dc9413 to
d37f651
Compare
davidzollo
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I reviewed the full current head locally on seatunnel-review-10851 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.
What This PR Fixes
- User pain:
DROP_DATAon Iceberg should avoid a heavy full-table delete commit when a safer metadata-level reset is possible. - Fix approach:
keepSchemaDropData()now routes intotruncateTable(), andtruncateTable()replaces the olddeleteFromRowFilter(alwaysTrue())path with a metadata reset. - One-line summary: the metadata-reset direction is good for plain tables, but the current head introduces a new hard failure for tables that have non-main snapshot refs, which is a visible behavior narrowing compared with the old path.
1. Code Review
1.1 Core Logic Analysis
-
Exact change scope:
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/sink/IcebergSaveModeHandler.java:43-51seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/catalog/IcebergCatalog.java:276-314
-
Before / after:
// older truncate path
catalog.loadTable(icebergTableIdentifier)
.newDelete()
.deleteFromRowFilter(Expressions.alwaysTrue())
.commit();// current path
Set<String> nonMainRefs = table.refs().keySet().stream()
.filter(refName -> !SnapshotRef.MAIN_BRANCH.equals(refName))
.collect(Collectors.toCollection(LinkedHashSet::new));
if (!nonMainRefs.isEmpty()) {
throw new CatalogException(...);
}
// then perform metadata reset-
Key findings:
- The normal
KEEP_SCHEMA + DROP_DATAruntime path definitely hits this change. - The metadata-reset direction itself is reasonable for plain tables.
- The new implementation now rejects any table that has non-main snapshot refs.
- That is a real behavior narrowing compared with the old path, not just an internal refactor.
- The normal
-
Runtime chain I checked:
save mode path
-> DefaultSaveModeHandler.keepSchemaDropData()
-> IcebergSaveModeHandler.keepSchemaDropData()
-> icebergCatalog.truncateTable(tablePath, true)
-> loadTable(...)
-> inspect table.refs()
-> if non-main refs exist -> throw CatalogException
-> else perform metadata reset commit
1.2 Compatibility Impact
- Verdict: partially incompatible.
- API/config/default/protocol/serialization are unchanged.
- The incompatible part is runtime behavior: a table that previously could still be cleared through the older path is now rejected immediately if it has non-main refs.
1.3 Performance / Side Effects
- For plain tables, the metadata-reset approach is a reasonable performance win.
- The main side effect is compatibility-related: the supported table shape is narrower than before.
1.4 Error Handling and Logs
Issue 1: Tables with non-main snapshot refs are now rejected outright
- Location:
IcebergCatalog.java:285 - Why this is a real problem:
this code sits directly on the realKEEP_SCHEMA + DROP_DATApath. If the table has non-main refs, the PR now turns that into a hard failure instead of falling back to a behavior-compatible truncation path. - Risk:
users with Iceberg tag/branch usage can hit a main-path failure after upgrading to this implementation. - Suggested fix:
if non-main refs are present, fall back to the older delete-based truncation path instead of failing the operation outright. If you intentionally want to keep the new limitation, it needs to be documented explicitly as an incompatible behavior change. - Severity: High
2. Code Quality
- The implementation is cleanly structured.
- I would still add one contract-style test around tables with non-main refs, because that is exactly where the current compatibility regression sits.
3. Architecture
- For plain tables this is a good direction.
- The blocker is that the current head narrows the supported table shape without a compatibility-safe fallback.
4. Issue Summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | Tables with non-main snapshot refs are now rejected outright | IcebergCatalog.java:285 |
High |
5. Merge Conclusion
Conclusion: merge after fixes
- Blocking items
- Issue 1 must be fixed before merge.
- Suggested non-blocking follow-up
- Add one contract test around the non-main-ref path so this boundary stays explicit.
The metadata-reset idea is worth keeping. The blocker is the new compatibility regression on tagged/branched Iceberg tables, not the overall direction.
Posted from the wrong account by mistake; superseded by the DanielLeens review.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I reviewed the full current head locally on seatunnel-review-10851 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.
What This PR Fixes
- User pain:
DROP_DATAon Iceberg should avoid a heavy full-table delete commit when a safer metadata-level reset is possible. - Fix approach:
keepSchemaDropData()now routes intotruncateTable(), andtruncateTable()replaces the olddeleteFromRowFilter(alwaysTrue())path with a metadata reset. - One-line summary: the metadata-reset direction is good for plain tables, but the current head introduces a new hard failure for tables that have non-main snapshot refs, which is a visible behavior narrowing compared with the old path.
1. Code Review
1.1 Core Logic Analysis
-
Exact change scope:
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/sink/IcebergSaveModeHandler.java:43-51seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/catalog/IcebergCatalog.java:276-314
-
Before / after:
// older truncate path
catalog.loadTable(icebergTableIdentifier)
.newDelete()
.deleteFromRowFilter(Expressions.alwaysTrue())
.commit();// current path
Set<String> nonMainRefs = table.refs().keySet().stream()
.filter(refName -> !SnapshotRef.MAIN_BRANCH.equals(refName))
.collect(Collectors.toCollection(LinkedHashSet::new));
if (!nonMainRefs.isEmpty()) {
throw new CatalogException(...);
}
// then perform metadata reset-
Key findings:
- The normal
KEEP_SCHEMA + DROP_DATAruntime path definitely hits this change. - The metadata-reset direction itself is reasonable for plain tables.
- The new implementation now rejects any table that has non-main snapshot refs.
- That is a real behavior narrowing compared with the old path, not just an internal refactor.
- The normal
-
Runtime chain I checked:
save mode path
-> DefaultSaveModeHandler.keepSchemaDropData()
-> IcebergSaveModeHandler.keepSchemaDropData()
-> icebergCatalog.truncateTable(tablePath, true)
-> loadTable(...)
-> inspect table.refs()
-> if non-main refs exist -> throw CatalogException
-> else perform metadata reset commit
1.2 Compatibility Impact
- Verdict: partially incompatible.
- API/config/default/protocol/serialization are unchanged.
- The incompatible part is runtime behavior: a table that previously could still be cleared through the older path is now rejected immediately if it has non-main refs.
1.3 Performance / Side Effects
- For plain tables, the metadata-reset approach is a reasonable performance win.
- The main side effect is compatibility-related: the supported table shape is narrower than before.
1.4 Error Handling and Logs
Issue 1: Tables with non-main snapshot refs are now rejected outright
- Location:
IcebergCatalog.java:285 - Why this is a real problem:
this code sits directly on the realKEEP_SCHEMA + DROP_DATApath. If the table has non-main refs, the PR now turns that into a hard failure instead of falling back to a behavior-compatible truncation path. - Risk:
users with Iceberg tag/branch usage can hit a main-path failure after upgrading to this implementation. - Suggested fix:
if non-main refs are present, fall back to the older delete-based truncation path instead of failing the operation outright. If you intentionally want to keep the new limitation, it needs to be documented explicitly as an incompatible behavior change. - Severity: High
2. Code Quality
- The implementation is cleanly structured.
- I would still add one contract-style test around tables with non-main refs, because that is exactly where the current compatibility regression sits.
3. Architecture
- For plain tables this is a good direction.
- The blocker is that the current head narrows the supported table shape without a compatibility-safe fallback.
4. Issue Summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | Tables with non-main snapshot refs are now rejected outright | IcebergCatalog.java:285 |
High |
5. Merge Conclusion
Conclusion: merge after fixes
- Blocking items
- Issue 1 must be fixed before merge.
- Suggested non-blocking follow-up
- Add one contract test around the non-main-ref path so this boundary stays explicit.
The metadata-reset idea is worth keeping. The blocker is the new compatibility regression on tagged/branched Iceberg tables, not the overall direction.
There was a problem hiding this comment.
| .conditional( | |
| IcebergSinkOptions.DATA_SAVE_MODE, | |
| DataSaveMode.DROP_DATA, | |
| IcebergDropDataStrategy) |
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head locally against the current upstream/dev. This was a source-level review only; I did not run local Maven in this round.
What This PR Fixes
- User pain:
data_save_mode = DROP_DATAon Iceberg can be expensive when it always goes through a full-table delete commit. - Fix approach: introduce an explicit
iceberg.drop-data.strategy; keepDELETE_COMMITas the default for compatibility, and add an opt-inHARD_METADATA_RESETpath plus branch / docs / tests. - One-line summary: the current head resolves the compatibility blocker from my previous review by restoring the legacy default behavior and making metadata reset an explicit opt-in strategy.
Runtime Chain I Checked
SaveMode path
-> IcebergSink.getSaveModeHandler()
-> new IcebergSaveModeHandler(..., dropDataStrategy, commitBranch)
-> handleDataSaveMode()
-> keepSchemaDropData()
-> IcebergCatalog.truncateTable(tablePath, true, strategy, commitBranch)
-> strategy = DELETE_COMMIT
-> table.newDelete()
-> optional toBranch(commitBranch)
-> deleteFromRowFilter(alwaysTrue())
-> commit()
-> strategy = HARD_METADATA_RESET
-> TableOperations.current()
-> remove all refs + snapshots from metadata
-> operations.commit(base, resetMetadata)
-> recreateCommitBranchIfNeeded(non-main branch only)
Key Findings
- The normal
data_save_mode = DROP_DATApath definitely hits this change. - The default strategy is now
DELETE_COMMIT, so existing jobs keep the legacy behavior. - The destructive metadata-reset path is opt-in, and both EN/ZH docs now say explicitly that it clears all snapshot refs / snapshots and leaves orphan cleanup to separate Iceberg maintenance.
- The blocker from my previous review is resolved: the PR no longer hard-fails tagged / branched tables on the default path.
I also checked the upstream Iceberg ManageSnapshots#createBranch(String) contract from the published source jar. When the current snapshot is null, Iceberg creates a new empty snapshot for that branch instead of failing, so the non-main branch recreation logic is valid at the API-contract level.
No blocking issue found.
Conclusion: can merge
- Blocking items
- None.
- Suggested non-blocking follow-up
- If this area evolves further, one more integration-style check around the first write after a hard metadata reset would still be useful.
Overall, this version fixes the compatibility problem from the last round in the right way: metadata reset remains available, but it is now behind an explicit opt-in knob and the default path stays backward-compatible.
Purpose of this pull request
Iceberg
DROP_DATApreviously used a full-table delete commit. That path can load metadata, snapshots, and manifests aggressively enough to OOM large tables before writes start.A previous attempt to always switch
DROP_DATAto metadata reset avoided that path, but it also regressed the existingiceberg.table.commit-branchcapability by rejecting tables with non-mainrefs.This patch adds an explicit Iceberg-only
iceberg.drop-data.strategyoption with two modes:DELETE_COMMIT(default): keeps the historical delete-commit behavior and targetsiceberg.table.commit-branchwhen configuredHARD_METADATA_RESET: removes all snapshot refs and snapshots throughTableOperations.commit, then recreates the configured non-maincommit branch if neededThe patch also removes the redundant
instanceof IcebergCatalogguard from the sink save-mode path, fixes the Iceberg catalog factory error text, updates the Iceberg sink docs, and adds unit coverage for both strategies.Does this PR introduce any user-facing change?
Yes. Iceberg sink users can now choose
iceberg.drop-data.strategy.The default remains
DELETE_COMMIT, so existingDROP_DATAjobs keep their historical behavior, including branch-targeted delete commits wheniceberg.table.commit-branchis configured.Users who explicitly choose
HARD_METADATA_RESETget a table-wide metadata reset that clears all refs and snapshots, then recreates the configured non-maincommit branch. Orphan file cleanup is still left to normal Iceberg maintenance.How was this patch tested?
JAVA_HOME=/Users/coolcorgy/Library/Java/JavaVirtualMachines/azul-17.0.11/Contents/Home PATH=/Users/coolcorgy/Library/Java/JavaVirtualMachines/azul-17.0.11/Contents/Home/bin:$PATH ./mvnw -o -nsu -pl seatunnel-connectors-v2/connector-iceberg spotless:applyIcebergCatalogMetadataResetTestandIcebergSaveModeHandlerTestto cover both strategies and branch handling