[Fix][Connectors-v2] Avoid Iceberg DROP_DATA full-table delete commit by corgy-w · Pull Request #10851 · apache/seatunnel

corgy-w · 2026-05-07T07:35:33Z

Purpose of this pull request

Iceberg DROP_DATA previously used a full-table delete commit. That path can load metadata, snapshots, and manifests aggressively enough to OOM large tables before writes start.

A previous attempt to always switch DROP_DATA to metadata reset avoided that path, but it also regressed the existing iceberg.table.commit-branch capability by rejecting tables with non-main refs.

This patch adds an explicit Iceberg-only iceberg.drop-data.strategy option with two modes:

DELETE_COMMIT (default): keeps the historical delete-commit behavior and targets iceberg.table.commit-branch when configured
HARD_METADATA_RESET: removes all snapshot refs and snapshots through TableOperations.commit, then recreates the configured non-main commit branch if needed

The patch also removes the redundant instanceof IcebergCatalog guard from the sink save-mode path, fixes the Iceberg catalog factory error text, updates the Iceberg sink docs, and adds unit coverage for both strategies.

Does this PR introduce any user-facing change?

Yes. Iceberg sink users can now choose iceberg.drop-data.strategy.

The default remains DELETE_COMMIT, so existing DROP_DATA jobs keep their historical behavior, including branch-targeted delete commits when iceberg.table.commit-branch is configured.

Users who explicitly choose HARD_METADATA_RESET get a table-wide metadata reset that clears all refs and snapshots, then recreates the configured non-main commit branch. Orphan file cleanup is still left to normal Iceberg maintenance.

How was this patch tested?

Ran JAVA_HOME=/Users/coolcorgy/Library/Java/JavaVirtualMachines/azul-17.0.11/Contents/Home PATH=/Users/coolcorgy/Library/Java/JavaVirtualMachines/azul-17.0.11/Contents/Home/bin:$PATH ./mvnw -o -nsu -pl seatunnel-connectors-v2/connector-iceberg spotless:apply
Updated IcebergCatalogMetadataResetTest and IcebergSaveModeHandlerTest to cover both strategies and branch handling

davidzollo

Thanks for the contribution. I reviewed the full current head locally on seatunnel-review-10851 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

User pain: DROP_DATA on Iceberg should avoid a heavy full-table delete commit when a safer metadata-level reset is possible.
Fix approach: keepSchemaDropData() now routes into truncateTable(), and truncateTable() replaces the old deleteFromRowFilter(alwaysTrue()) path with a metadata reset.
One-line summary: the metadata-reset direction is good for plain tables, but the current head introduces a new hard failure for tables that have non-main snapshot refs, which is a visible behavior narrowing compared with the old path.

1. Code Review

1.1 Core Logic Analysis

Exact change scope:
- seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/sink/IcebergSaveModeHandler.java:43-51
- seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/catalog/IcebergCatalog.java:276-314
Before / after:

// older truncate path
catalog.loadTable(icebergTableIdentifier)
        .newDelete()
        .deleteFromRowFilter(Expressions.alwaysTrue())
        .commit();

// current path
Set<String> nonMainRefs = table.refs().keySet().stream()
        .filter(refName -> !SnapshotRef.MAIN_BRANCH.equals(refName))
        .collect(Collectors.toCollection(LinkedHashSet::new));
if (!nonMainRefs.isEmpty()) {
    throw new CatalogException(...);
}
// then perform metadata reset

Key findings:
1. The normal KEEP_SCHEMA + DROP_DATA runtime path definitely hits this change.
2. The metadata-reset direction itself is reasonable for plain tables.
3. The new implementation now rejects any table that has non-main snapshot refs.
4. That is a real behavior narrowing compared with the old path, not just an internal refactor.
Runtime chain I checked:

save mode path
  -> DefaultSaveModeHandler.keepSchemaDropData()
      -> IcebergSaveModeHandler.keepSchemaDropData()
          -> icebergCatalog.truncateTable(tablePath, true)
              -> loadTable(...)
              -> inspect table.refs()
              -> if non-main refs exist -> throw CatalogException
              -> else perform metadata reset commit

1.2 Compatibility Impact

Verdict: partially incompatible.
API/config/default/protocol/serialization are unchanged.
The incompatible part is runtime behavior: a table that previously could still be cleared through the older path is now rejected immediately if it has non-main refs.

1.3 Performance / Side Effects

For plain tables, the metadata-reset approach is a reasonable performance win.
The main side effect is compatibility-related: the supported table shape is narrower than before.

1.4 Error Handling and Logs

Issue 1: Tables with non-main snapshot refs are now rejected outright

Location: IcebergCatalog.java:285
Why this is a real problem:
this code sits directly on the real KEEP_SCHEMA + DROP_DATA path. If the table has non-main refs, the PR now turns that into a hard failure instead of falling back to a behavior-compatible truncation path.
Risk:
users with Iceberg tag/branch usage can hit a main-path failure after upgrading to this implementation.
Suggested fix:
if non-main refs are present, fall back to the older delete-based truncation path instead of failing the operation outright. If you intentionally want to keep the new limitation, it needs to be documented explicitly as an incompatible behavior change.
Severity: High

2. Code Quality

The implementation is cleanly structured.
I would still add one contract-style test around tables with non-main refs, because that is exactly where the current compatibility regression sits.

3. Architecture

For plain tables this is a good direction.
The blocker is that the current head narrows the supported table shape without a compatibility-safe fallback.

4. Issue Summary

No.	Issue	Location	Severity
1	Tables with non-main snapshot refs are now rejected outright	`IcebergCatalog.java:285`	High

5. Merge Conclusion

Conclusion: merge after fixes

Blocking items

Issue 1 must be fixed before merge.

Suggested non-blocking follow-up

Add one contract test around the non-main-ref path so this boundary stays explicit.

The metadata-reset idea is worth keeping. The blocker is the new compatibility regression on tagged/branched Iceberg tables, not the overall direction.

Posted from the wrong account by mistake; superseded by the DanielLeens review.

DanielLeens

Thanks for the contribution. I reviewed the full current head locally on seatunnel-review-10851 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

User pain: DROP_DATA on Iceberg should avoid a heavy full-table delete commit when a safer metadata-level reset is possible.
Fix approach: keepSchemaDropData() now routes into truncateTable(), and truncateTable() replaces the old deleteFromRowFilter(alwaysTrue()) path with a metadata reset.
One-line summary: the metadata-reset direction is good for plain tables, but the current head introduces a new hard failure for tables that have non-main snapshot refs, which is a visible behavior narrowing compared with the old path.

1. Code Review

1.1 Core Logic Analysis

Exact change scope:
- seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/sink/IcebergSaveModeHandler.java:43-51
- seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/catalog/IcebergCatalog.java:276-314
Before / after:

// older truncate path
catalog.loadTable(icebergTableIdentifier)
        .newDelete()
        .deleteFromRowFilter(Expressions.alwaysTrue())
        .commit();

// current path
Set<String> nonMainRefs = table.refs().keySet().stream()
        .filter(refName -> !SnapshotRef.MAIN_BRANCH.equals(refName))
        .collect(Collectors.toCollection(LinkedHashSet::new));
if (!nonMainRefs.isEmpty()) {
    throw new CatalogException(...);
}
// then perform metadata reset

Key findings:
1. The normal KEEP_SCHEMA + DROP_DATA runtime path definitely hits this change.
2. The metadata-reset direction itself is reasonable for plain tables.
3. The new implementation now rejects any table that has non-main snapshot refs.
4. That is a real behavior narrowing compared with the old path, not just an internal refactor.
Runtime chain I checked:

save mode path
  -> DefaultSaveModeHandler.keepSchemaDropData()
      -> IcebergSaveModeHandler.keepSchemaDropData()
          -> icebergCatalog.truncateTable(tablePath, true)
              -> loadTable(...)
              -> inspect table.refs()
              -> if non-main refs exist -> throw CatalogException
              -> else perform metadata reset commit

1.2 Compatibility Impact

Verdict: partially incompatible.
API/config/default/protocol/serialization are unchanged.
The incompatible part is runtime behavior: a table that previously could still be cleared through the older path is now rejected immediately if it has non-main refs.

1.3 Performance / Side Effects

For plain tables, the metadata-reset approach is a reasonable performance win.
The main side effect is compatibility-related: the supported table shape is narrower than before.

1.4 Error Handling and Logs

Issue 1: Tables with non-main snapshot refs are now rejected outright

Location: IcebergCatalog.java:285
Why this is a real problem:
this code sits directly on the real KEEP_SCHEMA + DROP_DATA path. If the table has non-main refs, the PR now turns that into a hard failure instead of falling back to a behavior-compatible truncation path.
Risk:
users with Iceberg tag/branch usage can hit a main-path failure after upgrading to this implementation.
Suggested fix:
if non-main refs are present, fall back to the older delete-based truncation path instead of failing the operation outright. If you intentionally want to keep the new limitation, it needs to be documented explicitly as an incompatible behavior change.
Severity: High

2. Code Quality

The implementation is cleanly structured.
I would still add one contract-style test around tables with non-main refs, because that is exactly where the current compatibility regression sits.

3. Architecture

For plain tables this is a good direction.
The blocker is that the current head narrows the supported table shape without a compatibility-safe fallback.

4. Issue Summary

No.	Issue	Location	Severity
1	Tables with non-main snapshot refs are now rejected outright	`IcebergCatalog.java:285`	High

5. Merge Conclusion

Conclusion: merge after fixes

Blocking items

Issue 1 must be fixed before merge.

Suggested non-blocking follow-up

Add one contract test around the non-main-ref path so this boundary stays explicit.

The metadata-reset idea is worth keeping. The blocker is the new compatibility regression on tagged/branched Iceberg tables, not the overall direction.

chl-wxp · 2026-05-08T01:51:53Z

Suggested change

.conditional(

IcebergSinkOptions.DATA_SAVE_MODE,

DataSaveMode.DROP_DATA,

IcebergDropDataStrategy)

DanielLeens

Thanks for the update. I re-reviewed the latest head locally against the current upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

User pain: data_save_mode = DROP_DATA on Iceberg can be expensive when it always goes through a full-table delete commit.
Fix approach: introduce an explicit iceberg.drop-data.strategy; keep DELETE_COMMIT as the default for compatibility, and add an opt-in HARD_METADATA_RESET path plus branch / docs / tests.
One-line summary: the current head resolves the compatibility blocker from my previous review by restoring the legacy default behavior and making metadata reset an explicit opt-in strategy.

Runtime Chain I Checked

SaveMode path
  -> IcebergSink.getSaveModeHandler()
      -> new IcebergSaveModeHandler(..., dropDataStrategy, commitBranch)
      -> handleDataSaveMode()
          -> keepSchemaDropData()
              -> IcebergCatalog.truncateTable(tablePath, true, strategy, commitBranch)
                  -> strategy = DELETE_COMMIT
                       -> table.newDelete()
                       -> optional toBranch(commitBranch)
                       -> deleteFromRowFilter(alwaysTrue())
                       -> commit()
                  -> strategy = HARD_METADATA_RESET
                       -> TableOperations.current()
                       -> remove all refs + snapshots from metadata
                       -> operations.commit(base, resetMetadata)
                       -> recreateCommitBranchIfNeeded(non-main branch only)

Key Findings

The normal data_save_mode = DROP_DATA path definitely hits this change.
The default strategy is now DELETE_COMMIT, so existing jobs keep the legacy behavior.
The destructive metadata-reset path is opt-in, and both EN/ZH docs now say explicitly that it clears all snapshot refs / snapshots and leaves orphan cleanup to separate Iceberg maintenance.
The blocker from my previous review is resolved: the PR no longer hard-fails tagged / branched tables on the default path.

I also checked the upstream Iceberg ManageSnapshots#createBranch(String) contract from the published source jar. When the current snapshot is null, Iceberg creates a new empty snapshot for that branch instead of failing, so the non-main branch recreation logic is valid at the API-contract level.

No blocking issue found.

Conclusion: can merge

Blocking items

None.

Suggested non-blocking follow-up

If this area evolves further, one more integration-style check around the first write after a hard metadata reset would still be useful.

Overall, this version fixes the compatibility problem from the last round in the right way: metadata reset remains available, but it is now behind an explicit opt-in knob and the default path stays backward-compatible.

Avoid Iceberg DROP_DATA full-table delete commit

c7e9429

github-actions Bot added connectors-v2 iceberg labels May 7, 2026

[Fix][Connectors-v2] Use metadata reset for Iceberg DROP_DATA

d37f651

corgy-w force-pushed the corgy/dev-iceberg-drop-data-recreate branch from 2dc9413 to d37f651 Compare May 7, 2026 09:20

davidzollo previously requested changes May 7, 2026

View reviewed changes

DanielLeens suggested changes May 7, 2026

View reviewed changes

[Fix][Connectors-v2] Add configurable Iceberg DROP_DATA strategy

6994616

github-actions Bot added the document label May 7, 2026

chl-wxp reviewed May 8, 2026

View reviewed changes

DanielLeens approved these changes May 8, 2026

View reviewed changes

github-actions Bot added the reviewed label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Connectors-v2] Avoid Iceberg DROP_DATA full-table delete commit#10851

[Fix][Connectors-v2] Avoid Iceberg DROP_DATA full-table delete commit#10851
corgy-w wants to merge 3 commits into
apache:devfrom
corgy-w:corgy/dev-iceberg-drop-data-recreate

corgy-w commented May 7, 2026 •

edited

Loading

Uh oh!

davidzollo left a comment

Uh oh!

DanielLeens left a comment

Uh oh!

chl-wxp May 8, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

+                        .conditional(
+                                IcebergSinkOptions.DATA_SAVE_MODE,
+                                DataSaveMode.DROP_DATA,
+                                IcebergDropDataStrategy)

Conversation

corgy-w commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

What This PR Fixes

1. Code Review

1.1 Core Logic Analysis

1.2 Compatibility Impact

1.3 Performance / Side Effects

1.4 Error Handling and Logs

2. Code Quality

3. Architecture

4. Issue Summary

5. Merge Conclusion

Conclusion: merge after fixes

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What This PR Fixes

1. Code Review

1.1 Core Logic Analysis

1.2 Compatibility Impact

1.3 Performance / Side Effects

1.4 Error Handling and Logs

2. Code Quality

3. Architecture

4. Issue Summary

5. Merge Conclusion

Conclusion: merge after fixes

Uh oh!

chl-wxp May 8, 2026

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What This PR Fixes

Runtime Chain I Checked

Key Findings

Conclusion: can merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

corgy-w commented May 7, 2026 •

edited

Loading