HIVE-29437: Iceberg: Fix concurrency issues between compaction and co…#6292
Open
difin wants to merge 2 commits intoapache:masterfrom
Open
HIVE-29437: Iceberg: Fix concurrency issues between compaction and co…#6292difin wants to merge 2 commits intoapache:masterfrom
difin wants to merge 2 commits intoapache:masterfrom
Conversation
…ncurrent write operations.
deniskuzZ
reviewed
Feb 3, 2026
| IcebergCompactionUtil.getDataFiles(table, snapshotId, partitionPath, fileSizeThreshold); | ||
| List<DeleteFile> existingDeleteFiles = fileSizeThreshold == -1 ? | ||
| IcebergCompactionUtil.getDeleteFiles(table, partitionPath) : Collections.emptyList(); | ||
| IcebergCompactionUtil.getDeleteFiles(table, snapshotId, partitionPath) : Collections.emptyList(); |
Member
There was a problem hiding this comment.
please add test.
as an example you could use TestConflictingDataFiles#testConflictingUpdateAndDelete
deniskuzZ
reviewed
Feb 3, 2026
| Table deletesTable = | ||
| MetadataTableUtils.createMetadataTableInstance(table, MetadataTableType.POSITION_DELETES); | ||
| CloseableIterable<ScanTask> deletesScanTasks = deletesTable.newBatchScan().planFiles(); | ||
| CloseableIterable<ScanTask> deletesScanTasks = deletesTable.newBatchScan().useSnapshot(snapshotId).planFiles(); |
Member
There was a problem hiding this comment.
why do you use here newBatchScan() and in getDataFiles newScan()? should we use BatchScan in both places?
Contributor
Author
There was a problem hiding this comment.
It was in the existing code, changed to newScan().
Changed to use BatchScan in both places.
b842441 to
5433c78
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



…ncurrent write operations.
What changes were proposed in this pull request?
Fixing concurrency issues between compaction and concurrent write operations.
Why are the changes needed?
It was found in downstream testing that when Hive Iceberg compaction is running in parallel to Spark write operations on the same table, compaction sometimes produces wrong results. Before committing, when Hive already has the compacted data files that need to replace existing, uncompacted data and delete files in a table or partition, it collects those uncompacted data and delete files to replace them with the compacted files. The issue is that Hive collects those uncompacted data and delete files from the latest Iceberg snapshot instead of using the original snapshot. The latest snapshot may contain different data because of concurrent write operations, which can lead to data corruption.
Does this PR introduce any user-facing change?
No
How was this patch tested?
The fix was validated downstream with concurrent Spark write operations and Hive Iceberg compaction.