Description
Apache Iceberg version
1.8.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
When running two concurrent MERGE INTO
operations on an Apache Iceberg table, I expect them to be idempotent -- meaning Iceberg should either detect conflicts and resolve them or fail one of the jobs to prevent data inconsistencies.
However, Iceberg determines the operation type dynamically based on the result of the join condition, which can lead to unexpected behavior:
- If a match is found, Iceberg treats it as an overwrite operation and fails the second job due to conflicting commits.
- If no match is found, Iceberg considers it an append operation and attempts to resolve conflicts by creating a new manifest for appended data, as explained in the Cost of Retries doc.
This behavior introduces a problem:
If the dataset is large enough and neither job finds a match, both will proceed with appending data independently, causing duplicate records.
Reproduction Steps
Running the following query in concurrent jobs can result in duplicate data if no matching records exist in dest
:
MERGE INTO dest
USING src
ON dest.id = src.id
WHEN NOT MATCHED THEN
INSERT *
-- even with update action, we'll have the same issue
-- WHEN MATCHED THEN
-- UPDATE SET *
I initially expected the operation type to be determined by the query itself (i.e., always "append" in the query without UPDATE
action). However, through testing, I found that Iceberg decides the operation type at runtime, based on the actual join results. This makes MERGE INTO
non-idempotent, leading to unintended duplicate inserts.
Expected Behavior
Iceberg should ensure idempotency for MERGE INTO
, preventing duplicate data when no matches are found.
Additional Context
- Iceberg version: 1.8.1
- Iceberg catalog: Glue catalog (type
glue
) with S3 FileIO - Spark version: 3.5.5
Would love to hear if others have encountered this or if there's a recommended workaround.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time