Skip to content

Spark: MERGE INTO Statements with only WHEN NOT MATCHED Clauses are always executed at Snapshot Isolation #12653

Open
@hussein-awala

Description

@hussein-awala

Apache Iceberg version

1.8.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

When running two concurrent MERGE INTO operations on an Apache Iceberg table, I expect them to be idempotent -- meaning Iceberg should either detect conflicts and resolve them or fail one of the jobs to prevent data inconsistencies.

However, Iceberg determines the operation type dynamically based on the result of the join condition, which can lead to unexpected behavior:

  • If a match is found, Iceberg treats it as an overwrite operation and fails the second job due to conflicting commits.
  • If no match is found, Iceberg considers it an append operation and attempts to resolve conflicts by creating a new manifest for appended data, as explained in the Cost of Retries doc.

This behavior introduces a problem:
If the dataset is large enough and neither job finds a match, both will proceed with appending data independently, causing duplicate records.

Reproduction Steps

Running the following query in concurrent jobs can result in duplicate data if no matching records exist in dest:

MERGE INTO dest  
USING src  
ON dest.id = src.id  
WHEN NOT MATCHED THEN  
  INSERT *
-- even with update action, we'll have the same issue
-- WHEN MATCHED THEN
--  UPDATE SET *

I initially expected the operation type to be determined by the query itself (i.e., always "append" in the query without UPDATE action). However, through testing, I found that Iceberg decides the operation type at runtime, based on the actual join results. This makes MERGE INTO non-idempotent, leading to unintended duplicate inserts.

Expected Behavior

Iceberg should ensure idempotency for MERGE INTO, preventing duplicate data when no matches are found.

Additional Context

  • Iceberg version: 1.8.1
  • Iceberg catalog: Glue catalog (type glue) with S3 FileIO
  • Spark version: 3.5.5

Would love to hear if others have encountered this or if there's a recommended workaround.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions