Iceberg Glue Concurrent Update can result in missing metadata_location

### Apache Iceberg version

1.4.2 (latest release)

### Query engine

None

### Please describe the bug 🐞

Similar issue that i found that was supposed to be fixed in older version: https://github.com/apache/iceberg/issues/7151

We have a Java Iceberg Code that processes from a FIFO queue and does commits to Iceberg in single threaded fashion. I have confirmed that we are not making commits anywhere to a table at the same time. However, when doing a few commits back to back in a row, at some point we encountered the following WARN log indicating that Glue detected a concurrent update, and it was retrying:

```
Retrying task after failure: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update org.apache.iceberg.exceptions.CommitFailedException: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update at org.apache.iceberg.aws.glue.GlueTableOperations.handleAWSExceptions(GlueTableOperations.java:355) ~[output.jar:?] at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:180) 
...
```

But immediately after this log, while attempting to refresh the Iceberg metadata there is a iceberg NotFoundException as the current metadata location doesn't exist or no longer exists.

```
INFO BaseMetastoreTableOperations - Refreshing table metadata from new version: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json

ERROR IcebergMetadataWriter - org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
```

**This has resulted in our table becoming corrupt and the availability of our data lake service being effected until we manually fixed the table by refrencing the Glue `previous_metadata_location` and overriding the invalid current `metadata_location` with it.**

It looks to me that when experiencing a CommitFailedException (CFE) these are retried internally and in any case should not result in a corrupt table even if all tried fail. Our code looks as follows, as we catch all exceptions: 

```
// tableObj is our class, and a thin wrapper containing the Iceberg Java Table class

        logger.info("Committing for tables: ${tableObjs.keys}")
        start = System.currentTimeMillis()
        runBlocking {
            for (tableObj in tableObjs.values) {
                launch(Dispatchers.IO) {
                    try {
                        if (tableObj.isInitalized()) {
                            tableObj.getAppendFiles().commit()
                        }
                    } catch (e: Exception) {
                        logger.error(e.message)
                        e.printStackTrace()
                        failures.addAll(tableObj.sqsMessageIds)
                    }
                }
            }
        }

        logger.info("Committed tables in ${System.currentTimeMillis() - start} ms")
```

Is this a bug in the Glue Iceberg code, or how should we protect ourselves from a situation where the Iceberg table is left pointing to an invalid location because of failed commits due to concurrent modifications thrown by Glue?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg Glue Concurrent Update can result in missing metadata_location #9411

Apache Iceberg version

Query engine

Please describe the bug 🐞

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Iceberg Glue Concurrent Update can result in missing metadata_location #9411

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions