Description
Apache Iceberg version
1.4.2 (latest release)
Query engine
None
Please describe the bug 🐞
Similar issue that i found that was supposed to be fixed in older version: #7151
We have a Java Iceberg Code that processes from a FIFO queue and does commits to Iceberg in single threaded fashion. I have confirmed that we are not making commits anywhere to a table at the same time. However, when doing a few commits back to back in a row, at some point we encountered the following WARN log indicating that Glue detected a concurrent update, and it was retrying:
Retrying task after failure: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update org.apache.iceberg.exceptions.CommitFailedException: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update at org.apache.iceberg.aws.glue.GlueTableOperations.handleAWSExceptions(GlueTableOperations.java:355) ~[output.jar:?] at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:180)
...
But immediately after this log, while attempting to refresh the Iceberg metadata there is a iceberg NotFoundException as the current metadata location doesn't exist or no longer exists.
INFO BaseMetastoreTableOperations - Refreshing table metadata from new version: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
ERROR IcebergMetadataWriter - org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
This has resulted in our table becoming corrupt and the availability of our data lake service being effected until we manually fixed the table by refrencing the Glue previous_metadata_location
and overriding the invalid current metadata_location
with it.
It looks to me that when experiencing a CommitFailedException (CFE) these are retried internally and in any case should not result in a corrupt table even if all tried fail. Our code looks as follows, as we catch all exceptions:
// tableObj is our class, and a thin wrapper containing the Iceberg Java Table class
logger.info("Committing for tables: ${tableObjs.keys}")
start = System.currentTimeMillis()
runBlocking {
for (tableObj in tableObjs.values) {
launch(Dispatchers.IO) {
try {
if (tableObj.isInitalized()) {
tableObj.getAppendFiles().commit()
}
} catch (e: Exception) {
logger.error(e.message)
e.printStackTrace()
failures.addAll(tableObj.sqsMessageIds)
}
}
}
}
logger.info("Committed tables in ${System.currentTimeMillis() - start} ms")
Is this a bug in the Glue Iceberg code, or how should we protect ourselves from a situation where the Iceberg table is left pointing to an invalid location because of failed commits due to concurrent modifications thrown by Glue?