Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Properties file corruption caused by write failure #11835

Open
Ytimetravel opened this issue Aug 27, 2024 · 7 comments
Open

[SUPPORT] Properties file corruption caused by write failure #11835

Ytimetravel opened this issue Aug 27, 2024 · 7 comments
Labels
priority:blocker writer-core Issues relating to core transactions/write actions

Comments

@Ytimetravel
Copy link
Contributor

Describe the problem you faced
Dear community,
Recently I discovered a case: a write failure can cause the hoodi.properties file corrupted.
Problem site:
imageIt causes other write tasks to fail.
The process in which this situation occurs is as follows:

  1. Executing the commit will trigger the maybeDeleteMetadataTable process.(If need)
image
  1. An exception occurred during the following process, causing the properties file write to fail.
image image image

File status:properties error(len=0) properties_backup error-free

  1. Then it triggers rollback.
image image image
  1. Since the table version cannot be correctly obtained at this point, it triggers an upgrade from 0 to 6.
image image image

File status:properties error(len=0) properties_backup removed

  1. Attempt to create a properties_backup file
image image

I think that we should not only check if the hoodie.properties file exists when performing recoverIfNeeded, we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file.
Any suggestion?

Environment Description

  • Hudi version : 0.14.0

  • Spark version :2.4

  • Hadoop version :2.6

  • Storage (HDFS/S3/GCS..) :HDFS

Stacktrace
Caused by: org.apache.hudi.exception.HoodieException: Error updating table configs.
at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:91)
at org.apache.hudi.internal.HoodieDataSourceInternalWriter.commit(HoodieDataSourceInternalWriter.java:91)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:76)
... 69 more
Suppressed: java.lang.IllegalArgumentException: hoodie.table.name property needs to be specified
at org.apache.hudi.common.table.HoodieTableConfig.generateChecksum(HoodieTableConfig.java:523)
at org.apache.hudi.common.table.HoodieTableConfig.getOrderedPropertiesWithTableChecksum(HoodieTableConfig.java:321)
at org.apache.hudi.common.table.HoodieTableConfig.storeProperties(HoodieTableConfig.java:339)
at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:438)
at org.apache.hudi.common.table.HoodieTableConfig.delete(HoodieTableConfig.java:481)
at org.apache.hudi.table.upgrade.UpgradeDowngrade.run(UpgradeDowngrade.java:151)
at org.apache.hudi.client.BaseHoodieWriteClient.tryUpgrade(BaseHoodieWriteClient.java:1399)
at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1255)
at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1296)
at org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:769)
at org.apache.hudi.internal.DataSourceInternalWriterHelper.abort(DataSourceInternalWriterHelper.java:99)
at org.apache.hudi.internal.HoodieDataSourceInternalWriter.abort(HoodieDataSourceInternalWriter.java:96)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:82)
... 69 more
Caused by: org.apache.hudi.exception.HoodieIOException: Error updating table configs.
at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:466)
at org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:475)
at org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:816)
at org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:847)
at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1396)
at org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:275)
at org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:995)
at org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116)
at org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:947)
at org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:359)
at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:285)
at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236)
at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211)
at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:88)
... 71 more
Caused by: java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline
at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:3520)
at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:3498)
at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:3690)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:3625)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:80)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:115)
at org.apache.hudi.common.fs.SizeAwareFSDataOutputStream.close(SizeAwareFSDataOutputStream.java:75)
at org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:449)
... 84 more

@danny0405
Copy link
Contributor

The update to properties file should be atomic, and we already do that for HoodieTableConfig.modify, but it just throws for writer if any exception happens, the reader would still work by reading the back_up file.

we need more information to ensure that the hoodie.properties file is correct, rather than directly skipping file processing and deleting the backup file.

+1 for this, we need to strenthen the handling of the properties file exception for the invoker.

@Ytimetravel
Copy link
Contributor Author

@danny0405
My current understanding is as follows:

  1. The properties_backup is a copy of the original properties.
  2. The expected outcome is that original properties should be the same as properties_backup.
    Can we check if original properties is error-free by comparing file sizes?

@danny0405
Copy link
Contributor

Can we check if original properties is error-free by comparing file sizes?

We have a check-sum in the properties file.

@ad1happy2go ad1happy2go added the writer-core Issues relating to core transactions/write actions label Aug 29, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Aug 29, 2024
@Ytimetravel
Copy link
Contributor Author

@danny0405 Sounds good. Can I optimize the decision-making process here?

@danny0405
Copy link
Contributor

Sure, would be glad to review your fix.

@ad1happy2go
Copy link
Collaborator

@Ytimetravel Did you got a chance to work on this? Do we have any JIRA for the same?

@nsivabalan
Copy link
Contributor

sorry, I am not sure if I fully understand how exactly we got into corrupt state.

From what I see createMetaClient(true) fails. But if we chase the chain of calls, its ends up with

public static TypedProperties fetchConfigs(

which actually accounts for reading from either of back up or original property file.

can you help me understand a bit more.

@ad1happy2go ad1happy2go moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker writer-core Issues relating to core transactions/write actions
Projects
Status: 👤 User Action
Development

No branches or pull requests

4 participants