-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta Migration: Handle delta lake's automatic log clean and VACUUM #6880
Delta Migration: Handle delta lake's automatic log clean and VACUUM #6880
Conversation
delta-lake/src/integration/java/org/apache/iceberg/delta/TestSnapshotDeltaLakeTable.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix, overall looks good to me, just nit for test
This seems needed as a fix for Delta conversion to work properly, because most users run VACUUM against the table. Mark it as required for 1.2 release |
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java
Show resolved
Hide resolved
appendFiles.commit(); | ||
|
||
return constructableStartVersion; | ||
} catch (NotFoundException | IllegalArgumentException | DeltaStandaloneException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm we don't need to explicitly handle IOException
here for when the file isn't found, we'll get a NotFoundException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out. The HadoopFileIO
re-throw FileNotFoundExceptiion
as NotFoundException
and general IOException
as RuntimeIOException
.
iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopInputFile.java
Lines 159 to 170 in b5a31a1
private FileStatus lazyStat() { | |
if (stat == null) { | |
try { | |
this.stat = fs.getFileStatus(path); | |
} catch (FileNotFoundException e) { | |
throw new NotFoundException(e, "File does not exist: %s", path); | |
} catch (IOException e) { | |
throw new RuntimeIOException(e, "Failed to get status for file: %s", path); | |
} | |
} | |
return stat; | |
} |
In this case, I think we only need to handle the NotFoundException
since VACUUM
may delete the data file. I've added a code block in buildDataFileFromAction
to explicitly check file existence and throw NotFoundException
if necessary.
Please let me know if you have any other concern here.
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java
Show resolved
Hide resolved
delta-lake/src/integration/java/org/apache/iceberg/delta/TestSnapshotDeltaLakeTable.java
Outdated
Show resolved
Hide resolved
delta-lake/src/integration/java/org/apache/iceberg/delta/TestSnapshotDeltaLakeTable.java
Outdated
Show resolved
Hide resolved
delta-lake/src/integration/java/org/apache/iceberg/delta/TestSnapshotDeltaLakeTable.java
Show resolved
Hide resolved
Thanks @jackye1995, @amogh-jahagirdar for reviewing this. I added the file existence check and refactored the integration tests based on your suggestions. @amogh-jahagirdar May I ask if you could review this PR again? Thank you in advance for your help! |
delta-lake/src/integration/java/org/apache/iceberg/delta/TestSnapshotDeltaLakeTable.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
Merging, thanks for the fix @JonasJ-ap , and thanks for the review @yyanyy and @amogh-jahagirdar ! |
Issue Fixed
This PR fixes #6781 .
The automatic log clean whenever a new checkpoint is created and the manual data file clean by
VACUUM
sometimes lead to inconsistency between the first constructable delta lake version and the earliest delta lake version in the table. The inconsistency will cause file not found errors during the migration.This PR fix the issue by forcing the snapshot delta lake action to start from the first constructable delta lake version instead of the earliest version.
Test
Added Integration tests to mock the file clean scenario.