-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Compacted position delete files should use the max data sequence number of source files #7651
Core: Compacted position delete files should use the max data sequence number of source files #7651
Conversation
…e number of source files
* @param dataSequenceNumber data sequence number to append on the file | ||
* @return this for method chaining | ||
*/ | ||
default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original thought was to use DeleteFile.dataSequenceNumber()
if set (could be done by extending the builder) but this approach can be a bit more reliable as it is explicit. Question: is there any case when the added new delete file data sequence number can be wrong?
Any thoughts, @rdblue @RussellSpitzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consume it directly from the file, it would mean this would apply for appends too, which I am not sure a bad thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we don't have a way to set the sequence number in the public API. I think that's a good thing because we don't want people to think that this is something they can generally control. So I do prefer setting this here.
However, for the implementation in BaseRewriteDataFiles
I think it would make sense to set the sequence number and avoid needing the DeleteFileHolder
. Not a big deal either way, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we don't have a way to set the sequence number in the public API.
That's can be easily changed by extending TableMetadata$Builder
. I am not a big fan of DeleteFileHolder
and I do think places that actually set data sequence numbers (like rewrite position deletes) would benefit from passing it as part of delete file object. That said, I am concerned about edge cases when the passed object may contain a wrong data sequence number. Are there use cases we can think of? Like cherry-picks maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, cherry picking seems like a case where we'd have the wrong sequence number. The good news is that we don't allow cherry picking unless the commit is an append or a dynamic partition overwrite. Otherwise we lose too much information.
I'm also a bit concerned about letting sequence number on the file be used. I'd probably opt to ignore it and have an explicit sequence number in the API like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's stick with a safer option. Let me review the rest of the PR.
private final Long dataSequenceNumber; | ||
|
||
/** | ||
* Queue a delete file for commit with a given data sequence number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Not sure if queue
is the right word here, it's a bit redundant considering the class name but could we just use the word Holds
or Contains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, changed the word.
...ark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java
Outdated
Show resolved
Hide resolved
@@ -246,12 +246,21 @@ protected void add(DataFile file) { | |||
|
|||
/** Add a delete file to the new snapshot. */ | |||
protected void add(DeleteFile file) { | |||
Preconditions.checkNotNull(file, "Invalid delete file: null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this check? Should this add it back in add(DeleteFileHolder)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check moved to the DeleteFileHolder
constructor, but is present in only one of two constructors now. It would be more explicit in add
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea initially tried to move it , but did it incomplete. Added back to original location.
* | ||
* <p>This rewrite operation may change the size or layout of the delete files. When applicable, | ||
* it is also recommended to discard delete records for files that are no longer part of the table | ||
* state. However, the set of applicable delete records must never change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note explaining that when rewriting position delete files, the sequence number of the new file must be the max sequence number of the original files? We may want to note that rewriting equality deletes that belong to different data sequence numbers is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a paragraph below.
@@ -55,12 +55,13 @@ public void commit(Set<RewritePositionDeletesGroup> fileGroups) { | |||
RewriteFiles rewriteFiles = table.newRewrite().validateFromSnapshot(startingSnapshotId); | |||
|
|||
for (RewritePositionDeletesGroup group : fileGroups) { | |||
long maxSequenceNumber = group.maxRewrittenDataSequenceNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about either calling this maxRewrittenDataSequenceNumber
and adding an empty line before the for loop below or using group.maxRewrittenDataSequenceNumber()
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, use the call directly
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I had a few minor comments only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @szehon-ho !
Test fail looks unrelated |
Thanks @rdblue @aokolnychyi @amogh-jahagirdar for reviews! |
Fixes: #7624