-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add dataSequenceNumber to ManifestEntry #5913
Conversation
@@ -126,7 +126,7 @@ static Schema schema() { | |||
/** Returns the sequence number of the commit that added the manifest file. */ | |||
long sequenceNumber(); | |||
|
|||
/** Returns the lowest sequence number of any data file in the manifest. */ | |||
/** Returns the lowest data sequence number of any live file in the manifest. */ | |||
long minSequenceNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is what was happening before as entries with status DELETED had null as their sequence numbers, which made the writer ignore such entries while computing the min sequence number.
No behavior change, just updating the comment.
@Override | ||
public Long sequenceNumber() { | ||
return sequenceNumber; | ||
if (sequenceNumber != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept both fields for now as sequenceNumber
may be set explicitly during inheritance (DELETEd only).
* | ||
* @deprecated since 1.0.0, will be removed in 1.1.0; use {@link #dataSequenceNumber()} instead. | ||
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am debating whether we should deprecate this method or simply remove (prior to 1.0). It is core, not API. It is actually pretty dangerous to use this method as it behaves inconsistently.
If needed, we may add a new method to return the sequence number of the manifest entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer removing it.
@@ -368,8 +368,9 @@ private boolean manifestHasDeletedFiles( | |||
deletePaths.contains(file.path()) | |||
|| dropPartitions.contains(file.specId(), file.partition()) | |||
|| (isDelete | |||
&& entry.sequenceNumber() > 0 | |||
&& entry.sequenceNumber() < minSequenceNumber); | |||
&& entry.isLive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention! I think a delete manifest has files to delete ONLY if the entry status is ADDED or EXISTING. Also, we can't use entry.dataSequenceNumber()
otherwise as it may return null for old V2 manifests with DELETED entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
&& (minSequenceNumber == null || entry.sequenceNumber() < minSequenceNumber)) { | ||
this.minSequenceNumber = entry.sequenceNumber(); | ||
|
||
if (entry.isLive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention! I think there is no behavior change as sequenceNumber()
was returning null for DELETEd entries.
See here for details.
Preconditions.checkState( | ||
wrapped.snapshotId() == null || wrapped.snapshotId().equals(commitSnapshotId), | ||
"Found unassigned sequence number for an entry from snapshot: %s", | ||
wrapped.snapshotId()); | ||
|
||
// inheritance should work only for ADDED entries | ||
Preconditions.checkState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new validation. Any cases when it is not safe to validate this? This class is used during writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me.
@@ -243,7 +243,7 @@ public void testValidateDataFilesExistFromSnapshot() { | |||
// manifest with FILE_A deleted | |||
validateManifest( | |||
snap.dataManifests(table.io()).get(1), | |||
seqs(2, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test changes reflect the new behavior for DELETED entries.
// in v2 tables, the data sequence number should be inherited iff the entry status is ADDED | ||
if (manifestEntry.dataSequenceNumber() == null | ||
&& (sequenceNumber == 0 || manifestEntry.status() == ManifestEntry.Status.ADDED)) { | ||
manifestEntry.setDataSequenceNumber(sequenceNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
this.sequenceNumber = toCopy.sequenceNumber; | ||
this.file = toCopy.file().copy(fullCopy); | ||
} | ||
|
||
ManifestEntry<F> wrapExisting(Long newSnapshotId, Long newSequenceNumber, F newFile) { | ||
ManifestEntry<F> wrapExisting(Long newSnapshotId, Long newDataSequenceNumber, F newFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be fewer changes if we didn't rename newSequenceNumber
to newDataSequenceNumber
but I can see why you'd want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think. I wanted to be explicit in this case as we will also add fileSequenceNumber
next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to go with the rename as we have two variables now and I plan to add fileSequenceNumber
next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi , What class do you plan to add fileSequenceNumber
? DataFile
and DeleteFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada, I'd say we add it to ManifestEntry first (to store as part of manifest entry) and then we can propagate it via ContentFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi, Thanks for the explanation, that will be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada, I recall you had a PR for propagating seq numbers to ContentFile
. I'll reach out once this part is done so that you can rebase your work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi , Yeap, actually I have two PRs regarding sequence number, will rebase them when you finish these.
core/src/main/java/org/apache/iceberg/GenericManifestEntry.java
Outdated
Show resolved
Hide resolved
Overall, this looks good to me. I think that the removed method should be kept in favor of one that throws UnsupportedOperationException though. |
Thanks for reviewing, @rdblue! |
This PR fixes the behavior of
sequence_number
inManifestEntry
to always represent the data sequence number.