Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add dataSequenceNumber to ManifestEntry #5913

Merged
merged 3 commits into from
Oct 9, 2022

Conversation

aokolnychyi
Copy link
Contributor

This PR fixes the behavior of sequence_number in ManifestEntry to always represent the data sequence number.

@@ -126,7 +126,7 @@ static Schema schema() {
/** Returns the sequence number of the commit that added the manifest file. */
long sequenceNumber();

/** Returns the lowest sequence number of any data file in the manifest. */
/** Returns the lowest data sequence number of any live file in the manifest. */
long minSequenceNumber();
Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what was happening before as entries with status DELETED had null as their sequence numbers, which made the writer ignore such entries while computing the min sequence number.

No behavior change, just updating the comment.

@Override
public Long sequenceNumber() {
return sequenceNumber;
if (sequenceNumber != null) {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept both fields for now as sequenceNumber may be set explicitly during inheritance (DELETEd only).

*
* @deprecated since 1.0.0, will be removed in 1.1.0; use {@link #dataSequenceNumber()} instead.
*/
@Deprecated
Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am debating whether we should deprecate this method or simply remove (prior to 1.0). It is core, not API. It is actually pretty dangerous to use this method as it behaves inconsistently.

If needed, we may add a new method to return the sequence number of the manifest entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer removing it.

@@ -368,8 +368,9 @@ private boolean manifestHasDeletedFiles(
deletePaths.contains(file.path())
|| dropPartitions.contains(file.specId(), file.partition())
|| (isDelete
&& entry.sequenceNumber() > 0
&& entry.sequenceNumber() < minSequenceNumber);
&& entry.isLive()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention! I think a delete manifest has files to delete ONLY if the entry status is ADDED or EXISTING. Also, we can't use entry.dataSequenceNumber() otherwise as it may return null for old V2 manifests with DELETED entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

&& (minSequenceNumber == null || entry.sequenceNumber() < minSequenceNumber)) {
this.minSequenceNumber = entry.sequenceNumber();

if (entry.isLive()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention! I think there is no behavior change as sequenceNumber() was returning null for DELETEd entries.
See here for details.

Preconditions.checkState(
wrapped.snapshotId() == null || wrapped.snapshotId().equals(commitSnapshotId),
"Found unassigned sequence number for an entry from snapshot: %s",
wrapped.snapshotId());

// inheritance should work only for ADDED entries
Preconditions.checkState(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new validation. Any cases when it is not safe to validate this? This class is used during writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me.

@@ -243,7 +243,7 @@ public void testValidateDataFilesExistFromSnapshot() {
// manifest with FILE_A deleted
validateManifest(
snap.dataManifests(table.io()).get(1),
seqs(2, 1),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test changes reflect the new behavior for DELETED entries.

// in v2 tables, the data sequence number should be inherited iff the entry status is ADDED
if (manifestEntry.dataSequenceNumber() == null
&& (sequenceNumber == 0 || manifestEntry.status() == ManifestEntry.Status.ADDED)) {
manifestEntry.setDataSequenceNumber(sequenceNumber);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

this.sequenceNumber = toCopy.sequenceNumber;
this.file = toCopy.file().copy(fullCopy);
}

ManifestEntry<F> wrapExisting(Long newSnapshotId, Long newSequenceNumber, F newFile) {
ManifestEntry<F> wrapExisting(Long newSnapshotId, Long newDataSequenceNumber, F newFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be fewer changes if we didn't rename newSequenceNumber to newDataSequenceNumber but I can see why you'd want to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think. I wanted to be explicit in this case as we will also add fileSequenceNumber next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with the rename as we have two variables now and I plan to add fileSequenceNumber next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi , What class do you plan to add fileSequenceNumber? DataFile and DeleteFile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada, I'd say we add it to ManifestEntry first (to store as part of manifest entry) and then we can propagate it via ContentFile.

Copy link
Collaborator

@chenjunjiedada chenjunjiedada Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi, Thanks for the explanation, that will be very helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada, I recall you had a PR for propagating seq numbers to ContentFile. I'll reach out once this part is done so that you can rebase your work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi , Yeap, actually I have two PRs regarding sequence number, will rebase them when you finish these.

@rdblue
Copy link
Contributor

rdblue commented Oct 6, 2022

Overall, this looks good to me. I think that the removed method should be kept in favor of one that throws UnsupportedOperationException though.

@aokolnychyi aokolnychyi merged commit 5e5c6d1 into apache:master Oct 9, 2022
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @rdblue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants