Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add position in manifest to DataFile and DeleteFile #1723

Merged
merged 5 commits into from
Nov 10, 2020

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 5, 2020

This updates ManifestReader to project the file position metadata column so that DataFile and DeleteFile objects read from manifests include the position within the manifest. This is in support of streaming reads, where the data in a table is turned into a stream of snapshots, then new manifests, then data files. The position of each file in a manifest can be encoded as a recoverable offset.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 5, 2020

FYI @jerryshao and @HeartSaVioR. Since we can project the position of a record in a file, I added this to metadata so that we can use it in streaming offsets. Please have a look, I think it should avoid needing an extra boolean in the offsets.

@HeartSaVioR
Copy link
Contributor

I'm not familiar with the details on the code, but it looks OK in general. This would help us to read a manifest file incrementally across multiple batches without a custom representation.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Nov 5, 2020

Looks like the CI failure is related:

/home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:35: error: SparkDataFile is not abstract and does not override abstract method pos() in ContentFile

@github-actions github-actions bot added the spark label Nov 6, 2020
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

/**
* Returns the ordinal position of the file in a manifest, or null if it was not read from a manifest.
*/
Long pos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: do we want to include anything in the name to indicate that it is a position in the manifest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be open to that, but I couldn't come up with a very good name for this field so I just used "pos" because we use that elsewhere to indicate position in a file.

@aokolnychyi
Copy link
Contributor

Will this field also appear in the metadata table output?

@rdblue
Copy link
Contributor Author

rdblue commented Nov 10, 2020

Will this field also appear in the metadata table output?

No, it won't. This change only adds pos to files that are read by ManifestReader, and it doesn't change any of the read or write schemas outside of that.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 10, 2020

Thanks for reviewing, @HeartSaVioR and @aokolnychyi! I'll merge this.

@rdblue rdblue merged commit 7c3b0d7 into apache:master Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants