-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Add manifestLocation API to ContentFile #11044
API, Core: Add manifestLocation API to ContentFile #11044
Conversation
72323a7
to
2d88443
Compare
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think going via InheritableMetadata
is the easiest option.
I only have a question about CharSequence
, will check the tests tomorrow.
* Returns the path of the manifest which this file is referenced in or null if it was not read | ||
* from a manifest. | ||
*/ | ||
default CharSequence manifestPath() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put this method before pos()
so both are co-located?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss using CharSequence
for paths in our APIs. I am not sure this approach is working out very well. I believe the idea was to expose Avro Utf8
directly. The problem is that Utf8
is not serializable. Therefore, we always convert Utf8
to String
. Working with CharSequence
adds a lot of complexity for no performance benefit. Puffin and a few other places use String
for paths.
@rdblue, is my assumption about CharSequence
correct? What are your thoughts now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, path()
in ManifestFile
returns String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to use String here and always expect to convert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use String. I also published #11092 to deprecate the path API which returns CharSequence, in favor of a location
API which returns String. This removal would happen in 2.0
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
I took another look. I am on board. My high-level comments are below.
|
Great points on these, will run distributed planning benchmarks to make sure there's not a regression from adding this. |
b923d60
to
494339a
Compare
@aokolnychyi I ran the plan benchmarks, and added to the PR description. there's no significant difference in planning before and after.
For tests, we actually already have SparkDistributedDataScanTestBase which tests different combinations of local/distributed for data file and delete file planning. We have TestSparkDistributedDataScanJavaSerialization and TestSparkDistributedDataScanKryoSerialization for testing java/kryo serialization respectively. So I think we're effectively getting coverage in different planning + serialization combinations for the new code path. Please let me know if you think there's a coverage gap though! |
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
494339a
to
32145d7
Compare
32145d7
to
499011d
Compare
core/src/test/java/org/apache/iceberg/DataTableScanTestBase.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit in the tests. Thanks, @amogh-jahagirdar!
Verified TestSparkReaderDeletes#testMultiplePosDeleteFiles() failure is unrelated (ran locally a few times with my changes and it passed). The test may be flaky. I'll push another change to address the nit and we can verify if it's flaky.
|
…path to a manifest from which the content file resides in
499011d
to
bbf159b
Compare
This time |
I'm going to go ahead and merge. Thanks for reviewing @aokolnychyi @rdblue! |
This change adds a manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in. If the data or delete file was not read from a manifest, the API returns null. The planned use case for this API is to enable a more optimized determination of which manifests need to be rewritten during commits.
Note: To be very clear this is not a spec change, this is an additional read API on ContentFile which surfaces the manifest path on reading of the manifest entry
Benchmark results prior to the changes:
Benchmark results after the changes: