Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add manifestLocation API to ContentFile #11044

Merged

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Aug 29, 2024

This change adds a manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in. If the data or delete file was not read from a manifest, the API returns null. The planned use case for this API is to enable a more optimized determination of which manifests need to be rewritten during commits.

Note: To be very clear this is not a spec change, this is an additional read API on ContentFile which surfaces the manifest path on reading of the manifest entry

Benchmark results prior to the changes:

Benchmark                                                                           Mode  Cnt   Score   Error  Units
PlanningBenchmark.distributedPlanningWithMinMaxFilter                                 ss    5   1 ± 0.160   s/op
PlanningBenchmark.localPlanningWithMinMaxFilter                                       ss    5   1 ± 0.028   s/op
PlanningBenchmark.distributedPlanningWithPartitionAndMinMaxFilter                     ss    5   0.129 ± 0.034   s/op
PlanningBenchmark.localPlanningWithPartitionAndMinMaxFilter                           ss    5   0.134 ± 0.036   s/op
PlanningBenchmark.distributedPlanningWithoutFilter                                    ss    5   2 ± 0.509   s/op
PlanningBenchmark.localPlanningWithoutFilter                                          ss    5   3 ± 0.152   s/op
PlanningBenchmark.localDataDistributedDeletesPlanningWithoutFilterWithStats           ss    5   10 ± 0.501   s/op
PlanningBenchmark.localPlanningViaDistributedScanWithoutFilterWithStats               ss    5   10 ± 0.862   s/op
PlanningBenchmark.localPlanningWithoutFilterWithStats                                 ss    5   10 ± 0.520   s/op

Benchmark results after the changes:

Benchmark                                                                           Mode  Cnt   Score   Error  Units
PlanningBenchmark.distributedPlanningWithMinMaxFilter                                 ss    5   1 ± 0.041   s/op
PlanningBenchmark.localPlanningWithMinMaxFilter                                       ss    5   1 ± 0.049   s/op
PlanningBenchmark.distributedPlanningWithPartitionAndMinMaxFilter                     ss    5   0.129 ± 0.039   s/op
PlanningBenchmark.localPlanningWithPartitionAndMinMaxFilter                           ss    5   0.126 ± 0.046   s/op
PlanningBenchmark.distributedPlanningWithoutFilter                                    ss    5   2 ± 0.654   s/op
PlanningBenchmark.localPlanningWithoutFilter                                          ss    5   3 ± 0.184   s/op
PlanningBenchmark.localDataDistributedDeletesPlanningWithoutFilterWithStats           ss    5   9 ± 0.954   s/op
PlanningBenchmark.localPlanningViaDistributedScanWithoutFilterWithStats               ss    5   10 ± 0.689   s/op
PlanningBenchmark.localPlanningWithoutFilterWithStats                                 ss    5   10 ± 0.901   s/op

@amogh-jahagirdar amogh-jahagirdar force-pushed the manifest-reference-content-file branch from 72323a7 to 2d88443 Compare August 29, 2024 19:21
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review August 29, 2024 19:21
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think going via InheritableMetadata is the easiest option.

I only have a question about CharSequence, will check the tests tomorrow.

* Returns the path of the manifest which this file is referenced in or null if it was not read
* from a manifest.
*/
default CharSequence manifestPath() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put this method before pos() so both are co-located?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss using CharSequence for paths in our APIs. I am not sure this approach is working out very well. I believe the idea was to expose Avro Utf8 directly. The problem is that Utf8 is not serializable. Therefore, we always convert Utf8 to String. Working with CharSequence adds a lot of complexity for no performance benefit. Puffin and a few other places use String for paths.

@rdblue, is my assumption about CharSequence correct? What are your thoughts now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, path() in ManifestFile returns String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to use String here and always expect to convert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use String. I also published #11092 to deprecate the path API which returns CharSequence, in favor of a location API which returns String. This removal would happen in 2.0

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Sep 6, 2024

I took another look. I am on board. My high-level comments are below.

  • Use String instead of CharSequence.
  • Co-locate pos() and manifestLocation() in ContentFile.
  • Extend DataTableScanTestBase or whatever needed to cover distributed planning.
  • Run benchmarks for distributed planning to see the impact of serializing the manifest location.

@amogh-jahagirdar
Copy link
Contributor Author

Extend DataTableScanTestBase or whatever needed to cover distributed planning.
Run benchmarks for distributed planning to see the impact of serializing the manifest location.

Great points on these, will run distributed planning benchmarks to make sure there's not a regression from adding this.

@amogh-jahagirdar amogh-jahagirdar force-pushed the manifest-reference-content-file branch 5 times, most recently from b923d60 to 494339a Compare September 9, 2024 23:48
@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 9, 2024

@aokolnychyi I ran the plan benchmarks, and added to the PR description. there's no significant difference in planning before and after.

Extend DataTableScanTestBase or whatever needed to cover distributed planning.

For tests, we actually already have SparkDistributedDataScanTestBase which tests different combinations of local/distributed for data file and delete file planning. We have TestSparkDistributedDataScanJavaSerialization and TestSparkDistributedDataScanKryoSerialization for testing java/kryo serialization respectively. So I think we're effectively getting coverage in different planning + serialization combinations for the new code path. Please let me know if you think there's a coverage gap though!

@amogh-jahagirdar amogh-jahagirdar changed the title API, Core: Add manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in API, Core: Add manifestLocation API to ContentFile which will return the path to a manifest from which the content file resides in Sep 11, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the manifest-reference-content-file branch from 494339a to 32145d7 Compare September 11, 2024 15:18
@amogh-jahagirdar amogh-jahagirdar force-pushed the manifest-reference-content-file branch from 32145d7 to 499011d Compare September 12, 2024 19:37
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit in the tests. Thanks, @amogh-jahagirdar!

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 12, 2024

Verified TestSparkReaderDeletes#testMultiplePosDeleteFiles() failure is unrelated (ran locally a few times with my changes and it passed). The test may be flaky. I'll push another change to address the nit and we can verify if it's flaky.

TestSparkReaderDeletes > testMultiplePosDeleteFiles() > format = orc, vectorized = false, planningMode = DISTRIBUTED FAILED
    java.lang.AssertionError: [Metric values were not finalized] 
    Expecting actual not to be null
        at org.apache.iceberg.spark.source.SparkSQLExecutionHelper.lastExecutedMetricValue(SparkSQLExecutionHelper.java:63)
        at org.apache.iceberg.spark.source.TestSparkReaderDeletes.deleteCount(TestSparkReaderDeletes.java:205)

…path to a manifest from which the content file resides in
@amogh-jahagirdar amogh-jahagirdar force-pushed the manifest-reference-content-file branch from 499011d to bbf159b Compare September 12, 2024 23:49
@amogh-jahagirdar
Copy link
Contributor Author

This time TestDataFrameWrites > testFaultToleranceOnWrite() failed which is known to be flaky: #10811

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 13, 2024

I'm going to go ahead and merge. Thanks for reviewing @aokolnychyi @rdblue!

@amogh-jahagirdar amogh-jahagirdar changed the title API, Core: Add manifestLocation API to ContentFile which will return the path to a manifest from which the content file resides in API, Core: Add manifestPath API to ContentFile Sep 13, 2024
@amogh-jahagirdar amogh-jahagirdar changed the title API, Core: Add manifestPath API to ContentFile API, Core: Add manifestLocation API to ContentFile Sep 13, 2024
@amogh-jahagirdar amogh-jahagirdar merged commit 7991206 into apache:main Sep 13, 2024
47 checks passed
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants