-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Metadata table queries fail if a partition column was reused in V2 #4662
Conversation
I still need to ramp a bit on this, but I thought in that discussion we are going to collapse all of the same partition fields (if some are re-used) and they are uniquely identified with specid? Ie, just "data" and "id". example for reference:
I am wondering, does the code try to re-use old partition values in both metadata (ManifestEvaluator) and data filtering? I thought it does, but need to check, let me know if assumption is incorrect. Example here, we would hopefully use "data" partition values written by the table's initial spec, for queries after re-adding "data" to the spec. So then it may make sense to collapse the fields together in the partitions/metadata table? Welcome any thoughts from @rdblue , @aokolnychyi who have more background in the initial discussion. |
Hey @szehon-ho this was just intended to be a quick fix so we don't throw exceptions for the case you mentioned too.
Perhaps if we could rework the re-addition of "data" column to spec so that no new field ID is generated for it? I'm not sure if this will cause other issues though.. |
Yea I understand, my gist of the earlier conversation was that the most optimal solution is to try to re-use the field-id if its the equivalent transformation as a partition field before? I think the thing to check would be whether the regular metadata filtering is affected or not (today it seems to work, via logic in Projection filter being able to resolve it) |
@szehon-ho I think we were probably talking about the same thing for generating no new ID for a PartitionField that was already seen. I have amended my change based on this. I looked into this question in some detail and have found that this is actually kind of already happening when we are removing part of the PartitionFields:
Looking at the tests this should work, there's only one failure I get now in
This made me think though.. it seems like we have two different ways of generating the partition field column name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks like it's the right direction for me.
Is there any way to validate the data scan is correct filtering out the right files in this scenario (when we remove and add back partition filter), maybe by adding a test in "TestDataTableScan"?
In regards to your question in TestMetadataTableScans.testFilesTableScanWithDroppedPartition, yes I agree the former seems the correct name to me and seems we should try to align it, but I don't really have much context and maybe @aokolnychyi , @rdblue or @RussellSpitzer have more thoughts about that.
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/BaseUpdatePartitionSpec.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better to me, just some minor comments and some test failures.
@aokolnychyi , @rdblue or @RussellSpitzer could you please share your opinion on #4662 (review) aka on the discrepancy in generating PartitionField names? |
… V2 - the collapse method Change-Id: I91808def7cea2005e47b57d3033298afef26ca32
Change-Id: Ibba90e196e43fbde998d95ad02785f341c77f7ee
Change-Id: If25577df7655a13e21f4d6dd983b1b6c41601a2a
Change-Id: Ib70d0cefbce629e07d670a2f7876b5f638d87963
…leScans Change-Id: I9b3e0cc02b728986607481de3b9bda1d7513b2c3
Since the decision on #4868 is to stick with the old naming convention of default partition names, I think this is the best solution here. Please call out if you disagree, otherwise I will merge this soon. Thanks, |
Thanks for the fix @szlta and the @szehon-ho and @rdblue for the review! |
First go at fixing #4661.
The proposed solution described at #3411 (comment) seems very hardly feasible, so this would be a quick fix to just avoid
Invalid schema
errors thrown in any metadata query where the a partition col was reused.