-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Update parquet to 1.13.1 #7301
Conversation
Should we also revert #5681 ? |
IMHO we should, was thinking of the same but in a separate commit, post validating the issue stands fixed even without the change. Presently this upgrade takes a dependency on hadoop 3.2, and interestingly some of the BloomFilter ut's fails as well. Taking a look into the same as well. |
1d8bd1e
to
7dce949
Compare
7dce949
to
292c88a
Compare
There is a dependency on hadoop 3.2, I see there is a pr #5024 for the same, will wait for it to get in. |
I've created a PR to support Hadoop 2.7.3 as well: apache/parquet-java#1084 |
Looks like this commit made the bloom tests on the Iceberg side fail:
The PR apache/parquet-java#1033 |
This is an interesting find @Fokko, this implies, even when we are enabling BF via iceberg table conf's (https://iceberg.apache.org/docs/latest/configuration/#write-properties) parquet may decide not to write a BF for the column if all it's pages are dictionary encoded (I am assuming this is because there will be no benefit of writing BF in this particular case), Do we need to mention this in our doc as well, as the given table prop is a hint for parquet to write BF, but it may not choose to ? |
@singhpk234 Yes, that's indeed the case. The dictionary is a better version of the bloom filter because it cannot contain false positives, this makes it redundant to also generate the bloom filter. |
1699dfc
to
65f50e8
Compare
@Fokko, added a fix for the failing UTs dues to BF changes and also updated the docs with the current behaviour of parquet, I think once your pr is in apache/parquet-java#1084 and parquet 1.13.1 is out we should be good to go, please let me know your thoughts ! |
Ran a benchmark on the 1.2.1 branch:
|
@singhpk234 can I invite you to give the 1.13.1 RC a try: https://lists.apache.org/thread/0yokbmfcbhz76ftjbxktwxfo5vrt57od Would be great if you can vote on the RC. |
Thank you @Fokko for the awesome work ! Added my +1 :) ! |
081736b
to
123cfc9
Compare
123cfc9
to
28d3cf0
Compare
28d3cf0
to
6e99c43
Compare
aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking nit, this looks great @singhpk234 !
Thanks @singhpk234 for picking this up, and @amogh-jahagirdar for the review! |
| write.parquet.row-group-size-bytes | 134217728 (128 MB) | Parquet row group size | | ||
| write.parquet.page-size-bytes | 1048576 (1 MB) | Parquet page size | | ||
| write.parquet.page-row-limit | 20000 | Parquet page row limit | | ||
| write.parquet.dictionary.enabled | true | Enable dictionary encoding | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the setting above.
@@ -135,6 +135,9 @@ private TableProperties() {} | |||
public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit"; | |||
public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000; | |||
|
|||
public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the format of Iceberg options. It should be write.parquet.dict-enabled
to match dict-size-bytes
or should use enabled
as the last word if you're using dictionary
as a part of the hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, the name does match our pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for my oversight here, i intended to have .enabled
in suffix which i mentioned in the table prop but messed up here.
| write.update.mode | copy-on-write | Mode used for update commands: copy-on-write or merge-on-read (v2 only) | | ||
| write.update.isolation-level | serializable | Isolation level for update commands: serializable or snapshot | | ||
| write.merge.mode | copy-on-write | Mode used for merge commands: copy-on-write or merge-on-read (v2 only) | | ||
| write.merge.isolation-level | serializable | Isolation level for merge commands: serializable or snapshot | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this entire table reformatted?
@@ -197,6 +198,7 @@ public void createInputFile() throws IOException { | |||
try (FileAppender<Record> appender = | |||
Parquet.write(outFile) | |||
.schema(FILE_SCHEMA) | |||
.set(PARQUET_DICT_ENABLED, "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singhpk234 @Fokko, I don't think it makes sense to create an option that is public and must be supported by Iceberg moving forward just for this test case. Is it possible to set this with a Hadoop option or to do this some other way?
I guess there were a few oversights here that we want to address before release, I was thinking that it would make sense for certain use cases to disable dictionary encoding if they wanted to use bloom filters (controlling the space tradeoff). But this probably isn't that valuable if I think about it. Dictionary encoding is useful for low cardinality columns, so the space difference between the two is negligible, with the tradeoff being deterministic lookups vs False positives from the bloom filter. So if it's about the test, I suppose we can see about adding a package private method for disabling dictionary encoding to the builder? or some alternative configuration So we should conclude 1.) Do we really want a table property for controlling dictionary encoding? Curious to know other's opinions. cc: @Fokko @rdblue @singhpk234 @aokolnychyi 2.) If the answer is yes for 1 let's go back and correct the naming in the properties and docs |
I agree with @amogh-jahagirdar. It is probably not worth adding an extra table property for the sake of the test. I doubt anyone would ever configure it. Can we fix the test? |
About the change
parquet 1.13.0 release notes : https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.0/CHANGES.md?plain=1#L22-L78
This has fix for : https://issues.apache.org/jira/browse/PARQUET-2160
earlier iceberg pr to fix issue observed above :
cc @rdblue @bryanck