Parquet: Update parquet to 1.13.1 #7301

singhpk234 · 2023-04-07T23:55:57Z

About the change

parquet 1.13.0 release notes : https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.0/CHANGES.md?plain=1#L22-L78

This has fix for : https://issues.apache.org/jira/browse/PARQUET-2160

earlier iceberg pr to fix issue observed above :

Parquet: close zstd input stream early to avoid memory pressure #5681

cc @rdblue @bryanck

bryanck · 2023-04-08T14:43:17Z

Should we also revert #5681 ?

singhpk234 · 2023-04-10T16:59:41Z

Should we also revert #5681 ?

IMHO we should, was thinking of the same but in a separate commit, post validating the issue stands fixed even without the change.

Presently this upgrade takes a dependency on hadoop 3.2, and interestingly some of the BloomFilter ut's fails as well. Taking a look into the same as well.

singhpk234 · 2023-04-18T23:12:12Z

There is a dependency on hadoop 3.2, I see there is a pr #5024 for the same, will wait for it to get in.

Fokko · 2023-04-29T20:50:41Z

I've created a PR to support Hadoop 2.7.3 as well: apache/parquet-java#1084

Fokko · 2023-04-29T21:19:21Z

Looks like this commit made the bloom tests on the Iceberg side fail:

➜  parquet-mr git:(4e9e79c89) ✗  git bisect bad
4e9e79c895e61775066e11240729b574e2264b8b is the first bad commit
commit 4e9e79c895e61775066e11240729b574e2264b8b
Author: ChenLiang.Lu <31469905+yabola@users.noreply.github.com>
Date:   Mon Feb 27 15:45:46 2023 +0800

    PARQUET-2251 Avoid generating Bloomfilter when all pages of a column are encoded by dictionary in parquet v1 (#1033)

 .../apache/parquet/hadoop/ParquetFileWriter.java   |   3 +-
 .../apache/parquet/hadoop/TestBloomFiltering.java  |  18 ++-
 .../parquet/hadoop/TestStoreBloomFilter.java       | 132 +++++++++++++++++++++
 3 files changed, 151 insertions(+), 2 deletions(-)
 create mode 100644 parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestStoreBloomFilter.java

The PR apache/parquet-java#1033

singhpk234 · 2023-04-30T06:46:56Z

Looks like this commit made the bloom tests on the Iceberg side fail:

This is an interesting find @Fokko, this implies, even when we are enabling BF via iceberg table conf's (https://iceberg.apache.org/docs/latest/configuration/#write-properties) parquet may decide not to write a BF for the column if all it's pages are dictionary encoded (I am assuming this is because there will be no benefit of writing BF in this particular case), Do we need to mention this in our doc as well, as the given table prop is a hint for parquet to write BF, but it may not choose to ?

Fokko · 2023-05-01T06:50:15Z

@singhpk234 Yes, that's indeed the case. The dictionary is a better version of the bloom filter because it cannot contain false positives, this makes it redundant to also generate the bloom filter.

singhpk234 · 2023-05-01T22:52:26Z

@Fokko, added a fix for the failing UTs dues to BF changes and also updated the docs with the current behaviour of parquet, I think once your pr is in apache/parquet-java#1084 and parquet 1.13.1 is out we should be good to go, please let me know your thoughts !

Fokko · 2023-05-11T13:26:12Z

Ran a benchmark on the 1.2.1 branch:

Apache Iceberg 1.2.1 with Apache Iceberg 1.12.3

Benchmark                                                                            Mode  Cnt  Score   Error  Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  5,068 ± 0,151   s/op
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  1,961 ± 0,081   s/op
IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,438 ± 0,028   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,061 ± 0,120   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,490 ± 0,186   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,377 ± 0,052   s/op

Apache Iceberg 1.2.1 with Apache Parquet 1.13.1-SNAPSHOT

Benchmark                                                                            Mode  Cnt  Score   Error  Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  4,509 ± 0,047   s/op
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  2,137 ± 0,176   s/op
IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,446 ± 0,056   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,033 ± 0,085   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,471 ± 0,043   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,372 ± 0,039   s/op

Fokko · 2023-05-12T13:40:38Z

@singhpk234 can I invite you to give the 1.13.1 RC a try: https://lists.apache.org/thread/0yokbmfcbhz76ftjbxktwxfo5vrt57od Would be great if you can vote on the RC.

singhpk234 · 2023-05-13T05:03:07Z

Thank you @Fokko for the awesome work ! Added my +1 :) !

build.gradle

aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java

amogh-jahagirdar

Non-blocking nit, this looks great @singhpk234 !

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

Fokko · 2023-05-19T21:06:36Z

Thanks @singhpk234 for picking this up, and @amogh-jahagirdar for the review!

@bryanck do you want the honor of reverting #5681?

rdblue · 2023-05-19T21:20:24Z

docs/configuration.md

+| write.parquet.row-group-size-bytes                   | 134217728 (128 MB)          | Parquet row group size                                                                                                                                                                            |
+| write.parquet.page-size-bytes                        | 1048576 (1 MB)              | Parquet page size                                                                                                                                                                                 |
+| write.parquet.page-row-limit                         | 20000                       | Parquet page row limit                                                                                                                                                                            |
+ | write.parquet.dictionary.enabled                     | true                        | Enable dictionary encoding                                                                                                                                                                        |


This doesn't match the setting above.

rdblue · 2023-05-19T21:22:13Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -135,6 +135,9 @@ private TableProperties() {}
  public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit";
  public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000;

+  public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary";


This doesn't match the format of Iceberg options. It should be write.parquet.dict-enabled to match dict-size-bytes or should use enabled as the last word if you're using dictionary as a part of the hierarchy.

+1, the name does match our pattern

Apologies for my oversight here, i intended to have .enabled in suffix which i mentioned in the table prop but messed up here.

rdblue · 2023-05-19T21:22:30Z

docs/configuration.md

+| write.update.mode                                    | copy-on-write               | Mode used for update commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
+| write.update.isolation-level                         | serializable                | Isolation level for update commands: serializable or snapshot                                                                                                                                     |
+| write.merge.mode                                     | copy-on-write               | Mode used for merge commands: copy-on-write or merge-on-read (v2 only)                                                                                                                            |
+| write.merge.isolation-level                          | serializable                | Isolation level for merge commands: serializable or snapshot                                                                                                                                      |


Why was this entire table reformatted?

rdblue · 2023-05-19T21:23:49Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

@@ -197,6 +198,7 @@ public void createInputFile() throws IOException {
    try (FileAppender<Record> appender =
        Parquet.write(outFile)
            .schema(FILE_SCHEMA)
+            .set(PARQUET_DICT_ENABLED, "false")


@singhpk234 @Fokko, I don't think it makes sense to create an option that is public and must be supported by Iceberg moving forward just for this test case. Is it possible to set this with a Hadoop option or to do this some other way?

amogh-jahagirdar · 2023-05-19T22:42:30Z

I guess there were a few oversights here that we want to address before release, I was thinking that it would make sense for certain use cases to disable dictionary encoding if they wanted to use bloom filters (controlling the space tradeoff). But this probably isn't that valuable if I think about it.

Dictionary encoding is useful for low cardinality columns, so the space difference between the two is negligible, with the tradeoff being deterministic lookups vs False positives from the bloom filter.

So if it's about the test, I suppose we can see about adding a package private method for disabling dictionary encoding to the builder? or some alternative configuration

So we should conclude

1.) Do we really want a table property for controlling dictionary encoding? Curious to know other's opinions. cc: @Fokko @rdblue @singhpk234 @aokolnychyi

2.) If the answer is yes for 1 let's go back and correct the naming in the properties and docs

aokolnychyi · 2023-05-19T23:15:48Z

I agree with @amogh-jahagirdar. It is probably not worth adding an extra table property for the sake of the test. I doubt anyone would ever configure it. Can we fix the test?

github-actions bot added the build label Apr 7, 2023

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 1d8bd1e to 7dce949 Compare April 18, 2023 23:08

github-actions bot added the ALIYUN label Apr 18, 2023

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 7dce949 to 292c88a Compare April 18, 2023 23:09

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 1699dfc to 65f50e8 Compare May 1, 2023 22:39

github-actions bot added core parquet docs labels May 1, 2023

Fokko mentioned this pull request May 3, 2023

PARQUET-2276: Bring back support for Hadoop 2.7.3 apache/parquet-java#1084

Merged

4 tasks

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 081736b to 123cfc9 Compare May 16, 2023 23:31

singhpk234 changed the title ~~Parquet: Update parquet to 1.13.0~~ Parquet: Update parquet to 1.13.1 May 16, 2023

singhpk234 commented May 16, 2023

View reviewed changes

build.gradle Outdated Show resolved Hide resolved

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 123cfc9 to 28d3cf0 Compare May 16, 2023 23:33

singhpk234 commented May 16, 2023

View reviewed changes

build.gradle Outdated Show resolved Hide resolved

Fokko added this to the Iceberg 1.3.0 milestone May 17, 2023

flyrain mentioned this pull request May 17, 2023

Parquet: Remove the row position since parquet row group has it natively #6056

Closed

Prashant Singh added 3 commits May 18, 2023 18:32

Parquet: Update parquet to 1.13.0

1875242

fix aliyun failures

9046980

Disable dictionary encoding to make sure BF always gets created

6b1ffdd

Prashant Singh added 3 commits May 18, 2023 18:32

Update the doc

9d32c5b

point to 1.13.1 using staging repo until officially released

e3eb272

revert staging remote repo

6e99c43

singhpk234 force-pushed the upgrade/parquet-1.13 branch from 28d3cf0 to 6e99c43 Compare May 19, 2023 01:32

Fokko reviewed May 19, 2023

View reviewed changes

aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java Outdated Show resolved Hide resolved

remove apache commons dep

7f2649d

github-actions bot removed the ALIYUN label May 19, 2023

amogh-jahagirdar approved these changes May 19, 2023

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Outdated Show resolved Hide resolved

address review feedback

9f97804

Fokko approved these changes May 19, 2023

View reviewed changes

Fokko merged commit 14a30c0 into apache:master May 19, 2023

rdblue reviewed May 19, 2023

View reviewed changes

amogh-jahagirdar mentioned this pull request May 19, 2023

Rename table property for enabling Parquet dictionary encoding #7663

Closed

bryanck mentioned this pull request May 19, 2023

Parquet: Revert workaround for resource usage with zstd #7664

Closed

bryanck mentioned this pull request Jun 14, 2023

Parquet: Revert workaround for resource usage with zstd #7834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Update parquet to 1.13.1 #7301

Parquet: Update parquet to 1.13.1 #7301

singhpk234 commented Apr 7, 2023

bryanck commented Apr 8, 2023

singhpk234 commented Apr 10, 2023 •

edited

Loading

singhpk234 commented Apr 18, 2023

Fokko commented Apr 29, 2023

Fokko commented Apr 29, 2023 •

edited

Loading

singhpk234 commented Apr 30, 2023

Fokko commented May 1, 2023

singhpk234 commented May 1, 2023

Fokko commented May 11, 2023

Fokko commented May 12, 2023

singhpk234 commented May 13, 2023

amogh-jahagirdar left a comment

Fokko commented May 19, 2023

rdblue May 19, 2023

rdblue May 19, 2023

aokolnychyi May 19, 2023

singhpk234 May 20, 2023

rdblue May 19, 2023

rdblue May 19, 2023

amogh-jahagirdar commented May 19, 2023 •

edited

Loading

aokolnychyi commented May 19, 2023

Parquet: Update parquet to 1.13.1 #7301

Parquet: Update parquet to 1.13.1 #7301

Conversation

singhpk234 commented Apr 7, 2023

About the change

bryanck commented Apr 8, 2023

singhpk234 commented Apr 10, 2023 • edited Loading

singhpk234 commented Apr 18, 2023

Fokko commented Apr 29, 2023

Fokko commented Apr 29, 2023 • edited Loading

singhpk234 commented Apr 30, 2023

Fokko commented May 1, 2023

singhpk234 commented May 1, 2023

Fokko commented May 11, 2023

Fokko commented May 12, 2023

singhpk234 commented May 13, 2023

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Fokko commented May 19, 2023

rdblue May 19, 2023

Choose a reason for hiding this comment

rdblue May 19, 2023

Choose a reason for hiding this comment

aokolnychyi May 19, 2023

Choose a reason for hiding this comment

singhpk234 May 20, 2023

Choose a reason for hiding this comment

rdblue May 19, 2023

Choose a reason for hiding this comment

rdblue May 19, 2023

Choose a reason for hiding this comment

amogh-jahagirdar commented May 19, 2023 • edited Loading

aokolnychyi commented May 19, 2023

singhpk234 commented Apr 10, 2023 •

edited

Loading

Fokko commented Apr 29, 2023 •

edited

Loading

amogh-jahagirdar commented May 19, 2023 •

edited

Loading