Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Update parquet to 1.13.1 #7301

Merged
merged 8 commits into from
May 19, 2023
Merged

Conversation

singhpk234
Copy link
Contributor

@github-actions github-actions bot added the build label Apr 7, 2023
@bryanck
Copy link
Contributor

bryanck commented Apr 8, 2023

Should we also revert #5681 ?

@singhpk234
Copy link
Contributor Author

singhpk234 commented Apr 10, 2023

Should we also revert #5681 ?

IMHO we should, was thinking of the same but in a separate commit, post validating the issue stands fixed even without the change.

Presently this upgrade takes a dependency on hadoop 3.2, and interestingly some of the BloomFilter ut's fails as well. Taking a look into the same as well.

@singhpk234
Copy link
Contributor Author

There is a dependency on hadoop 3.2, I see there is a pr #5024 for the same, will wait for it to get in.

@Fokko
Copy link
Contributor

Fokko commented Apr 29, 2023

I've created a PR to support Hadoop 2.7.3 as well: apache/parquet-java#1084

@Fokko
Copy link
Contributor

Fokko commented Apr 29, 2023

Looks like this commit made the bloom tests on the Iceberg side fail:

➜  parquet-mr git:(4e9e79c89) ✗  git bisect bad
4e9e79c895e61775066e11240729b574e2264b8b is the first bad commit
commit 4e9e79c895e61775066e11240729b574e2264b8b
Author: ChenLiang.Lu <31469905+yabola@users.noreply.github.com>
Date:   Mon Feb 27 15:45:46 2023 +0800

    PARQUET-2251 Avoid generating Bloomfilter when all pages of a column are encoded by dictionary in parquet v1 (#1033)

 .../apache/parquet/hadoop/ParquetFileWriter.java   |   3 +-
 .../apache/parquet/hadoop/TestBloomFiltering.java  |  18 ++-
 .../parquet/hadoop/TestStoreBloomFilter.java       | 132 +++++++++++++++++++++
 3 files changed, 151 insertions(+), 2 deletions(-)
 create mode 100644 parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestStoreBloomFilter.java

The PR apache/parquet-java#1033

@singhpk234
Copy link
Contributor Author

Looks like this commit made the bloom tests on the Iceberg side fail:

This is an interesting find @Fokko, this implies, even when we are enabling BF via iceberg table conf's (https://iceberg.apache.org/docs/latest/configuration/#write-properties) parquet may decide not to write a BF for the column if all it's pages are dictionary encoded (I am assuming this is because there will be no benefit of writing BF in this particular case), Do we need to mention this in our doc as well, as the given table prop is a hint for parquet to write BF, but it may not choose to ?

@Fokko
Copy link
Contributor

Fokko commented May 1, 2023

@singhpk234 Yes, that's indeed the case. The dictionary is a better version of the bloom filter because it cannot contain false positives, this makes it redundant to also generate the bloom filter.

@singhpk234
Copy link
Contributor Author

@Fokko, added a fix for the failing UTs dues to BF changes and also updated the docs with the current behaviour of parquet, I think once your pr is in apache/parquet-java#1084 and parquet 1.13.1 is out we should be good to go, please let me know your thoughts !

@Fokko
Copy link
Contributor

Fokko commented May 11, 2023

Ran a benchmark on the 1.2.1 branch:

Apache Iceberg 1.2.1 with Apache Iceberg 1.12.3

Benchmark                                                                            Mode  Cnt  Score   Error  Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  5,068 ± 0,151   s/op
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  1,961 ± 0,081   s/op
IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,438 ± 0,028   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,061 ± 0,120   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,490 ± 0,186   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,377 ± 0,052   s/op

Apache Iceberg 1.2.1 with Apache Parquet 1.13.1-SNAPSHOT

Benchmark                                                                            Mode  Cnt  Score   Error  Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  4,509 ± 0,047   s/op
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  2,137 ± 0,176   s/op
IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,446 ± 0,056   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,033 ± 0,085   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,471 ± 0,043   s/op
IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,372 ± 0,039   s/op

@Fokko
Copy link
Contributor

Fokko commented May 12, 2023

@singhpk234 can I invite you to give the 1.13.1 RC a try: https://lists.apache.org/thread/0yokbmfcbhz76ftjbxktwxfo5vrt57od Would be great if you can vote on the RC.

@singhpk234
Copy link
Contributor Author

Thank you @Fokko for the awesome work ! Added my +1 :) !

@singhpk234 singhpk234 changed the title Parquet: Update parquet to 1.13.0 Parquet: Update parquet to 1.13.1 May 16, 2023
build.gradle Outdated Show resolved Hide resolved
build.gradle Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the ALIYUN label May 19, 2023
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking nit, this looks great @singhpk234 !

@Fokko Fokko merged commit 14a30c0 into apache:master May 19, 2023
@Fokko
Copy link
Contributor

Fokko commented May 19, 2023

Thanks @singhpk234 for picking this up, and @amogh-jahagirdar for the review!

@bryanck do you want the honor of reverting #5681?

| write.parquet.row-group-size-bytes | 134217728 (128 MB) | Parquet row group size |
| write.parquet.page-size-bytes | 1048576 (1 MB) | Parquet page size |
| write.parquet.page-row-limit | 20000 | Parquet page row limit |
| write.parquet.dictionary.enabled | true | Enable dictionary encoding |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the setting above.

@@ -135,6 +135,9 @@ private TableProperties() {}
public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit";
public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000;

public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the format of Iceberg options. It should be write.parquet.dict-enabled to match dict-size-bytes or should use enabled as the last word if you're using dictionary as a part of the hierarchy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the name does match our pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for my oversight here, i intended to have .enabled in suffix which i mentioned in the table prop but messed up here.

| write.update.mode | copy-on-write | Mode used for update commands: copy-on-write or merge-on-read (v2 only) |
| write.update.isolation-level | serializable | Isolation level for update commands: serializable or snapshot |
| write.merge.mode | copy-on-write | Mode used for merge commands: copy-on-write or merge-on-read (v2 only) |
| write.merge.isolation-level | serializable | Isolation level for merge commands: serializable or snapshot |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this entire table reformatted?

@@ -197,6 +198,7 @@ public void createInputFile() throws IOException {
try (FileAppender<Record> appender =
Parquet.write(outFile)
.schema(FILE_SCHEMA)
.set(PARQUET_DICT_ENABLED, "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234 @Fokko, I don't think it makes sense to create an option that is public and must be supported by Iceberg moving forward just for this test case. Is it possible to set this with a Hadoop option or to do this some other way?

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented May 19, 2023

I guess there were a few oversights here that we want to address before release, I was thinking that it would make sense for certain use cases to disable dictionary encoding if they wanted to use bloom filters (controlling the space tradeoff). But this probably isn't that valuable if I think about it.

Dictionary encoding is useful for low cardinality columns, so the space difference between the two is negligible, with the tradeoff being deterministic lookups vs False positives from the bloom filter.

So if it's about the test, I suppose we can see about adding a package private method for disabling dictionary encoding to the builder? or some alternative configuration

So we should conclude

1.) Do we really want a table property for controlling dictionary encoding? Curious to know other's opinions. cc: @Fokko @rdblue @singhpk234 @aokolnychyi

2.) If the answer is yes for 1 let's go back and correct the naming in the properties and docs

@aokolnychyi
Copy link
Contributor

I agree with @amogh-jahagirdar. It is probably not worth adding an extra table property for the sake of the test. I doubt anyone would ever configure it. Can we fix the test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants