-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #115016 has finished for PR 26804 at commit
|
retest this please |
Test build #115021 has finished for PR 26804 at commit
|
Retest this please. |
Is the generated result different from the Jenkins (the following)?
|
Probably a good idea, but does it pick up any key new features or fixes, or breaking changes? |
Oops, clicked the wrong button. Didn't mean to close it. |
Test build #115048 has finished for PR 26804 at commit
|
Test build #115098 has finished for PR 26804 at commit
|
Since parquet is the default format of Spark, I am wondering how stable it is? How many systems are using this new version? Do we need to do it in 3.0? or we can just wait for the next release? |
cc @rdblue for @gatorsmile 's question. |
@gatorsmile . I fully understand the concerns. |
I feel we should do some benchmark first, and we can have more meaningful discussion if we should have it in Spark 3.0. |
What does Parquet 1.11 buy in terms of improvements? perf or bug fixes? |
Test build #115184 has finished for PR 26804 at commit
|
The main change is that Parquet 1.11.0 will now write column indexes near the Parquet footer for page-level skipping. Skipping is not turned on by default. There are also some changes with how logical types are tracked in metadata that allow storing extra information, like whether a timestamp is I don't know of anyone running 1.11.0 in production yet. I think @mccheah runs a branch close to Parquet master and may be running something like 1.11.0. I tend to agree with the cautious approach. Let's at least run the benchmarks to verify there is no perf regression from the changes. I'd also be fine delaying this until 3.1. |
Yeah we run pretty far ahead in our Parquet dependency. It's worked fine for us: https://github.com/palantir/spark/blob/master/pom.xml#L143, see also https://github.com/palantir/parquet-mr |
...src/test/scala/org/apache/spark/sql/execution/benchmark/ParquetFilterPushdownBenchmark.scala
Outdated
Show resolved
Hide resolved
...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
Outdated
Show resolved
Hide resolved
Test build #115532 has finished for PR 26804 at commit
|
...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
Outdated
Show resolved
Hide resolved
Test build #116279 has finished for PR 26804 at commit
|
Test build #116339 has finished for PR 26804 at commit
|
This pull request upgrades the Avro transitive dependency version to 1.9.1 without upgrading the Spark Avro dependency version, which is 1.8.2. This will cause runtime exceptions such as
Perhaps if the (edit) No additional tests fail with the exclusions removed. Might it be useful to add a (failing) test here or to a separate pull request that demonstrates the above runtime exception? |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #134296 has finished for PR 26804 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #134301 has finished for PR 26804 at commit
|
@@ -127,6 +127,9 @@ class ParquetFileFormat | |||
conf.setEnum(ParquetOutputFormat.JOB_SUMMARY_LEVEL, JobSummaryLevel.NONE) | |||
} | |||
|
|||
// PARQUET-1746: Disables page-level CRC checksums by default. | |||
conf.setBooleanIfUnset(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous. Also cc @bbraams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum Any chance you could elaborate on this a bit more? Are we convinced that the issue you pointed out in #26804 (comment) is actually a regression caused by parquet and not a problem with the test itself (e.g. caused by any non-trivial assumptions made w.r.t. the output files)? Considering the benefit of having checksums enabled by default (e.g. much improved visibility into hard to debug data corruption issues), I'd propose further investigation before disabling the feature entirely and having Spark diverge from the parquet-mr
defaults.
Regarding the defaults, support for checksums was added back in PARQUET-1580. These changes were included and released with parquet-mr
1.11.0 (see CHANGES), and writing out checksums has been enabled by default since the release, see ParquetProperties.java
in:
I also noticed that PARQUET-1746 was raised and a PR was opened for it to set the default to false
, but that the issue has already been marked as resolved and the PR closed without merging the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Disable it to fix this regression: [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804 (review).
- Writing out checksums has minimal performance impact.
- Do we really need this feature? I haven't seen Spark SQL users request this feature before. This change just disable it by default, users can still enable this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Conflicts: # pom.xml
Kubernetes integration test starting |
Test build #134400 has finished for PR 26804 at commit
|
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #134487 has finished for PR 26804 at commit
|
LGTM The current PR looks good to me. However, based on the pervious experience, Parquet upgrade always causes various issues. We might revert the upgrade at the last minute. @wangyum Could you create a 3.2.0 blocker JIRA? Before the release, we need to double check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert Parquet. At the same time, we should encourage the whole community to do the compatibility and performance tests for their production workloads, including both read and write code paths. |
|
Thank you all! |
Merged to master. |
Thank you, @wangyum and @gatorsmile ! |
BTW, Apache Parquet 1.12 is also one of the candidate we can choose in Apache Spark 3.2.0 timeframe. |
Thank you @dongjoon-hyun I will evaluate Parquet 1.12 soon. |
Nice work @wangyum and all! is there anything else to be done in order to get the full page skipping feature with column indexes? looking at PARQUET-1739 I was under the impression that the vectorized path needs some more work. |
@wangyum 👏 great work ! |
### What changes were proposed in this pull request? This PR upgrade Parquet to 1.11.1. Parquet 1.11.1 new features: - [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes - [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation - [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr More details: https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md ### Why are the changes needed? Support column indexes to improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test. Closes apache#26804 from wangyum/SPARK-26346. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>
What changes were proposed in this pull request?
This PR upgrade Parquet to 1.11.1.
Parquet 1.11.1 new features:
More details:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md
Why are the changes needed?
Support column indexes to improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing test.