[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

wangyum · 2019-12-09T06:49:13Z

What changes were proposed in this pull request?

This PR upgrade Parquet to 1.11.1.

Parquet 1.11.1 new features:

PARQUET-1201 - Column indexes
PARQUET-1253 - Support for new logical type representation
PARQUET-1388 - Nanosecond precision time and timestamp - parquet-mr

More details:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md

Why are the changes needed?

Support column indexes to improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test.

SparkQA · 2019-12-09T07:03:11Z

Test build #115016 has finished for PR 26804 at commit 4d12d7f.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-12-09T08:22:32Z

retest this please

SparkQA · 2019-12-09T08:35:12Z

Test build #115021 has finished for PR 26804 at commit 4d12d7f.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-12-09T17:01:57Z

Retest this please.

dongjoon-hyun · 2019-12-09T17:03:26Z

Is the generated result different from the Jenkins (the following)?

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-1.2 b/dev/pr-deps/spark-deps-hadoop-2.7-hive-1.2
index 4d2176f..37bc50f 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-1.2
+++ b/dev/pr-deps/spark-deps-hadoop-2.7-hive-1.2
@@ -107,6 +107,7 @@ jakarta.ws.rs-api-2.1.6.jar
 jakarta.xml.bind-api-2.3.2.jar
 janino-3.0.15.jar
 javassist-3.22.0-CR2.jar
+javax.annotation-api-1.3.2.jar
 javax.inject-1.jar
 javax.servlet-api-3.1.0.jar
 javolution-5.5.1.jar

srowen · 2019-12-09T17:11:49Z

Probably a good idea, but does it pick up any key new features or fixes, or breaking changes?
Also how well does this work with the Avro that we use? I know that's always a problem area, but maybe it's behind us now.

srowen · 2019-12-09T17:12:02Z

Oops, clicked the wrong button. Didn't mean to close it.

SparkQA · 2019-12-09T17:19:08Z

Test build #115048 has finished for PR 26804 at commit 4d12d7f.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-10T11:43:04Z

Test build #115098 has finished for PR 26804 at commit 2ceb9ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-12-10T16:42:32Z

Since parquet is the default format of Spark, I am wondering how stable it is? How many systems are using this new version? Do we need to do it in 3.0? or we can just wait for the next release?

dongjoon-hyun · 2019-12-10T16:46:35Z

cc @rdblue for @gatorsmile 's question.

dongjoon-hyun · 2019-12-10T16:49:15Z

@gatorsmile . I fully understand the concerns.
However, if the test passed, I believe this is a good candidate for Preview2.
cc @dbtsai , @aokolnychyi , @felixcheung

dbtsai · 2019-12-10T19:38:24Z

I feel we should do some benchmark first, and we can have more meaningful discussion if we should have it in Spark 3.0.

srowen · 2019-12-11T11:43:00Z

What does Parquet 1.11 buy in terms of improvements? perf or bug fixes?
I'd generally think we want to make this change for Spark 3.0 rather than say 3.1, to reduce risk.
But, yes always bears understanding if it is going to change behavior in important ways.
Is there a perf angle here? is the benchmark idea to detect improvements or regressions?

SparkQA · 2019-12-11T13:26:59Z

Test build #115184 has finished for PR 26804 at commit 40be254.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-12-11T18:42:14Z

The main change is that Parquet 1.11.0 will now write column indexes near the Parquet footer for page-level skipping. Skipping is not turned on by default. There are also some changes with how logical types are tracked in metadata that allow storing extra information, like whether a timestamp is timestamp with time zone or timestamp without time zone.

I don't know of anyone running 1.11.0 in production yet. I think @mccheah runs a branch close to Parquet master and may be running something like 1.11.0.

I tend to agree with the cautious approach. Let's at least run the benchmarks to verify there is no perf regression from the changes. I'd also be fine delaying this until 3.1.

mccheah · 2019-12-11T19:47:10Z

Yeah we run pretty far ahead in our Parquet dependency. It's worked fine for us: https://github.com/palantir/spark/blob/master/pom.xml#L143, see also https://github.com/palantir/parquet-mr

...src/test/scala/org/apache/spark/sql/execution/benchmark/ParquetFilterPushdownBenchmark.scala

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

SparkQA · 2019-12-19T04:19:22Z

Test build #115532 has finished for PR 26804 at commit a6575de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dev/deps/spark-deps-hadoop-2.7-hive-1.2

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

SparkQA · 2020-01-08T03:08:18Z

Test build #116279 has finished for PR 26804 at commit 4756e67.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-09T05:16:18Z

Test build #116339 has finished for PR 26804 at commit 15d9f96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

heuermh · 2020-01-28T16:28:37Z

Also how well does this work with the Avro that we use? I know that's always a problem area, but maybe it's behind us now.

This pull request upgrades the Avro transitive dependency version to 1.9.1 without upgrading the Spark Avro dependency version, which is 1.8.2.

This will cause runtime exceptions such as

Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
	at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
	at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
	at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Perhaps if the parquet-avro test scope dependency did not exclude the Avro 1.9.1 transitive dependencies these runtime issues would show up in Spark unit tests rather than in downstream projects. I am testing this hypothesis today.

(edit) No additional tests fail with the exclusions removed. Might it be useful to add a (failing) test here or to a separate pull request that demonstrates the above runtime exception?

SparkQA · 2021-01-21T01:48:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38884/

SparkQA · 2021-01-21T02:16:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38884/

SparkQA · 2021-01-21T02:18:50Z

Test build #134296 has finished for PR 26804 at commit 802eb36.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-21T04:04:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38887/

SparkQA · 2021-01-21T04:53:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38887/

SparkQA · 2021-01-21T05:43:12Z

Test build #134301 has finished for PR 26804 at commit a89c61d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2021-01-22T22:55:14Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -127,6 +127,9 @@ class ParquetFileFormat
      conf.setEnum(ParquetOutputFormat.JOB_SUMMARY_LEVEL, JobSummaryLevel.NONE)
    }

+    // PARQUET-1746: Disables page-level CRC checksums by default.
+    conf.setBooleanIfUnset(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, false)


This looks dangerous. Also cc @bbraams

@wangyum Any chance you could elaborate on this a bit more? Are we convinced that the issue you pointed out in #26804 (comment) is actually a regression caused by parquet and not a problem with the test itself (e.g. caused by any non-trivial assumptions made w.r.t. the output files)? Considering the benefit of having checksums enabled by default (e.g. much improved visibility into hard to debug data corruption issues), I'd propose further investigation before disabling the feature entirely and having Spark diverge from the parquet-mr defaults.

Regarding the defaults, support for checksums was added back in PARQUET-1580. These changes were included and released with parquet-mr 1.11.0 (see CHANGES), and writing out checksums has been enabled by default since the release, see ParquetProperties.java in:

master

apache-parquet-1.11.0

apache-parquet-1.11.1

I also noticed that PARQUET-1746 was raised and a PR was opened for it to set the default to false, but that the issue has already been marked as resolved and the PR closed without merging the changes.

Disable it to fix this regression: [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804 (review).

Writing out checksums has minimal performance impact.

Do we really need this feature? I haven't seen Spark SQL users request this feature before. This change just disable it by default, users can still enable this feature.

I see it's been addressed in 72c52b6, thanks for the quick fix @wangyum! 👍

# Conflicts: # pom.xml

SparkQA · 2021-01-23T09:07:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38986/

SparkQA · 2021-01-23T10:37:06Z

Test build #134400 has finished for PR 26804 at commit eb1c95e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class BitwiseGet(left: Expression, right: Expression)
new RuntimeException(s\"class$`

SparkQA · 2021-01-23T10:42:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38986/

SparkQA · 2021-01-26T07:24:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39073/

SparkQA · 2021-01-26T07:51:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39073/

SparkQA · 2021-01-26T09:05:29Z

Test build #134487 has finished for PR 26804 at commit 72c52b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2021-01-28T04:25:52Z

LGTM

The current PR looks good to me. However, based on the pervious experience, Parquet upgrade always causes various issues. We might revert the upgrade at the last minute.

@wangyum Could you create a 3.2.0 blocker JIRA? Before the release, we need to double check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert Parquet. At the same time, we should encourage the whole community to do the compatibility and performance tests for their production workloads, including both read and write code paths.

wangyum · 2021-01-28T04:36:37Z

Could you create a 3.2.0 blocker JIRA?

OK, https://issues.apache.org/jira/browse/SPARK-34276.

wangyum · 2021-01-29T00:11:05Z

Thank you all!

wangyum · 2021-01-29T00:11:15Z

Merged to master.

dongjoon-hyun · 2021-01-29T00:27:41Z

Thank you, @wangyum and @gatorsmile !

dongjoon-hyun · 2021-01-29T00:30:10Z

BTW, Apache Parquet 1.12 is also one of the candidate we can choose in Apache Spark 3.2.0 timeframe.
Apache Spark 1.12.0 RC1 vote started already.

wangyum · 2021-01-29T00:34:40Z

Thank you @dongjoon-hyun I will evaluate Parquet 1.12 soon.

sunchao · 2021-01-29T00:48:14Z

Nice work @wangyum and all! is there anything else to be done in order to get the full page skipping feature with column indexes? looking at PARQUET-1739 I was under the impression that the vectorized path needs some more work.

wangyum · 2021-01-29T08:39:28Z

@sunchao #31393

iemejia · 2021-01-29T08:55:30Z

@wangyum 👏 great work !

### What changes were proposed in this pull request? This PR upgrade Parquet to 1.11.1. Parquet 1.11.1 new features: - [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes - [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation - [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr More details: https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md ### Why are the changes needed? Support column indexes to improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test. Closes apache#26804 from wangyum/SPARK-26346. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

dongjoon-hyun added the BUILD label Dec 9, 2019

srowen closed this Dec 9, 2019

srowen reopened this Dec 9, 2019

dongjoon-hyun requested a review from rdblue December 10, 2019 16:46

wangyum commented Dec 19, 2019

View reviewed changes

...src/test/scala/org/apache/spark/sql/execution/benchmark/ParquetFilterPushdownBenchmark.scala Outdated Show resolved Hide resolved

wangyum commented Dec 19, 2019

View reviewed changes

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java Outdated Show resolved Hide resolved

gszadovszky reviewed Jan 6, 2020

View reviewed changes

dev/deps/spark-deps-hadoop-2.7-hive-1.2 Outdated Show resolved Hide resolved

gszadovszky reviewed Jan 6, 2020

View reviewed changes

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java Outdated Show resolved Hide resolved

wangyum commented Jan 8, 2020

View reviewed changes

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java Outdated Show resolved Hide resolved

fix

a89c61d

gatorsmile reviewed Jan 22, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-26346

eb1c95e

# Conflicts: # pom.xml

wangyum added 2 commits January 26, 2021 12:47

Merge remote-tracking branch 'upstream/master' into SPARK-26346

cb7133f

Make "DataFrame reuse" test pass

72c52b6

wangyum closed this in a7683af Jan 29, 2021

wangyum deleted the SPARK-26346 branch January 29, 2021 00:34

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

Uh oh!

Conversation

wangyum commented Dec 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 9, 2019

Uh oh!

wangyum commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 9, 2019

Uh oh!

dongjoon-hyun commented Dec 9, 2019

Uh oh!

dongjoon-hyun commented Dec 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Dec 9, 2019

Uh oh!

srowen commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

gatorsmile commented Dec 10, 2019

Uh oh!

dongjoon-hyun commented Dec 10, 2019

Uh oh!

dongjoon-hyun commented Dec 10, 2019

Uh oh!

dbtsai commented Dec 10, 2019

Uh oh!

srowen commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

rdblue commented Dec 11, 2019

Uh oh!

mccheah commented Dec 11, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jan 8, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

heuermh commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

SparkQA commented Jan 21, 2021

Uh oh!

gatorsmile Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

bbraams Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

wangyum Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wangyum commented Dec 9, 2019 •

edited

Loading

dongjoon-hyun commented Dec 9, 2019 •

edited

Loading

heuermh commented Jan 28, 2020 •

edited

Loading

wangyum Jan 25, 2021 •

edited

Loading

dongjoon-hyun commented Jan 29, 2021 •

edited

Loading