Skip to content

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804

wants to merge 11 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Dec 9, 2019

What changes were proposed in this pull request?

This PR upgrade Parquet to 1.11.1.

Parquet 1.11.1 new features:

More details:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md

Why are the changes needed?

Support column indexes to improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test.

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115016 has finished for PR 26804 at commit 4d12d7f.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Dec 9, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115021 has finished for PR 26804 at commit 4d12d7f.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 9, 2019

Is the generated result different from the Jenkins (the following)?

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-1.2 b/dev/pr-deps/spark-deps-hadoop-2.7-hive-1.2
index 4d2176f..37bc50f 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-1.2
+++ b/dev/pr-deps/spark-deps-hadoop-2.7-hive-1.2
@@ -107,6 +107,7 @@ jakarta.ws.rs-api-2.1.6.jar
 jakarta.xml.bind-api-2.3.2.jar
 janino-3.0.15.jar
 javassist-3.22.0-CR2.jar
+javax.annotation-api-1.3.2.jar
 javax.inject-1.jar
 javax.servlet-api-3.1.0.jar
 javolution-5.5.1.jar

@srowen
Copy link
Member

srowen commented Dec 9, 2019

Probably a good idea, but does it pick up any key new features or fixes, or breaking changes?
Also how well does this work with the Avro that we use? I know that's always a problem area, but maybe it's behind us now.

@srowen srowen closed this Dec 9, 2019
@srowen srowen reopened this Dec 9, 2019
@srowen
Copy link
Member

srowen commented Dec 9, 2019

Oops, clicked the wrong button. Didn't mean to close it.

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115048 has finished for PR 26804 at commit 4d12d7f.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115098 has finished for PR 26804 at commit 2ceb9ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Since parquet is the default format of Spark, I am wondering how stable it is? How many systems are using this new version? Do we need to do it in 3.0? or we can just wait for the next release?

@dongjoon-hyun
Copy link
Member

cc @rdblue for @gatorsmile 's question.

@dongjoon-hyun dongjoon-hyun requested a review from rdblue December 10, 2019 16:46
@dongjoon-hyun
Copy link
Member

@gatorsmile . I fully understand the concerns.
However, if the test passed, I believe this is a good candidate for Preview2.
cc @dbtsai , @aokolnychyi , @felixcheung

@dbtsai
Copy link
Member

dbtsai commented Dec 10, 2019

I feel we should do some benchmark first, and we can have more meaningful discussion if we should have it in Spark 3.0.

@srowen
Copy link
Member

srowen commented Dec 11, 2019

What does Parquet 1.11 buy in terms of improvements? perf or bug fixes?
I'd generally think we want to make this change for Spark 3.0 rather than say 3.1, to reduce risk.
But, yes always bears understanding if it is going to change behavior in important ways.
Is there a perf angle here? is the benchmark idea to detect improvements or regressions?

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115184 has finished for PR 26804 at commit 40be254.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Dec 11, 2019

The main change is that Parquet 1.11.0 will now write column indexes near the Parquet footer for page-level skipping. Skipping is not turned on by default. There are also some changes with how logical types are tracked in metadata that allow storing extra information, like whether a timestamp is timestamp with time zone or timestamp without time zone.

I don't know of anyone running 1.11.0 in production yet. I think @mccheah runs a branch close to Parquet master and may be running something like 1.11.0.

I tend to agree with the cautious approach. Let's at least run the benchmarks to verify there is no perf regression from the changes. I'd also be fine delaying this until 3.1.

@mccheah
Copy link
Contributor

mccheah commented Dec 11, 2019

Yeah we run pretty far ahead in our Parquet dependency. It's worked fine for us: https://github.com/palantir/spark/blob/master/pom.xml#L143, see also https://github.com/palantir/parquet-mr

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115532 has finished for PR 26804 at commit a6575de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2020

Test build #116279 has finished for PR 26804 at commit 4756e67.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116339 has finished for PR 26804 at commit 15d9f96.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@heuermh
Copy link
Contributor

heuermh commented Jan 28, 2020

Also how well does this work with the Avro that we use? I know that's always a problem area, but maybe it's behind us now.

This pull request upgrades the Avro transitive dependency version to 1.9.1 without upgrading the Spark Avro dependency version, which is 1.8.2.

This will cause runtime exceptions such as

Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
	at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
	at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
	at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Perhaps if the parquet-avro test scope dependency did not exclude the Avro 1.9.1 transitive dependencies these runtime issues would show up in Spark unit tests rather than in downstream projects. I am testing this hypothesis today.

(edit) No additional tests fail with the exclusions removed. Might it be useful to add a (failing) test here or to a separate pull request that demonstrates the above runtime exception?

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38884/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38884/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134296 has finished for PR 26804 at commit 802eb36.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38887/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38887/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134301 has finished for PR 26804 at commit a89c61d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -127,6 +127,9 @@ class ParquetFileFormat
conf.setEnum(ParquetOutputFormat.JOB_SUMMARY_LEVEL, JobSummaryLevel.NONE)
}

// PARQUET-1746: Disables page-level CRC checksums by default.
conf.setBooleanIfUnset(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dangerous. Also cc @bbraams

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum Any chance you could elaborate on this a bit more? Are we convinced that the issue you pointed out in #26804 (comment) is actually a regression caused by parquet and not a problem with the test itself (e.g. caused by any non-trivial assumptions made w.r.t. the output files)? Considering the benefit of having checksums enabled by default (e.g. much improved visibility into hard to debug data corruption issues), I'd propose further investigation before disabling the feature entirely and having Spark diverge from the parquet-mr defaults.

Regarding the defaults, support for checksums was added back in PARQUET-1580. These changes were included and released with parquet-mr 1.11.0 (see CHANGES), and writing out checksums has been enabled by default since the release, see ParquetProperties.java in:

I also noticed that PARQUET-1746 was raised and a PR was opened for it to set the default to false, but that the issue has already been marked as resolved and the PR closed without merging the changes.

Copy link
Member Author

@wangyum wangyum Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Disable it to fix this regression: [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 #26804 (review).
  2. Writing out checksums has minimal performance impact.
  3. Do we really need this feature? I haven't seen Spark SQL users request this feature before. This change just disable it by default, users can still enable this feature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's been addressed in 72c52b6, thanks for the quick fix @wangyum! 👍

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38986/

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Test build #134400 has finished for PR 26804 at commit eb1c95e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class BitwiseGet(left: Expression, right: Expression)
  • new RuntimeException(s\"class$`

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38986/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39073/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39073/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134487 has finished for PR 26804 at commit 72c52b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

The current PR looks good to me. However, based on the pervious experience, Parquet upgrade always causes various issues. We might revert the upgrade at the last minute.

@wangyum Could you create a 3.2.0 blocker JIRA? Before the release, we need to double check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert Parquet. At the same time, we should encourage the whole community to do the compatibility and performance tests for their production workloads, including both read and write code paths.

@wangyum
Copy link
Member Author

wangyum commented Jan 28, 2021

Could you create a 3.2.0 blocker JIRA?

OK, https://issues.apache.org/jira/browse/SPARK-34276.

@wangyum wangyum closed this in a7683af Jan 29, 2021
@wangyum
Copy link
Member Author

wangyum commented Jan 29, 2021

Thank you all!

@wangyum
Copy link
Member Author

wangyum commented Jan 29, 2021

Merged to master.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 29, 2021

Thank you, @wangyum and @gatorsmile !

@dongjoon-hyun
Copy link
Member

BTW, Apache Parquet 1.12 is also one of the candidate we can choose in Apache Spark 3.2.0 timeframe.
Apache Spark 1.12.0 RC1 vote started already.

@wangyum
Copy link
Member Author

wangyum commented Jan 29, 2021

Thank you @dongjoon-hyun I will evaluate Parquet 1.12 soon.

@wangyum wangyum deleted the SPARK-26346 branch January 29, 2021 00:34
@sunchao
Copy link
Member

sunchao commented Jan 29, 2021

Nice work @wangyum and all! is there anything else to be done in order to get the full page skipping feature with column indexes? looking at PARQUET-1739 I was under the impression that the vectorized path needs some more work.

@wangyum
Copy link
Member Author

wangyum commented Jan 29, 2021

@sunchao #31393

@iemejia
Copy link
Member

iemejia commented Jan 29, 2021

@wangyum 👏 great work !

skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?

This PR upgrade Parquet to 1.11.1.

Parquet 1.11.1 new features:

- [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes
- [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation
- [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr

More details:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md

### Why are the changes needed?
Support column indexes to improve query performance.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing test.

Closes apache#26804 from wangyum/SPARK-26346.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.