[SPARK-25407][SQL] Ensure we pass a compatible pruned schema to ParquetRowConverter #22880

mallman · 2018-10-29T19:16:59Z

What changes were proposed in this pull request?

(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-25407)

As part of schema clipping in ParquetReadSupport.scala, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to ParquetRecordMaterializer: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two.

Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file.

Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the ReadContext.

How was this patch tested?

A previously ignored test case which exercises the failure scenario this PR addresses has been enabled.

SparkQA · 2018-10-29T22:38:00Z

Test build #98225 has finished for PR 22880 at commit e5e60ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

viirya · 2018-10-30T08:16:01Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

+    // parquet-mr reader requires that parquetRequestedSchema include only those fields present in
+    // the underlying parquetFileSchema. Therefore, in the case where we use the parquet-mr reader
+    // we intersect the parquetClippedSchema with the parquetFileSchema to construct the
+    // parquetRequestedSchema set in the ReadContext.


For vectorized reader, even we do this additional intersectParquetGroups, will it cause any problem?

For vectorized reader, even we do this additional intersectParquetGroups, will it cause any problem?

Yes. The relevant passage being

Further, [the vectorized reader] assumes that parquetRequestedSchema includes all fields requested in catalystRequestedSchema. It includes logic in its read path to skip fields in parquetRequestedSchema which are not present in the file.

If we break this assumption by giving the vectorized reader a Parquet requested schema which does not include all of the fields in the Catalyst requested schema, then it will fail with an exception. This scenario is covered by the tests. (Comment out the relevant code below and run the tests to see.)

viirya · 2018-10-30T08:24:22Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

@@ -202,11 +204,15 @@ private[parquet] class ParquetRowConverter(

  override def start(): Unit = {
    var i = 0
-    while (i < currentRow.numFields) {
+    while (i < fieldConverters.length) {
      fieldConverters(i).updater.start()
      currentRow.setNullAt(i)


Now fieldConverters(i) may not be matched to currentRow(i)?

That is correct. Now that we're passing a Parquet schema that's a (non-strict) subset of the Catalyst schema, we cannot assume that their fields are in 1:1 correspondence.

Yea, I think it should be fine to do setNullAt at non-corresponding field, right?

Yes. The following while loop at

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

Lines 212 to 215 in 6b19f57

while (i < currentRow.numFields) {

currentRow.setNullAt(i)

i += 1

}

ensures all remaining columns/fields in the current row are nulled-out.

I am also lost here. The i index seems to me following the parquet fields so is not the updater.ordinal the correct index to update the currentRow?

I would expect something like:

val updater = fieldConverters(i).updater updater.start() currentRow.setNullAt(updater.ordinal)

@viirya @attilapiros Hi guys. Does my explanation make sense? If so, do you want me to change the code as I suggested or leave it as-is in the current PR commit?

I see. I think doing this separately is better and you can rewrite it to one-liner, like for setNullAt:

(0 until currentRow.numFields).foreach(currentRow.setNullAt)

I'm fine with current commit. Seems It can save some redundant iterations.

Thank you both for your feedback.

Seems It can save some redundant iterations.

That was my motivation in writing the code this way. While the code is not as clear as it could be, it is very performance critical.

I'm going to push a new commit keeping the current code but with a brief explanatory comment.

I'm going to push a new commit keeping the current code but with a brief explanatory comment.

On further careful consideration, I believe that separating the calls to currentRow.setNullAt(i) into their own loop actually won't incur any significant performance degradation—if any at all.

The performance of the start() method is dominated by the calls to fieldConverters(i).updater.start() and currentRow.setNullAt(i). Putting the latter calls into their own loop won't change the count of those method calls, just the order. @viirya LMK if you disagree with my analysis.

I will push a new commit with separate while loops. I won't use the more elegant (0 until currentRow.numFields).foreach(currentRow.setNullAt) because that's not a loop, and I doubt either the Spark or Hotspot optimizer can turn that into a loop.

viirya · 2018-10-30T08:31:12Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

+      s"""Going to read the following fields from the Parquet file with the following schema:
+         |Parquet file schema:
+         |$fileSchema
+         |Parquet read schema:


This might increase a lot of log data. Do we need to output fileSchema?

This detailed, formatted information was very helpful in developing and debugging this patch. Perhaps this should be logged at the debug level instead? Even the original message does seem rather technical for info-level logging. What do you think?

I think it is useful for debugging this patch, but may not useful for end users and will increase log size. Make it as debug level sounds good to me. But let's wait for others opinions too.

Yea, we should maybe change this into debugging level for them. I would additionally log them somewhere as debugging level.

dbtsai · 2018-10-30T18:28:44Z

I can confirm that this fixes https://issues.apache.org/jira/browse/SPARK-25879

cc @cloud-fan @gatorsmile @beettlle

Thanks.

SparkQA · 2018-10-30T22:46:17Z

Test build #98278 has finished for PR 22880 at commit 6b19f57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-11-05T21:22:01Z

cc @HyukjinKwon

Would you like to review this PR? It's a bug fix.

ParquetRowConverter.start() into their own loop for clarity

SparkQA · 2018-11-06T20:43:43Z

Test build #98528 has finished for PR 22880 at commit 598d965.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-11-07T19:44:26Z

Jenkins retest please.

mallman · 2018-11-07T20:04:53Z

Can someone with Jenkins retest privileges please kick off a retest?

viirya · 2018-11-07T22:28:56Z

retest this please.

SparkQA · 2018-11-08T01:47:14Z

Test build #98572 has finished for PR 22880 at commit 598d965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-11-08T18:48:22Z

@gatorsmile How do you feel about merging this in? Anyone else I should ping for review?

HyukjinKwon · 2018-11-09T05:30:39Z

Let me take a look on this weekends.

HyukjinKwon · 2018-11-11T14:07:21Z

Looks good. I or someone else should take a closer look before getting this in.

HyukjinKwon · 2018-11-11T14:10:41Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

-      case ((parquetFieldType, catalystField), ordinal) =>
-        // Converted field value should be set to the `ordinal`-th cell of `currentRow`
-        newConverter(parquetFieldType, catalystField.dataType, new RowUpdater(currentRow, ordinal))
+    parquetType.getFields.asScala.map {


also .. nit: parquetType.getFields.asScala.map { parquetField => per https://github.com/databricks/scala-style-guide#pattern-matching

HyukjinKwon · 2018-11-11T14:15:40Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

-    parquetType.getFieldCount == catalystType.length,
-    s"""Field counts of the Parquet schema and the Catalyst schema don't match:
+    parquetType.getFieldCount <= catalystType.length,
+    s"""Field count of the Parquet schema is greater than the field count of the Catalyst schema:


Can we assert this only when this pruning is enabled? - we could fix the condition like enabled && parquetType.getFieldCount <= catalystType.length || parquetType.getFieldCount == catalystType.length for instance.

Why do you ask? Is it for safety, clarity? My concern is around reducing complexity, but I'm not strictly against this.

HyukjinKwon · 2018-11-11T14:19:06Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

+      i += 1
+    }
+    i = 0
+    while (i < currentRow.numFields) {


Can we loop once with if?

Yes, but I think it's clearer this way. If @viirya has an opinion either way I'll take it as a "tie-breaker".

SparkQA · 2018-11-13T22:25:17Z

Test build #98791 has finished for PR 22880 at commit 4dfd459.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-12-05T22:25:59Z

Hi @dbtsai @HyukjinKwon @gatorsmile @viirya. Can we merge this to master?

MaxGekk

LGTM

gatorsmile · 2019-03-06T18:01:38Z

ping @hvanhovell

dongjoon-hyun · 2019-04-03T16:38:59Z

Hi, @mallman . I can review and help this PR. Could you rebase once more?

dongjoon-hyun · 2019-04-05T20:46:32Z

Okay. I'll take over this with @mallman 's authorship in a new PR.

dongjoon-hyun · 2019-04-06T17:35:53Z

Ping, @mallman here, too. Since you are back, please rebase this one. I can help you here as I mentioned here. In the PR I made, you are the author also, but I don't like creating that kind of PR.

Ensure we pass a compatible pruned schema to ParquetRowConverter

e5e60ad

mallman mentioned this pull request Oct 30, 2018

[SPARK-4502][SQL] Parquet nested column pruning - foundation #21320

Closed

viirya reviewed Oct 30, 2018

View reviewed changes

Replace an unnecessarily partial function with a "total" function

6b19f57

Extract all calls to currentRow.setNullAt(i) in

598d965

ParquetRowConverter.start() into their own loop for clarity

HyukjinKwon reviewed Nov 11, 2018

View reviewed changes

Change some log levels and make a stylistic change

4dfd459

MaxGekk approved these changes Mar 6, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Apr 5, 2019

[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled #24307

Closed

HyukjinKwon closed this in 215609d Apr 8, 2019

kimtkyeom mentioned this pull request Mar 12, 2020

[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter #27888

Closed

	while (i < currentRow.numFields) {
	currentRow.setNullAt(i)
	i += 1
	}

[SPARK-25407][SQL] Ensure we pass a compatible pruned schema to ParquetRowConverter #22880

[SPARK-25407][SQL] Ensure we pass a compatible pruned schema to ParquetRowConverter #22880

Uh oh!

Conversation

mallman commented Oct 29, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 29, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mallman Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 30, 2018

Uh oh!

mallman commented Nov 5, 2018

Uh oh!

SparkQA commented Nov 6, 2018

Uh oh!

mallman commented Nov 7, 2018

Uh oh!

mallman commented Nov 7, 2018

Uh oh!

viirya commented Nov 7, 2018

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

mallman commented Nov 8, 2018

Uh oh!

HyukjinKwon commented Nov 9, 2018

Uh oh!

HyukjinKwon commented Nov 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mallman Oct 30, 2018 •

edited

Loading

attilapiros Oct 31, 2018 •

edited

Loading

dbtsai commented Oct 30, 2018 •

edited

Loading

HyukjinKwon Nov 11, 2018 •

edited

Loading

dongjoon-hyun commented Apr 6, 2019 •

edited

Loading