[SPARK-26447][SQL]Allow OrcColumnarBatchReader to return less partition columns #23387

gengliangwang · 2018-12-26T17:45:12Z

What changes were proposed in this pull request?

Currently OrcColumnarBatchReader returns all the partition column values in the batch read.
In data source V2, we can improve it by returning the required partition column values only.

This PR is part of #23383 . As @cloud-fan suggested, create a new PR to make review easier.

Also, this PR doesn't improve OrcFileFormat, since in the method buildReaderWithPartitionValues, the requiredSchema filter out all the partition columns, so we can't know which partition column is required.

How was this patch tested?

Unit test

cloud-fan · 2018-12-26T19:07:07Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

@@ -58,10 +58,16 @@

  /**
   * The column IDs of the physical ORC file schema which are required by this reader.
-   * -1 means this required column doesn't exist in the ORC file.
+   * -1 means this required column is partition column, or it doesn't exist in the ORC file.


I think we need more comments here.

Ideally partition column should never appear in the physical file, and should only appear in the directory name. However, Spark is OK with partition columns inside physical file, but Spark will discard the values from the file, and use the partition value got from directory name. The column order will be reserved though.

cloud-fan · 2018-12-26T19:08:56Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

   */
  private int[] requestedColIds;

+  /**
+   * The column IDs of the ORC file partition schema which are required by this reader.
+   * -1 means this required column doesn't exist in the ORC partition columns.


-1 means this required column is not a partition column

cloud-fan · 2018-12-26T19:11:16Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

   */
  private int[] requestedColIds;

+  /**
+   * The column IDs of the ORC file partition schema which are required by this reader.


it's not column ID, it's the index of the partition column

according to https://github.com/apache/spark/pull/23387/files#diff-8e7e05590803a656e46183bc94642ab8R211

SparkQA · 2018-12-26T21:26:49Z

Test build #100457 has finished for PR 23387 at commit 799f429.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-27T02:56:53Z

adding @dongjoon-hyun

dongjoon-hyun · 2018-12-27T04:26:05Z

Thank you for pinging me, @HyukjinKwon .

dongjoon-hyun · 2018-12-27T04:36:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

-        val requestedColIds = requestedColIdsOrEmptyFile.get
-        assert(requestedColIds.length == requiredSchema.length,
+        val requestedColIds =
+          requestedColIdsOrEmptyFile.get ++ Array.fill(partitionSchema.length)(-1)


This semantic change also affects non-vectorized code path. Can we isolate the scope of this PR?

dongjoon-hyun · 2018-12-27T04:37:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

-        assert(requestedColIds.length == requiredSchema.length,
+        val requestedColIds =
+          requestedColIdsOrEmptyFile.get ++ Array.fill(partitionSchema.length)(-1)
+        assert(requestedColIds.length == resultSchema.length,
          "[BUG] requested column IDs do not match required schema")


Could you adjust the error message according to your PR? required schema is not used in the new assertion.

dongjoon-hyun

Hi, @gengliangwang .
It seems that we missed new test cases for the new feature suggested here; Allow OrcColumnarBatchReader to return less partition columns. Could you add some?

gengliangwang · 2018-12-27T11:29:01Z

@cloud-fan @dongjoon-hyun Thanks a lot for the review.

@dongjoon-hyun I tried adding a test case for the improvement. But the implementation seems too low level for constructing and initializing OrcColumnarBatchReader.
This PR is mainly for Orc V2 migration. It is a independent PR since the code context is a bit complex. I prefer to add test case for checking if the output row of Orc V2 reader is pruned.
Is that OK to you?

cloud-fan · 2018-12-27T12:47:18Z

LGTM. How would you add test? IIUC this is just a code refactor now, nothing will be changed. It will become a real optimization when migrating orc to v2 API.

SparkQA · 2018-12-27T14:57:04Z

Test build #100473 has finished for PR 23387 at commit 49ae28b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-27T20:48:18Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

+      if (requestedPartitionColIds[i] != -1) {
+        requestedDataColIds[i] = -1;
+      }
+    }


Does this loop work as expected? The intention seems to be clear, but here, we initialized like the following.

val requestedDataColIds = requestedColIds ++ Array.fill(partitionSchema.length)(-1) val requestedPartitionColIds = Array.fill(requiredSchema.length)(-1) ++ Range(0, partitionSchema.length)

So, logically, in this for loop, the range of i satisfying requestedPartitionColIds != -1 seems to be filled with Array.fill(partitionSchema.length)(-1)? Did I understand correct?

Yes. This is because in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191
the require schema always filter out the all partition columns.

Now It can be easily fixed in ORC V2, but to fix FileFormat it may affect the Parquet reader as well.
In this PR, I will check the requestedPartitionColIds in requiredSchema, so that it will be easier if the someday the improvement is made for FileFormat.

The suggested test suite also covers this logic.

dongjoon-hyun · 2018-12-27T22:38:01Z

@cloud-fan and @gengliangwang .
Could you review and merge gengliangwang#3 to this PR?
This PR is claiming an improvement by returning the required partition column values only. We had better add the test coverage on the newly added logic here.

IIUC this is just a code refactor now, nothing will be changed

Add `OrcColumnarBatchReaderSuite`

gengliangwang · 2018-12-28T07:50:17Z

@dongjoon-hyun Thanks for the test suite. I have merge and updated the test case. Please review it again.

gengliangwang · 2018-12-28T07:51:22Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+      val reader = getReader(requestedDataColIds, requestedPartitionColIds,
+        Array(dataSchema.fields(0), partitionSchema.fields(0)))
+      val batch = reader.columnarBatch
+      assert(batch.numCols() === 2)


Here we can see the result columns is pruned.

SparkQA · 2018-12-28T08:05:02Z

Test build #100489 has finished for PR 23387 at commit 1b09dae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T08:05:03Z

Test build #100491 has finished for PR 23387 at commit b87ea1e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-28T08:08:55Z

Retest this please

SparkQA · 2018-12-28T11:56:32Z

Test build #100493 has finished for PR 23387 at commit b87ea1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T15:15:51Z

Test build #100498 has finished for PR 23387 at commit 1b58df8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-28T19:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

@@ -539,6 +539,25 @@ object PartitioningUtils {
    }).asNullable
  }

+  def requestedPartitionColumnIds(


Can we have a more intuitive name? This function name looks weird to me because requestedPartitionColumnIds returns full schema.

The returned value depends on the parameter requiredSchema. The parameter can be full schema or requested schema.
Do you have suggestion for the method name?

If you don't mind, I prefer to revert the last commit, 1b58df8 .

Sure, I have reverted it.

This reverts commit 1b58df8.

SparkQA · 2019-01-01T13:34:35Z

Test build #100613 has finished for PR 23387 at commit 5ed34d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-03T16:37:29Z

thanks, merging to master!

…ion columns ## What changes were proposed in this pull request? Currently OrcColumnarBatchReader returns all the partition column values in the batch read. In data source V2, we can improve it by returning the required partition column values only. This PR is part of apache#23383 . As cloud-fan suggested, create a new PR to make review easier. Also, this PR doesn't improve `OrcFileFormat`, since in the method `buildReaderWithPartitionValues`, the `requiredSchema` filter out all the partition columns, so we can't know which partition column is required. ## How was this patch tested? Unit test Closes apache#23387 from gengliangwang/refactorOrcColumnarBatch. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

refactor

799f429

cloud-fan reviewed Dec 26, 2018

View reviewed changes

dongjoon-hyun reviewed Dec 27, 2018

View reviewed changes

address comments

49ae28b

dongjoon-hyun reviewed Dec 27, 2018

View reviewed changes

Add OrcColumnarBatchReaderSuite

a3a5741

gengliangwang and others added 2 commits December 28, 2018 14:37

Merge pull request #3 from dongjoon-hyun/PR-23387

1b09dae

Add `OrcColumnarBatchReaderSuite`

update test case

b87ea1e

gengliangwang commented Dec 28, 2018

View reviewed changes

add method PartitioningUtils.requestedPartitionColumnIds

1b58df8

dongjoon-hyun reviewed Dec 28, 2018

View reviewed changes

Revert "add method PartitioningUtils.requestedPartitionColumnIds"

5ed34d8

This reverts commit 1b58df8.

gengliangwang mentioned this pull request Jan 3, 2019

[SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path #23383

Closed

cloud-fan closed this in e2dbafd Jan 3, 2019

[SPARK-26447][SQL]Allow OrcColumnarBatchReader to return less partition columns #23387

[SPARK-26447][SQL]Allow OrcColumnarBatchReader to return less partition columns #23387

Uh oh!

Conversation

gengliangwang commented Dec 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Dec 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 26, 2018

Uh oh!

HyukjinKwon commented Dec 27, 2018

Uh oh!

dongjoon-hyun commented Dec 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Dec 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 27, 2018

Uh oh!

SparkQA commented Dec 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Dec 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

dongjoon-hyun commented Dec 28, 2018

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 1, 2019

Uh oh!

cloud-fan commented Jan 3, 2019

Uh oh!

Uh oh!

cloud-fan Dec 26, 2018 •

edited

Loading

gengliangwang commented Dec 27, 2018 •

edited

Loading

dongjoon-hyun commented Dec 27, 2018 •

edited

Loading