[SPARK-23896][SQL]Improve PartitioningAwareFileIndex #21004

gengliangwang · 2018-04-08T18:56:22Z

What changes were proposed in this pull request?

Currently PartitioningAwareFileIndex accepts an optional parameter userPartitionSchema. If provided, it will combine the inferred partition schema with the parameter.

However,

to get userPartitionSchema, we need to combine inferred partition schema with userSpecifiedSchema
to get the inferred partition schema, we have to create a temporary file index.

Only after that, a final version of PartitioningAwareFileIndex can be created.

This can be improved by passing userSpecifiedSchema to PartitioningAwareFileIndex.

With the improvement, we can reduce redundant code and avoid parsing the file partition twice.

How was this patch tested?

Unit test

SparkQA · 2018-04-08T21:13:38Z

Test build #89034 has finished for PR 21004 at commit 35aff24.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-09T07:05:01Z

Test build #89044 has finished for PR 21004 at commit 10536a6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-09T09:35:59Z

retest this please.

SparkQA · 2018-04-09T14:56:32Z

Test build #89049 has finished for PR 21004 at commit 10536a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-10T05:04:39Z

@cloud-fan @gatorsmile

cloud-fan · 2018-04-11T11:32:06Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

        // we need to cast into the data type that user specified.
        def castPartitionValuesToUserSchema(row: InternalRow) = {
          InternalRow((0 until row.numFields).map { i =>
+            val expr = inferredPartitionSpec.partitionColumns.fields(i).dataType match {
+              case StringType => Literal.create(row.getUTF8String(i), StringType)


why special case string type?

row.get(i, StringType) throws exception

cloud-fan · 2018-04-11T11:32:59Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

@@ -81,7 +81,7 @@ class PartitionProviderCompatibilitySuite
          HiveCatalogMetrics.reset()
          assert(spark.sql("select * from test where partCol < 2").count() == 2)
          assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 2)
-          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 2)
+          assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 7)


what happened here?

all the files should be parsed once for creating file index. So it is 5 + 2

SparkQA · 2018-04-11T21:11:44Z

Test build #89215 has finished for PR 21004 at commit d12efab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-11T23:16:00Z

retest this please.

SparkQA · 2018-04-12T03:07:46Z

Test build #89224 has finished for PR 21004 at commit d12efab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T07:05:01Z

Test build #89233 has finished for PR 21004 at commit 43f6b77.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-12T07:21:41Z

retest this please.

cloud-fan · 2018-04-12T07:48:30Z

retest this please

cloud-fan · 2018-04-12T08:03:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)
+        checkAndGlobPathIfNecessary(checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
+        val (dataSchema, partitionSchema) =
+          getOrInferFileFormatSchema(format)


now it can be merged to the above line

cloud-fan · 2018-04-12T08:05:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-
-        val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
-        val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)
+        checkAndGlobPathIfNecessary(checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)


now we may glob the path twice?

Yes. Originally it glob twice too. I don't have a good solution to avoid this.

SparkQA · 2018-04-12T11:42:25Z

Test build #89244 has finished for PR 21004 at commit 43f6b77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T11:45:23Z

Test build #89245 has finished for PR 21004 at commit 630fb8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-12T14:32:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-      }.toArray
-      new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
-    }
+      optionalFileIndex: Option[FileIndex] = None): (StructType, StructType) = {


existingFileIndex

cloud-fan · 2018-04-12T14:34:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      val index = fileIndex match {
+        case i: InMemoryFileIndex => i
+        case _ => tempFileIndex
+      }


SparkQA · 2018-04-12T15:00:36Z

Test build #89259 has finished for PR 21004 at commit 91946a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T16:32:49Z

Test build #89272 has finished for PR 21004 at commit d871ea8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T16:36:17Z

Test build #89273 has finished for PR 21004 at commit e9b6e90.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-12T17:46:23Z

retest this please.

SparkQA · 2018-04-12T18:20:33Z

Test build #89277 has finished for PR 21004 at commit 60d5b6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T21:38:17Z

Test build #89288 has finished for PR 21004 at commit 60d5b6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-12T23:15:51Z

retest this please.

cloud-fan · 2018-04-13T02:00:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+  // The operations below are expensive therefore try not to do them if we don't need to, e.g.,
+  // in streaming mode, we have already inferred and registered partition columns, we will
+  // never have to materialize the lazy val below
+  private lazy val tempFileIndex = {


it's only used once, no need to be a lazy val, we can just inline it.

I moved it here on purpose. So it may be avoid being created twice in the future.
I am OK to inline it.

let's just inline it. People can still create a new index in the future, technically this can't prevent users from doing that.

cloud-fan · 2018-04-13T02:01:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+          checkAndGlobPathIfNecessary(checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
+        val useCatalogFileIndex = sparkSession.sqlContext.conf.manageFilesourcePartitions &&
+          catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog &&
+          catalogTable.get.partitionSchema.nonEmpty


use partitionColumnNames over partitionSchema, since partitionColumnNames is a val and partitionSchema is def

cloud-fan · 2018-04-13T02:03:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -552,6 +523,40 @@ case class DataSource(
        sys.error(s"${providingClass.getCanonicalName} does not allow create table as select.")
    }
  }
+
+  /** Returns an [[InMemoryFileIndex]] that can be used to get partition schema and file list. */
+  private def createInMemoryFileIndex(globbedPaths: Seq[Path]): InMemoryFileIndex = {


this can be def createInMemoryFileIndex(checkEmptyGlobPath: Boolean)

and we can merge checkAndGlobPathIfNecessary and createInMemoryFileIndex

No, we can't. In some case we need to check the glob files, while we don't need to create InMemoryFileIndex

SparkQA · 2018-04-13T02:07:17Z

Test build #89306 has finished for PR 21004 at commit 60d5b6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-13T07:05:01Z

Test build #89315 has finished for PR 21004 at commit 12ac191.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-13T07:08:56Z

retest this please.

SparkQA · 2018-04-13T11:03:36Z

Test build #89319 has finished for PR 21004 at commit 12ac191.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-13T12:28:53Z

thanks, merging to master!

HyukjinKwon · 2018-04-14T13:46:08Z

(let's avoid to describe the PR title just saying improvement next time)

mgaido91 · 2018-11-28T09:19:38Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

-    userPartitionSchema match {
+    val inferredPartitionSpec = PartitioningUtils.parsePartitions(
+      leafDirs,
+      typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,


this is causing a behavior change in Spark 2.4.0 reported in SPARK-26188. Why did we need this change?

Before this patch, there was a subtle difference between with and without a user-provided partition schema:

with user-provided partition schema, we should not infer data types. We should infer as string and cast to user-provided type

without user-provided partition schema, we should infer the data type(with a config)

So it was wrong to unify these 2 code paths. @gengliangwang can you change it back?

@mgaido91 Thanks for the investigation!!
I will fix it and add test case.

actually the investigation was done by the reported of SPARK-26188, I did nothing... Thanks for doing that @gengliangwang and thanks for your comment @cloud-fan

cloud-fan reviewed Apr 11, 2018

View reviewed changes

cloud-fan reviewed Apr 12, 2018

View reviewed changes

gengliangwang added 8 commits April 13, 2018 01:33

improve PartitioningAwareFileIndex

51b385b

fix checkFilesExist

5028fe2

fix test failure

603a836

make it simple

378d0cc

revise

4b5e2db

revise

71d98ed

revise

553a412

revise

2b99b12

gengliangwang added 4 commits April 13, 2018 01:33

use the old way

00438cd

best refactor

9a2af2d

revise

114737f

Use CatalogFileIndex to get schemas

8c8bf69

cloud-fan reviewed Apr 13, 2018

View reviewed changes

address comments

12ac191

gengliangwang force-pushed the PartitioningAwareFileIndex branch from 60d5b6b to 12ac191 Compare April 13, 2018 05:27

asfgit closed this in 4dfd746 Apr 13, 2018

mgaido91 reviewed Nov 28, 2018

View reviewed changes

[SPARK-23896][SQL]Improve PartitioningAwareFileIndex #21004

[SPARK-23896][SQL]Improve PartitioningAwareFileIndex #21004

Uh oh!

Conversation

gengliangwang commented Apr 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 8, 2018

Uh oh!

SparkQA commented Apr 9, 2018

Uh oh!

gengliangwang commented Apr 9, 2018

Uh oh!

SparkQA commented Apr 9, 2018

Uh oh!

gengliangwang commented Apr 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

gengliangwang commented Apr 11, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

gengliangwang commented Apr 12, 2018

Uh oh!

cloud-fan commented Apr 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

gengliangwang commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

gengliangwang commented Apr 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 14, 2018 •

edited

Loading