[SPARK-18661] [SQL] Creating a partitioned datasource table should not scan all files for table #16090

ericl · 2016-11-30T23:44:51Z

What changes were proposed in this pull request?

Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason.

We should avoid doing this when the user specifies a schema.

How was this patch tested?

Perf stat tests.

ericl · 2016-11-30T23:45:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -58,13 +58,20 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo
    // Create the relation to validate the arguments before writing the metadata to the metastore,
    // and infer the table schema and partition if users didn't specify schema in CREATE TABLE.
    val pathOption = table.storage.locationUri.map("path" -> _)
+    val uncreatedTable = table.copy(


@cloud-fan is there a better way to do this?

can we do this before we pass the CatalogTable? e.g. in the parser and DataFrameWriter

Hm, that seems more brittle since you'd have to duplicate the logic. I added a comment describing why we need to do this here.

ericl · 2016-11-30T23:45:28Z

@rxin

SparkQA · 2016-11-30T23:49:41Z

Test build #69435 has finished for PR 16090 at commit 51d2a41.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T01:46:30Z

Test build #69437 has finished for PR 16090 at commit 275f6b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-12-01T22:51:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -565,7 +571,8 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
      val table = catalog.getTableMetadata(TableIdentifier("tbl"))
      assert(table.tableType == CatalogTableType.MANAGED)
      assert(table.provider == Some("parquet"))
-      assert(table.schema == new StructType().add("a", IntegerType).add("b", IntegerType))
+      // a is ordered last since it is a user-specified partitioning column
+      assert(table.schema == new StructType().add("b", IntegerType).add("a", IntegerType))


@yhuai this is the minor behavior change we discussed about create table in 2.1

SparkQA · 2016-12-02T01:19:15Z

Test build #69514 has finished for PR 16090 at commit 89b0a64.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-02T02:58:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -58,13 +58,21 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo
    // Create the relation to validate the arguments before writing the metadata to the metastore,
    // and infer the table schema and partition if users didn't specify schema in CREATE TABLE.
    val pathOption = table.storage.locationUri.map("path" -> _)
+    // Fill in some default table options from the session conf
+    val uncreatedTable = table.copy(


how about tableWithDefaultOptions?

cloud-fan · 2016-12-02T02:59:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+      identifier = table.identifier.copy(
+        database = Some(
+          table.identifier.database.getOrElse(sessionState.catalog.getCurrentDatabase))),
+      tracksPartitionsInCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions)


Logically we don't know the value of tracksPartitionsInCatalog here, as the partition columns are not inferred yet.

I think this is true for all new tables. If the table is unpartitioned the flag is harmless.

cloud-fan · 2016-12-02T03:02:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

        className = table.provider.get,
        bucketSpec = table.bucketSpec,
-        options = table.storage.properties ++ pathOption).resolveRelation()
+        options = table.storage.properties ++ pathOption,
+        catalogTable = Some(uncreatedTable)).resolveRelation()


why we need to pass in the catalogTable here?

Otherwise, we will construct an InMemoryFileIndex which scans the filesystem.

cloud-fan · 2016-12-02T03:07:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -312,7 +312,13 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
          pathToNonPartitionedTable,
          userSpecifiedSchema = Option("num int, str string"),
          userSpecifiedPartitionCols = partitionCols,
-          expectedSchema = new StructType().add("num", IntegerType).add("str", StringType),
+          expectedSchema = if (partitionCols.isDefined) {


shall we just change the test to use str as partition column?

I think that would be testing something slightly different.

cloud-fan · 2016-12-02T03:09:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

    val dataSource: BaseRelation =
      DataSource(
        sparkSession = sparkSession,
        userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema),
+        partitionColumns = table.partitionColumnNames,


it looks to me that this line and https://github.com/apache/spark/pull/16090/files#diff-7a6cb188d2ae31eb3347b5629a679cecR135 are the key change. Did I miss something?

You also need to pass catalogTable in, so that on line 390 of DataSource we create a CatalogFileIndex instead of an InMemoryFileIndex.

I think it's fine to create an InMemoryFileIndex for this case, as we call DataSource.resolveRelation here just to infer the schema and partition columns.

You don't want to do that though. Resolve relation also does not always scan the filesystem if you pass in a user defined schema.

I mean, if passing the catalogTable or not doesn't affect the correctness(or performance), we can remove the tableWithDefaultOptions and make the code simpler right?

You do need to pass it in though.

val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions && catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) { new CatalogFileIndex( sparkSession, catalogTable.get, catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L)) } else { new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema)) }

Otherwise, this code will perform a full filesystem scan, independent of the other change to prevent getOrInferFileFormatSchema from performing a scan as well.

Then can we just make the InMemoryFileIndex scan the file lazily? If we only need to infer the schema and partition columns, it should not do the scan.

That's a pretty big change, considering how many classes depend on the eager behavior of InMemoryFileIndex.

SparkQA · 2016-12-02T05:30:30Z

Test build #69534 has finished for PR 16090 at commit 2940d55.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-02T06:04:25Z

Test build #69536 has finished for PR 16090 at commit b405635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-02T14:11:54Z

My main concern is that, in CreateDataSourceTableCommand, we call DataSource.resolveRelation to infer the schema and partition columns. At that time, the table is not created yet, so logically we should not pass a CatalogTable to DataSource and create CatalogFileIndex inside it, which looks like a hack.

It seems to me that it's more logical to tweak the InMemoryFileIndex to scan the files lazily, to avoid unnecessary file scan for cases like this one.

ericl · 2016-12-02T21:35:51Z

I looked at avoiding the creation of a CatalogFileIndex, but the way table resolution works right now, the only way is to create some sort of dummy file index class that does not support scans. It's not clear to me that is any better than just creating a CatalogFileIndex, even if the table is not yet ready.

We can probably clean this up so it is not necessary to create a file index for table creation, but that would be a pretty big change to land for 2.1.

ericl · 2016-12-03T01:48:15Z

Seems like we also create InMemoryFileIndex twice for non-catalog tables. Let me try to fix that too.

ericl · 2016-12-03T02:02:40Z

Fixed by adding a private cache to Datasource, which is used to avoid the duplicate file reads with InMemoryIndex.

ericl · 2016-12-03T02:40:41Z

cc @rxin please merge unless wenchen gets to it first

SparkQA · 2016-12-03T03:48:03Z

Test build #69600 has finished for PR 16090 at commit 5a250ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-12-03T04:43:06Z

Seems like the caching broke a bunch of tests. I'll take a look at this again tomorrow.

…

On Fri, Dec 2, 2016, 7:49 PM UCB AMPLab ***@***.***> wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69600/ Test FAILed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#16090 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SoVLnj_xZMTNgUpQo0THN8Z2LRvDks5rEOa2gaJpZM4LA1GX> .

cloud-fan · 2016-12-03T15:03:13Z

If we are going to hack it, how about this?

val dataSource = DataSource(...)
if (classOf[FileFormat].isAssignableFrom(dataSource.providingClass)) {
  dataSource.getOrInferFileFormatSchema()
} else {
  dataSoure.resolveRelation().schema -> new StructType
}

Then we don't need to create FileIndex and scan files.

ericl · 2016-12-04T00:05:55Z

Not sure I follow - could you explain more on why that would resolve the issue?

Btw, I reverted this pr to b405635, which passes all tests.

SparkQA · 2016-12-04T02:16:51Z

Test build #69630 has finished for PR 16090 at commit b405635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-04T08:05:31Z

After looking more at the code, now I agree with your approach. One question, seems we still scan the files when creating a unpartitioned external data source table?

ericl · 2016-12-04T10:53:53Z

Yeah I was wondering if we should also try to fix that. It seems maybe not as bad since unpartitioned tables usually aren't that big. We can create separate tickets for investigating that, and also the duplicate file scan issue (that might not be a regression).

…

On Sun, Dec 4, 2016, 12:05 AM Wenchen Fan ***@***.***> wrote: After looking more at the code, now I agree with your approach. One question, seems we still scan the files when creating a unpartitioned external data source table? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#16090 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6Ska3bJklmhoUUWIaS1Tlg0ha1JsHks5rEnRigaJpZM4LA1GX> .

… scan all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. Author: Eric Liang <ekl@databricks.com> Closes #16090 from ericl/spark-18661. (cherry picked from commit d9eb4c7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-12-04T12:45:14Z

LGTM, merging to master/2.1!
@ericl please create tickets for the other 2 issues

ericl · 2016-12-05T22:11:17Z

Thanks!, filed
https://issues.apache.org/jira/browse/SPARK-18725
https://issues.apache.org/jira/browse/SPARK-18726

… scan all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. Author: Eric Liang <ekl@databricks.com> Closes apache#16090 from ericl/spark-18661.

…the saved files ### What changes were proposed in this pull request? `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it. The related PR: #16090 ### How was this patch tested? Updated the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #16481 from gatorsmile/saveFileScan.

… scan all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. Author: Eric Liang <ekl@databricks.com> Closes apache#16090 from ericl/spark-18661.

…the saved files ### What changes were proposed in this pull request? `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it. The related PR: apache#16090 ### How was this patch tested? Updated the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16481 from gatorsmile/saveFileScan.

Wed Nov 30 15:43:59 PST 2016

51d2a41

ericl commented Nov 30, 2016

View reviewed changes

Wed Nov 30 16:00:14 PST 2016

275f6b9

ericl added 2 commits December 1, 2016 13:24

Merge branch 'master' into spark-18661

893b130

Thu Dec 1 14:49:59 PST 2016

89b0a64

ericl commented Dec 1, 2016

View reviewed changes

Merge branch 'master' into spark-18661

2940d55

cloud-fan reviewed Dec 2, 2016

View reviewed changes

Thu Dec 1 19:35:53 PST 2016

b405635

ericl force-pushed the spark-18661 branch from 5a250ad to b405635 Compare December 4, 2016 00:05

asfgit closed this in d9eb4c7 Dec 4, 2016

gatorsmile mentioned this pull request Jan 6, 2017

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

Closed

[SPARK-18661] [SQL] Creating a partitioned datasource table should not scan all files for table #16090

[SPARK-18661] [SQL] Creating a partitioned datasource table should not scan all files for table #16090

Uh oh!

Conversation

ericl commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Nov 30, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

cloud-fan commented Dec 2, 2016

Uh oh!

ericl commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Dec 3, 2016

Uh oh!

ericl commented Dec 3, 2016

Uh oh!

ericl commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

ericl commented Dec 3, 2016 via email

Uh oh!

cloud-fan commented Dec 3, 2016

Uh oh!

ericl commented Nov 30, 2016 •

edited

Loading

cloud-fan Dec 2, 2016 •

edited

Loading

ericl commented Dec 2, 2016 •

edited

Loading

ericl commented Dec 5, 2016 •

edited

Loading