[SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time #14241

ericl · 2016-07-17T22:20:14Z

What changes were proposed in this pull request?

Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.

This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by FileSourceScanExec, which handles both batched and non-batched file scans. This requires some duplication with RowDataSourceScanExec, but is probably worth it so that FileSourceScanExec does not need to depend on an input RDD.

TODO: In another pr, move DataSourceScanExec to it's own file.

How was this patch tested?

Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).

rxin · 2016-07-17T23:32:41Z

This doesn't actually give us a way to add additional filter constraints in the physical operator, does it?

SparkQA · 2016-07-17T23:35:33Z

Test build #62438 has finished for PR 14241 at commit 0d4642a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-07-18T00:16:07Z

You should be able to add additional filter constraints in buildScan(), e.g. in FileDataSourceStrategy. I don't think it matters too much whether that code is located within buildScan(), or in the operator itself.

rxin · 2016-07-19T01:08:11Z

@ericl I was talking with @marmbrus -- it'd be better to create an API in the physical scan operator that accepts a list of filters, and then do pruning there. That is to say, we also want to move all the pruning code from physical planning into the physical operators.

ericl · 2016-07-20T20:11:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

-  // Metadata keys
-  val INPUT_PATHS = "InputPaths"
-  val PUSHED_FILTERS = "PushedFilters"
+  private def genCodeColumnVector(ctx: CodegenContext, columnVar: String, ordinal: String,


All these functions below were moved verbatim.

SparkQA · 2016-07-20T20:11:36Z

Test build #62619 has finished for PR 14241 at commit b45e253.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T21:09:02Z

Test build #62623 has finished for PR 14241 at commit ebf2102.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T21:20:41Z

Test build #62624 has finished for PR 14241 at commit bbf89a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T21:30:48Z

Test build #62625 has finished for PR 14241 at commit 358eb9f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T22:28:31Z

Test build #62627 has finished for PR 14241 at commit 2d78051.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-21T22:36:03Z

Test build #62691 has finished for PR 14241 at commit a3d2c69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-25T07:07:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

-    override val outputPartitioning: Partitioning,
-    override val metadata: Map[String, String],
+    outputSchema: StructType,
+    partitionFilters: Seq[Expression],


can you add classdoc documenting what partitionFilters and dataFilters do? It's a little bit confusing because they are both filters, but have different types.

BTW in order to make this more dynamic, we'd need to make these mutable.

SparkQA · 2016-07-25T22:19:33Z

Test build #62845 has finished for PR 14241 at commit 780fec5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-26T00:36:48Z

Test build #62847 has finished for PR 14241 at commit ddb202e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-27T10:23:10Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

@@ -358,11 +358,11 @@ class BucketedReadSuite extends QueryTest with SQLTestUtils with TestHiveSinglet
      df1.write.parquet(tableDir.getAbsolutePath)

      val agged = spark.table("bucketed_table").groupBy("i").count()
-      val error = intercept[RuntimeException] {
+      val error = intercept[Exception] {


NIT: we cannot catch the proper exception?

It's a nested exception, which is quite hard to match. The following assert checks for the right error message, which is the important bit I think.

hvanhovell · 2016-07-27T12:39:46Z

This looks pretty good. I have left a few comments.

hvanhovell · 2016-07-27T12:41:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

-/** Physical plan node for scanning data from a batched relation. */
-private[sql] case class BatchedDataSourceScanExec(
+/**
+ * Physical plan node for scanning data from files.


...from HadoopFsRelations?

SparkQA · 2016-07-28T20:57:32Z

Test build #62979 has finished for PR 14241 at commit 18f5543.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-29T06:03:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

@@ -275,62 +272,161 @@ private[sql] case class RowDataSourceScanExec(
       |}
     """.stripMargin
  }
+
+  // Ignore rdd when checking results
+  override def sameResult(plan: SparkPlan): Boolean = plan match {


let's make sure we fix this one

davies · 2016-08-02T22:12:03Z

LGTM

SparkQA · 2016-08-02T23:54:03Z

Test build #63139 has finished for PR 14241 at commit a76b432.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable
- case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic
- case class SparkPartitionID() extends LeafExpression with Nondeterministic
- case class AggregateExpression(
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class CurrentDatabase() extends LeafExpression with Unevaluable
- class GenericInternalRow(val values: Array[Any]) extends BaseGenericInternalRow
- class AbstractScalaRowIterator[T] extends Iterator[T]
- implicit class SchemaAttribute(f: StructField)

SparkQA · 2016-08-03T00:01:50Z

Test build #63141 has finished for PR 14241 at commit 704511e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-08-03T17:35:52Z

@hvanhovell Have you finished your round of review?

hvanhovell · 2016-08-03T18:01:31Z

LGTM

davies · 2016-08-03T18:19:39Z

Merging this into master, thanks!

ericl added 7 commits July 15, 2016 19:18

Fri Jul 15 19:18:52 PDT 2016

d046364

Sun Jul 17 14:28:47 PDT 2016

36d6ef4

Sun Jul 17 14:29:46 PDT 2016

6c0eb0e

Sun Jul 17 14:36:58 PDT 2016

1a46602

Sun Jul 17 14:42:32 PDT 2016

5382334

Sun Jul 17 14:55:13 PDT 2016

98d6d74

Sun Jul 17 15:12:24 PDT 2016

0d4642a

ericl added 5 commits July 20, 2016 11:47

Wed Jul 20 11:47:51 PDT 2016

9cc3337

Wed Jul 20 11:49:00 PDT 2016

b45e253

Wed Jul 20 12:50:42 PDT 2016

ebf2102

Wed Jul 20 12:52:49 PDT 2016

bbf89a1

Wed Jul 20 13:07:47 PDT 2016

358eb9f

ericl reviewed Jul 20, 2016
View reviewed changes

exception is now wrapped

2d78051

Merge branch 'master' into refactor

a3d2c69

rxin reviewed Jul 25, 2016
View reviewed changes

Mon Jul 25 15:07:23 PDT 2016

780fec5

ericl added 2 commits July 25, 2016 15:29

Merge remote-tracking branch 'upstream/master' into refactor

ac9547f

fix compile

ddb202e

hvanhovell reviewed Jul 27, 2016
View reviewed changes

ericl added 3 commits July 28, 2016 11:31

Merge branch 'master' into refactor

d3bb6fe

Thu Jul 28 11:46:08 PDT 2016

b4b2403

Merge branch 'refactor' of github.com:ericl/spark into refactor

18f5543

rxin reviewed Jul 29, 2016
View reviewed changes

ericl added 3 commits August 2, 2016 13:37

Tue Aug 2 13:37:47 PDT 2016

8c2b7df

Merge branch 'master' into refactor

a76b432

Tue Aug 2 15:01:42 PDT 2016

704511e

asfgit closed this in e6f226c Aug 3, 2016

[SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time #14241

[SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time #14241

Uh oh!

Conversation

ericl commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Jul 17, 2016

Uh oh!

SparkQA commented Jul 17, 2016

Uh oh!

ericl commented Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jul 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jul 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 2, 2016

Uh oh!

SparkQA commented Aug 2, 2016

Uh oh!

SparkQA commented Aug 3, 2016

Uh oh!

davies commented Aug 3, 2016

Uh oh!

hvanhovell commented Aug 3, 2016

Uh oh!

davies commented Aug 3, 2016

Uh oh!

Uh oh!

ericl commented Jul 17, 2016 •

edited

Loading

ericl commented Jul 18, 2016 •

edited

Loading