[SPARK-23203][SQL] make DataSourceV2Relation immutable #20448

cloud-fan · 2018-01-31T04:47:07Z

What changes were proposed in this pull request?

This is inspired by #20387, but only focus on making the plan immutable.

The idea is simple, instead of keeping the mutable DataSourceReader in the plan, we should keep DataSourceV2, and create the reader when needed. The pushdown information will be stored in the plan, instead of relying on the mutable reader.

This can also help us removing 2 unnecessary APIs from SupportsPushDownCatalystFilters and SupportsPushDownFilters.

Since in this PR we add a lot of new parameters to DataSourceRelation, the explain result of this plan becomes a little messy, I cleaned it up a little, now the explain looks like

== Parsed Logical Plan ==
Relation SimpleDataSourceV2[i#0, j#1]

== Analyzed Logical Plan ==
i: int, j: int
Relation SimpleDataSourceV2[i#0, j#1]

== Optimized Logical Plan ==
Relation SimpleDataSourceV2[i#0, j#1]

== Physical Plan ==
*(1) Scan SimpleDataSourceV2[i#0, j#1]

== Parsed Logical Plan ==
'Filter ('i > 6)
+- AnalysisBarrier
      +- Project [j#78]
         +- Relation JavaAdvancedDataSourceV2[i#77, j#78] ()

== Analyzed Logical Plan ==
j: int
Project [j#78]
+- Filter (i#77 > 6)
   +- Project [j#78, i#77]
      +- Relation JavaAdvancedDataSourceV2[i#77, j#78] ()

== Optimized Logical Plan ==
Relation JavaAdvancedDataSourceV2[j#78] (PushedFilter: [isnotnull(i#77), (i#77 > 6)])

== Physical Plan ==
*(1) Scan JavaAdvancedDataSourceV2[j#78] (PushedFilter: [isnotnull(i#77), (i#77 > 6)])

How was this patch tested?

I improved the test in DataSourceV2Suite, to make sure this new change doesn't break the column pruning and filter push down.

cloud-fan · 2018-01-31T04:47:36Z

cc @rdblue @tdas @gatorsmile @ericl

cloud-fan · 2018-01-31T04:52:52Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala

    case _ => false
  }

  override def hashCode(): Int = {
    metadata.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)
  }
-
-  lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name =>


We don't need to do this anymore. Now the plan is immutable, we have to create a new plan when applying push down optimizations, and we can also update output at that time.

SparkQA · 2018-01-31T04:53:05Z

Test build #86861 has finished for PR 20448 at commit 7441334.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-31T04:58:21Z

Test build #86862 has finished for PR 20448 at commit 0665282.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-31T05:03:00Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2QueryPlan.scala


 /**
- * A base class for data source reader holder with customized equals/hashCode methods.
+ * A base class for data source v2 related query plan. It defines the equals/hashCode methods
+ * according to some common information.


We might need to emphasize this is for both physical and logical plans.

gatorsmile · 2018-01-31T05:10:03Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+ * @param options The options specified for this scan, used to create the `DataSourceReader`.
+ * @param userSpecifiedSchema The user specified schema, used to create the `DataSourceReader`.
+ * @param filters The predicates which are pushed and handled by this data source.
+ * @param existingReader An mutable reader carrying some temporary stats during optimization and


An -> A

gatorsmile · 2018-01-31T05:12:15Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+    options: DataSourceOptions,
+    userSpecifiedSchema: Option[StructType],
+    filters: Set[Expression],
+    existingReader: Option[DataSourceReader]) extends LeafNode with DataSourceV2QueryPlan {


Why this plan does not extend MultiInstanceRelation?

Could you add a test for self join? Just to ensure it still works.

good catch! Yea this is a bug, but to respect the rule about solving different issues in different PR, I'd like to fix it in a new PR.

SparkQA · 2018-01-31T05:16:31Z

Test build #86863 has finished for PR 20448 at commit d96a48f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-31T05:21:29Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+      case _ => throw new IllegalStateException()
+    }
+  }
+


Do we need to override a def doCanonicalize?

What is the output of this node in Explain?

What is the behavior we expect when users call REFRESH TABLE?

Also another potential issue is about storing the statistics in the external catalog? Do we still have the previous issues discussed in #14712?

data source v2 doesn't support tables yet, so we don't have this problem now.

SparkQA · 2018-01-31T08:05:02Z

Test build #86865 has finished for PR 20448 at commit 11220db.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-31T08:15:13Z

retest this please

SparkQA · 2018-01-31T13:17:52Z

Test build #86867 has finished for PR 20448 at commit 11220db.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-31T13:23:23Z

retest this please

cloud-fan · 2018-01-31T13:24:41Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

-class StreamingDataSourceV2Relation(
-    fullOutput: Seq[AttributeReference],
-    reader: DataSourceReader) extends DataSourceV2Relation(fullOutput, reader) {
+case class StreamingDataSourceV2Relation(


Similar to LogicalRelation, I think we can simply add a isStream parameter to DataSourceV2Relation. This can be addressed in a follow up PR.

SparkQA · 2018-01-31T15:38:15Z

Test build #86876 has finished for PR 20448 at commit 11220db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-01-31T18:09:22Z

@cloud-fan, please close this PR. There is already a pull request for these changes, #20387, and ongoing discussion there.

If you want the proposed implementation to change, please ask for changes in a review.

cloud-fan force-pushed the immutable-plan branch from 7441334 to 0665282 Compare January 31, 2018 04:50

cloud-fan commented Jan 31, 2018

View reviewed changes

cloud-fan force-pushed the immutable-plan branch from 0665282 to d96a48f Compare January 31, 2018 05:03

gatorsmile reviewed Jan 31, 2018

View reviewed changes

make DataSourceV2Relation immutable

11220db

cloud-fan force-pushed the immutable-plan branch from d96a48f to 11220db Compare January 31, 2018 06:49

cloud-fan commented Jan 31, 2018

View reviewed changes

rdblue mentioned this pull request Feb 5, 2018

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Closed

cloud-fan closed this Feb 6, 2018

[SPARK-23203][SQL] make DataSourceV2Relation immutable #20448

[SPARK-23203][SQL] make DataSourceV2Relation immutable #20448

Uh oh!

Conversation

cloud-fan commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

cloud-fan commented Jan 31, 2018

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

cloud-fan commented Jan 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

rdblue commented Jan 31, 2018

Uh oh!

Uh oh!

cloud-fan commented Jan 31, 2018 •

edited

Loading

cloud-fan commented Jan 31, 2018 •

edited

Loading