[SPARK-27576][SQL] table capability to skip the output column resolution #24469

cloud-fan · 2019-04-26T08:39:58Z

What changes were proposed in this pull request?

Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.

However, not all data sources need this check. For example, the NoopDataSource doesn't care about the schema of input query at all.

This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.

Note that, we already skip resolving output columns for NoopDataSource because it implements SupportsSaveMode. However, SupportsSaveMode is a hack and will be removed soon.

How was this patch tested?

new test cases

cloud-fan · 2019-04-26T08:40:33Z

cc @rdblue @gengliangwang @gatorsmile

SparkQA · 2019-04-26T11:42:55Z

Test build #104937 has finished for PR 24469 at commit 65cefd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-30T03:31:03Z

Retest this please.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamedRelation.scala

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

SparkQA · 2019-04-30T06:35:53Z

Test build #105020 has finished for PR 24469 at commit 65cefd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-05-01T16:24:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamedRelation.scala

@@ -21,4 +21,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

 trait NamedRelation extends LogicalPlan {
  def name: String
+
+  // When true, the schema of input data must match the schema of this relation, during write.
+  def requireSchemaMatch: Boolean = true


Separate from the discussion about the method name, I don't think it makes sense for this to be in NamedRelation.

NamedRelation exists to help create better error messages from the generic rules that apply across any relation. This addition doesn't fit with that purpose. I think the reason why this was added to NamedRelation is because the v2 relation class is not available to catalyst, but this rule is in catalyst. If that's the case, then this depends on moving v2 into catalyst and I think it makes sense to do that first.

Moving DS v2 to catalyst module should be done soon, we can wait for it. BTW do we still need NamedRelation after that? It looks to me that NamedRelation is mostly for testing. Currently only the v2 relation class extends it.

We probably don't need it any more.

@cloud-fan, we may want to keep NamedRelation to be able to use UnresolvedRelation in v2 plans. If we updated AppendData (for example) to use DataSourceV2Relation, then we would not be able to create it with an UnresolvedRelation and delegate resolving the table to the analyzer.

rdblue · 2019-05-01T16:28:36Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala

+    val query = TestRelation(StructType(Seq(
+      StructField("s", StringType))).toAttributes)
+
+    val plan1 = byName(table, query)


The rest of the test cases in this class test either by name or by position, and I would like for this to keep using that convention. When test cases fail, it is easier to see what happened because the case is specific to one path. It also avoids using names that are not specific, like plan1 and plan2.

cloud-fan · 2019-05-13T06:48:31Z

Hi @rdblue , I think we need to make an exception now.

#24233 removes SupportsSaveMode, but it has a hack to bypass the schema check for tables that reports empty schema.

To get rid of that hack, this PR adds a new table capability to bypass the schema check. However, this PR has a hack in NamedRelation because the v2 relation is not available in the catalyst.

To remove the hack in this PR, #24416 was created to move data source v2 to catalyst. However, #24416 is blocked because SupportsSaveMode refers to SaveMode which is in sql/core.

I think we need to merge either #24233 or this PR first, even it has a hack. Do you have a better idea?

rdblue · 2019-05-15T16:01:11Z

@cloud-fan, I think the cleanest way to fix it is to commit this PR with the new method in NamedRelation. That should unblock the other PRs and we can remove NamedRelation in #24416.

Does that work for you?

cloud-fan · 2019-05-16T08:21:33Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala

+      checkAnalysis(parsedPlan, parsedPlan)
+    }
+
+    withClue("byPosition") {


@rdblue if 2 tests are very similar, it's recommended to use withClue and merge the 2 tests into one. For example, when the byPosition case fails, we will see

bypass output column resolution *** FAILED *** (36 milliseconds) [info] byPosition (DataSourceV2AnalysisSuite.scala:473) [info] org.scalatest.exceptions.TestFailedException: ...

I'll send a followup PR to update the entire test suite to use withClue later.

Okay, sounds good.

SparkQA · 2019-05-16T11:03:12Z

Test build #105446 has finished for PR 24469 at commit 31ce720.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-16T11:38:19Z

Test build #105447 has finished for PR 24469 at commit 15a02f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-05-16T23:18:18Z

+1 from me to unblock the circular dependency.

@dongjoon-hyun, does this look okay to you?

dongjoon-hyun · 2019-05-16T23:23:31Z

Thank you for asking. Yes. This looks good to me, too. My comments are also addressed. :)

dongjoon-hyun

+1, LGTM. Thank you, @cloud-fan and @rdblue .
Merged to master to unblock DSv2 dev.

Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun reviewed Apr 30, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamedRelation.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Apr 30, 2019

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Outdated Show resolved Hide resolved

rdblue reviewed May 1, 2019

View reviewed changes

cloud-fan mentioned this pull request May 15, 2019

[SPARK-26356][SQL] remove SaveMode from data source v2 #24233

Closed

cloud-fan added 2 commits May 16, 2019 15:58

table capabilty to skip the output column resolution

d00bd85

address comments

31ce720

cloud-fan force-pushed the schema-check branch from 65cefd3 to 31ce720 Compare May 16, 2019 08:00

improve tst

15a02f6

cloud-fan commented May 16, 2019

View reviewed changes

dongjoon-hyun approved these changes May 16, 2019

View reviewed changes

dongjoon-hyun closed this in fc5bd6d May 16, 2019

[SPARK-27576][SQL] table capability to skip the output column resolution #24469

[SPARK-27576][SQL] table capability to skip the output column resolution #24469

Uh oh!

Conversation

cloud-fan commented Apr 26, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 26, 2019

Uh oh!

SparkQA commented Apr 26, 2019

Uh oh!

dongjoon-hyun commented Apr 30, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 13, 2019

Uh oh!

rdblue commented May 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

rdblue commented May 16, 2019

Uh oh!

dongjoon-hyun commented May 16, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!