Skip to content

[SPARK-27576][SQL] table capability to skip the output column resolution #24469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.

However, not all data sources need this check. For example, the NoopDataSource doesn't care about the schema of input query at all.

This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.

Note that, we already skip resolving output columns for NoopDataSource because it implements SupportsSaveMode. However, SupportsSaveMode is a hack and will be removed soon.

How was this patch tested?

new test cases

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Apr 26, 2019

Test build #104937 has finished for PR 24469 at commit 65cefd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 30, 2019

Test build #105020 has finished for PR 24469 at commit 65cefd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -21,4 +21,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan

trait NamedRelation extends LogicalPlan {
def name: String

// When true, the schema of input data must match the schema of this relation, during write.
def requireSchemaMatch: Boolean = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate from the discussion about the method name, I don't think it makes sense for this to be in NamedRelation.

NamedRelation exists to help create better error messages from the generic rules that apply across any relation. This addition doesn't fit with that purpose. I think the reason why this was added to NamedRelation is because the v2 relation class is not available to catalyst, but this rule is in catalyst. If that's the case, then this depends on moving v2 into catalyst and I think it makes sense to do that first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving DS v2 to catalyst module should be done soon, we can wait for it. BTW do we still need NamedRelation after that? It looks to me that NamedRelation is mostly for testing. Currently only the v2 relation class extends it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need it any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, we may want to keep NamedRelation to be able to use UnresolvedRelation in v2 plans. If we updated AppendData (for example) to use DataSourceV2Relation, then we would not be able to create it with an UnresolvedRelation and delegate resolving the table to the analyzer.

val query = TestRelation(StructType(Seq(
StructField("s", StringType))).toAttributes)

val plan1 = byName(table, query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the test cases in this class test either by name or by position, and I would like for this to keep using that convention. When test cases fail, it is easier to see what happened because the case is specific to one path. It also avoids using names that are not specific, like plan1 and plan2.

@cloud-fan
Copy link
Contributor Author

Hi @rdblue , I think we need to make an exception now.

#24233 removes SupportsSaveMode, but it has a hack to bypass the schema check for tables that reports empty schema.

To get rid of that hack, this PR adds a new table capability to bypass the schema check. However, this PR has a hack in NamedRelation because the v2 relation is not available in the catalyst.

To remove the hack in this PR, #24416 was created to move data source v2 to catalyst. However, #24416 is blocked because SupportsSaveMode refers to SaveMode which is in sql/core.

I think we need to merge either #24233 or this PR first, even it has a hack. Do you have a better idea?

@rdblue
Copy link
Contributor

rdblue commented May 15, 2019

@cloud-fan, I think the cleanest way to fix it is to commit this PR with the new method in NamedRelation. That should unblock the other PRs and we can remove NamedRelation in #24416.

Does that work for you?

checkAnalysis(parsedPlan, parsedPlan)
}

withClue("byPosition") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue if 2 tests are very similar, it's recommended to use withClue and merge the 2 tests into one. For example, when the byPosition case fails, we will see

bypass output column resolution *** FAILED *** (36 milliseconds)
[info]   byPosition (DataSourceV2AnalysisSuite.scala:473)
[info]   org.scalatest.exceptions.TestFailedException:
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll send a followup PR to update the entire test suite to use withClue later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds good.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105446 has finished for PR 24469 at commit 31ce720.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105447 has finished for PR 24469 at commit 15a02f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented May 16, 2019

+1 from me to unblock the circular dependency.

@dongjoon-hyun, does this look okay to you?

@dongjoon-hyun
Copy link
Member

Thank you for asking. Yes. This looks good to me, too. My comments are also addressed. :)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan and @rdblue .
Merged to master to unblock DSv2 dev.

mccheah pushed a commit to palantir/spark that referenced this pull request May 24, 2019
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.

However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all.

This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.

Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon.

new test cases

Closes apache#24469 from cloud-fan/schema-check.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.

However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all.

This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.

Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon.

new test cases

Closes apache#24469 from cloud-fan/schema-check.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants