-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27576][SQL] table capability to skip the output column resolution #24469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #104937 has finished for PR 24469 at commit
|
Retest this please. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamedRelation.scala
Outdated
Show resolved
Hide resolved
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
Outdated
Show resolved
Hide resolved
Test build #105020 has finished for PR 24469 at commit
|
@@ -21,4 +21,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | |||
|
|||
trait NamedRelation extends LogicalPlan { | |||
def name: String | |||
|
|||
// When true, the schema of input data must match the schema of this relation, during write. | |||
def requireSchemaMatch: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate from the discussion about the method name, I don't think it makes sense for this to be in NamedRelation
.
NamedRelation
exists to help create better error messages from the generic rules that apply across any relation. This addition doesn't fit with that purpose. I think the reason why this was added to NamedRelation
is because the v2 relation class is not available to catalyst, but this rule is in catalyst. If that's the case, then this depends on moving v2 into catalyst and I think it makes sense to do that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving DS v2 to catalyst module should be done soon, we can wait for it. BTW do we still need NamedRelation
after that? It looks to me that NamedRelation
is mostly for testing. Currently only the v2 relation class extends it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need it any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, we may want to keep NamedRelation
to be able to use UnresolvedRelation
in v2 plans. If we updated AppendData
(for example) to use DataSourceV2Relation
, then we would not be able to create it with an UnresolvedRelation
and delegate resolving the table to the analyzer.
val query = TestRelation(StructType(Seq( | ||
StructField("s", StringType))).toAttributes) | ||
|
||
val plan1 = byName(table, query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the test cases in this class test either by name or by position, and I would like for this to keep using that convention. When test cases fail, it is easier to see what happened because the case is specific to one path. It also avoids using names that are not specific, like plan1
and plan2
.
Hi @rdblue , I think we need to make an exception now. #24233 removes To get rid of that hack, this PR adds a new table capability to bypass the schema check. However, this PR has a hack in To remove the hack in this PR, #24416 was created to move data source v2 to catalyst. However, #24416 is blocked because I think we need to merge either #24233 or this PR first, even it has a hack. Do you have a better idea? |
@cloud-fan, I think the cleanest way to fix it is to commit this PR with the new method in Does that work for you? |
checkAnalysis(parsedPlan, parsedPlan) | ||
} | ||
|
||
withClue("byPosition") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue if 2 tests are very similar, it's recommended to use withClue
and merge the 2 tests into one. For example, when the byPosition case fails, we will see
bypass output column resolution *** FAILED *** (36 milliseconds)
[info] byPosition (DataSourceV2AnalysisSuite.scala:473)
[info] org.scalatest.exceptions.TestFailedException:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send a followup PR to update the entire test suite to use withClue
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds good.
Test build #105446 has finished for PR 24469 at commit
|
Test build #105447 has finished for PR 24469 at commit
|
+1 from me to unblock the circular dependency. @dongjoon-hyun, does this look okay to you? |
Thank you for asking. Yes. This looks good to me, too. My comments are also addressed. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan and @rdblue .
Merged to master to unblock DSv2 dev.
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.
However, not all data sources need this check. For example, the
NoopDataSource
doesn't care about the schema of input query at all.This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.
Note that, we already skip resolving output columns for
NoopDataSource
because it implementsSupportsSaveMode
. However,SupportsSaveMode
is a hack and will be removed soon.How was this patch tested?
new test cases